Know how to do the things in the recent quizzes and homeworks.
Note that the quizzes are necessarily very short, so the questions are broad (e.g. describe in a few sentences how do you implement something). Whereas on an exam you will be asked to actually implement the solution for a set of data.
The exam will overwhelmingly favor recent material (after the first midterm).
Understand what a language model is and how to implement it with a network. I.e. it's just a function like any other we model with a neural network, except its inputs are usually (one or more) words and its output is a softmax that is interpreted as a probability distribution.
Know how to take a dataset consisting of a single long string of text, and convert it into a machine learning dataset for training a language model.
Know how to make adjacency matrices from a matrix consisting of data
Know the steps for spectral embedding of graph
compare and contrast embedding matrix, term document matrix, and tf-idf model
Q1: How generally do you make words into vectors using one-hot encoding?
Q2: How can an embedding matrix may be implemented with a neural network?
Q3: How would you convert a single long sequence into a collection of samples for training a language model?
A1: For a vocabulary of size V unique words, make a dictionary where the key is the word and the value is its index in the vocabulary. For each word, use a unique index k_word. Then for each word, produce a vector of length V which has a one in the kth element and zeros elsewhere.
A2: Directly use the embedding matrix as the weight matrix for a dense layer (be able to draw this). I.e. the dense layer should compute $y = \sigma(\mathbf E^T \mathbf x + \mathbf b) = \mathbf E^T \mathbf x$, because we set the layer to not use activation or bias.
A3.1: for a feedforward network, take each N words in a sliding window (i.e. for sample i, use the ith word through the (i+N)th word) as a sequence, then use the (one-hot encoded) Nth word as the target $\mathbf y^{(i)}$ and the previous (one-hot encoded) N-1 words as the sample input vector $\mathbf x^{(i)}$. The output layer is a softmax over possible words. There are other ways too.
A3.2: For a recurrent network you can simply use the ith word for the target $\mathbf y^{(i)}$ and the (i-1)th word for the input $\mathbf x^{(i)}$.
Q: Describe the function of an embedding layer ?
A: An embedding layer implements the multiplication of an embedding matrix with a one-hot encoded input vector. The input is typically a one-hot encoded vector and the output is an embedded version of the vector.
Q: Describe how an embedding layer differs from a normal neural network layer?
A: An embedding layer is like a dense layer, but with no activation function or bias for the nodes.
Q: The precision matrix is the inverse of: (a) the covariance matrix (b) the correlation matrix (c) the adjacency matrix
A: (a) the covariance matrix
Q: What is the difference between a neural language model and a N-gram language model?
A: a neural language model relates embeddings of words rather than words themselves.
Q: What is the difference between a recurrent neural language model and a feedforward neural language model?
A: a feedforward neural language model predicts the next word using a limited number of inputs, similar to an $N$-gram. While the output of a recurrent neural language model depends on all previous words.
Q: how is the probability distribution implemented in a neural language model?
A: the output layer is a softmax over different outputs representing words. The softmax values can be interpreted as probabilities for each word.
Q: How do you autogenerate text for a particular topic like sports with a recurrent neural language model?
A: Feed a sequence of words using real sports text into the neural language model for some "prefix" length L, then after input L start using the outputs of the language to choose inputs. I.e. take the prediction of the most likeyl next word and use it as the next input word.