Know how to do the things in the recent quizzes and homeworks.
Note that the quizzes are necessarily very short, so the questions are broard (e.g. describe in a few sentences how do you implement something with a computational graph). Whereas on an exam you may be asked to actually implement the solution with a computational graph.
The exam will overwhelmingly favor recent material (after the first midterm).
You will not need to memorize code; the APIs will be given if needed (as on the last quiz).
Know how to perform the full machine learning optimization of a model via computational graphs and stochastic gradient descent by hand. I.e.
For those still unclear what I am looking for regarding backpropagation, basically I want to see three things:
Example(ish):
You should understand fundamentally what is calculated by convolution and how it is implemented by a layer in a neural network.
Q1: Give keras code for a “1x1” (i.e. the kernel size is (1,1) conv2D layer, where the input is a color image (i.e. 3 channels). Is this useless or does it calculate something nontrivial?
A1: at each output node you get a combination of the three colors from a pixel in the input image, $y[i,j] = \sigma(w_1 x[i,j,1] + w_2 x[i,j,2] + w_3 x[i,j,3] + b)$. It is not trivial; it learns to combine the colors in a useful way for the task.
Be able to write the math function and draw the graph implemented by simple Keras layers, including the simple recurrent layer. You will be given the relevant API pages. You may need to figure out code you haven't used before to test your understanding.
E.g. draw the neural network layers that the following lines implement, assuming an input size of 6.
keras.layers.Conv1D(5,2)
keras.layers.AveragePooling1D(3)
Understand what a language model is and how to implement it with a network. I.e. it's just a function like any other we model with a neural network, except its inputs are usually (one or more) words and its output is a softmax that is interpreted as a probability distribution.
Q1: How generally do you make words into vectors using one-hot encoding?
Q2: How can an embedding matrix may be implemented with a neural network?
Q3: How would you convert a single long sequence into a collection of samples for training a language model?
A1: For a vocabulary of size V unique words, make a dictionary where the key is the word and the value is its index in the vocabulary. For each word, use a unique index k_word. Then for each word, produce a vector of length V which has a one in the kth element and zeros elsewhere.
A2: Directly use the embedding matrix as the weight matrix for a dense layer (be able to draw this). I.e. the dense layer should compute $y = \sigma(\mathbf E^T \mathbf x + \mathbf b) = \mathbf E^T \mathbf x$, because we set the layer to not use activation or bias.
A3.1: for a feedforward network, take each N words in a sliding window (i.e. for sample i, use the ith word through the (i+N)th word) as a sequence, then use the (one-hot encoded) Nth word as the target $\mathbf y^{(i)}$ and the previous (one-hot encoded) N-1 words as the sample input vector $\mathbf x^{(i)}$. The output layer is a softmax over possible words. There are other ways too.
A3.2: For a recurrent network you can simply use the ith word for the target $\mathbf y^{(i)}$ and the (i-1)th word for the input $\mathbf x^{(i)}$.
Q1: Suppose you have a dataset consisting of faces from mugshots. When you train it you get a very accurate model for recognizing people using mugshots. However when you test it on pictures from street cameras it is very innaccurate. How would you use data augmentation to train a better model?
A1: augment your dataset with many variations of the mugshots. For each variation apply random backgrounds, lighting, and translations, but use the same true label in training.
Q1: When is a softmax typically used?
Q2: When is a ReLU typically used?
Q3: When is a sigmoid used?
A1: final layer activation function for multi-category classifier. can interpret as a probability estimate.
A2: hidden layer activations
A3: final layer activation function for two-category classifier. can interpret as a probability estimate.
x=[1,2,3,4,6,7,7,6]
when convolving with [-1,2,-1] your code should have returned [0,0,-1,1,1,1]
The math operation here is a finite-difference second derivative.
when convolving with [1,1,1] your code should have returned [6,9,13,17,20,20]
The math operation here is a smoothing of the sequence
when convolving with [-1,1] your code should have returned [-1,-1,-1,-2,-1,0,1]
The math operation here is a finite-difference first derivative.
Yur answer may differ slightly if you implemented correlation versus convolution.
when convolving with [2] your code should have returned [2,4,6,8,12,14,14,12]
The math operation here is just a scaling by 2.
After implementing a stride of 2, your code should have output every other number from above, e.g., [2,6,12,14] for the last one.
Using python-like notation...
$$h[m,n] = \sigma\Big(\sum_{i=-1}^{+1}\sum_{j=-1}^{+1}\sum_{k=0}^{1} x[n,m,k]w[n-i,m-j,k] +b \Big) $$The actual ranges of the indices is arbitray.