Given training data $(\mathbf x_{(i)},y_i)$ for $i=1,...,m$ --> lists of samples $\verb|X|$ and labels $\verb|y|$
Choose a model $f(\cdot)$ where we want to make $f(\mathbf x_{(i)})\approx y_i$ (for all $i$) --> choose sklearn estimator to use
Define a loss function $L(f(\mathbf x), y)$ to minimize by changing $f(\cdot)$ ...by adjusting the weights --> default choices for estimators, sometimes multiple options
Very flexible machine learning methods(s). I.e. there are many new options and variations to choose
Make up an architecture - choose layers and their parameters - define custom $f(\mathbf x_{(i)})$
Choose a Loss function - how compute error for $f(\mathbf x_{(i)}) \ne y_i$
Choose a optimization method - many variants of same basic method
Choose "regularization" tricks to prevent overfitting
Handle other important details like initializing and normalizing data
This is just a big complex model that computes $y = f(\mathbf x)$
Notice output shape was 3D, because each sample consisted of a list of words, each embedded to a vector
For each word in an input consisting of $$\mathbf h(\mathbf x) = \sigma(\mathbf z) = \sigma(\mathbf W^T \mathbf x + \mathbf b) = [\mathbf E^T \mathbf x^{(1)},\mathbf E^T \mathbf x^{(2)},...,\mathbf E^T \mathbf x^{(N)}]$$
"Flatten" layer converted this to a single concatenated vector
Dominant achitecture type for processing images
Not immediately obvious that would work well for text, but major advantages in efficiency and speed allow use of many layers and good performance when adapted for text processing
Kim 2014: treat text as a sequence of finite length
Simple recurrent neural network after Elman (Elman, 1990). The hidden layer includes a recurrent connection as part of its input. That is, the activation value of the hidden layer depends on the current input as well as the activation value of the hidden layer from the previous time step.
Simple recurrent neural network illustrated as a feedforward network.
A simple recurrent neural network shown unrolled in time. Network layers are copied for each time step, while the weights $\mathbf U$, $\mathbf V$ and $\mathbf W$ are shared in common across all time steps
Effective approach to
Form the basis for sequence-to-sequence approaches (Ch. 10 and 11)
Autoregressive text generation... demo later
Goal: compute the conditional probability of the next word in a sequence given the preceding words $P(w_{n}|w_1^{n−1})$
$N$-gram models and feedforward networks with sliding windows both model this
The quality of a model is largely dependent on the size ofthe context and how effectively the model makes use of it.
Both make limited approximation $P(w_{n}|w_1^{n−1}) \approx P(w_n|w_{n-N+1}^{n−1})$
For probability of sequence, combine probabilities of words
\begin{align} P(w_{1}^{n}) &= \prod_{k=1}^n P(w_{k}|w_1^{k−1}) \\ &= \prod_{k=1}^n y_k \end{align}Recall chain rule of probability
for a single example gives the negative log probability assigned to the correct class, which is the result of applying a softmax to the final output layer.
\begin{align} L_{CE} &= -\log\hat{y}_i \\ &= -\log\frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \end{align}The output of a lower level serves as the input to higher levels with the output of the last network serving as the final output.
Simple recurrent network: the hidden state at time $t$ represents everything the network knows about the sequence up to that point in the sequence, i.e. using inputs preceding $t$
$$ h_t^{fwd}=RNN_{fwd}(x^t_1) $$Can perform same training backward in time by reversing sequence and starting at end
$$ h_t^{bkwd}=RNN_{bkwd}(x^n_t) $$Bi-RNN. Separate models are trained in the forward and backward directions with the output of each model at each time point concatenated to represent the state of affairs at that point in time.
Bi-RNN for sequence classification. The final hidden units from the forward and backward passes are combined to represent the entire sequence. This combined representation serves as input to the subsequent classifier.
The flights the airline was cancelling were full.
"was" should have probability of following "flight"
"were" also follows "flight" in sense, delayed due to phrase in middle
Manage this context rather than hard-coding a strategy into the architecture - "gates"
Common design pattern for each gate
To select the information to add to the current context
\begin{align} i_t &= σ(\mathbf U_i \mathbf h_{t-1} + \mathbf W_i \mathbf x_t) \\ j_t & = g_{t} \odot i_t \end{align}Update context
$$ c_t=j_t+k_t$$To decide what information is required for the current hidden state (as opposed to what information needs to be preserved for future decisions)
\begin{align} o_t &= σ(\mathbf U_0 \mathbf h_{t-1} + \mathbf W_0 \mathbf x_t) \\ h_t & = o_{t} \odot \tanh c_t \end{align}Reset used in gating recurrent input for intermediate output
$$ \hat{h}_t=\tanh\big(\mathbf U_t (\mathbf r_t\odot \mathbf h_{t−1}) + \mathbf W \mathbf x_t\big)$$Update gate used in combining with recurrent output $$ h_t= (1−z_t)h_{t−1}+z_t ̃\hat{h}_t $$
Feed-forward, Simple RNN, LSTM, GRU
Complexity increase encapsulated within unit
Outward complexity boils down to recurrent connection(s), unroll loop for optimization
Shannon’s method (Shannon, 1951) used to use to generate random sentences in Ch. 3
Use characters as your tokens
Use the predicted character as part of the next input sequence to predict subsequent character
keep repeating this process
Recall must embed word when chosen for feedback
import numpy as np
path = keras.utils.get_file('nietzsche.txt',origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))
Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt 606208/600901 [==============================] - 1s 2us/step Corpus length: 600893
# Length of extracted character sequences
maxlen = 60
# We sample a new sequence every `step` characters
step = 3
# This holds our extracted sequences
sentences = []
# This holds the targets (the follow-up characters)
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))
# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)
# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Number of sequences: 200278 Unique characters: 57 Vectorization...