Category of Methods
Goal: to assign a label chosen from a small fixed set of labels to each element of a sequence
Canonical Example: Part-of-speech (POS) tagging
eight parts of speech:
The basis for practically all subsequent part-of-speech descriptions of most European languages, suggesting deep importance
Modern tagsets may use many more tags. Penn treebank = 45
the process of assigning a part-of-speech marker to each word in an input text
Also known as...
All three methods have roughly equal performance but different tradeoffs
(8.1) The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
(8.2)There/EX are/VBP 70/CD children/NNS there/RB
(8.3) Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ./.
Brown corpus - a million words of samples from 500 written texts from different genres published in the United States in 1961
WSJ corpus contains a million words published in the Wall Street Journal in 1989.
Switchboard corpus consists of 2 million words of telephone conversations collected in 1990-1991.
45 tags were collapsed from 87-tag Brown tagset
designed for a treebank in which sentences were parsed (making trees), and so it leaves off syntactic information recoverable from the parse tree
Please read J&M Chapter 8 section 1.
Some examples in the following slides...
Interchange with Verbs: "beautiful" becomes "to be beautiful"
Interchange with nouns: "I like rice" vs "rice is likeable"
These classes may be distinguished or lumped together as interjections or adverbs depending on the purpose of the labeling.
Closed = fixed (relatively) vocabulary, versus Open = able to add new words
Closed classes generally function words (of, it, and, you), important to structure. New prepositions are rarely coined.
Nouns and verbs are open classes. new nouns and verbs like iPhone or fax are continually being created or borrowed. Also adjectives and adverbs.
Any given speaker or corpus may have different open class words, but all speakers of a language, and sufficiently large corpora, likely share the set of closed class word
Proper nouns
Common nouns
"a hodge-podge in both form and meaning"
Directional adverbs or locative adverbs (home, here, downhill)
Degree adverbs (extremely,very,somewhat)
Manner adverbs (slowly,slinkily,delicately)
Temporal adverbs (yesterday,Monday).
some adverbs (e.g., temporal adverbs like Monday) are tagged in some "tagging schemes as nouns.
Disambiguation task - resolving ambiguous words (book a flight vs. read a book).
given an ambiguous word, choose the tag which is most frequent in the training corpus.
Accuracy training on WSJ corpus and test on sections 22-24 of the same corpus:
Consider how this baseline algorithm relates to describing text with $N$-grams.
inputs are word embeddings and the outputs are tag probabilities generated by a softmax layer over the tagset
Entire sequences of text are classified as belonging to one of a small number of categories.
Sequence classification using a simple RNN combined with a feedforward network. The final hidden state from the RNN is used as the input to a feedforward network thatperforms the classification.
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000 # number of words to consider as features
maxlen = 500 # cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
Loading data... 25000 train sequences 25000 test sequences Pad sequences (samples x time) input_train shape: (25000, 500) input_test shape: (25000, 500)
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(input_train, y_train,
epochs=10,
batch_size=128,
validation_split=0.2)
Train on 20000 samples, validate on 5000 samples Epoch 1/10 20000/20000 [==============================] - 139s 7ms/step - loss: 0.5160 - acc: 0.7434 - val_loss: 0.3616 - val_acc: 0.8498 Epoch 2/10 20000/20000 [==============================] - 144s 7ms/step - loss: 0.2970 - acc: 0.8835 - val_loss: 0.2890 - val_acc: 0.8826 Epoch 3/10 20000/20000 [==============================] - 143s 7ms/step - loss: 0.2381 - acc: 0.9087 - val_loss: 0.5040 - val_acc: 0.8442 Epoch 4/10 20000/20000 [==============================] - 148s 7ms/step - loss: 0.2051 - acc: 0.9239 - val_loss: 0.2921 - val_acc: 0.8918 Epoch 5/10 20000/20000 [==============================] - 120s 6ms/step - loss: 0.1776 - acc: 0.9361 - val_loss: 0.4152 - val_acc: 0.8744 Epoch 6/10 20000/20000 [==============================] - 155s 8ms/step - loss: 0.1617 - acc: 0.9408 - val_loss: 0.2996 - val_acc: 0.8764 Epoch 7/10 20000/20000 [==============================] - 122s 6ms/step - loss: 0.1466 - acc: 0.9465 - val_loss: 0.4075 - val_acc: 0.8600 Epoch 8/10 20000/20000 [==============================] - 115s 6ms/step - loss: 0.1312 - acc: 0.9542 - val_loss: 0.4163 - val_acc: 0.8544 Epoch 9/10 20000/20000 [==============================] - 127s 6ms/step - loss: 0.1223 - acc: 0.9588 - val_loss: 0.3178 - val_acc: 0.8804 Epoch 10/10 20000/20000 [==============================] - 115s 6ms/step - loss: 0.1124 - acc: 0.9616 - val_loss: 0.4207 - val_acc: 0.8640