The representation of text (words, sentences, documents, etc.) with vectors
This allows geometric analysis of these vectors (e.g. words may be close, far away, etc, as computed via distance metrics on the vectors).
Distributional hypothesis - Words that occur in similar contexts tend to have similar meanings.... so close together in a gometric sense.
Synonyms (like oculist and eye-doctor) tended to occur in the same environment (e.g., near words like eye or examined) with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments” (Harris, 1954, 157).
Lexical Semantics - The linguistic study of word meanings
word at top of definition, main representative form chosen. "Citation form".
mouse (N)
1. any of numerous small rodents...
2. a hand-operated device that controls a cursor...
Wordforms - other forms of lemma, such as "mice".
Word sense - the meaning(s) of the lemma
Synonym - words with same word sense. couch/sofa.
Antonyms - words with an opposite meaning. long/short.
Some degree of similarity exists for many words with different meanings. Cat vs. Dog. See vs. Hear.
SimLex-999 dataset (Hill et al., 2015) gives values on a scale from 0 to 10, like the examples below, which range from near-synonyms (vanish, disappear) to pairs that scarcely seem to have anything in common (hole, agreement). Judged by humans.
E.g., Coffee vs. cup
Multiple kinds of relatedness.
Affective meaning, i.e., emotion/sentiment
Dimensions:
Word as a coordinate in 3D space.
1950's: define word as distribution over contexts, the neighboring words or grammatical environments.
Ex: ongchoi (Cantonese). Given contexts:
(6.1) Ongchoi is delicious sauteed with garlic.
(6.2) Ongchoi is superb over rice.
(6.3) ...ongchoi leaves with salty sauces...
And given other contexts of the neighboring words.
(6.4) ...spinach sauteed with garlic over rice...
(6.5) ...chard stems and leaves are delicious...
(6.6) ...collard greens and other salty leafy greens
Conclusion: onchoi is similar to spinach & collard greens.
Figure 6.1 A two-dimensional (t-SNE) projection of embeddings for some words and phrases, showing that words with similar meanings are nearby in space. The original 60- dimensional embeddings were trained for a sentiment analysis task. Simplified from Li et al. (2015).
Will cover two models: tf-idf and word2vec
A kind of co-occurence matrix, which counts how ofen words co-occur based on being in the same context or document. BOW representation of documents.
Dimensionality of vectors is number of words used, here 4. Equal to vocabulary size used.
Gigantic matrix for all words $\times$ all documents. Exploit sparsity.
Used in Information Retrieval field. Document search based on vector similarity to search term.
Perform search by putting search query into vector and computing a distance metric with document vectors.
Use rows of matrix to compare vectors
More common to use word-word matrix, a.k.a term-term matrix or term-context matrix.
Numbers are counts of times the two words appear together in same context: same document, or a sliding window within documents.
Note similarity of rows for information and digital, and of rows for apricot and pineapple.
Exercise: compute cosine similarity between Apricot, Digital, and Information.
Log to dampen dominance of common words
plt.plot(np.linspace(1,10,100),1+ np.log(np.linspace(1,10,100)));
Document frequency $df(term)$ = # documents term appears in.
Collection frequency of term = # times term appears in collection.
Expect common words which are concentrated in fewer documents to be more discrimatory. Use inverse document frequency.
$$ idf(term) = \log_{10}\left(\dfrac{N}{df(term)}\right) $$$N$ = total number of documents in collection
tf-idf weighting is by far the dominant way of weighting co-occurrence matrices in information retrieval
$$ w(term, doc) = tf(term,doc) \times idf(term) $$A tf-idf weighted term-document matrix for four words in four Shakespeare plays, using the counts in Fig. 6.2. Note that the idf weighting has eliminated the importance of the ubiquitous word good and vastly reduced the impact of the almost-ubiquitous word fool.
Compute word co-occurence matrix with counts weighted by tf-idf for word in document.
Represent words with vectors (rows or columns) from this matrix - tf-idf vectors.
Compare similarity between words using cosine metric of vectors.
Describe documents using average of tf-idf vectors for all contained words.
Alternative weighting to tf-idf
Pointwise mutual information (Fano, 1961) is one of the most important concepts in NLP. It is a measure of how often two events x and y occur, compared with what we would expect if they were independent:
$$ PMI(x,y) = \log_2\left(\dfrac{P(x,y)}{P(x)P(y)}\right) = \log_2\left(\dfrac{\text{observed co-occurence}}{\text{co-occurence if independent}}\right) $$where $x$ = target word, and $y$ = context word.
Eliminates all negative values which represent word occurring less often than random -- too rare to be reliable.
Use short dense vectors rather than long sparse vectors
"It turns out that dense vectors work better in every NLP task than sparse vectors."
Possible reasons why:
In training data, 'car' and 'dashboard' are used together in sentence.
But in test data, 'dashboard' and 'automobile' are used together instead.
Feature selection - choose a subset of important features pruning the rest
Feature extraction - form fewer, new features from the original inputs
Here, linear means new features are linear combinations of original features. E.g. mean of multiple features.
Exercise: write the SVD of your dataset and identify the combinations of original inputs
Linear Discriminant Analysis (LDA) - supervised variation on PCA
Factor Analysis -
Multidimensional Scaling -
Canonical correlation analysis - finds joint features that relate multiple datasets
Here, nonlinear means new features are not linear combinations of original features. E.g. 2nd & higher order statistics (variance, correlation, skew)
Exercise: describe how to compute nonlinear statistical functions of features in dataset
Isometric feature mapping
Locally linear embedding
Laplacian eigenmaps
plot of variance explained as a function of the number of eigenvectors kept
Exercise: how do you compute "variance explained"?
"Eigendigits" = eigenvectors of handwritten digit image dataset (with images as rows).
Exercise 2: what are the original and extracted features in the geometric figure?
$\mathbf X$ is $N\times d$ data matrix.
"Centered to have zero mean" (row means)
$d\times d$ covariance matrix = $\mathbf X^T \mathbf X$
Eigenvector decomposition $\mathbf X^T \mathbf X = \mathbf W \mathbf D \mathbf W^T$
Truncate (length-$d$) eigenvectors $\mathbf w_i$ for $k$ largest eigenvalues.
Get coordinates of data point in eigenvector space by taking inner product of row $\mathbf x^{(j)}$ with eigenvectors $\mathbf w_i$. Note these are length-$n$ vectors.
Compute embedding coordinates in one shot: $\mathbf X \mathbf W = \mathbf F$ where columns of $\mathbf F$ are the embedded features.
Definition of our eigenvectors: $\mathbf X^T\mathbf X \mathbf W = \mathbf W\mathbf D$.
Multiply both sides by $\mathbf X$ to get $\mathbf X \mathbf X^T\mathbf X \mathbf W = \mathbf X \mathbf W\mathbf D$
Plug in definition of $\mathbf F$ to get $\mathbf X \mathbf X^T\mathbf F = \mathbf F \mathbf D$
We just proved that the embedding vectors (columns of $\mathbf F$) are eigenvectors of $\mathbf X \mathbf X^T$
Summary - two ways to compute embeddings of our data into eigenvector space:
Compute basis $\mathbf W$ as eigenvectors of $\mathbf X^T\mathbf X$, then project rows of $\mathbf X$ onto this basis (this is basically PCA)
Directly compute eigenvectors of $\mathbf X\mathbf X^T$.
Which way is more efficient?
Reconsider $\mathbf X \mathbf X^T$ matrix, elements are inner products between rows of $\mathbf X$ -- similarity measure between samples.
So we can view $\mathbf X \mathbf X^T$ as a weighted adjacency matrix of a graph relating the samples to each other.
By taking a truncated set of its eigenvectors we are doing graph embedding!
Columns of $\mathbf V$ are eigenvectors of $\mathbf X \mathbf X^T$
Columns of $\mathbf W$ are eigenvectors of $\mathbf W \mathbf W^T$
Diagonal of $\mathbf A$ contains singular values, squaare roots of eigenvalues of $\mathbf X \mathbf X^T$ and $\mathbf W \mathbf W^T$
Break matrix into two or more $\mathbf X = \mathbf F \mathbf G$
Analogous to factoring a number, e.g. $28 = 7 \times 2 \times 2$
If $\mathbf X$ is $N \times d$, what are sizes of factors?
$\mathbf G$ as factors in terms of original features
$\mathbf F$ as samples transformed to factor combinations
$\mathbf X$ is a sample of $N$ documents each using a bag of words representation with $d$ words
each factor may be one topic or concept written using a certain subset of words
each document is a certain combination of such factors
$\mathbf G$ relates documents to factors
$\mathbf F$ relates words to factors
We have $N$ customers and we sell $d$ different products
$X_{ij}$ is number of times customer $i$ bought product $d$
$\mathbf G$ relates customers to factors
$\mathbf F$ relates products to factors
...point of these is to get some new customer or document and compute something...
Most machine learning methods are based on numerical mathematical mathods which operate on vectors of numbers. E.g. deep learning.
But when text is converted to numbers via most common approaches, the numbers are not very meaningful.
Example of meaningful vectors:
Example of less-meaningful vectors:
Goal: Convert text into vectors who's geometric locations are meaningful. So similar text passages have similar vectors
ELMo = Em-beddings from Language Models "a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis."
One-hot encoding converts words (or $n$-grams) into orthogonal vectors $\mathbf e_k$ with a single "1" value and the rest zeros. Vectors for any two words are orthogonal. So to handle 50,000 words requires length-50,000 vectors.
Geometrically these are orthogonal vectors in 50,000-dimensional space. With every word vector equally-distant from every other.
Word embeddings try to squeeze these into a lower number of dimensions by putting "similar" words closer together, use real numbers (rather than only binary).
Use machine learning method which assigns vectors by learning similarity
Two-step process:
Q: Isn't embedding basically just another name for dimensionality reduction done for the usual reasons (to reduce need for data)?
A: Yes. But there's some new methods devised under this new name.
Draw a dense linear single-layer network with no activation function, with $N$ inputs and $m$ outputs,
write the output for a one-hot encoded input vector $\mathbf e_k$ representing a single word.
write the output for a general input vector $\mathbf v$ representing multiple words in a document.
Recall Matrix Factorization: Break matrix into two or more $\mathbf X = \mathbf F \mathbf G$
$\mathbf X$ is a sample of $N$ documents each using a bag of words representation with $d$ words
each factor may be one topic or concept written using a certain subset of words
each document is a certain combination of such factors
$\mathbf G$ relates documents to factors
$\mathbf F$ relates words to factors
Can download and use result: https://nlp.stanford.edu/projects/glove/
Listing most-similar words for given word based on cosine distance
GloVe:
given word: frog
most similar words: frogs, toad, litoria, leptodactylidae, rana, lizard, and eleutherodactylus
Clustering words based on cosine distance - hierarchical representation
Project embedding space to 2D
context window size
First order co-occurence (syntagmatic association) - words typically near each other. $wrote$ and $book$
Second-order co-occurrence (paradigmatic association) - have similar neighbors $wrote$ and $said$
Relational meanings (analogous meanings) via word differences: $king-man+woman = queen$
Compare embedded distance to human-estimated similarity
Metric to compare language models
$$ PP(W) = \left( \prod_{i=1}^N \frac{1}{P(w_{i}|w_{i−1})} \right)^\frac{1}{N} $$