BDS 761: Data Science and Machine Learning I
Topic 8: Embeddings
This topic:¶
- Intro to embedding
- Vector Semantics
- Dimensionality Reduction
- Manifold embedding
- t-SNE (t-distributed Stochastic Neighbor Embedding)
- Word Embeddings
Reading:¶
- Any machine learning text on Dimensionality Reduction.
- J&M Chapter 6. (Vector Semantics)
- https://scikit-learn.org/1.5/modules/manifold.html
0. Embedding intro¶
Data embedding¶
Recall these 2D data correlation plots.
Somegimes the points can be seen as approximately on a lower dimensional manifold.
This is embedded in the higher dimensional space of the plot.
What is the dimensions of the manifolds? How would we represent a point on them?
Principal Component Analysis¶
Using eigendecomposition to identify embedding spaces, dimensions
- remove means
- compute covariance matrix
- compute eigenvalues
https://medium.com/intuition/mathematical-understanding-of-principal-component-analysis-6c761004c2f8
Spectral Graph theory¶
Key here is instead of a data matrix of vectors we have an adjacency (or related) matrix of similarities
Consider how the covariance matrix from PCA contains both weak correlations and strongs ones.
With spectral embeddings, we basically truncate the small ones using asimilarity score or binary threshold, allowing long-range nonlinear variation, but maintaining locally linear relationships.
Spectral Graph Drawing (a.k.a. Embedding)¶
The drawing of a graph is a function $\boldsymbol\rho(\cdot)$ which assigns a point in space $\boldsymbol\rho(v_i)$ to each node $v_i$
The matrix $\mathbf R$ of a graph drawing is a $m\times n$ matrix whos $i$th row is $\boldsymbol\rho(v_i)$
Recall energy of drawing:
View edges as springs and our goal is to make a graph with minimum energy stored in the springs
Weight of edge is strength of spring
Spectral clustering¶
Combine the nonlinear manifold embedding with clustering
View as distances within the manifold used to cluster,rather than Euclidean distancs
Consider manual clustering of 2D points as example
ELI5: Embedding¶
An embedding takes a longer vector (i.e., in high dimensions) and assigns to it a shorter vector (i.e. in a lower dimensional space), in a way that hopefully preserves the important variation, while discarding unimportant, redundant, or mostly-constant dimensions
$$\begin{pmatrix}v_1 \\ v_2 \\ v_3 \\ v_4 \\ v_5 \\ v_6 \\ \vdots \\ v_n \end{pmatrix} \rightarrow \begin{pmatrix}w_1\\ w_2 \\ \vdots \\ w_m \end{pmatrix} $$
There are many different approaches, ranging from very-simple to very involved.
Dimensionality Reduction - an older, broader field of methods
How can this be accomplished with a simple matrix-vector multiplication?
What if we want to embed things that aren't vectors? (categorical or text)
I. Vector Semantics¶
Motivation¶
The representation of text (words, sentences, documents, etc.) with vectors ~ a.k.a. text embedding
This allows geometric analysis of these vectors (e.g. words may be close,far away, etc, as computed via distance metrics on the vectors).
Distributional hypothesis - Words that occur in similar contexts tend to have similar meanings.... so close together in a gometric sense.
Synonyms (like oculist and eye-doctor) tended to occur in the same environment (e.g., near words like eye or examined) with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments” (Harris, 1954, 157).
Vector semantics¶
- a model which instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts.
- example of representation learning
Lexical Semantics - The linguistic study of word meanings
Lemma¶
word at top of definition, main representative form chosen. "Citation form".
mouse (N)
1. any of numerous small rodents...
2. a hand-operated device that controls a cursor...
Wordforms - other forms of lemma, such as "mice".
Word sense - the meaning(s) of the lemma
Synonym - words with same word sense. couch/sofa.
Antonyms - words with an opposite meaning. long/short.
Word Similarity¶
Some degree of similarity exists for many words with different meanings. Cat vs. Dog. See vs. Hear.
SimLex-999 dataset (Hill et al., 2015) gives values on a scale from 0 to 10, like the examples below, which range from near-synonyms (vanish, disappear) to pairs that scarcely seem to have anything in common (hole, agreement). Judged by humans.
Word relatedness, a.k.a. association¶
E.g., Coffee vs. cup
Multiple kinds of relatedness.
- Semantic field - topic relatedness. scalpel/surgeon/nurse.
- Semantic frame - different perspectives of same thing. buy/sell/pay/customer
- Taxonomic - subclass. animal/dog.
Connotation¶
Affective meaning, i.e., emotion/sentiment
Dimensions:
- valence: the pleasantness of the stimulus - happy vs. unhappy
- arousal: the intensity of emotion provoked by the stimulus - excited vs. calm
- dominance: the degree of control exerted by the stimulus
Word as a coordinate in 3D space.
Historical Vector Semantics¶
1950's: define word as distribution over contexts, the neighboring words or grammatical environments.
Ex: ongchoi (Cantonese). Given contexts:
(6.1) Ongchoi is delicious sauteed with garlic.
(6.2) Ongchoi is superb over rice.
(6.3) ...ongchoi leaves with salty sauces...
And given other contexts of the neighboring words.
(6.4) ...spinach sauteed with garlic over rice...
(6.5) ...chard stems and leaves are delicious...
(6.6) ...collard greens and other salty leafy greens
Conclusion: onchoi is similar to spinach & collard greens.
Figure 6.1 A two-dimensional (t-SNE) projection of embeddings for some words and phrases, showing that words with similar meanings are nearby in space. The original 60- dimensional embeddings were trained for a sentiment analysis task. Simplified from Li et al. (2015).
Vector semantic models¶
- Embeddings of word in a particular vector space
- Better handle OOV words, only need similar words in training set
- Can be learned automatically from text without any complex labeling or supervision.
- Now the standard way to represent the meaning of words in NLP
Will cover two models: tf-idf and word2vec
Term-document matrix¶
A kind of co-occurence matrix, which counts how ofen words co-occur based on being in the same context or document. BOW representation of documents.
Dimensionality of vectors is number of words used, here 4. Equal to vocabulary size used.
Gigantic matrix for all words $\times$ all documents. Exploit sparsity.
Document vectors¶
Used in Information Retrieval field. Document search based on vector similarity to search term.
Perform search by putting search query into vector and computing a distance metric with document vectors.
Word vectors¶
Use rows of matrix to compare vectors
More common to use word-word matrix, a.k.a term-term matrix or term-context matrix.
Numbers are counts of times the two words appear together in same context: same document, or a sliding window within documents.
Note similarity of rows for information and digital, and of rows for apricot and pineapple.
Cosine similarity metric¶
$$ S(\mathbf v, \mathbf w) = \dfrac{\mathbf v^T \mathbf w}{\Vert\mathbf v\Vert \Vert\mathbf w\Vert} = \cos\theta $$
Exercise: compute cosine similarity between Apricot, Digital, and Information.
Term Frequency¶
- TF: Frequent words more important than rare words. term frequency.
$$ tf(term,doc) = \begin{cases} 1+\log_{10} \text{Count}(term, doc), \text{ if } \text{Count}(term, doc)>0 \\ 0, \text{ else} \end{cases} $$
Log to dampen dominance of common words
plt.plot(np.linspace(1,10,100),1+ np.log(np.linspace(1,10,100)));
Inverse Document Frequency¶
- Very frequent words very unimportant (a, and, the): inverse document frequency.
Document frequency $df(term)$ = # documents term appears in.
Collection frequency of term = # times term appears in collection.
Expect common words which are concentrated in fewer documents to be more discrimatory. Use inverse document frequency.
$$ idf(term) = \log_{10}\left(\dfrac{N}{df(term)}\right) $$
$N$ = total number of documents in collection
Words that appear in every document have low idf¶
tf-idf¶
tf-idf weighting is by far the dominant way of weighting co-occurrence matrices in information retrieval
$$ w(term, doc) = tf(term,doc) \times idf(term) $$
A tf-idf weighted term-document matrix for four words in four Shakespeare plays, using the counts in Fig. 6.2. Note that the idf weighting has eliminated the importance of the ubiquitous word good and vastly reduced the impact of the almost-ubiquitous word fool.
tf-idf Model¶
Compute word co-occurence matrix with counts weighted by tf-idf for word in document.
Represent words with vectors (rows or columns) from this matrix - tf-idf vectors.
Compare similarity between words using cosine metric of vectors.
Describe documents using average of tf-idf vectors for all contained words.
Dense Embeddings¶
Use short dense vectors rather than long sparse vectors
"It turns out that dense vectors work better in every NLP task than sparse vectors."
Possible reasons why:
- Easier to learn fewer parameters
- Less prone to overfitting
- May do a better job of capturing synonymy than sparse vectors
Example¶
In training data, 'car' and 'dashboard' are used together in sentence.
But in test data, 'dashboard' and 'automobile' are used together instead.
Recap¶
We considered a couple ways to represent words as numerical vectors which are meaningful in a sense
Words with similar semantics are hopefully closer in terms of distance measured between their vectors
The key idea is that a word's context (other words nearby, within same sentence or same documents) identifies its meaning.
Classical NLP techniques like tf-idf try to fix issues such as dealing with rare words.
II. Dimensionality Reduction¶
Dimensionality Reduction - Motivation¶
- Reducting problem complexity. Complexity of any classifier or regressor depends on the number of inputs.
- time and space complexity
- necessary number of training examples to train such a classifier or regressor.
- Curse of dimensionality
- Visualization - 2D retina, 3D mental capacity, $n$-dimensional data
Feature selection - choose a subset of important features pruning the rest
Feature extraction - form fewer, new features from the original inputs
Main methods for dim reduction¶
- Feature selection - given $d$ inputs, choose $k<d$ inputs to keep, discard the rest.
- Feature extraction - compute $k<d$ new inputs using combinations of original inputs. E.g. replace multiple inputs with their average.
Feature extraction: Linear Methods¶
Here, linear means new features are linear combinations of original features. E.g. mean of multiple features.
Principal Component Analysis (PCA) - use SVD & keep singular vectors for largest singular values
Linear Discriminant Analysis (LDA) - supervised variation on PCA
Factor Analysis -
Multidimensional Scaling -
Canonical correlation analysis - finds joint features that relate multiple datasets
Exercise:
write the SVD of your dataset and identify the combinations of original inputs
create low-rank approximations to your data covariance matrix and visualize how similar they are.
Exercise: use the PCA function in sklearn for a toy dataset.
Scree graph¶
plot of variance explained as a function of the number of eigenvectors kept
Exercise: how do you compute "variance explained"?
"Eigendigits" = eigenvectors of handwritten digit image dataset (with images as rows).
Exercise 2: what are the original and extracted features in the geometric figure?
Feature Embedding¶
$\mathbf X$ is $N\times d$ data matrix.
"Centered to have zero mean" (row means)
$d\times d$ covariance matrix = $\mathbf X^T \mathbf X$
Eigenvector decomposition $\mathbf X^T \mathbf X = \mathbf W \mathbf D \mathbf W^T$
Truncate (length-$d$) eigenvectors $\mathbf w_i$ for $k$ largest eigenvalues.
Get coordinates of data point in eigenvector space by taking inner product of row $\mathbf x^{(j)}$ with eigenvectors $\mathbf w_i$. Note these are length-$n$ vectors.
Compute embedding coordinates in one shot: $\mathbf X \mathbf W = \mathbf F$ where columns of $\mathbf F$ are the embedded features.
Definition of our eigenvectors: $\mathbf X^T\mathbf X \mathbf W = \mathbf W\mathbf D$.
Multiply both sides by $\mathbf X$ to get $\mathbf X \mathbf X^T\mathbf X \mathbf W = \mathbf X \mathbf W\mathbf D$
Plug in definition of $\mathbf F$ to get $\mathbf X \mathbf X^T\mathbf F = \mathbf F \mathbf D$
We just proved that the embedding vectors (columns of $\mathbf F$) are eigenvectors of $\mathbf X \mathbf X^T$
Summary - two ways to compute embeddings of our data into eigenvector space:
Compute basis $\mathbf W$ as eigenvectors of $\mathbf X^T\mathbf X$, then project rows of $\mathbf X$ onto this basis (this is basically PCA)
Directly compute eigenvectors of $\mathbf X\mathbf X^T$.
Which way is more efficient?
"Embedding"?¶
Reconsider $\mathbf X \mathbf X^T$ matrix, elements are inner products between rows of $\mathbf X$ -- similarity measure between samples.
So we can view $\mathbf X \mathbf X^T$ as a weighted adjacency matrix of a graph relating the samples to each other.
By taking a truncated set of its eigenvectors we are doing graph embedding!
Singular Value Decomposition and Matrix Factorization¶
$$\mathbf X = \mathbf V \mathbf A \mathbf W^T$$
Columns of $\mathbf V$ are eigenvectors of $\mathbf X \mathbf X^T$
Columns of $\mathbf W$ are eigenvectors of $\mathbf W \mathbf W^T$
Diagonal of $\mathbf A$ contains singular values, squaare roots of eigenvalues of $\mathbf X \mathbf X^T$ and $\mathbf W \mathbf W^T$
Topic Modeling¶
- Determine small set of topics for corpus of documents, e.g. politics, art, sports
- Unsupervised learning problem - factor word-document matrix - SVD, NNMF, Latent Dirichlet Allocation
Matrix Factorization¶
Break matrix into two or more $\mathbf X = \mathbf F \mathbf G$
Analogous to factoring a number, e.g. $28 = 7 \times 2 \times 2$
If $\mathbf X$ is $N \times d$, what are sizes of factors?
$\mathbf G$ as factors in terms of original features
$\mathbf F$ as samples transformed to factor combinations
Latent Semantic Indexing¶
$\mathbf X$ is a sample of $N$ documents each using a bag of words representation with $d$ words
each factor may be one topic or concept written using a certain subset of words
each document is a certain combination of such factors
$\mathbf G$ relates documents to factors
$\mathbf F$ relates words to factors
Application: Recommender System¶
We have $N$ customers and we sell $d$ different products
$X_{ij}$ is number of times customer $i$ bought product $d$
$\mathbf G$ relates customers to factors
$\mathbf F$ relates products to factors
...point of these is to get some new customer or document and compute something...
V. Word Embedding¶
Recap: networks, embeddings, and self-supervised models¶
Word embedding - Motivation¶
Most machine learning methods are based on numerical mathematical mathods which operate on vectors of numbers. E.g. deep learning.
But when text is converted to numbers via most common approaches, the numbers are not very meaningful.
Example of meaningful vectors:
- array of light levels in image
- list of concentration levels of chemicals
Example of less-meaningful vectors:
- list of base-pairs in DNA converted to 0,1,2,3
- list of ascii code of letters in document
Goal: Convert text into vectors who's geometric locations are meaningful. So similar text passages have similar vectors
Word embedding - Approach¶
Use machine learning method which assigns vectors by learning similarity
Two-step process:
- Convert text to numbers via simple approach
- Dimensionality reduction
Word embeddings¶
One-hot encoding converts words (or $n$-grams) into orthogonal vectors $\mathbf e_k$ with a single "1" value and the rest zeros. Vectors for any two words are orthogonal. So to handle 50,000 words requires length-50,000 vectors.
Geometrically these are orthogonal vectors in 50,000-dimensional space. With every word vector equally-distant from every other.
Word embeddings try to squeeze these into a lower number of dimensions by putting "similar" words closer together, use real numbers (rather than only binary).
Q: Isn't embedding basically just another name for dimensionality reduction done for the usual reasons (to reduce need for data)?
A: Yes. But there's some new methods devised under this new name.
Embedding methods¶
- Principal component analysis & related methods applied to BOW data ~ Latent semantic indexing
- GloVe - Dim reduction on matrix of co-occurence statistics.
- Shallow neural network layer - Embedding layer
- Neural embedding: word2vec, doc2vec, x2vec
Exercise - Embedding Layer¶
Draw a dense linear single-layer network with no activation function, with $N$ inputs and $m$ outputs,
write the output for a one-hot encoded input vector $\mathbf e_k$ representing a single word.
write the output for a general input vector $\mathbf v$ representing multiple words in a document.
Latent Semantic Indexing¶
Recall Matrix Factorization: Break matrix into two or more $\mathbf X = \mathbf F \mathbf G$
$\mathbf X$ is a sample of $N$ documents each using a bag of words representation with $d$ words
each factor may be one topic or concept written using a certain subset of words
each document is a certain combination of such factors
$\mathbf G$ relates documents to factors
$\mathbf F$ relates words to factors
GloVe - Global Vectors for Word Representation¶
Can download and use result: https://nlp.stanford.edu/projects/glove/
Word2Vec¶
Continuous bag of words - predict $k$th word using neighbors in a window
Skip-gram - predict neighboring words for $k$th word
The weights in the neural layer for prediction give the emedding vectors.
Can download and use result: https://code.google.com/archive/p/word2vec
word2vec general idea: model probability of both target word with context.¶
“Is word w likely to show up near apricot?”
Use running text as implicitly supervised training data
- no need for hand-labeling
- Bengio 2003 and Collobert 2011 used for neural language model
word2vec = Simplified case of neural language model
- Binary classifier - learn to identify neighboring word versus randomly-chosen word ("noise word").
- Logistic regression - then use regression weights as embeddings.
word2vec = Simplified case of neural language model¶
- Binary classifier - learn to identify neighboring word versus randomly-chosen word ("noise word").
- Logistic regression - then use regression weights as embeddings.
- fast
- efficient to train
- available online with code and pretrained embeddings
word2vec contains two algorithms:¶
- Continuous Skip-gram - predict neighboring words of given word
- Continuous bag-of-words (CBOW) - predict target word given neighbors
Continuous = dense embedding vector rather than sparse binary.
Negative Sampling = minimize the log-likelihood of those randomly-chosen other words from lexicon. "Skip-gram with Negative Sampling (SGNS)"
The classifier¶
Given tuple of words $(t,c)$, e.g. $(apricot,jam)$:
- $t$ = target word
- $c$ = candidate context word
Predict $P(+|t,c)$ probability $c$ is a context word.
Probability $c$ is not a context word: $P(-|t,c) = 1- P(+|t,c)$
Classifier model - basically just a similarity metric, made into probability by sigmoid function.
\begin{align} P(+|t,c) &= \sigma(\text{"similarity"}) \\ &= \frac{1}{1-e^{-\mathbf t \cdot \mathbf c}} \end{align}
where $\mathbf t$ and $\mathbf c$ are the dense vectors representing the words -- these are the parameters of the model which we will fit using data.
\begin{align} P(-|t,c) &= 1-P(+|t,c) \\ &= \frac{e^{-\mathbf t \cdot \mathbf c}}{1-e^{-\mathbf t \cdot \mathbf c}} \end{align}
"skip-gram"¶
A variation on a bigram which combines target word and context word (not necessarily neighbor), hence it "skips" other context words.
Context words assumed independent of each other, so for a target word and list of context words,
\begin{align} P(+|t,(c_1, c_2,...)) &= \prod_i \frac{1}{1-e^{-\mathbf t \cdot \mathbf c_i}} \end{align}
Bigrams:
- $t$ and immediately preceding word
- $t$ and immediately following word
Skip-grams:
- $t$ and word two words away
- $t$ and word three words away
- ...
Noise word choice¶
Sampled from lexicon according to weighted unigram frequency $P_\alpha(w)$
$$P_\alpha(w) = \frac{[C(w)]^\alpha}{\sum_{w'} [C(w')]^\alpha }$$
$\alpha = 0$: ignore unigram probability altogether. Use rare words and common words equally often.
$\alpha<1$: dampen high probabilities (like for $w=the$). Use common words more, but not as much more as their frequency in lexicon.
Common choice: $\alpha = 0.75$
Optimization of model¶
Start by making training set $D$ of skip grams $(t_i, c_i)$ with target words and context words.
Note that if we simply try to maximize $\prod_i P(+|t,c_i) = \prod_i \dfrac{1}{1-e^{-\mathbf t_i \cdot \mathbf c_i}}$ by choosing $\mathbf t_i$ and $\mathbf c_i$, we can get an optimal by making all $\mathbf t_i = \mathbf t_j = \mathbf c_i = \mathbf c_j$.
Hence we also need a term in our optimization objective to force the vectors apart and counter this useless trivial solution. This is the negative samples.
So we augment the training set with a negative set $D'$ with fake skip grams $(t_j,c'_j)$ using noise words for the $c'_j$.
Then try to maximize $\prod_i P(+|t_i,c_i) \prod_j P(-|t'_j,c'_j)$.
In other words we want a model which has high probabilities for all the skip grams in the set $D$, and another model with high probabilities for all the negative samples in the set $D'$.
Taking the log of the objective we get
\begin{align} &\max_{\mathbf t_k,\mathbf c_k} \sum_{(t_i,c_i)\in D} \log \frac{1}{1-e^{-\mathbf t_i \cdot \mathbf c_i}} + \sum_{(t_j,c_j)\in D'} \log \frac{e^{-\mathbf t_j \cdot \mathbf c_j}}{1-e^{-\mathbf t_j \cdot \mathbf c_j}} \end{align}
Note that the parameters are two sets of vectors, the $\mathbf t_k$ and $\mathbf c_k$ which we can form into matrices $\mathbf T$ and $\mathbf C$.
The result is two different embedding vectors for each word, one when it is target and one when it is context. We can just choose one or combine them (e.g. take average).
Hyperparameters
- $k$ - the ratio of negative samples to context samples
- $\alpha$ - the weighting for negative sampling
- $L$ - the context window size
"Visualizing Embeddings"¶
Listing most-similar words for given word based on cosine distance
GloVe:
given word: frog
most similar words: frogs, toad, litoria, leptodactylidae, rana, lizard, and eleutherodactylus
Clustering words based on cosine distance - hierarchical representation
Project embedding space to 2D
Semantic Properties of Embeddings¶
context window size
- small window - semantically-similar words, same parts of speech
- large window - topically related words, not similar
First order co-occurence (syntagmatic association) - words typically near each other. $wrote$ and $book$
Second-order co-occurrence (paradigmatic association) - have similar neighbors $wrote$ and $said$
Relational meanings (analogous meanings) via word differences: $king-man+woman = queen$
- gender vector
- comparative & superlative morphology
- temporal changes
- bias: $doctor-man+woman = nurse$
Evaluating embeddings¶
Compare embedded distance to human-estimated similarity
- WordSim-353 (Finkelstein et al., 2002) is a commonly used set of ratings from 0 to 10 for 353 noun pairs;
- SimLex-999 (Hill et al., 2015) quantifies similarity (cup, mug) rather than relatedness (cup, coffee), and including both concrete and abstract adjective, noun and verb pairs.
- TOEFL dataset (Landauer and Dumais, 1997) - 80 questions consisting of a target word with 4 additional word choices for closest synonym.
- Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) - human judgments on 2,003 pairs of words in their sentential context, including nouns, verbs, and adjectives.
- semantic textual similarity task (Agirre et al. 2012, Agirre et al. 2015) evaluates the performance of sentence-level similarity algorithms, consisting of a set of pairs of sentences, each pair with human-labeled similarity scores
- Analogy task datasets (Mikolov et al. 2013, Mikolov et al. 2013b) a is to b as c is to d