$\Omega$ = set of all possible outcomes of an experiment
Examples:
subset of the sample space. That is, any collection of outcomes forms an event.
Example:
Toss a coin twice. Sample space: $\Omega = \{HH, \ HT, \ TH, \ TT\}$
Let event $A$ be the event that there is exactly one head
We write: $A =“exactly \ one \ head”$
Then $A = \{HT, \ TH \}$
$A$ is a subset of $\Omega$, and we write $A \subset \Omega$
Theorems:
Notation note: $P(A,B) = P(A \text{ AND } B) = P(A \cap B)$
This sort of diagram representing events in a sample space is called a Venn diagram.
a) $A \cup B\quad(e.g. 6$ on either $die_1$ or $die_2$ (or both)$)$
b) $A \cap B\quad(e.g. 6$ on both $die_1$ and $die_2)$
c) $B \cap A^c\quad(e.g. 6$ on $die_2$ but not on $die_1)$
If $B_1, B_2, \dots, B_k$ form a partition of $\Omega$, then $(A \cap B_1), (A \cap B_2), \dots, (A \cap B_k)$ form a partition of the set or event A.
The probability of event A is therefore the sum of its parts:
$$P(A) = P(A \cap B_1) + P(A \cap B_2) + P(A \cap B_3) + P(A \cap B_4)$$If experiment $A$ has $n$ possible outcomes, and experiment $B$ has $k$ possible outcomes, then there are $nk$ possible outcomes when you perform both experiments.
Let $A$ be the experiment "Flip a coin." Let $B$ be "Roll a die." Then $A$ has two outcomes, $H$ and $T$, and $B$ has six outcomes, $1,...,6$. The joint experiment, called "Flip a coin and roll a die" has how many outcomes?
Explain what this computation means in this case
The number of $k$-permutations of $n$ distinguishable objects is $$^nP_k=n(n-1)(n-2)\dots(n-k+1) = \frac{n!}{(n-k)!}$$
The number of ways to select $k$ objects from $n$ distinct objects when different orderings constitute different choices
I have five vases, and I want to put two of them on the table, how many different ways are there to arrange the vases?
If order doesn't matter...
Q. How many ways are there to get 4 of a kind in a 5 card draw?
A. Break it down:
Q. How many ways are there to get a full house in a 5 card draw?
A. The matching triple can be any of 13 the denominations and the pair can be any of the remaining 12 denominations.
Joint probability: $P(A \cap B) = P(A|B) \times P(B)$
Law of Total Probability: If $B_1, \dots, B_k$ partition $S$, then for any event A,
$$ P(A) = \sum_{i = 1}^k P(A \cap B_i) = \sum_{i = 1}^k P(A | B_i) P(B_i) $$We can write any joint probability as incremental product of conditional probabilities,
$ P(A_1 \cap A_2) = P(A_1)P(A_2 | A_1) $
$ P(A_1 \cap A_2 \cap A_3) = P(A_1)P(A_2 | A_1)P(A_3 | A_2 \cap A_1) $
In general, for $n$ events $A_1, A_2, \dots, A_n$, we have
$ P (A_1 \cap A_2 \cap \dots \cap A_n) = P(A_1)P(A_2 | A_1) \dots P(A_n | A_{n-1} \cap \dots \cap A_1) $
Frequentist perspective: probability of event is relative frequency
$$P(A) = \frac{\text{# times $A$ occurs}}{\text{total # experiments}}$$Assign probabilities to sequences of "words". This is a model of language structure.
Most common word is "the"
Is the most common two-word sentence therefore: "the the."?
Spelling & grammar correction: "Their are problems wit this sentence"
"Their are..." vs. "There are..." vs. "They're are..."
Predict next word in this sequence: "5, 4, 3, 2, _"
Speech recognition: “I ate a cherry” or “Eye eight uh Jerry”
他向记者介绍了发言的主要内容
Or with vector notation $P(\mathbf w)$, where $w_1$="once", $w_2$="upon", $w_3$="a", $w_4$="time".
Note these are not quite the same as the simple joint probability of the words, i.e.,
$$P(\text{"once"},\text{"upon"},\text{"a"},\text{"time"}) = P(\text{"upon"},\text{"a"},\text{"time"},\text{"once"}) = ...$$Unless we choose to ignore relative location of words (will get to that soon).
Different versions of same information. Both are referred to as "language models".
Problem: most sentences will never appear in our corpus.
We need to assign a probability to every possible sentence
Suppose our vocubulary is (limited to) 10,000 words
How many possible 5-word sentences are there?
$(10,000)^5 = 10^{20}$ -- sampling with replacement
We consider a sequence of words as a vector (really a list of strings here)
$$ \mathbf w = (w_1, w_2, ..., w_n)$$$$ w_1 = "the", \, w_2 = "cat",\, w_3 = "sat", \, ... $$Position is important. What are the events?
$P(W = \mathbf w)$ is the probability that the random variable $W$ takes the value $\mathbf w$ (our vector)
J&M also writes this as
$$P(w_1 w_2 ... w_n)$$Note this is not a product.
Think of $P(w_1 w_2 ... w_n)$ as $P(W_1 = w_1, W_2=w_2, ..., W_n = w_n)$, a multivariate distribution.
Sometimes we may get sloppy with the notation. Generally expect order to matter unless we are doing a method where we discard info about order.
How do we do this?
Answer: use the Chain Rule of Probability
where $w^{b}_a \equiv w_aw_{a+1}...w_b$.
Model order limited to length $N$ sequences:
\begin{align} P(w_n|w_{n-1}...w_1) &\approx P(w_n|w_{n-1}...w_{n-(N-1)}) \\ P(w_n|w^{n-1}_1) &\approx P(w_n|w^{n-1}_{n-N+1}) \end{align}So use previous $N-1$ words.
Exercise: list all possible bigrams from sentence "The cat sat on the hat."
Exercise: list all possible unigrams.
Convert a text string (like a document) into a vector of word frequencies by essentially summing up the one-hot encoded vectors for the words. Perhaps divide by total number of words.
Basically get histogram for each documents to use as feature vectors. Becomes structured data.
What kind of $N$-gram does this use?
Future state only depends on present state (not past states)
How might this apply to word prediction?
What kind of $N$-gram does this use?
Higher-order Markov processes. Probability of next step depends on current node and previous node.
$N$-grams for $N>2$
What is P(<s>|</s>)
?
What is the P(eggs|green)
?
Trigrams are actually much more commonly used in practice, but bigrams easier for examples
In practice, we work with log-probabilities instead to avoid underflow
Exercise: demonstrate how this calculation would be done with log-probabilities
Extrinsic versus Intrinsic
Training set, Test set, and "Dev Set"
Use Training Set for computing counts of sequences, then compare to Test set.
Bigram version:
Weighted average branching factor - number of words that can follow any given word, weighted by probability.
Note similarity to concept of information entropy.
Assume each digit has probability of $\frac{1}{10}$, independent of prior digit. Compute PP.
What happens if probabilities vary from this uniform case?
WSJ dataset: 20k word vocabulary, 1.5M word corpus.
Overfitting and Bias cause misleadingly low PP
Shorter vocabulary causes higher PP, must compare using same size.
Using WSJ treebank corpus:
Note bias caused by corpus, inportance of genre & dialect.
Consider what happens to this calculation if any of our $N$-gram count are zero in the corpus.
Closed Vocabulary System - Vocabulary is limited to certain words. Ex: phone system with limited options.
Open Vocabulary System - Possible unknown words - set to
Out-of-Vocabulary (OOV) - words encoutered in test set (or real application) which aren't in training set.
Adjust $N$-gram counts to give positive numbers in place of zeros, reduces counts of others.
https://en.wikipedia.org/wiki/Additive_smoothing
$$P(w_i) = \dfrac{C(w_i)}{N} \approx \dfrac{C(w_i)+1}{N+V}$$$N$ = total number of words.
$V$ = size of vocabulary.
Adjusted count: $C^*(w_i) = \left(C(w_i)+1\right)\dfrac{N}{N+V}$
Exercise: compute $P(w_i)$ using the adjusted count.
Discount = reduction for words with nonzero counts = $d_c=\dfrac{C^*(w_i)}{C(w_i)}$
Before Laplace smoothing:
After Laplace smoothing:
Before Laplace smoothing:
After Laplace smoothing:
Note large reductions.
Backoff: If desired $N$-gram not available, use $(N-1)$-gram.
Stupid Backoff: perform backoff but don't bother adjusting normalization properly.
Interpolation: combine $N$-grams for different $N$.
Simple linear interpolation: linear combination of $N$-gram with $(N-1)$-gram and $(N-2)$-gram ... and unigram.
Note need to adjust normalization (denominator in probability estimate) depending on total number $N$-grams used
Church & Gale noticed in 1991 using AP Newswire dataset with 22M word training set and 22M word test set:
Bigram Absolute discounting with interpolated Backoff
$$ P_{Abs}(w_i|w_{i-1})= \dfrac{C(w_{i-1}w_i)-d}{\sum_vC(w_{1-1}v)}+\lambda(w_{i-1})P(w_i) $$context-dependent weights: $\lambda$ higher when count higher.
Consider $P(\text{kong})>P(\text{glasses})$, but $P(\text{reading glasses})>P(\text{reading kong})$
Bigrams should capture this, but when we don't have any in training set and need to do backoff, want to still maintain the effect.
Replace unigram probability with continuation probability.
Word appears in many different bigrams --> higher $P_{CONTINUATION}$
Use absolute smoothing
Uses continuation probability for low counts
Bigram version:
Recursive implementation for $N$-grams.
the task of assigning a label or categorization category to an entire text or document
Words like great, richly, awesome, and pathetic, and awful and ridiculously are very informative cues:
(so a unigram model might work reasonably well)
$P(c|d)$ = posterior probability
$P(d|c)$ = likelihood of data
$P(c)$ = prior probability of class $c$
Where the product index $i$ runs over every word in document (including repeats).
Note this can be viewed as a linear classification technique, take linear combination by applying weight to each feature (word).
Feature vector as list of 1's. Different weight vector for each class, document-specific.
Or with bag-of-words representation, feature vector as histogram. Can apply same weight vectors to different documents.
\begin{align} c_{NB} &= \arg\max_c \left\{ \log P(c) + \sum_i \log P(w_i|c) \right\} \\ &= \arg\max_c \left\{ \log P(c) + \sum_{w \in V} N_w \log P(w|c) \right\} \\ \end{align}where we sum over all words in vocabulary, and apply weights.
Note that outputs aren't class probabilities. How could we make them into probabilities?
Convert a text string (like a document) into a vector of word frequencies $(N_1,N_2, ...)$ by essentially summing up the one-hot encoded vectors for the words. Perhaps divide by total number of words.
Basically get histogram for each documents to use as feature vectors. Becomes structured data.
How does this relate to $N$-grams?
What probability rule are we using here?
Binary Naive Bayes: rather than using counts (and smoothing), just use binary indicator of word presense or absense (1 or 2, rather than 0 or 1, to avoid zeros). Essentially just remove word repeats from document.
Lists of positive vs negative words.
A language model models statistical relationsips between words, can be used to predict words of high overall probability for a string of text.
Naive bayes models statistical relationships between words and classes. Used to predict a class given words.