Machine Learning

Keith Dillon
Fall 2018

Topic 7: Statistics

Outline¶

Sampling
Mean & Variance
Correlations

In [2]:

import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Statistics PRE-QUIZ¶

Define these terms:

Mean & Variance
Distribution
Sample

JARGON ALERT!¶

Sample can have different meanings in different fields

Machine Learning
Statistics
Signal Processing

What do statisticians call the Machine Learning version of a "sample"?

What are we calling the statisticians' "sample"?

Statistics: Describing data and Populations¶

Not just the data we have, but presuming our numbers apply to a population we sampled from:

Example: we want to know the average age of college students in the US, so we ask the members of this class and take the average.

What's the sample here?

Abstraction of Populations¶

Consider what is the population in the examples where we estimate a mean:

an election where we poll 100 people (what is the mean for this?)
an industrial process where we measure every 1000th part made
an experiment where we measure the temperature 10 times

Good versus Bad Sampling¶

Is polling this class a good way to estimate the average age of US college students? Why or why not?

Random Sampling¶

A central concept in Statistics. If a sample is randomly taken where every member is equally likely to be chosen, then a concept called the Central Limit Theorem (CLT) (generally known by the related Law of Large Numbers), allows us to compute confidence intervals, "$p$-values, etc., describing how good our estimate is.

Ex: If you sample 100 students randomly, your estimate of the average, e.g., 30.5 years, will have a confidence interval of $\pm 3.5$ years. Meaning the estimate is 63% likely to be within the interval [30.5-3.5, 30.5+3.5].

HOWEVER: it all goes out the window if we do a bad job of sampling.

"Real-World Sampling"¶

Psychology studies that use psychology students.
Several hours spent recording info on cars that drive on street through campus.
Internet searches, youtube searches.
Study that uses patients who come into their clinic.
Online polls.

Sampling Bias - error in estimate due to sampling that is not sufficiently random. What kinds of biases might appear in the above list?

Most basic things to do in statistics:¶

Data display - _plots, histograms, ..._
Compute averages $\rightarrow$ _the mean_
Quantify variability $\rightarrow$ _variance_, _standard deviation_

Means... Simple averages¶

drawing

Population mean: $\mu = \frac{\sum_{i=1}^N x_i}{N}$¶

...As in the whole population

Sample mean: $ \bar{x} = \frac{\sum_{i=1}^n x_i}{n}$¶

...As in just the sample

Sample Mean (a.k.a. "Mean")¶

The arithmetic average of the data values

$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + \ldots + x_n}{n} $$ where n is the sample size.

The most common measure of center
Can be affected by extreme data values (outliers)

In [3]:

print( (1+2+3+4+5)/5, np.mean((1,2,3,4,5)), (1+2+3+4+10)/5, np.mean((1,2,3,4,10)) )

3.0 3.0 4.0 4.0

Median¶

The "middle" number (when the data values are put in order)

If n is odd, the median is exactly the middle number
If n is even, the median is the average of the two middle numbers
Not affected by extreme values (outliers)

Ex: mean salary of 1984 UNC geography grads: \$1M+. Mean \$30k.

In [4]:

print(np.median((1,2,3,4,5)), np.median((1,2,3,4,10)) )

3.0 3.0

Mode¶

The most frequently occurred value

There may be no mode or several modes. "Multimodal" implies multiple peaks in histogram.
Not affected by extreme values (outliers)

In [6]:

# find it in numpy/scipy

Percentile¶

The $p^{th}$ percentile - $p\%$ of the values in the data are less than or equal to this value ($0 \leq p \leq 100$)

Quartile:¶

* $1^{st}$ quartile = $25^{th}$ percentile
* $2^{nd}$ quartile = $50^{th}$ percentile = median
* $3^{rd}$ quartile = $75^{th}$ percentile

How might you compute this?

Variance... trickier¶

drawing

Population variance: $\sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}$¶

...As in the whole population

Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$¶

...As in just the sample

Note that sneaky $n-1$ denominator.

Standard deviation = $\sqrt{\text{Variance}}$¶

Variance is a kind of avreage too.

Exercise¶

Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$¶

Standard deviation = $\sqrt{\text{Variance}}$¶

How do these relate to the norms we covered?

In [7]:

# find it in numpy/scipy - multiple ways

What is the mean?¶

   x = [1048 1059 1047 1066 1040 1070 1037 1073]

What is the mean?¶

In [3]:

(1048 + 1059 + 1047 + 1066 + 1040 + 1070 + 1037 + 1073)/8

Out[3]:

1055.0

In [4]:

x = np.array([1048, 1059, 1047, 1066, 1040, 1070, 1037, 1073]);
np.mean(x)

Out[4]:

1055.0

What is the Variance & Standard Deviation?¶

... how much are these points spread out from the mean?

What is the Variance & Standard Deviation?¶

## Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$

what are the units of variance and standard deviation?

In [5]:

x = np.matrix([1048, 1059, 1047, 1066, 1040, 1070, 1037, 1073]);
x_bar = x - np.mean(x)
x_bar_2 = np.power(x_bar,2.)

table2_1 = np.concatenate((x.T, x_bar.T, x_bar_2.T), axis=1)

print(table2_1)

[[1048.   -7.   49.]
 [1059.    4.   16.]
 [1047.   -8.   64.]
 [1066.   11.  121.]
 [1040.  -15.  225.]
 [1070.   15.  225.]
 [1037.  -18.  324.]
 [1073.   18.  324.]]

In [6]:

np.mean(table2_1,axis=0)

Out[6]:

matrix([[1055. ,    0. ,  168.5]])

In [7]:

print(np.var(x), np.std(x))

168.5 12.98075498574717

Alternative way to compute (aka "Computational Formula")¶

$ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^n x_i^2 - \frac{1}{n}(\sum_{i=1}^n x_i)^2}{n - 1}$¶

$\frac{1}{n-1} \times$ (sum-of-squares - $\frac{1}{n}$ squared-mean)¶

Recap¶

\begin{align} \text{Population mean} &= \mu = \frac{\sum_{i=1}^N x_i}{N} \\ \text{Sample mean} &= \bar{x} = \frac{\sum_{i=1}^n x_i}{n} \\ \text{Population variance} &= \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N} \\ \text{Sample variance} &= s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^n x_i^2 - \frac{1}{n}(\sum_{i=1}^n x_i)^2}{n - 1} \\ \text{Standard deviation} &= \sqrt{\text{Variance}} \end{align}

Multivariate Data - describing relationships¶

"Multivariate" dataset ~ a matrix with samples for rows and features for columns. Each column (a.k.a. feature) is also known as a variable.

Just focusing on two variables for the moment, say column1 = $x$, and column2 = $y$.

Examples¶

How would you describe these relationships?

Correlation(s)¶

\begin{align} \text{Variance} &= s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^n x_i^2 - \frac{1}{n}(\sum_{i=1}^n x_i)^2}{n - 1} \\ \text{Standard deviation} &= \sqrt{\text{Variance}} \end{align}

\begin{align} \text{Correlation Coefficient} &= r = \frac{ \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}} = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} \\ %\text{''Corrected Correlation''} % &= S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) % = \sum_{i=1}^n x_i y_i - \frac{1}{n}(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i) \\ \text{Covariance} &= S_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) % = \sum_{i=1}^n x_i y_i - \frac{1}{n}(\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i) \end{align}

Look kind of familiar? Relate to variance.

Questions¶

What are $S_{xx}$ and $S_{yy}$ also called?
Can you relate the covariance to a distance metric?

Exercises¶

Roughly what are the correlation coefficients here?

Harder Exercises I¶

Roughly what are the correlation coefficients here?

Harder Exercises II¶

Roughly what are the correlation coefficients here?

Correlation Qualitative Behavior¶

What does magnitude tell you about the relationship?
What does sign tell you about the relationship?
Would you expect the Covariance to be similar?

And of course, Correlation does not prove Causation¶

Most important statistics fact ever?

Lab¶

Generate N Gaussian random numbers and compute mean and std and plot histograms, for N=10,100,1000.
Load IRIS dataset and compute pairwise correlations between features.
Now standardize each feature and give scatter plot for pairs of columns with highest pairwise correlation. Also give scatter plot for the pair of columns with lowest pairwise correlation.

Machine Learning

Keith DillonFall 2018

Topic 7: Statistics

Outline¶

Statistics PRE-QUIZ¶

JARGON ALERT!¶

Statistics: Describing data and Populations¶

Abstraction of Populations¶

Good versus Bad Sampling¶

Random Sampling¶

"Real-World Sampling"¶

Most basic things to do in statistics:¶

Means... Simple averages¶

Population mean: $\mu = \frac{\sum_{i=1}^N x_i}{N}$¶

Sample mean: $ \bar{x} = \frac{\sum_{i=1}^n x_i}{n}$¶

Sample Mean (a.k.a. "Mean")¶

Median¶

Mode¶

Percentile¶

Quartile:¶

Variance... trickier¶

Population variance: $\sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}$¶

Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$¶

Standard deviation = $\sqrt{\text{Variance}}$¶

Exercise¶

Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$¶

Standard deviation = $\sqrt{\text{Variance}}$¶

What is the mean?¶

What is the mean?¶

What is the Variance & Standard Deviation?¶

What is the Variance & Standard Deviation?¶

Alternative way to compute (aka "Computational Formula")¶

$ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^n x_i^2 - \frac{1}{n}(\sum_{i=1}^n x_i)^2}{n - 1}$¶

$\frac{1}{n-1} \times$ (sum-of-squares - $\frac{1}{n}$ squared-mean)¶

Recap¶

Multivariate Data - describing relationships¶

Examples¶

Correlation(s)¶

Questions¶

Exercises¶

Harder Exercises I¶

Harder Exercises II¶

Correlation Qualitative Behavior¶

And of course, Correlation does not prove Causation¶

Lab¶

Keith Dillon
Fall 2018