1. Statistics with linear algebra¶
(a) Load the IRIS dataset from sklearn and standardize (subtract mean and scale by norm) the flower feature vectors (recall they are rows of the data matrix in this case). Use numpy to compute the mean and std and use broadcasting to subtract or multiply/divide as needed. Demonstrate the mean and std of rows of the standardized data are zero and one, respectively.
(b) Compute the covariance matrix of the dataset using matrix multiplication after you've subtracted the mean, and perhaps applied any appropriate scaling (it should agree with numpy.cov()). Display the covariance matrix using imshow. Can you see the different classes?
(c) using a similar matrix-matrix multiplication with your standardized data matrix, compute a matrix of correlations and display.
(d) compute the precision matrix and display it also. Explain the result. (Hint: you may need to use colorbar() to see what's going on)
from sklearn.datasets import load_iris
import numpy as np
dat = load_iris()
X = dat.data
print(X.shape, 'rows are data vectors')
# subtract means from data vectors
X_0 = X - np.mean(X,1).reshape(150,1) # use broadcasting
# normalize by std of vectors
X_1 = X_0*1/np.std(X_0,1).reshape(150,1) # use broadcasting
print(np.mean(X[0,:]), np.std(X[0,:]))
print(np.mean(X_0[0,:]), np.std(X_0[0,:]))
print(np.mean(X_1[0,:]), np.std(X_1[0,:]))
(150, 4) rows are data vectors 2.55 1.8874586088176872 1.1102230246251565e-16 1.8874586088176872 5.551115123125783e-17 1.0
C = 1/3*X_0@X_0.T
imshow(C);
colorbar();
figure()
imshow(np.cov(X));
colorbar();
corr = X_1@X_1.T
imshow(corr);
colorbar();
prec = np.linalg.inv(C)
imshow(prec)
colorbar();
2. Networkx¶
In class we used networkx to visualize a network made with a simple similarity metrix (inverse of distance) applied to the IRIS dataset.
Perform this visualization using a better similarity metric such as $\exp\left(\frac{-1}{2\sigma}\Vert\mathbf a - \mathbf b\Vert_2^2\right)$. Choose a value of $\sigma$ such that a few major clusters are identifiable. Explain what the result tells you, given what we know about the dataset.
from sklearn.datasets import load_iris
from matplotlib.pyplot import *
import math
import networkx as nx
import numpy as np
from networkx import from_numpy_array
dat = load_iris()
X = dat.data
rows,cols = X.shape
sigma = 1
W = np.zeros((rows,rows))
for r1 in range(0,rows):
for r2 in range(0,rows):
# compute distance between row r1 and row r2
xi = X[r1]
xj = X[r2]
W[r1,r2] = math.exp(-1/2/sigma**2*np.linalg.norm(xi-xj)**2)
#imshow(W);
#colorbar();
figure()
A = (W>0.8).astype(int)
G = from_numpy_array(A, create_using=nx.DiGraph)
layout = nx.spring_layout(G)
nx.draw(G, layout)
3. Eigendecomposition practice¶
Use numpy for this exercise.
(a) Create a 4x4 diagonal matrix where the diagonal consists of random integers between -10 to 10. Compute the eigenvalue decomposition of this matrix, and in a comment identify the eigenvalues and eigenvectors (should be four scalars and four vectors)
(b) create a matrix $C$ which is the covariance matrix formed from the IRIS dataset (covariances between flowers). Compute the eigenvalue decomposition of $C$ and plot its eigenvalues. Do you know how this helps explain the precision matrix result above?
(b) Demonstrate how you can use the outputs of np.linalg.eig to reconstruct $C$ from its eigenvalue decomposition, i.e., $C = U@D@U^T$ for appropriately defined matrices. (hint: $D$ is a diagonal matrix with the eigenvalues on the diagonal, $U$ is the eigenvector matrix returned by eig).
Hint: due to numerical issues, your eigenvalues and/or eigenvectors may become slightly complex, meaning they have a very small imaginary part. This can cause trouble for plotting or imshow. You can remove this imaginary part with np.real().
D = np.diag(np.random.randint(-10,10,4))
print(D)
print(np.linalg.eig(D))
[[-1 0 0 0] [ 0 4 0 0] [ 0 0 -5 0] [ 0 0 0 4]] (array([-1., 4., -5., 4.]), array([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]]))
sv = np.linalg.eig((C+C.T)/2)
plot(sv[0]);
S = np.diag(sv[0])
V = sv[1]
imshow(np.real(V@S@V.T - C))
colorbar();