I.e. data is in external memory (e.g. hard drive(s)), a.k.a. out-of-core.
Also can be applied to data from network, sensor samples, etc.
https://docs.scipy.org/doc/numpy/reference/routines.linalg.html
Popular Python ML toolbox, has several functions relevant to this course
sklearn
has an Object Oriented interface
Most models/transforms/objects in sklearn
are Estimator
objects
class Estimator(object):
def fit(self, X, y=None):
"""Fit model to data X (and y)"""
self.some_attribute = self.some_fitting_method(X, y)
return self
def predict(self, X_test):
"""Make prediction based on passed features"""
pred = self.make_prediction(X_test)
return pred
model = Estimator()
Some estimators in the library implement this.
Unsupervised in this case refers to any method that does not need labels, including unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.
The transformer interface usually defines two additional methods:
model.transform
: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to fit()
the model before you transform it.model.fit_transform
: For some models you may not need to fit()
and transform()
separately. In these cases it is more convenient to do both at the same time. updates internal parameters
est = SGDClassifier(...)
est.partial_fit(X_train_1, y_train_1)
est.partial_fit(X_train_2, y_train_2)
import numpy as np
from sklearn import linear_model
n_samples, n_features = 5000, 5
y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)
clf = linear_model.Ridge(alpha=0)
clf.fit(X, y)
print(clf.coef_,clf.intercept_)
X@clf.coef_ + clf.intercept_
y
clfi = linear_model.SGDRegressor(alpha=0)
for k in range(0,1000):
k_rand = np.random.randint(0,len(X)-10)
clfi.partial_fit(X[k_rand:k_rand+10,:], y[k_rand:k_rand+10]) # same as fit() for this case...
if k%100==0:
print(clfi.coef_,clfi.intercept_,np.linalg.norm(X@clf.coef_ + clf.intercept_ - y))
clfi = linear_model.SGDRegressor(alpha=0)
for k in range(0,10000):
clfi.partial_fit(X, y) # same as fit() for this case...
if k%1000==0:
print(clfi.coef_,clfi.intercept_)
https://www.youtube.com/watch?v=RRtqIagk93k
Ecosystem:
Dask on Kubernetes
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
https://pandas.pydata.org/pandas-docs/stable/index.html
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
# https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html
import numpy as np
import pandas as pd
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
pd.DataFrame(d, index=['d', 'b', 'a'])
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])