sklearn
¶sklearn
is a best-in-breed machine learning library for Python that we will use extensively in this class. It also has one of the best APIs designs out there (with a paper even written about the design) and is very modular and flexible. As such it has a bit of a learning curve, but once you can think in the sklearn
way for one algorithm/model you can apply that general knowledge to any model.
import pandas as pd
import numpy as np
Typically you have an external dataset that you will be working with and even if it is clean, you will need to manipulate/transform it to create features. And as such you will load your dataset with something like numpy
or pandas
We will be performing a simple linear regression on a Lending Club dataset of interest rates for individual loans. To start we will need to slightly prepare our data with pandas
to get it ready for our model.
df = pd.read_csv('loanf.csv')
df.head()
np.sum(df.isnull())
df = df.dropna(axis=0)
np.sum(df.isnull())
Remember from lecture that for any machine learning model we have Features (or a feature matrix) and a Target (or response/dependent variable from statistics parlance). In the sklearn
API we need to separate these from our initial data matrix.
NOTE:
sklearn
expects as input anumpy
array/matrix. Often if you pass in aDataFrame
Python can convert/coerce the DataFrame into anumpy
array alright, but it is a best practice to do this conversion yourself
features = df.iloc[:, 1:]
features.head()
labels = df.iloc[:, 0]
labels.head()
X = features.as_matrix()
y = labels.as_matrix()
print ("Features: \n", X)
print ("\n\nLabels: \n", y)
sklearn
has a very Object Oriented interface and it is import to be aware of this when building models. It is important to note that (almost) every model/transform/object in sklearn
is an Estimator
object. What is an Estimator
?
class Estimator(object):
def fit(self, X, y=None):
"""Fit model to data X (and y)"""
self.some_attribute = self.some_fitting_method(X, y)
return self
def predict(self, X_test):
"""Make prediction based on passed features"""
pred = self.make_prediction(X_test)
return pred
model = Estimator()
The Estimator
class defines a fit()
method as well as a predict()
method. For an instance of an Estimator
stored in a variable model
:
model.fit
: fits the model with the passed in training data. For supervised models, it also accepts a second argument y
that corresponds to the labels (model.fit(X, y)
. For unsupervised models, there are no labels so you only need to pass in the feature matrix (model.fit(X)
)Since the interface is very OO, the instance itself stores the results of the
fit
internally. And as such you must alwaysfit()
before youpredict()
on the same object.
model.predict
: predicts new labels for any new datapoints passed in (model.predict(X_test)
) and returns an array equal in length to the number of rows of what is passed in containing the predicted labels.There are 3(ish) types of subclass of estimator:
Supervised estimators in addition to the above methods typically also have:
model.predict_proba: For classifiers that have a notion of probability (or some measure of confidence in a prediction) this method returns those "probabilities". The label with the highest probability is what is returned by the
model.predict()` mehod from above.model.score
: For both classification and regression models, this method returns some measure of validation of the model (which is configurable). For example, in regression the default is typically R^2 and classification it is accuracy.Some estimators in the library implement what is referred to as the transformer interface. Unsupervised in this case refers to any method that does not need labels, including (but not limited to) unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.
The transformer interface defines (usually) two additional methods:
model.transform
: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to fit()
the model before you transform it.model.fit_transform
: For some models you may not need to fit()
and transform()
separately. In these cases it is more convenient to do both at the same time. And that is precisely what fit_transform()
does!We will be trying to predict the loan interest rate based on the FICO score, loan length, monthly income, and loan amount:
$$Interest.Rate = \beta_0 + \beta_1 \cdot FICO.Score + \beta_2 \cdot Loan.Length + \beta_3 \cdot Monthly.Income + \beta_4 \cdot Loan.Amount$$from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print ("The training split: \n")
print (len(X_train), len(y_train))
print ("\n\nThe testing split: \n")
print (len(X_test), len(y_test))
# create an instance of an estimator
clf = LinearRegression()
# fit the estimator (notice I do not save any return value in a variable)
clf.fit(X_train, y_train)
# predict (but only after we have trained!)
predictions = clf.predict(X_test)
print (len(predictions))
# The coefficients
print ('Coefficients: \n', clf.coef_)
# The mean square error
print("\n\nResidual sum of squares: %.2f"
% np.mean((predictions - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('\n\nVariance score: %.2f' % clf.score(X_test, y_test))