Choose document to Maximize $score(query,document)$
Choose document to Maximize $score(query,document)$
How might you apply the I.R. framework for QA?
Given training data $(\mathbf x_{(i)},y_i)$ for $i=1,...,m$.
Choose a model $f(\cdot)$ where we want to make $f(\mathbf x_{(i)})\approx y_i$ (for all $i$)
Define a loss function $L(f(\mathbf x), y)$ to minimize by changing $f(\cdot)$ ...by adjusting the weights.
How we apply to text?
Find $c$ which maximizes $P(c|text)$ where $c \in \{spam,ham\}$.
Exercise: describe an approach to implement this with some canned classifier (e.g. sklearn) and some text.
Find $c$ which maximizes $P(c|text)$, where $c \in \{English,Spanish, German, ...\}$
When would we want to use something other than accuracy?
Acquiring knowledge by skimming a text for
Tasks:
Google translate, English to Danish to English:
Watson "DeepQA"
Adaptation of machine translation technique
(Baker et al., 1998), object-oriented approach to linguistics
Source of datasets and many NLP tasks, "annual task bake-off"
Relation extraction example: determine relationships between words in text.
The goal of this project is to use a sklearn classifier to discriminate between sentences or paragraphs of two different corpuses, e.g. to try and determine the true author of a string of text. There are two key outcomes for this project:
Those already familiar with sklearn, machine learning, and processing text can choose to do a more sophisticated task.
Given training data $(\mathbf x_{(i)},y_i)$ for $i=1,...,m$ --> lists of samples $\verb|X|$ and labels $\verb|y|$
Choose a model $f(\cdot)$ where we want to make $f(\mathbf x_{(i)})\approx y_i$ (for all $i$) --> choose sklearn estimator to use
Define a loss function $L(f(\mathbf x), y)$ to minimize by changing $f(\cdot)$ ...by adjusting the weights --> default choises for estimators, sometimes multiple options
sklearn
has an Object Oriented interface
Most models/transforms/objects in sklearn
are Estimator
objects
class Estimator(object):
def fit(self, X, y=None):
"""Fit model to data X (and y)"""
self.some_attribute = self.some_fitting_method(X, y)
return self
def predict(self, X_test):
"""Make prediction based on passed features"""
pred = self.make_prediction(X_test)
return pred
model = Estimator()
The Estimator
class defines a fit()
method as well as a predict()
method. For an instance of an Estimator
stored in a variable model
:
model.fit
: fits the model with the passed in training data. For supervised models, it also accepts a second argument y
that corresponds to the labels (model.fit(X, y)
. For unsupervised models, there are no labels so you only need to pass in the feature matrix (model.fit(X)
)Since the interface is very OO, the instance itself stores the results of the
fit
internally. And as such you must alwaysfit()
before youpredict()
on the same object.
model.predict
: predicts new labels for any new datapoints passed in (model.predict(X_test)
) and returns an array equal in length to the number of rows of what is passed in containing the predicted labels.Supervised estimators in addition to the above methods typically also have:
model.predict_proba: For classifiers that have a notion of probability (or some measure of confidence in a prediction) this method returns those "probabilities". The label with the highest probability is what is returned by the
model.predict()` mehod from above.model.score
: For both classification and regression models, this method returns some measure of validation of the model (which is configurable). For example, in regression the default is typically R^2 and classification it is accuracy.Some estimators in the library implement this.
Unsupervised in this case refers to any method that does not need labels, including unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.
The transformer interface usually defines two additional methods:
model.transform
: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to fit()
the model before you transform it.model.fit_transform
: For some models you may not need to fit()
and transform()
separately. In these cases it is more convenient to do both at the same time. X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
[0] [[0.66666667 0.33333333]]
import the module
from sklearn.neighbors import KNeighborsClassifier
Construct an instance of the class. Set hyperparameters here generally.
neigh = KNeighborsClassifier(n_neighbors=1)
Fit mode to your dataset using a subset (e.g. 70 percent) of samples
neigh.fit(X_train, y_train)
For supervised methods, test accuracy using different samples than were used for training...
y_pred = neigh.predict(X_test, y_test)
accuracy = sum(y_pred==y_test)/len(y_test) # note predictions may be non-binary