Home Ask Login Register

Developers Planet

Your answer is one click away!

Brian February 2016

how to properly use sklearn to predict the error of a fit

I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.

I train a linear regression model with the following piece of code

    from sklearn.linear_model import LinearRegression
    model = LinearRegression()

and everything seems to be fine.

Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing

    predictions = model.predict(X_new)

My question is, what is this the error associated to this prediction? From my understanding I should compute the mean squared error of the model:

    from sklearn.metrics import mean_squared_error
    model_mse = mean_squared_error(model.predict(X),y)

And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?


Kris February 2016

You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.

The most basic usage is:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Tom February 2016

It depends on you training data- If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.

That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested? I usually use grid search for optimizing parameters:

#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)

#cross validation gridsearch 
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)

#print scores and best estimator
print 'best param: ', grid_search.best_params_ 
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)

The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.

Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

Post Status

Asked in February 2016
Viewed 3,544 times
Voted 7
Answered 2 times


Leave an answer

Quote of the day: live life