Your answer is one click away!

Brian February 2016
### how to properly use sklearn to predict the error of a fit

I'm using `sklearn`

to fit a linear regression model to some data. In particular, my response variable is stored in an array `y`

and my features in a matrix `X`

.

I train a linear regression model with the following piece of code

```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
```

and everything seems to be fine.

Then let's say I have some new data `X_new`

and I want to predict the response variable for them. This can easily done by doing

```
predictions = model.predict(X_new)
```

My question is, what is this the error associated to this prediction? From my understanding I should compute the mean squared error of the model:

```
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
```

And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean `predictions`

and sigma^2 = `model_mse`

. Do you agree with this and do you know if there's a faster way to do this in `sklearn`

?

Kris February 2016

You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule `sklearn.cross_validation`

.

The most basic usage is:

```
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

Tom February 2016

It depends on you training data- If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.

That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested? I usually use grid search for optimizing parameters:

```
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
```

The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.

Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

Asked in February 2016

Viewed 3,544 times

Voted 7

Answered 2 times

Viewed 3,544 times

Voted 7

Answered 2 times