python - scikit ShuffleSplit raising pandas "IndexError: index N is out of bounds for axis 0 with size M" -
i'm trying use scikit's gridsearch find best alpha lasso, , 1 of parameters want iterate cross validation split. so, i'm doing:
# x_train := pandas dataframe no index (auto numbered index) , 62064 rows # y_train := pandas 1-column dataframe no index (auto numbered index) , 62064 rows sklearn import linear_model lm sklearn import cross_validation cv sklearn import grid_search model = lm.lassocv(eps=0.001, n_alphas=1000) params = {"cv": [cv.shufflesplit(n=len(x_train), test_size=0.2), cv.shufflesplit(n=len(x_train), test_size=0.1)]} m_model = grid_search.gridsearchcv(model, params) m_model.fit(x_train, y_train)
but raises exception
--------------------------------------------------------------------------- indexerror traceback (most recent call last) <ipython-input-113-f791cb0644c1> in <module>() 10 m_model = grid_search.gridsearchcv(model, params) 11 ---> 12 m_model.fit(x_train.as_matrix(), y_train.as_matrix()) /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/grid_search.py in fit(self, x, y) 802 803 """ --> 804 return self._fit(x, y, parametergrid(self.param_grid)) 805 806 /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/grid_search.py in _fit(self, x, y, parameter_iterable) 551 self.fit_params, return_parameters=true, 552 error_score=self.error_score) --> 553 parameters in parameter_iterable 554 train, test in cv) 555 /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable) 798 # dispatched. in particular covers edge 799 # case of parallel used exhausted iterator. --> 800 while self.dispatch_one_batch(iterator): 801 self._iterating = true 802 else: /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator) 656 return false 657 else: --> 658 self._dispatch(tasks) 659 return true 660 /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch) 564 565 if self._pool none: --> 566 job = immediatecomputebatch(batch) 567 self._jobs.append(job) 568 self.n_dispatched_batches += 1 /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch) 178 # don't delay application, avoid keeping input 179 # arguments in memory --> 180 self.results = batch() 181 182 def get(self): /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self) 70 71 def __call__(self): ---> 72 return [func(*args, **kwargs) func, args, kwargs in self.items] 73 74 def __len__(self): /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0) 70 71 def __call__(self): ---> 72 return [func(*args, **kwargs) func, args, kwargs in self.items] 73 74 def __len__(self): /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, x, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score) 1529 estimator.fit(x_train, **fit_params) 1530 else: -> 1531 estimator.fit(x_train, y_train, **fit_params) 1532 1533 except exception e: /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/linear_model/coordinate_descent.py in fit(self, x, y) 1146 train, test in folds) 1147 mse_paths = parallel(n_jobs=self.n_jobs, verbose=self.verbose, -> 1148 backend="threading")(jobs) 1149 mse_paths = np.reshape(mse_paths, (n_l1_ratio, len(folds), -1)) 1150 mean_mse = np.mean(mse_paths, axis=1) /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable) 798 # dispatched. in particular covers edge 799 # case of parallel used exhausted iterator. --> 800 while self.dispatch_one_batch(iterator): 801 self._iterating = true 802 else: /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator) 656 return false 657 else: --> 658 self._dispatch(tasks) 659 return true 660 /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch) 564 565 if self._pool none: --> 566 job = immediatecomputebatch(batch) 567 self._jobs.append(job) 568 self.n_dispatched_batches += 1 /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch) 178 # don't delay application, avoid keeping input 179 # arguments in memory --> 180 self.results = batch() 181 182 def get(self): /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self) 70 71 def __call__(self): ---> 72 return [func(*args, **kwargs) func, args, kwargs in self.items] 73 74 def __len__(self): /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0) 70 71 def __call__(self): ---> 72 return [func(*args, **kwargs) func, args, kwargs in self.items] 73 74 def __len__(self): /home/user/programs/repos/pyenv/versions/3.5.2/envs/work/lib/python3.5/site-packages/sklearn/linear_model/coordinate_descent.py in _path_residuals(x, y, train, test, path, path_params, alphas, l1_ratio, x_order, dtype) 931 avoid memory copies 932 """ --> 933 x_train = x[train] 934 y_train = y[train] 935 x_test = x[test] indexerror: index 60527 out of bounds axis 0 size 41376
i tried use x_train.as_matrix() didn't work either, giving same error.
strange can use manually:
cv_split = cv.shufflesplit(n=len(x_train), test_size=0.2) tr, te in cv_split: print(x_train.as_matrix()[tr], y_train.as_matrix()[tr]) [[0 0 0 ..., 0 0 1] [0 0 0 ..., 0 0 1] [0 0 0 ..., 0 0 1] ..., [0 0 0 ..., 0 0 1] [0 0 0 ..., 0 0 1] [0 0 0 ..., 0 0 1]] [2 1 1 ..., 1 4 1] [[ 0 0 0 ..., 0 0 1] [1720 0 0 ..., 0 0 1] [ 0 0 0 ..., 0 0 1] ..., [ 773 0 0 ..., 0 0 1] [ 0 0 0 ..., 0 0 1] [ 501 1 0 ..., 0 0 1]] [1 1 1 ..., 1 2 1]
what not seeing here? doing wrong or scikit bug?
update 1
just found out cv parameter not cv.shufflesplit object. counterintuitive me, since the docs says
aren't cross_validation classes "object used cross-validation generator"?
thanks!
you shouldn't varying cv
in cross validation parameters grid, idea have fixed cross-validation, , use grid search on other parameters, this:
m_model = grid_search.gridsearchcv(model, {'learning_rate': [0.1, 0.05, 0.02]}, cv = cv.shufflesplit(n=len(x_train), test_size=0.2))
Comments
Post a Comment