You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I remember I saw some discussion (possibly in several unrelated threads) of restriction that sklearn imposes on modelling as a framework. One such restriction is pipelining (X,y) in separate "pipes" (which makes sense because y is unavailable at prediction, and is related to the fact that LabelEncoder and OrdinalEncoder are separate estimators and has something to do with metadata routing). Another such restriction is that n_samples, the first dimension in X and y, is the same. I am interested in the latter. I couldn't recall or find where I saw this discussion, so please refer me to such a place if it exists.
If I understand correctly, scikit-learn aims at models that fit this improvised graphical model (plate notation, shading denotes available data, graphviz-online):
First, infer parameters as in model.fit(X_train, y_train).
Next, use fitted parameters y_test = model.predict(X_test).
There could be some variations (e.g. missing value imputation in X_train and X_test), but X and y being inside the same plate with n_samples samples seems to be the unshakeable assumption. Or am I wrong?
Here is my problem. I have a model, for which I want to use scikit-learn's GridSearch and cross-validation infrastructure. When I started packaging it as an estimator class, I realised I did not really think in terms of X and y throughout the pipeline. To be specific, I have graph-like data A (nodes and edges), from which I construct a set of "different edges" B, and from it I learn node's features C (model parameters). Next, the model's task is to predict "different edges'" properties (orientation, weight etc), call it b, which has the same number of rows as B. So far it seems reasonable to take X=B and y=b. However, to protect myself from data leakages of all kinds, I want the estimator to take A as the input and produce b as the output (the number of elements in A is not the same as in B or b). How should I think about my model? Currently I am thinking in the direction of model.fit(X=A_train, y=None), y_pred = model.predict(X=A_test) and y_true = ground_truth.fit_predict(X=A_test).
I'd appreciate any suggestions! I haven't practiced scikit-learn for a while, so there could be something obvious I am missing. I am also going to look at scikit-network to better understand ML on networks and their framework.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I remember I saw some discussion (possibly in several unrelated threads) of restriction that sklearn imposes on modelling as a framework. One such restriction is pipelining
(X,y)in separate "pipes" (which makes sense becauseyis unavailable at prediction, and is related to the fact thatLabelEncoderandOrdinalEncoderare separate estimators and has something to do with metadata routing). Another such restriction is thatn_samples, the first dimension inXandy, is the same. I am interested in the latter. I couldn't recall or find where I saw this discussion, so please refer me to such a place if it exists.If I understand correctly, scikit-learn aims at models that fit this improvised graphical model (plate notation, shading denotes available data, graphviz-online):

model.fit(X_train, y_train).y_test = model.predict(X_test).There could be some variations (e.g. missing value imputation in
X_trainandX_test), butXandybeing inside the same plate withn_samplessamples seems to be the unshakeable assumption. Or am I wrong?Here is my problem. I have a model, for which I want to use scikit-learn's GridSearch and cross-validation infrastructure. When I started packaging it as an estimator class, I realised I did not really think in terms of X and y throughout the pipeline. To be specific, I have graph-like data
A(nodes and edges), from which I construct a set of "different edges"B, and from it I learn node's featuresC(model parameters). Next, the model's task is to predict "different edges'" properties (orientation, weight etc), call itb, which has the same number of rows asB. So far it seems reasonable to takeX=Bandy=b. However, to protect myself from data leakages of all kinds, I want the estimator to takeAas the input and producebas the output (the number of elements inAis not the same as inBorb). How should I think about my model? Currently I am thinking in the direction ofmodel.fit(X=A_train, y=None),y_pred = model.predict(X=A_test)andy_true = ground_truth.fit_predict(X=A_test).I'd appreciate any suggestions! I haven't practiced scikit-learn for a while, so there could be something obvious I am missing. I am also going to look at scikit-network to better understand ML on networks and their framework.
Beta Was this translation helpful? Give feedback.
All reactions