Discussion of different (not matching) numbers of rows in X, y (n_samples) #32253

avm19 · 2025-09-23T00:25:20Z

avm19
Sep 23, 2025

I remember I saw some discussion (possibly in several unrelated threads) of restriction that sklearn imposes on modelling as a framework. One such restriction is pipelining (X,y) in separate "pipes" (which makes sense because y is unavailable at prediction, and is related to the fact that LabelEncoder and OrdinalEncoder are separate estimators and has something to do with metadata routing). Another such restriction is that n_samples, the first dimension in X and y, is the same. I am interested in the latter. I couldn't recall or find where I saw this discussion, so please refer me to such a place if it exists.

If I understand correctly, scikit-learn aims at models that fit this improvised graphical model (plate notation, shading denotes available data, graphviz-online):

First, infer parameters as in model.fit(X_train, y_train).
Next, use fitted parameters y_test = model.predict(X_test).

There could be some variations (e.g. missing value imputation in X_train and X_test), but X and y being inside the same plate with n_samples samples seems to be the unshakeable assumption. Or am I wrong?

Here is my problem. I have a model, for which I want to use scikit-learn's GridSearch and cross-validation infrastructure. When I started packaging it as an estimator class, I realised I did not really think in terms of X and y throughout the pipeline. To be specific, I have graph-like data A (nodes and edges), from which I construct a set of "different edges" B, and from it I learn node's features C (model parameters). Next, the model's task is to predict "different edges'" properties (orientation, weight etc), call it b, which has the same number of rows as B. So far it seems reasonable to take X=B and y=b. However, to protect myself from data leakages of all kinds, I want the estimator to take A as the input and produce b as the output (the number of elements in A is not the same as in B or b). How should I think about my model? Currently I am thinking in the direction of model.fit(X=A_train, y=None), y_pred = model.predict(X=A_test) and y_true = ground_truth.fit_predict(X=A_test).

I'd appreciate any suggestions! I haven't practiced scikit-learn for a while, so there could be something obvious I am missing. I am also going to look at scikit-network to better understand ML on networks and their framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Discussion of different (not matching) numbers of rows in X, y (n_samples) #32253

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Discussion of different (not matching) numbers of rows in X, y (n_samples) #32253

Uh oh!

Uh oh!

avm19 Sep 23, 2025

Replies: 0 comments

avm19
Sep 23, 2025