-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
FIX instability of gcv_mode="svd" in RidgeCV #32506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX instability of gcv_mode="svd" in RidgeCV #32506
Conversation
|
I triggered the CUDA build out of curiosity. However, I agree that I am worried about the performance impact of using full SVD on large datasets. Can you benchmark main vs PR branch for various data shapes? |
|
Here a few benchmarks, with fit time and peak memory: Snippetfrom sklearn.linear_model import RidgeCV
from sklearn.datasets import make_regression
from time import time
import tracemalloc
def benchmark(n_samples, n_features, alphas, gcv_mode):
X, y = make_regression(n_samples=n_samples, n_features=n_features)
tracemalloc.start()
reg = RidgeCV(alphas=alphas, gcv_mode=gcv_mode)
t0 = time()
reg.fit(X, y)
elapsed = time() - t0
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
peak /= 1024**2 # MB
print(f"{elapsed=:.2f}s {peak=:.1f}MB {gcv_mode=} {n_samples=}")benchmark(n_samples=1000, n_features=100, alphas=[1e-15], gcv_mode="svd")
benchmark(n_samples=10_000, n_features=100, alphas=[1e-15], gcv_mode="svd")
# main
elapsed=0.02s peak=3.2MB gcv_mode='svd' n_samples=1000
elapsed=0.05s peak=31.1MB gcv_mode='svd' n_samples=10000
# PR using full svd
elapsed=0.05s peak=23.8MB gcv_mode='svd' n_samples=1000
elapsed=7.29s peak=2297.0MB gcv_mode='svd' n_samples=10000 |
|
I think I'll try using the reduced svd of EDIT: actually that's a dead end :( Using the reduced svd and the Woodbury identity yields the same formula as main for |
|
The performance overhead is indeed not admissible. To better handle the EDIT: link to the relevant part of |
|
@antoinebaker would you be interested in trying to prototype the above approach (possibly in google colab notebook using the array API) and run some benchmark to check that it's a worthwhile approach (both from a computational and a numerical precision standpoint)? |
|
Also, I think for
The |
|
I think the idea in #32506 (comment) sounds good and worth pursuing when the number of folds is small (like say cv=5 folds). I'll try in a dedicated PR (or a colab as suggested) to improve RidgeCV with given However, for this PR and issue #32459 specifically, ie for the Unfortunately, I still don't see how to make the
|
Wouldn't it be possible to first estimate |
Alas, computing |
|
For the For this case the PR introduces no overhead compared to main: benchmark(n_samples=1000, n_features=10_000, alphas=[1e-15], gcv_mode="svd")
elapsed=2.14s peak=236.8MB gcv_mode='svd' n_samples=1000 n_features=10000 # PR
elapsed=1.93s peak=236.8MB gcv_mode='svd' n_samples=1000 n_features=10000 # mainand fixes the numerical instability observed in main. I don't have a good minimal test for this case. We expect Ridge(alpha -> 0) to recover the least square norm solution in the noiseless underdetermined case. Plotting the fitted coef vs the least square norm solution gives in this PR: .../sklearn/linear_model/_ridge.py:2253: RuntimeWarning: divide by zero encountered in divide
squared_errors = (c / G_inverse_diag) ** 2 |
However we still have the performance overhead issue for the n_samples > n_features case. |
I think we can be pragmatic and fix the numerical stability problem only when Maybe we could also expand the docstring of |
|
BTW @antoinebaker, did you set an explicit value to the |
Ah good point, I just try, but it doesn't affect the solution. However, good news! We recover exactly the least square norm solution for |
|
Interestingly, the full svd induces less overhead than the "eigen" option, which is very slow in the # eigen, numerically stable
benchmark(n_samples=1000, n_features=100, alphas=[1e-15], gcv_mode="eigen")
benchmark(n_samples=10_000, n_features=100, alphas=[1e-15], gcv_mode="eigen")
elapsed=0.10s peak=23.8MB gcv_mode='eigen' n_samples=1000 n_features=100
elapsed=118.13s peak=2297.0MB gcv_mode='eigen' n_samples=10000 n_features=100
# this PR, full svd, numerically stable
benchmark(n_samples=1000, n_features=100, alphas=[1e-15], gcv_mode="svd")
benchmark(n_samples=10_000, n_features=100, alphas=[1e-15], gcv_mode="svd")
elapsed=0.03s peak=23.8MB gcv_mode='svd' n_samples=1000 n_features=100
elapsed=7.18s peak=2297.0MB gcv_mode='svd' n_samples=10000 n_features=100
# main, reduced svd, numerically unstable
benchmark(n_samples=1000, n_features=100, alphas=[1e-15], gcv_mode="svd")
benchmark(n_samples=10_000, n_features=100, alphas=[1e-15], gcv_mode="svd")
elapsed=0.02s peak=3.2MB gcv_mode='svd' n_samples=1000 n_features=100
elapsed=0.05s peak=31.1MB gcv_mode='svd' n_samples=10000 n_features=100 |
I am so confused now. My previous comment (#32506 (comment)) was wrong: the default |
|
Was the numerical stability problem of the candidate change described in #32506 (comment) observed in the |
Yep, it breaks |



Fixes #32459.
What does this implement/fix? Explain your changes.
The$\text{diag}(G^{-1})$ and $c = G^{-1} y$ where $G = XX^T + \alpha I$ .
RidgeCVfit needs to estimateThey are computed thanks to the
solveanddecomposefunctions chosen by thegcv_modeoption.This PR uses the full svd of the design matrix$X=U S V^T$ :
The implementation in main uses the reduced svd$X=USV^T$ :
This last formula is numerically unstable in the
Any other comments
The full svd will be costly (time and memory) when
n_samples >> n_features. Maybe we should try a trick similar to main (reduced svd), but without the division byalpha. Not sure how to do it though :)