Regularized linear vs. RKHS-regression

Question

I'm studying the difference between regularization in RKHS regression and linear regression, but I have a hard time grasping the crucial difference between the two.

Given input-output pairs $(x_i,y_i)$, I want to estimate a function $f(\cdot)$ as follows \begin{equation}f(x)\approx u(x)=\sum_{i=1}^m \alpha_i K(x,x_i),\end{equation} where $K(\cdot,\cdot)$ is a kernel function. The coefficients $\alpha_m$ can either be found by solving \begin{equation} {\displaystyle \min _{\alpha\in R^{n}}{\frac {1}{n}}\|Y-K\alpha\|_{R^{n}}^{2}+\lambda \alpha^{T}K\alpha},\end{equation} where, with some abuse of notation, the $i,j$'th entry of the kernel matrix $K$ is ${\displaystyle K(x_{i},x_{j})} $. This gives \begin{equation} \alpha^*=(K+\lambda nI)^{-1}Y. \end{equation} Alternatively, we could treat the problem as a normal ridge regression/linear regression problem: \begin{equation} {\displaystyle \min _{\alpha\in R^{n}}{\frac {1}{n}}\|Y-K\alpha\|_{R^{n}}^{2}+\lambda \alpha^{T}\alpha},\end{equation} with solution \begin{equation} {\alpha^*=(K^{T}K +\lambda nI)^{-1}K^{T}Y}. \end{equation}

What would be the crucial difference between these two approaches and their solutions?

The first version only makes sense if Y is a sampled version (and moreover of the same size as x), but the second version also work if Y is actually a function. Btw, in inverse problems the former is called Lavrentiev regularization while the latter is called Tikhonov regularization. — Dirk
– Dirk, Commented Feb 16, 2018 at 23:22

R Hahn · Accepted Answer · 2018-02-22 04:52:27Z

2

Both of the penalties can be thought of as arising from the linear regression setting in a Bayesian framework with predictor matrix $K$ and a Gaussian prior over the vector $\alpha$, centered at zero with prior variance $V$.

In the ridge regression case $V = n^{-1}\lambda^{-1}I$ and in the other case $V = n^{-1}\lambda^{-1}K^{-1}$ (as a kernel matrix $K$ is symmetric and PSD; I'm also assuming it is invertible). This follows just by equating terms; the posterior mean has the form $(K^tK + V^{-1})^{-1}K^tY$. Plugging in $V = n^{-1}\lambda^{-1}K^{-1}$ gives $$(K^tK + n\lambda K)^{-1}K^tY = (K^t + n\lambda I)^{-1}K^{-1}K^tY = (K + n\lambda I)^{-1}Y.$$ Anyway, this is all just definitions, but the perspective might be intuition-boosting: the RKHS version stipulates explicitly that the prior over alpha has higher precision (more regularization) along directions of high variation as defined by the kernel function.

answered Feb 22, 2018 at 4:52

R Hahn

2,8411 gold badge22 silver badges29 bronze badges

$\begingroup$ I don't see how the Bayesian interpretation helps. The Bayesian priors and the immediate look at the penalty terms tell us exactly the same: (i) in the RKHS case, the most penalized $\alpha$'s are the eigenvectors of $K$ with the largest eigenvalue, and the least penalized $\alpha$'s are the eigenvectors of $K$ with the smallest eigenvalue, whereas (ii) in the ridge case, all directions of $\alpha$ are equally penalized. I think the question is this: What actually are the larger- and smaller-eigenvalue directions, depending on the degree of smoothness of the kernel $K$? $\endgroup$

Iosif Pinelis
– Iosif Pinelis

2018-02-23 13:36:14 +00:00
Commented Feb 23, 2018 at 13:36
$\begingroup$ @IosifPinelis Yeah, I agree. The OP wrote of "RKHS regression and linear regression" and I was just pointing out that you can think of both as linear regression. I mainly wanted to point out that the prior/penalty shows up in the form of $\alpha^*$ in a particular way; this is obscured because $K$ also shows up from the likelihood portion, so I think it is helpful to write the $\alpha^*$ in terms of $V$ and then make the substitution. $\endgroup$

R Hahn
– R Hahn

2018-02-23 21:56:15 +00:00
Commented Feb 23, 2018 at 21:56

Add a comment |

Carlo Beenakker · Accepted Answer · 2018-02-16 23:59:43Z

To appreciate the difference, it is helpful to consider the case that $K$ is invertible. For small $\lambda$ the solution should then be close to $\alpha^\ast=K^{-1}Y\equiv\alpha_0$.

For the first solution, the RKHS regularization, one finds $$\alpha^\ast=\alpha_0 +n\lambda K^{-1}\alpha_0 + {\cal O}(\lambda^2).$$ For the second solution, instead $$\alpha^\ast=\alpha_0 +n\lambda (K^TK)^{-1}\alpha_0 + {\cal O}(\lambda^2).$$ When the smallest eigenvalues of $K$ become of order $\epsilon\rightarrow 0$, the deviation of $\alpha^\ast$ from $\alpha_0$ in the first case is of order $n\lambda/\epsilon$, while the deviation in the second case is larger, of order $n\lambda/\epsilon^2$. This is why the RKHS regularization is preferrable.

Iosif Pinelis · Accepted Answer · 2018-02-22 05:13:59Z

The difference is of course that the two penalty terms, $\alpha^{T}K\alpha$ and $\alpha^{T}\alpha$, penalize rather differently. Suppose that $n=m$ is very large and $K(x,y)=K(x-y)$ for some (say even) function $K$ (typical cases should be similar to this).

Then we can consider an infinite-dimensional approximation of this finite-dimensional setting. Let us see how the kernel $K$ acts on the harmonic $e_k$ of frequency $k\in\mathbb R$ given by the formula $e_k(x):=e^{ikx}$ for real $x$: \begin{equation} (Ke_k)(x)=\int_{-A}^A K(x-y)e^{iky}dy=\int_{x-A}^{x+A} K(u)e^{ik(x-u)}du\approx\lambda_k e_k(x), \end{equation} where $A\in(0,\infty)$ is very large and \begin{equation} \lambda_k:=\int_{-\infty}^\infty K(u)e^{-iku}du, \end{equation} so that $e_k$ is an approximate eigenvector of $K$ with approximate eigenvalue $\lambda_k$. If $|k|\to\infty$ then, by an appropriate version of the Riemann--Lebesgue lemma, $\lambda_k$ goes to $0$; this convergence is the faster, the smoother $K$ is. So, the RKHS penalizer is lenient with respect to high-frequency harmonics $e_k$, with large $|k|$ -- that is, $e_k^T Ke_k=(Ke_k,e_k)\approx\lambda_k(e_k,e_k)=A\lambda_k$ with $\lambda_k$ small. Accordingly, with the total size $\|\alpha\|_2$ of the minimizing mixture $\alpha$ of harmonics $e_k$ fixed, the RKHS penalty term penalizes mainly the low-frequency constituent harmonics $e_k$ of $\alpha$, with $|k|$ comparatively small. This behavior may result in better catching (by the minimizer) fine, high-frequency features of the unknown, estimated function $f$. However, such behavior may be not so desirable when there is prior knowledge that the true $f$ is rather smooth (whereas some constituent smooth, low-frequency harmonics got partially penalized out).

The ridge penalty term $\alpha^{T}\alpha=(\alpha,\alpha)$ can actually be considered a special case of $\alpha^{T}K\alpha=(K\alpha,\alpha)$ with $K(x-y)=\delta(x-y)$, the delta-function kernel, which is of course very non-smooth. This latter kernel treats all the harmonic frequencies whatsoever absolutely equally: $\delta e_k=e_k$ for all $k$; it is the ultimate "equal opportunity" penalizer, in contrast with the smooth-kernel one.

One should also note that, if $K$ is smooth, the estimate $K\alpha$ of $f$ will to an extent suppress the constituent high-frequency harmonics of $\alpha$, whether $\alpha$ is the RKHS minimizer or the ridge one. However, it should be clear from the above discussion that the overall suppression of the high-frequency harmonics will be relatively less in the RKHS case.

Stack Exchange Network

Regularized linear vs. RKHS-regression

3 Answers 3

You must log in to answer this question.

Regularized linear vs. RKHS-regression

3 Answers 3

You must log in to answer this question.

Related