8
$\begingroup$

Let \begin{equation} k({x},{y}) = \sigma \exp\left(-\frac{(x-y)^2}{2\theta^2}\right)\end{equation} be a squared-exponential (Gaussian) kernel, with $\sigma,\vartheta>0$. Consider, for a set of $N$ distinct points $x_1,\ldots, x_N \in \mathbb{R}$, the corresponding kernel matrix $\mathbf{K}$, with entries \begin{equation} K_{ij} = k({x}_i,{x}_j). \end{equation} In Carl Edward Rasmussen's book (http://www.gaussianprocess.org/gpml/chapters/RW.pdf, page 113), it is stated that the complexity (penalty) $\log \vert \mathbf{K}\vert$ of a Gaussian process model with kernel matrix $\mathbf{K}$ decreases with the lengthscale, i.e., \begin{equation} \frac{d\log \vert \mathbf{K} \vert}{d\theta} \leq 0. \end{equation} Even though this seems to be common knowledge among people who employ Gaussian processes, I am struggling to prove it, and would like to know how to do so.

Edit: I have tried employing the following:

  • First, note that $\vert \mathbf{K} \vert = \prod_{n=1}^N \sigma_n(x_n)$, where the elements of $\mathbf{k}_n$ and $\mathbf{K}_n$ are given by $k_{n,i}=k(x_n,x_i)$ and $K_{n,ij}= k({x}_i,{x}_j)$, respectively, i.e., $\sigma_n(x_n)$ denotes the posterior GP variance at $x_n$ with respect to the first $n-1$ data points. This identity follows directly from the Schur determinant formula.

I am able to show that the identity holds if all $\sigma_n(x_n)$ decrease monotonically with the lengthscale, which would imply the desired result. To this end, I am trying to apply the following, which can be found, e.g., in https://arxiv.org/pdf/1704.00445.pdf

  • $\sigma_n(x_n) = \sigma \langle k(x,\cdot) (\Phi_n \Phi_n^{\top} + \sigma I)^{-1} k(x,\cdot) \rangle_k$, where $\langle \cdot, \cdot \rangle_k$ denotes the inner product of the reproducing kernel Hilbert space (RKHS) with reproducing kernel $k(\cdot,\cdot)$, and $\Phi_n: H_k \rightarrow \mathbb{R}^n$ is a linear operator, specified as $\Phi_n = (k(x_1,\cdot),\ldots,k(x_n,\cdot))^{\top}$.

Now, it can be easily shown that, for two lengthscales $\theta < \tilde{\theta}$ and corresponding kernels $k_{\theta}(\cdot,\cdot)$, $k_{\tilde\theta}(\cdot,\cdot)$, we have \begin{equation} \langle k_{\theta}(x,\cdot) (\Phi_{\theta,n} \Phi_{\theta,n}^{\top} + \sigma I) k_{\theta}(x,\cdot) \rangle_{k_{\theta}} \leq \langle k_{\tilde\theta}(x,\cdot) (\Phi_{\tilde\theta,n} \Phi_{\tilde\theta,n}^{\top} + \sigma I) k_{\tilde\theta}(x,\cdot) \rangle_{k_{\tilde\theta}} . \end{equation}

What I am now wondering is: does this also imply \begin{equation} \langle k_{\theta}(x,\cdot) (\Phi_{\theta,n} \Phi_{\theta,n}^{\top} + \sigma I)^{-1} k_{\theta}(x,\cdot) \rangle_{k_{\theta}} \geq \langle k_{\tilde\theta}(x,\cdot) (\Phi_{\tilde\theta,n} \Phi_{\tilde\theta,n}^{\top} + \sigma I)^{-1} k_{\tilde\theta}(x,\cdot) \rangle_{k_{\tilde\theta}} ? \end{equation}

$\endgroup$

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.