Posts on Axect's Blog https://axect.github.io/posts/ Recent content in Posts on Axect's Blog Hugo -- gohugo.io en <a href="https://creativecommons.org/licenses/by-nc/4.0/" target="_blank" rel="noopener">CC BY-NC 4.0</a> Fri, 18 Nov 2022 17:49:04 +0900 📊 Piecewise Rejection Sampling https://axect.github.io/posts/006_prs/ Fri, 18 Nov 2022 17:49:04 +0900 https://axect.github.io/posts/006_prs/ <figure> <img src="https://axect.github.io/posts/images/006_01_test_dist.png" alt="Differential energy spectrum of ALPs from primordial black hole (PBH)${}^{[1]}$"/> <figcaption style="text-align:center"> <p>Differential energy spectrum of ALPs from primordial black hole (PBH)<a href="https://axect.github.io/#footnotes">${}^{[1]}$</a></p> </figcaption> </figure> <p>Suppose you&rsquo;re presented with an unnormalized probability density function graph such as the one shown. If you&rsquo;re tasked with generating 10,000 data points that conform to this distribution, what would be your approach?</p> <p>Here are the two most commonly used methods to sample data from an arbitrary probability density function:</p> <ol> <li><a href="https://en.wikipedia.org/wiki/Inverse_transform_sampling">Inverse Transform Sampling</a></li> <li><a href="https://en.wikipedia.org/wiki/Rejection_sampling">Rejection Sampling</a></li> </ol> <p>Inverse Transform Sampling involves generating data points by calculating the cumulative distribution function (CDF) of the probability density function, deriving its inverse function, and then using this inverse function to produce data points. This method can be quite efficient, but if the exact form of the probability density function is unknown—as in our case—it becomes challenging to apply<a href="https://axect.github.io/posts/006_prs/#footnotes">${}^{[2]}$</a>. On the other hand, Rejection Sampling is versatile and can be utilized regardless of the probability density function&rsquo;s form. Therefore, we&rsquo;ll begin with Rejection Sampling.</p> Differential energy spectrum of ALPs from primordial black hole (PBH)${}^{[1]}$

Differential energy spectrum of ALPs from primordial black hole (PBH)${}^{[1]}$

Suppose you’re presented with an unnormalized probability density function graph such as the one shown. If you’re tasked with generating 10,000 data points that conform to this distribution, what would be your approach?

Here are the two most commonly used methods to sample data from an arbitrary probability density function:

  1. Inverse Transform Sampling
  2. Rejection Sampling

Inverse Transform Sampling involves generating data points by calculating the cumulative distribution function (CDF) of the probability density function, deriving its inverse function, and then using this inverse function to produce data points. This method can be quite efficient, but if the exact form of the probability density function is unknown—as in our case—it becomes challenging to apply${}^{[2]}$. On the other hand, Rejection Sampling is versatile and can be utilized regardless of the probability density function’s form. Therefore, we’ll begin with Rejection Sampling.


1. Rejection Sampling

   Rejection sampling , a.k.a. the acceptance-rejection method, employs a straightforward algorithm. To describe it, let’s represent the target probability density function as $f(x)$. When direct sampling from $f(x)$ is not feasible, we introduce an auxiliary function $g(x)$, which is easy to sample from. For a chosen positive constant $M$, it’s required that $f(x) \leq M \cdot g(x)$ across the domain of $x$. This relationship is depicted in the figure below:

A graph commonly seen in particle physics

A graph commonly seen in particle physics

Typically, the uniform distribution is the simplest to sample from, so we define $g(x)$ as $\text{Unif}(x|0,10)$. Given that the peak value of $f(x)$ is 1, we select $M=10$ to ensure that $M \times g(x)$ consistently surpasses or meets $f(x)$. The sampling process using this graph is as follows:

  1. Draw a sample $y$ from $g(x)$, which in our case would be a value from $\text{Unif}(x|0,10)$.

  2. For the chosen $y$, draw a sample $u$ from $\text{Unif}(u|0,M \cdot g(y))$. This step always utilizes a uniform distribution, regardless of $g(x)$’s configuration.

  3. If $u$ is less than or equal to $f(y)$, then $y$ is accepted as a sampled value from the desired distribution $f(x)$. If not, the process is repeated from step 1.

This method is named ‘rejection sampling’ because samples where $u > f(y)$ are discarded. By following this algorithm, we can simulate sampling from the desired probability distribution $f(x)$. To implement this, we will explore the cumulative probability distribution function (CDF).

If we denote the random variable $X$ obtained by Rejection sampling, we can establish the following relationship for two other random variables $Y \sim g(y)$, $U \sim \text{Unif}(u|0, M\cdot g(y))$: $$ P(X \leq x) = P\left(Y \leq x \,|\, U < f(Y)\right) $$ The conditional probability on the right-hand side represents the probability that $Y$ is less than $x$ when $U$ is not rejected, which is like step 3 in the algorithm above. Using the definition of conditional probability, we can transform this into: $$ P(X \leq x) = \frac{P (Y \leq x,~U < f(Y))}{P(U < f(Y))} $$ First, let’s expand the numerator using the probability density function: $$ \begin{aligned} P(Y \leq x,U < f(Y)) &= \int P(Y \leq x,U < f(Y) | Y = y) \cdot g(y) \,dy \\ &= \int P(y \leq x,~U < f(y)) \cdot g(y) \,dy \\ &= \int 𝟙_{y \leq x} \cdot P(U < f(y))\cdot g(y) \, dy \\ &= \int_{-\infty}^x P(U < f(y)) \cdot g(y) \, dy \end{aligned} $$ When moving from the second to the third equation, we use the condition that $y \leq x$ and $U < f(y)$ are independent. Now, since $U \sim \text{Unif}(u|0,,M\cdot g(y))$, by substituting $\displaystyle P(U < f(y)) = \frac{1}{M\cdot g(y)} \times (f(y) - 0)$ we get: $$ \begin{aligned} P(Y \leq x,~U < f(Y)) &= \int_{-\infty}^x \frac{f(y)}{M\cdot g(y)}\cdot g(y) \, dy \\ &= \frac{1}{M} \int_{-\infty}^x f(y) \, dy \end{aligned} $$
Now let’s find the denominator: $$ \begin{aligned} P(U < f(Y)) &= \int P(U < f(y)) \cdot g(y) \, dy \\ &= \int \frac{f(y)}{M\cdot g(y)} \cdot g(y) \, dy \\ &= \frac{1}{M} \int f(y) \, dy \\ &= \frac{1}{M} \end{aligned} $$ Finally, by dividing the numerator by the denominator, we get: $$ P(X \leq x) = \int_{-\infty}^x f(y) \, dy $$ This is the cumulative distribution function of $f(x)$. Thus, we can conclude that the probability density function of the random variable $X$ obtained by Rejection sampling is $f(x)$.

Now that the mathematical proof is complete, let’s write some code to see if this actually works well in practice.

// Rust
use peroxide::fuga::*;

const M: f64 = 10.0;
const N: usize = 100_000;

fn main() {
    // Create g(x)=Unif(x|0,10) & h(y)=Unif(y|0,M)
    let g = Uniform(0.0, 10.0);
    let h = Uniform(0.0, M);

    // Rejection sampling
    let mut x_vec = vec![0f64; N];
    let mut i = 0usize;
    while i < N {
        let x = g.sample(1)[0];
        let y = h.sample(1)[0];

        if y <= f(x) {      // Accept
            x_vec[i] = x;
            i += 1;
        } else {            // Reject
            continue;
        }
    }

    // ...
}

// Test function
fn f(x: f64) -> f64 {
    1f64 / (x+1f64).sqrt() + 0.2 * (-(x-3f64).powi(2) / 0.2).exp()
}

The algorithm itself is simple, so the code is very straightforward. However, despite that, the results are impressive.

Result of rejection sampling

Result of rejection sampling

Rejection sampling is not limited by the form of the probability density function and is easy to implement, but it has a critical downside: computational inefficiency. To secure a sample in Rejection sampling, it must survive the rejection condition. This means that the higher the $P(U < f(Y))$, the faster samples can be obtained, and conversely, the lower it is, the longer it takes to secure a sufficient number of samples. This is known as the Acceptance rate , which we already calculated in the proof above.


Acceptance rate
The Acceptance rate of Rejection is defined as follows: $$ P(U < f(Y)) = \int P(U < f(y)) \cdot g(y) \, dy $$

In the algorithm we used, this is equivalent to $1/M$, which is like dividing the area occupied by $f(x)$ by the total area occupied by $M \cdot g(x)$. Therefore, the larger the difference between the distributions, the lower the Acceptance rate, and the longer it takes to secure samples. Our example exhibits a minor discrepancy between $g(x)$ and $f(x)$, which is manageable. However if the probability distribution has a significant number of zeroes, as we gave at the beginning, and we use a uniform distribution for $g(x)$, the majority of samples will be rejected. This not only increases the time required for sampling but may also render it virtually impossible to sample effectively from areas where the probability density is close to zero, such as the distribution’s tails. To circumvent this issue, we must seek more efficient methods that can improve the acceptance rate for distributions with extensive zero-value regions.


2. Piecewise Rejection Sampling

  Indeed, the scientific community has developed several techniques to improve upon the basic rejection sampling method. Among these, Adaptive Rejection Sampling (ARS) stands out for its efficiency, though it operates under the assumption that the target function is log-concave. On the other hand, Adaptive Rejection Metropolis Sampling (ARMS) addresses the limitations of ARS by not requiring the log-concavity condition, making it more generalizable, but at the cost of being more complex to implement. While there are comprehensive R packages available that facilitate the use of these methods, there is a simpler alternative that I would like to present: Piecewise Rejection Sampling (PRS) .

PRS was developed as a direct response to a challenge I faced during my physics research. It retains the core principles of rejection sampling but proposes a more tailored approach for the auxiliary function $g(x)$. Rather than employing a simple uniform distribution, PRS uses a Weighted uniform distribution that is specifically optimized for $f(x)$. Let’s delve into this method step by step.

2.1. Max-Pooling

  If you have an interest in deep learning, you might already be acquainted with the term Max Pooling . Max Pooling is a common operation in Convolutional Neural Networks (CNNs), and it is conceptually straightforward. Let’s proceed to the mathematical definition of Max Pooling to understand how it works.


Max pooling

Let $f\,:\,[a,b]\to \mathbb{R}$ be a continuous function and consider the equidistant partition of the interval $[a,b]$ $$ a = x_0 < x_1 < \cdots < x_{n-1} < x_n = b $$ The partitions size, $(b-a)/n$ is called stride. Denote by $\displaystyle M_i = \max_{[x_{i-1},x_i]}f(x)$ and consider the simple function

$$ S_n(x) = \sum_{i=1}^n M_i 𝟙_{[x_{i-1},x_i)}(x). $$

The process of approximating the function $f(x)$ by the simple function $S_n(x)$ is called max-pooling.

A simple function, in the context of measure theory, is one that assumes a finite number of constant values, each within a specific interval. For an in-depth explanation, you can refer to Definition 10 and Property 1 in the resource Precise Machine Learning with Rust.

In essence, Max Pooling is a process that yields a simple function from measure theory, which is defined by distinct constant values on various intervals. Specifically, when applied to a function $f(x)$, Max Pooling divides the domain into $n$ bins and within each bin, it identifies the maximum value of $f(x)$. These maximum values are then used as the constant, representative values for their respective intervals, thereby transforming $f(x)$ into a simple function with a finite number of plateaus corresponding to the local maxima within each bin. When we apply this concept to our initial distribution, we can observe the following:

Max-pooling for Test Distribution

Max-pooling for Test Distribution

The solid red line depicted in the figure represents the function $f(x)$, and the dashed blue line illustrates the outcome of applying Max Pooling to $f(x)$. Astute observers might have realized by now that this blue line will serve as $M\cdot g(x)$ in our rejection sampling process. To implement this, we will establish a probability distribution that resembles a uniform distribution within each interval but differs in terms of the representative values. This forms the basis for the Weighted uniform distribution —the optimized $g(x)$ that we’ve referenced earlier—which adapts the uniformity concept to accommodate varying weights across different intervals.

2.2. Weighted Uniform Distribution


Weighted uniform distribution
Let $(S, \mathcal{F}, \mu)$ be a measure space. For a disjoint family $\mathcal{A} = \left\{A_i\right\}_{i=1}^n \in \mathcal{F}$ of measurable sets with non-zero measure and a family $\mathbf{M} = \{M_i\}_{i=1}^n$ of non-negative real numbers (but $\sum_i M_i > 0$), define the weighted uniform distribution on $S$ by $$ \text{WUnif}(x|\mathbf{M}, \mathcal{A}) = \frac{1}{\sum_{j}M_j \cdot \mu(A_j)}\sum_i M_i 𝟙_{A_i}(x) $$

The definition sounds complicated, but when you define it for a one-dimensional interval, you realize that it’s really quite simple.

  • $S = [a,b]$
  • $\mathcal{A} = \left\{[x_{i-1},x_i)\right\}_{i=1}^n$ and $\Delta x_i \equiv x_i - x_{i-1}$
  • $\displaystyle \text{WUnif}(x|\mathbf{M}, \mathcal{A}) = \frac{1}{\sum_{j}M_j \cdot \Delta x_j}\sum_i M_i 𝟙_{A_i}(x)$

Moving forward with our one-dimensional context, we’ll utilize this straightforward definition for our calculations. To begin, we can swiftly demonstrate that this function qualifies as a probability density function.

$$ \begin{aligned} \int_a^b \text{WUnif}(x|\mathbf{M}, \mathcal{A}) dx &= \int_a^b \frac{1}{\sum_{j}M_j \cdot \Delta x_j}\sum_i M_i 𝟙_{A_i}(x) dx \\ &= \frac{1}{\sum_{j}M_j \cdot \Delta x_j}\sum_i M_i \int_{a}^{b} 𝟙_{A_i}(x) dx \\ &= \frac{1}{\sum_{j}M_j \cdot \Delta x_j}\sum_i M_i \cdot \Delta x_i \\ &= 1 \end{aligned} $$

Sampling from a weighted uniform distribution is indeed straightforward, and the process is as follows:

  1. Choose one of the $n$ bins, where the probability of selecting a particular bin is proportional to its area, calculated as $\displaystyle \frac{M_i \cdot \Delta x_i}{\sum_{j}M_j \cdot \Delta x_j}$.

  2. Once a bin is selected, generate a sample from the uniform distribution within that bin.

When Max Pooling is applied to determine the weights $\mathbf{M}$ and the set $\mathcal{A}$ (which includes the bins), and if the bin lengths are equal, the term $\Delta x_i$ cancels out. This simplifies the probability of choosing any bin to $M_i / \sum_{j}M_j$. This simplicity is a significant advantage when implementing the sampling algorithm, making the process computationally efficient.

2.3. Piecewise Rejection Sampling

  The weighted uniform distribution, which we can readily sample from, is a viable choice for $g(x)$ in rejection sampling. But instead of selecting random weights and bin lengths, we will use Max Pooling to derive $\mathbf{M}$ and $\mathcal{A}$, ensuring that our $g(x)$ always remains above $f(x)$. This procedure is outlined in the following steps:

  1. Determine the number of bins to segment the entire range into, denoted as $n$. Define $\mathcal{A}$ by splitting the interval into $n$ equal-length sections.

  2. Perform Max Pooling on $f(x)$ across the established bins to ascertain $\mathbf{M}$, the set of maximum values within each bin.

  3. Utilize $\mathbf{M}$ and $\mathcal{A}$ to construct a weighted uniform distribution.

  4. Implement rejection sampling using this specifically defined weighted uniform distribution as $g(x)$.

This approach is aptly named Piecewise Rejection Sampling because it involves sampling within discrete bins. When you compute the acceptance rate for this method, it becomes clear that it can substantially improve the acceptance rate over using a simple uniform distribution for rejection sampling.

$$ \begin{aligned} P(U < f(Y)) &= \int_a^b P(U < f(Y) | Y = y) \cdot g(y) \, dy \\ &= \int_a^b P(U < f(y)) \cdot \text{WUnif}(y|\mathbf{M}, \mathcal{A}) \, dy \\ &= \frac{1}{\sum_{j}M_j \cdot \Delta x_j} \sum_i M_i \int_{A_i} P(U < f(y)) dy \end{aligned} $$ Since $U \sim \text{Unif}(u|0, M_i)$ and $P(U < f(y)) = f(y)/{M_i}$, it can be simplified to: $$ \begin{aligned} P(U < f(Y)) &= \frac{1}{\sum_{j}M_j \cdot \Delta x_j} \sum_i \int_{A_i} f(y)\, dy \\ &= \frac{1}{\sum_{j}M_j \cdot \Delta x_j} \geq \frac{1}{M_\max \cdot \sum_{j} \Delta x_j} = \frac{1}{M} \end{aligned} $$

The final part of the inequality references that the greatest value among the $M_i$ when multiplied by the total interval length is equal to $M$, aligning with the area under the curve of the original uniform distribution. This fact confirms that the acceptance rate for piecewise rejection sampling will always exceed that of a simple uniform distribution.

To tackle the initial problem we introduced, we’ll employ piecewise rejection sampling. The algorithm for this has been integrated into the Rust numerical library Peroxide, which we will utilize for our solution. Using this library streamlines the process, allowing us to efficiently apply the piecewise rejection sampling method we’ve outlined.

// Before running this code, you need to add peroxide in Cargo.toml
// `cargo add peroxide --features parquet`
use peroxide::fuga::*;

#[allow(non_snake_case)]
fn main() {
    // Read parquet data file
    let df = DataFrame::read_parquet("data/test.parquet").unwrap();
    let E: Vec<f64> = df["E"].to_vec();
    let dNdE: Vec<f64> = df["dNdE"].to_vec();

    // Cubic hermite spline -> Make continuous f(x)
    let cs = cubic_hermite_spline(&E, &dNdE, Quadratic);
    let f = |x: f64| cs.eval(x);

    // Piecewise rejection sampling
    // * # samples = 10000
    // * # bins = 100
    // * tolerance = 1e-6
    let E_sample = prs(f, 10000, (E[0], E[E.len()-1]), 100, 1e-6);

    // Write parquet data file
    let mut df = DataFrame::new(vec![]);
    df.push("E", Series::new(E_sample));
    df.write_parquet(
        "data/prs.parquet", 
        CompressionOptions::Uncompressed
    ).unwrap();
}

If you plot the data in a histogram, it looks like this:

Finally, we get samples!

Finally, we get samples!


References

  • Yen-Chi Chen, Lecture 4: Importance Sampling and Rejection Sampling, STAT/Q SCI 403: Introduction to Resampling Methods (2017)

  • Ovidiu Calin, Deep Learning Architectures: A Mathematical Approach, Springer (2020)

  • Tae-Geun Kim (Axect), Precise Machine Learning with Rust (2019)


A. Footnotes

[1]: The figure used here depicts the spectrum of Axion Like Particles (ALPs) emitted at a specific time from a Primordial Black Hole (PBH). You can find more details about this in this paper.

[2]: In this scenario, one could approximate the distribution’s nodes using cubic spline interpolation, calculate the cumulative distribution function via numerical or polynomial integration, and then obtain the inverse function through further interpolation. However, this method may yield inaccurate and inefficient results, which is why it’s not the recommended approach.

]]>
💔 Decorrelation + Deep learning = Generalization https://axect.github.io/posts/005_decov/ Sat, 29 Oct 2022 17:39:54 +0900 https://axect.github.io/posts/005_decov/ <figure> <img src="https://axect.github.io/posts/images/005_01_paper.png" alt="arXiv: 1511.06068"/> <figcaption style="text-align:center"> <p><a href="https://arxiv.org/abs/1511.06068">arXiv: 1511.06068</a></p> </figcaption> </figure> <p>  The most pervasive challenge in deep learning is <span style="background-color: rgba(255, 255, 0, 0.534);"> <b>Overfitting</b> </span>. This occurs when the dataset is small, and extensive training leads to a model that excels on training datasets but fails to generalize to validation datasets or real-world scenarios. To address this issue, various strategies have been developed. Historically, in statistics, regularization methods like Ridge and LASSO were employed, while deep learning has adopted approaches such as regularizing weights or applying different techniques to neural networks. These techniques encompass a range of methods.</p> arXiv: 1511.06068

arXiv: 1511.06068

  The most pervasive challenge in deep learning is Overfitting . This occurs when the dataset is small, and extensive training leads to a model that excels on training datasets but fails to generalize to validation datasets or real-world scenarios. To address this issue, various strategies have been developed. Historically, in statistics, regularization methods like Ridge and LASSO were employed, while deep learning has adopted approaches such as regularizing weights or applying different techniques to neural networks. These techniques encompass a range of methods.

Bejani, M.M., Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif Intell Rev 54, 6391-6438 (2021)

Bejani, M.M., Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif Intell Rev 54, 6391-6438 (2021)

Among these, Dropout is perhaps the most well-known, which involves omitting certain neurons in a neural network to prevent their co-adaptation. Practically, this has proven to be an exceedingly effective measure to mitigate overfitting, becoming a standard inclusion in neural network architectures. However, dropout is not universally applicable: with limited training data or few neurons, the random removal of neurons might impede the network’s capacity to model complex relationships. Additionally, despite its practicality, dropout’s straightforward architecture and the principle may seem less theoretically satisfying. In this context, I wish to present an intriguing paper that offers both theoretical enrichment and practical effectiveness.


1. Why do we need decorrelation?

1.1. Covariance & Correlation

  The paper under review, Reducing Overfitting in Deep Networks by Decorrelating Representations from 2015, introduces a method termed Decorrelation . The authors demonstrate how this technique can bolster the performance of Dropout in reducing overfitting. Despite its potential, Decorrelation did not surge in popularity, likely due to its novelty and relative complexity. Nevertheless, it remains a valuable and consistently cited work in the field.

To grasp the concept of decorrelation, we must first understand what correlation entails. Correlation denotes the relationship between two sets of data, with covariance being a common measure of this relationship. Covariance is defined as the linear association between two random variables, expressed by the formula:

$$ \text{Cov}(X,\,Y) \equiv \mathbb{E}\left[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])\right] $$

It is generally understood that a positive covariance indicates a direct proportional relationship between variables, a negative value suggests an inverse relationship, and a value near zero implies no apparent relationship. A more precise metric for this relationship is the Pearson correlation coefficient, defined as:

$$ \text{Corr}(X,\,Y) \equiv \frac{\text{Cov}(X,\,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} $$

This coefficient is constrained within a range from -1 to 1. A value closer to 1 indicates a strong direct relationship, closer to -1 indicates a strong inverse relationship, and a value near 0 suggests no correlation. Consider the following data as an illustration.

# Python
import numpy as np

x = np.array([1,2,3,4])
y = np.array([5,6,7,8])

For these two variables, the covariance can be calculated using the following function in Python:

# Python
def cov(x, y):
    N = len(x)
    m_x = x.mean()
    m_y = y.mean()
    return np.dot((x - m_x), (y - m_y)) / (N-1)

Using this function to compute the covariance for x and y above yields a value of 1.6666666666666667. This indicates a positive correlation, but interpreting this raw number can be challenging. To facilitate a clearer interpretation, we employ the Pearson correlation coefficient, as previously mentioned, defined in Python as:

# Python
def pearson_corr(x, y):
    # ddof=1 for sample variance
    return cov(x, y) / np.sqrt(x.var(ddof=1) * y.var(ddof=1))

Applying this to the given data reveals a Pearson correlation coefficient of exactly 1, signifying that the two variables have a perfect linear relationship.

When dealing with multiple features, it is possible to construct a matrix that represents the covariance among them all at once. This is known as the covariance matrix, and is denoted as follows:

$$ C_{ij} = \text{Cov}(x_i,\,x_j) $$

Therefore, the covariance matrix is a square matrix with dimensions equal to the number of features, representing the covariance between each pair of features.

 

1.2. Overfitting & Decorrelation

  The discussion of correlation in the realm of deep learning arises from the issue that significant correlation among variables, or features, can be detrimental to a model’s performance. Deep neural networks refine their learning by adjusting weights assigned to each feature; however, when two features yield identical information (high correlation), it creates redundancy. This redundancy can cause a form of degeneracy where altering the weights of either feature doesn’t uniquely contribute to the learning process, which in turn can impede accurate learning and introduce bias into the neural network. The aforementioned paper posits that this redundancy is a contributing factor to overfitting.

In fact, the study demonstrates that by defining the extent of overfitting as the discrepancy between training and validation accuracy and quantifying the level of decorrelation with a metric termed DeCov, a notable pattern emerges:

Correlation between Overfitting and Covariance

Correlation between Overfitting and Covariance

The graph above illustrates that with an increase in the number of training samples, the degree of overfitting diminishes, concurrently with a decrease in Cross-Covariance. This correlation suggests that by reducing covariance, or in other words, promoting decorrelation, one can potentially reduce the tendency of a model to overfit.


2. How to decorrelate?

  Within the paper, the authors propose that to combat overfitting, one should aim to decorrelate the activation outputs of the hidden layers. This approach is grounded in the logic that these activations are the actual values being multiplied by subsequent weights in the network. Let’s consider the activation values of one hidden layer as $h^n \in \mathbb{R}^d$, where $n$ is the batch index. We can then define the covariance matrix of these activations as:

$$ C_{ij} = \frac{1}{N} \sum_n (h_i^n - \mu_i)(h_j^n - \mu_j) $$

By reducing the covariance of these activation values, we aim to achieve our objective of decorrelation. However, to integrate this concept into our Loss function, which requires a scalar value, the covariance matrix must be represented in a scalar form. The paper introduces the following Loss function for this purpose:

$$ \mathcal{L}_{\text{DeCov}} = \frac{1}{2}(\lVert C \rVert_F^2 - \lVert\text{diag}(C) \rVert_2^2) $$

Here, $\lVert \cdot \rVert_F$ denotes the Frobenius norm of the matrix, and $\lVert \cdot \rVert_2$ denotes the $l^2$ norm. The subtraction of the diagonal components is deliberate since these represent the variance of individual features, which does not contribute to inter-feature correlation. By incorporating this DeCov Loss function into the overall loss, much like a regularization term, we can enforce decorrelation in the learning process.

Implementing this decorrelation strategy in practical deep learning applications may initially appear daunting, given the complexity of gradient calculations for the covariance matrix and its norms. However, PyTorch simplifies the process considerably with its built-in automatic differentiation capabilities, which can handle gradients for the covariance matrix and norms with ease. Consequently, this DeCov loss function can be integrated into your model with just a few lines of Python code:

# Python
import torch

def decov(h):
    C = torch.cov(h)
    C_diag = torch.diag(C, 0)
    return 0.5 * (torch.norm(C, 'fro')**2 - torch.norm(C_diag, 2)**2)

By defining a function decov, we calculate the covariance matrix C of the activations h. The function then computes the DeCov loss by taking the Frobenius norm of the entire covariance matrix, squaring it, and subtracting the squared $l^2$ norm of the diagonal elements, which represent the variance of each feature. This DeCov loss can then be added to the primary loss function, thereby enabling the model to learn decorrelated features effectively.


3. Apply to Regression

  In the original study, the authors applied their decorrelation technique to various sets of image data. However, to more clearly demonstrate its effectiveness, we’ll apply it to a simple regression problem. Below is the dataset that will be used to train the neural network:

Nonlinear data [Note: Peroxide_Gallery]

Nonlinear data [Note: Peroxide_Gallery]

To illustrate the impact of the DeCov loss function, we will compare the performance of a standard neural network to one that implements the DeCov function. The architecture of the neural network utilizing DeCov is detailed below.

import pytorch_lightning as pl
from torch import nn
import torch.nn.functional as F
# ...

class DeCovMLP(pl.LightningModule):
    def __init__(self, hparams=None):
        # ...
        
        self.fc_init = nn.Sequential(
            nn.Linear(1, self.hidden_nodes),
            nn.ReLU(inplace=True)
        )
        
        self.fc_mid = nn.Sequential(
            nn.Linear(self.hidden_nodes, self.hidden_nodes),
            nn.ReLU(inplace=True),
            nn.Linear(self.hidden_nodes, self.hidden_nodes),
            nn.ReLU(inplace=True),
            nn.Linear(self.hidden_nodes, self.hidden_nodes),
            nn.ReLU(inplace=True),
        )
        
        self.fc_final = nn.Linear(self.hidden_nodes, 1)
        
        # ...
        
    def forward(self, x):
        x = self.fc_init(x)
        x = self.fc_mid(x)
        return self.fc_final(x)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        
        h0 = self.fc_init(x)
        loss_0 = decov(h0)
        
        h1 = self.fc_mid(h0)
        loss_1 = decov(h1)
        
        y_hat = self.fc_final(h1)
        loss = F.mse_loss(y,y_hat) + loss_0 + loss_1
        
        return loss
    
    # ...

Upon closer examination, we observe that loss_0 is defined as the loss after the initial layer fc_init, and loss_1 as the loss after the subsequent three layers fc_mid. These losses are then integrated into the mean squared error (MSE) loss as regularization terms. Before delving into the outcomes, it’s insightful to examine the training dynamics of SimpleMLP and DeCovMLP.

SimpleMLP losses (wandb.ai)

SimpleMLP losses (wandb.ai)

DeCovMLP losses (wandb.ai)

DeCovMLP losses (wandb.ai)

In the SimpleMLP model, decov_1 initially increases during training iterations and then plateaus, indicating no further decrease. Contrastingly, within the DeCovMLP model, decov_1 consistently decreases, aligning with the paper’s hypothesis: as overfitting intensifies, so does the correlation between features. Now, let’s consider the results.

Results!

Results!

The red line represents the SimpleMLP outcome, while the blue line depicts DeCovMLP. Noticeably, the red line exhibits overfitting with erratic fluctuations, whereas the blue line maintains closer alignment with the true data and exhibits less jitter. This demonstrates a remarkable improvement, and further extrapolation yields even more compelling insights.

Extrapolate!

Extrapolate!

When extrapolated, the red line — indicative of overfitting — diverges significantly from the expected trend even with minor deviations, leading to inaccurate predictions. Meanwhile, the blue line remains stable, displaying minimal deviation. This clearly demonstrates the superior generalization capability of the DeCovMLP.


4. Further more..

  The paper, “Reducing Overfitting in Deep Networks by Decorrelating Representations,” was published in 2015—a notable period ago, especially when considered within the fast-paced advancements of the field up to 2022. Since its publication, there has been a continuous outpouring of research on decorrelation techniques, with studies like Decorrelated Batch Normalization emerging more recently. Decorrelation, having been extensively explored in the field of statistics, offers a robust theoretical foundation that can render the results of neural networks more intuitive and analytically solid for researchers and practitioners alike.

For those interested in delving deeper, the regression code discussed earlier can be accessed at the provided link.

]]>