(NeurIPS 2025 Spotlight) Abstain Mask Retain Core (AMRC): Time Series Prediction by Adaptive Masking Loss with Representation Consistency

This repository contains the implementation of AMRC (Adaptive Masking Loss with Representation Consistency), a novel optimization framework for time series forecasting that addresses the fundamental issue of redundant feature learning. The method is presented in our paper "Abstain Mask Retain Core: Time Series Prediction by Adaptive Masking Loss with Representation Consistency" (We are still refining the code).

Abstract

Time series forecasting plays a pivotal role in critical domains such as energy management and financial markets. Through systematic experimentation, we reveal a counterintuitive phenomenon: appropriately truncating historical data can paradoxically enhance prediction accuracy, indicating that existing models learn substantial redundant features during training. Building upon information bottleneck theory, we propose AMRC, which features two core components:

Adaptive Masking Loss (AML): Dynamically identifies highly discriminative temporal segments to guide gradient descent
Embedding Similarity Penalty (ESP): Stabilizes mapping relationships among inputs, labels, and predictions

Key Theoretical Insights

1. The Redundancy Learning Phenomenon

Our analysis challenges the prevailing "long-sequence information gain hypothesis" in time series forecasting. Through extensive experiments across multiple benchmarks (Table 1 in paper), we demonstrate that:

Over 50% of samples exhibit improved predictive performance when input sequences are optimally masked
This phenomenon is architecture-agnostic, manifesting in both Transformer-based (iTransformer, PatchTST) and MLP-based models (TSMixer, SOFTS)
The improvement is consistent across diverse datasets with varying temporal characteristics

2. Information Bottleneck Perspective

According to Information Bottleneck (IB) Theory, a neural network functions as a bottleneck that compresses input information during feature extraction. For time series forecasting, the objective can be formulated as:

max I(Z; Y) - β I(Z; X)

Where:

Z: Latent representation
Y: Prediction target
X: Input sequence
I(·,·): Mutual information
β: Trade-off parameter

Current models focus primarily on maximizing I(Z; Y) but fail to explicitly minimize I(Z; X), leading to redundant feature retention.

3. Representation Similarity Paradox

Through t-SNE visualization analysis, we observe that:

Input embeddings maintain natural dispersion patterns
Model representations exhibit abnormal clustering despite diverse labels
This concentration indicates encoding of task-irrelevant features that distort input-output mappings

Methodology

Adaptive Masking Loss (AML)

AML addresses redundancy by guiding the encoder toward minimal sufficient representations:

Stochastic Mask Sampling: Generate m masked variants by randomly sampling mask indices
Optimal Selection: Identify the mask that minimizes prediction loss
Representation Alignment: Minimize distance between original and optimal masked representations

L_AML = β · ||Z - Z_k*||²

Where Z_k* represents the encoder output from the optimally masked input.

Embedding Similarity Penalty (ESP)

ESP enforces geometric consistency between embedding and output spaces:

L_ESP = (1/n²) ∑∑ |Δ_E^ij - Δ_O^ij|

Where:

Δ_E^ij: Normalized pairwise distances in embedding space
Δ_O^ij: Normalized pairwise distances in output space

Combined Objective

The final training objective integrates both components:

L_total = L_pred + λ_AML · L_AML + λ_ESP · L_ESP

Experimental Validation

Datasets

ETT (Electricity Transformer Temperature): ETTh1, ETTh2, ETTm1, ETTm2
Solar-Energy: 137-channel solar power production data
Electricity: Hourly electricity consumption
Weather: 21-channel meteorological data

Key Results

Consistent Performance Gains: AMRC achieves improvements across all tested architectures
- Average MSE reduction: 3-7% across different models
- More pronounced on datasets with strong temporal redundancy
Architecture Agnostic: Effective on diverse model families
- Transformer-based: iTransformer, PatchTST
- MLP-based: TSMixer, SOFTS, TimeMixer
Redundancy Reduction: Post-training analysis shows
- Decreased susceptibility to input masking (Ratio* < Ratio)
- More robust feature representations

Implementation Notes

Requirements

PyTorch >= 1.10
NumPy, Pandas, scikit-learn
Model-specific dependencies (see individual model directories)

Integration

AMRC is designed as a plug-and-play training framework that can be integrated into existing time series forecasting models without architectural modifications. The implementation follows these principles:

Non-invasive: No changes to model architecture required
Flexible: Hyperparameters λ_AML and λ_ESP can be tuned per dataset
Efficient: Minimal computational overhead during training

Repository Structure

SOFTS_exp/
├── iTransformer/    # iTransformer with AMRC integration
├── PatchTST/        # PatchTST with AMRC integration  
├── TimeMixer/       # TimeMixer with AMRC integration
├── TSMixer/         # TSMixer with AMRC integration
└── SOFTS/           # SOFTS baseline implementation

Limitations and Future Work

Computational Overhead: AML requires m additional forward passes per batch
High-Dimensional Challenges: ESP effectiveness diminishes in very high-dimensional embedding spaces
Approximation Bounds: Optimal mask selection is limited by sampling size m

Acknowledgments

This research investigates a fundamental but overlooked aspect of time series forecasting: the detrimental effects of redundant feature learning. By introducing AMRC, we provide both theoretical insights and practical solutions for improving forecasting accuracy through redundancy suppression.

Note: Due to ongoing code refinement and validation, specific implementation details and usage instructions will be updated upon publication. The theoretical framework and experimental results presented here represent the core contributions of our work.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
PatchTST		PatchTST
SOFTS		SOFTS
TSMixer		TSMixer
TimeMixer		TimeMixer
iTransformer		iTransformer
.gitignore		.gitignore
NeurIPS_Abstain_Mask_Retain_Core.pdf		NeurIPS_Abstain_Mask_Retain_Core.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

(NeurIPS 2025 Spotlight) Abstain Mask Retain Core (AMRC): Time Series Prediction by Adaptive Masking Loss with Representation Consistency

Abstract

Key Theoretical Insights

1. The Redundancy Learning Phenomenon

2. Information Bottleneck Perspective

3. Representation Similarity Paradox

Methodology

Adaptive Masking Loss (AML)

Embedding Similarity Penalty (ESP)

Combined Objective

Experimental Validation

Datasets

Key Results

Implementation Notes

Requirements

Integration

Repository Structure

Limitations and Future Work

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

(NeurIPS 2025 Spotlight) Abstain Mask Retain Core (AMRC): Time Series Prediction by Adaptive Masking Loss with Representation Consistency

Abstract

Key Theoretical Insights

1. The Redundancy Learning Phenomenon

2. Information Bottleneck Perspective

3. Representation Similarity Paradox

Methodology

Adaptive Masking Loss (AML)

Embedding Similarity Penalty (ESP)

Combined Objective

Experimental Validation

Datasets

Key Results

Implementation Notes

Requirements

Integration

Repository Structure

Limitations and Future Work

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages