A paper presentation of the Extending Context Window of Large Language Models via Positional Interpolation paper
There are certain limitations when implementing large language models (LLMs). One main limitation is the pre-defined context window size as many tasks or applications will exceed the context window size. When the context length exceeds the pre-trained settings, traditional transformer models tend to exhibit a significant degradation in performance.
The purpose of this paper is to introduce the use of position interpolation (PI) as a fine-tuning technique to extend context length for various LLMs over other fine-tuning methods such as extrapolation. The paper covers various experiments testing the viability of PI to determine that PI is successful in extending context length up to 32 times while having reduced complexity and only experiencing a small trade-off with model performance.
Before discussing what is positional interpolation, we need to discuss what are positional embeddings.
What are positional embeddings and why are they important?
-
Another way to think about this, what would be the problem if a model perceived sequences as “bags of words” instead?
Hint Answer
The problem is that if we had a bag of words then a Transformer cannot make sense of word ordering. For example, we know that the sentence “Tiffany gives Camille a gift” has a completely different meaning than “Camille gives Tiffany a gift”, but the model cannot tell the difference.
Full Answer
Positional embeddings only contain information about a token’s position in a sequence, so there is no semantic or syntactic information included. The positional embedding is added to the input embeddings to provide the model with information about the position of each item in the sequence, which allows the model to make sense of word ordering.
Many LLMs, such as LLaMA will use positional embedding to maintain sense of word order, but we currently face limits with context size as there is a fixed-length nature to these vectors that puts a limit on the maximum sequence length the model can handle.
Many LLMs will have tasks that require long context windows, such as summarizing long docs and doing long conversations. The two ways to extend context window is by either training a LLM from scratch or fine-tuning.
- When training from scratch, it takes a lot of effort and resources to accomplish as there is an increased amount of complexity as there are more parameters since the context-length increased.
- Extrapolation fine-tuning for increased context length also has obstacles with an increased complexity and severe deterioration in model performance.
- Extrapolation is training on short context windows to inference on longer context windows, so the problem is that many LLMs use positional embeddings that are not able to extrapolate well, which creates a decline in performance.
- For example, LLaMA and Falcon both use Rotary position embeddings (RoPE). RoPE is applied by computing positional information through a rotation operation in the embedding space, which creates a rotation matrix. This allows a reduction in positional collisions and an enhanced modeling of long-range dependencies, but RoPE by itself is not good at extrapolation.
- For example, LLaMA and Falcon both use Rotary position embeddings (RoPE). RoPE is applied by computing positional information through a rotation operation in the embedding space, which creates a rotation matrix. This allows a reduction in positional collisions and an enhanced modeling of long-range dependencies, but RoPE by itself is not good at extrapolation.
- Extrapolation is training on short context windows to inference on longer context windows, so the problem is that many LLMs use positional embeddings that are not able to extrapolate well, which creates a decline in performance.
Why might RoPE not work well through extrapolation?
Hint:
Extrapolation involves training on short context windows to inference on longer context windows. RoPE creates a rotation matrix based on the sequence positions seen during training.Answer:
RoPE has a fixed rotation pattern as its rotation matrices are designed based on the sequence positions seen during training. When encountering longer sequences, the rotation patterns for these new, unseen positions are already outside of the trained context length. This leads to unpredictable or suboptimal embedding rotations. An effective approach needs to ensure that the positional embeddings for longer sequences fall within the trained range.This ends up with pre-trained LLM models that use RoPE to have a severe performance issue once it passes the trained context length as it will have high attention scores that will hurt the self-attention mechanism, which is represented in the graphics above. The above graphics show that using extrapolation for RoPE will cause model performance to have higher attention scores, which represent increased model complexity and model performance to decline.
To get positional embeddings to fall within the training range, this has led to the idea of positional interpolation where we directly down-scale the position indices so that the maximum position index matches the previous context window limit. This is done by simply downscaling and dividing the position index by a scaling factor.
Extrapolation vs. Interpolation Down-Scale on RoPE
The top graphic is showing a LLaMa model with a 2048 context window length. The red part of the graphic is when we have gone over the context window length via extrapolation. The bottom graphic shows that in positional interpolation, we downscale the position indices so that we get the 4096 position to still reside in a 2048 context length, which we can see with the increased number of dots in the bottom graphic.
The paper is focused on how to extend the context window when a LLM is using RoPE. Given that RoPE is defined by the
where
Using RoPE, the self-attention score
This gives the self-attention that is only dependent on relative position m-n.
Thus to use the positional interpolation in RoPE so that we can scale down each input position index (m) to be within the range [0,L) to fit within the pre-trained context length, we need to change the f(x,m) function seen above to:
Where:
- x is the word embedding, which is without the position information
- m is the token position/positional embedding
- L is the original context window length/the max length
- L' is the longer extended context window length.
By aligning the ranges of position indices and relative distances before and after extension, the problem with attention score computation due to context window extension is mitigated, so the model is able to easier adapt as the interpolation bound is much tighter than the extrapolation bound for attention scores computed using the interpolated positions.
-
Input:
-
$\ell\in L'$ : Position of a token in the sequence. -
$L$ : The original context window length. -
$L'$ : The longer extended context window length.
-
-
Output:
-
$e_p \in \mathbb{R}^{d_e}$ : The vector representation of the position with interpolation.
-
-
Parameters:
-
$W_p$ : The positional embedding parameter. - For
$0 \leq i < d_e/2$ :
-
-
Return:
- Retrieve the positional embedding with interpolation:
$e_p = W_p(i, \ell)$ For$0 \leq i < d_e/2$ .
- Retrieve the positional embedding with interpolation:
The positional embeddings with interpolation are used with the token embeddings to form a token's initial embedding:
-
Variables:
-
$e$ : The token embedding -
$W_e$ : The word embedding matrix -
$x$ : The document -
$\ell$ : Position of a token in the sequence -
$W_p[:,\ell]$ : The Positional embedding with interpolation
-
Since positional interpolation does not modify the model architecture or attention mechanism, it can be used for a variety of tasks that can help extend the context length. The paper covered many different experiments to evaluate PI effectiveness compare to extrapolation fine-tuned and non-fine-tuned models:
- Long Sequence Language Modeling
- Passerkey Retrieval
- Benchmarks on Original Context Window Size of 2048
- Long Document Summarization
Essentially, the paper saw that models fine-tuned with positional interpolation can achieve better perplexity with longer context windows while only seeing a very minor degradation in performance since the experiments showed that PI can help extend the context window up to 32 times successfully.
This can be seen in the below graphic where we see that models fine-tuned with PI shows progressively lower perplexity with longer context window, while the perplexity of fine-tuned with extrapolation increases over the longer window context without even achieving the same context length.
Long Sequence Language Modeling Experiment on LLaMA with RoPE
The example experiments show how positional interpolation can effectively extend a model’s context window to be significantly larger through minimal fine-tuning, and it does not need to modify the model architecture or attention mechanism. The ability to preserve its original architecture gives positional interpolation versatility to be used in various tasks and help models achieve an extended context window.
There are a few things that could have been developed further in the paper.
- There is potential for extrapolation fine-tuning to use regularization to possibly end up in the [0,L] boundary, but the authors did not conduct a comparison to see how including regularization in extrapolation compares to positional interpolation.
- The authors discussed how PI can be used in Retrieval-augmented LLMs, few-shots learning, recurrent transformers, and memory transformers, which were all not included in the experiments covered in the paper. Additional work can be done by not only conducting experiments on these models, but the authors can also conduct experiments on tasks that are more sensitive to positional embeddings like question-answering.
- Experiment how PI performs with different models besides LLaMA or other positional embedding techniques as new discoveries occur. The authors specifically only focus on LLaMA and RoPE when conducting their experiments, but further work can be done to see how PI performs across different models and other embedding techniques.
- Extending Context is Hard: https://kaiokendev.github.io/context
- Extending the Context length of Language Models: Understanding Positional Interpolation (Blog Post 1): https://medium.com/@jain.sm/extending-the-context-length-of-language-models-a-deep-dive-into-positional-interpolation-a93140c69f6a
- Position Interpolation: Extending Context Window Sizes in Large Language Models (Blog Post 2): https://medium.com/@jain.sm/position-interpolation-extending-context-window-sizes-in-large-language-models-ef19d0209a9f
- Exploring Ways to Extend Context Length in Transformers: https://muhtasham.github.io/blog/posts/explore-context/
- Extending context size via RoPE scaling (with Reddit Discussion Link): ggml-org/llama.cpp#1965
Chen, S., Wong, S., Chen, L., & Tian, Y. (2023). Extending context window of large language models via positional interpolation. Retrieved from http://arxiv.org/abs/2306.15595


