Positional Encoding in Transformer

With the advancement of large language models (LLMs), the significance of the context length they can handle is increasingly apparent. Let’s take a look at the evolution of positional encoding over the years to enhance the context processing capability of LLMs.

Vanilla Positional Encoding

Why does Transformer need positional encoding? Actually, Transformer contains no recurrence and no convolution. To help the model to ultilize the order of the sequence, Vanilla Transformer (vaswani et al., 2017¹) introduced the concept of positional encoding and adopted a simple yet effective approach, using sine and cosine functions to generate positional encodings. This method allows the model to effectively capture the positional information of words in the sequence without adding additional parameters.

$$ PE(pos,2i) = sin(pos/10000^{2i/d_{model}}) $$ $$PE(pos,2i+1) = cos(pos/10000^{2i/d_{model}})$$ where $pos$ is the position and $i$ is the dimension, $d_{model}$ is the word embedding dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2π$ to $10000 · 2π$. The authors chose this function because they hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

Of course, the authors also make a comparision with learned positon embedding (gehring et al., 2017²), and the two versions all produce nearly identical results. Nevertheless, the sinusoidal version may allow the model to extrapolate to sequence lengths not encountered during training. To understand the position encoding deeply, let’s visualize the position matrix with a certain length.

import numpy as np
import matplotlib.pyplot as plt

def get_position_encoding_by_attention_is_all_your_need(seq_len, d, n=10000):
    P = np.zeros((seq_len, d))
    for k in range(seq_len):
        for i in np.arange(int(d/2)):
            denominator = np.power(n, 2*i/d)
            P[k, 2*i] = np.sin(k/denominator)
            P[k, 2*i+1] = np.cos(k/denominator)
    return P

P = get_position_encoding_by_attention_is_all_your_need(seq_len=200, d=512, n=10000)
figure = plt.matshow(P)
plt.gcf().colorbar(figure)

Figure 1. Visulize the position matrix when n=10000,d=512,seq_len=200. (Image source: generated from code above.)

Relative Positional Encoding

Other view: Positonal Encoding is Nothing but computational budget

Almost all the people think postional encoding is vital for Transformer though, a team from IBM Research, Facebook CIFAR AI, .etc, get a radically different conclusion (kazemnejad et al., 2024)³, that is position encodings are not essential for decoder-only Transformers to generalize to longer sequences.

References

Vaswani et al., “Attention Is All Your Need”, AAAI 2017 ↩︎
Gehring et al., “Convolutional Sequence to Sequence Learning”, NIPS 2017 ↩︎
Kazemnejad et al., “The Impact of Positional Encoding on Length Generalization in Transformers”, NIPS 2023 ↩︎

Vanilla Positional Encoding#

Relative Positional Encoding#

Other view: Positonal Encoding is Nothing but computational budget#

References#

Vanilla Positional Encoding

Relative Positional Encoding

Other view: Positonal Encoding is Nothing but computational budget

References