Transformers


3. Positional Encodings

3.1 Introduction

For the model to make use of the order of the sequence, we need to inject some information about the relative or absolute position of the tokens in the sequence. We do this by adding “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. These encodings have the same dimensionality \(d_{model}\) as the embeddings to make their addition possible.

The figure below shows the part of the transformer where positional encodings are created and added to the embeddings.

embeddings_and_pe.png

3.2 Mathematical Foundations & Implementation

The transformer uses sine and cosine functions of different frequencies: \[ \begin{align} PE_{pos, 2i} &= \sin \left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)\\[0.5 cm] PE_{pos, 2i+1} &= \cos \left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right) \end{align} \] where \(pos\) is the position of the token in the input sequence and \(i\) is the index of the dimension in the embedding vector. As an example, if the sequence is 10 tokens long, then \(pos\) varies between \(0\) and \(9\) while \(i\) varies between \(0\) and \(d_{model}-1\) (i.e., each token has dimensionality \(d_{model}\) in the embedding space). Each dimension \(i\) of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from \(2\pi\) to \(10000\times 2\pi\).

Here is an example implementation of the above encoding equations when \(d_{model}=4\):

import numpy as np

def create_pos_enc(d=4, max_pos=10):
    """
    Create positional encodings for a  given dimension and maximum 
    number of positions.
    """
    d = 4
    max_pos = 10
    i_values = np.arange(d // 2)  # We have d/2 sine-cosine pairs 

    thetas = 1 / (10000 ** (2 * i_values / d)) 

    positions = np.arange(max_pos + 1)
    PE = np.zeros((max_pos + 1, d))  

    for pos in positions:
        PE[pos, 0::2] = np.sin(pos * thetas)
        PE[pos, 1::2] = np.cos(pos * thetas)  

    return PE  

PE = create_pos_enc()

# Display first three encodings
print("Positional Encodings (first 3 positions):\n", PE[:3])
>>> Positional Encodings (first 3 positions): 
     [[ 0.         1.         0.         1.         ] 
      [ 0.84147098  0.54030231 0.00999983 0.99995   ] 
      [ 0.90929743 -0.41614684 0.01999867 0.99980001]]

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset \(k\), \(PE_{pos+k}\) can be represented as a linear function of \(PE_{pos}\). Attention is all you need, 2017 - Vaswani et al.

Let’s dive deeper to see how the model would learn to attend by relative positions using the linear relationship between \(PE_{pos}\) and \(PE_{pos+k}\). Using trigonometric identities, these expressions can be rewritten in terms of \(PE_{pos,i}\):

  1. For even \(i\): \[ \sin \left(\frac{pos + k}{10000^{\frac{2i}{d}}}\right) = \sin \left(\frac{pos}{10000^{\frac{2i}{d}}} \right) \cos \left(\frac{k}{10000^{\frac{2i}{d}}}\right) + \cos \left(\frac{pos}{10000^{\frac{2i}{d}}}\right) \sin \left(\frac{k}{10000^{\frac{2i}{d}}}\right) \]
  2. For odd \(i\): \[ \cos \left(\frac{pos + k}{10000^{\frac{2i}{d}}}\right) = \cos \left(\frac{pos}{10000^{\frac{2i}{d}}}\right) \cos \left(\frac{k}{10000^{\frac{2i}{d}}}\right) - \sin \left(\frac{pos}{10000^{\frac{2i}{d}}}\right) \sin \left(\frac{k}{10000^{\frac{2i}{d}}}\right) \]

In case you haven’t noticed it already, these equations are linear transformations. Let’s define \(\theta_i\) as the quantity \[ \theta_i = \frac{1}{10000^{\frac{2i}{d_{model}}}} \]

and the following sine and cosine terms in \(k\) are both constants:

\[ \cos \left(\frac{k}{10000^{\frac{2i}{d_{model}}}}\right)=\cos (k\cdot \theta_i);;;;;;\sin \left(\frac{k}{10000^{\frac{2i}{d_{model}}}}\right)=\sin (k\cdot \theta_i) \]

And the positional encoding equations now become:

  1. For even \(i\): \[ \sin \left((pos + k) \cdot \theta_i \right) = \sin (pos \cdot \theta_i) \times \cos (k\cdot \theta_i) + \cos (pos \cdot \theta_i) \times \sin (k\cdot \theta_i) \]
  2. For odd \(i\): \[ \cos \left((pos + k) \cdot \theta_i \right) = \cos (pos \cdot \theta_i) \times \cos (k\cdot \theta_i) - \sin (pos \cdot \theta_i) \times \sin (k\cdot \theta_i) \]

Notice in the above pair of equations that when \(i\) is even, then \(\cos (pos \cdot \theta_i)\) must correspond to the encoding for the next odd \(i\) i.e., \(PE_{pos, i+1}\). Conversely, when \(i\) is odd, then \(\sin (pos \cdot \theta_i)\) must correspond to the encoding for the previous even \(i\) i.e., \(PE_{pos, i-1}\). We now rewrite the above pair of equations leveraging this new observation i.e., we express\(PE_{pos+k, i}\) in terms of \(PE_{pos, i}\).

  1. For even \(i\): \[ PE_{pos+k, i} = PE_{pos, i} \times \cos (k\cdot \theta_i) + PE_{pos, i+1} \times \sin (k\cdot \theta_i) \]
  2. For odd \(i\): \[ PE_{pos+k, i} = PE_{pos, i} \times \cos (k\cdot \theta_i) - PE_{pos, i-1} \times \sin (k\cdot \theta_i) \]

Let’s assume we have the positional encoding for position \(0\) i.e., \(PE_0\) and we desire to find \(PE_k\) for some fixed offset \(k\). Here is how we can implement this in code:

import numpy as np

def compute_pe(pos, target, d_model=4):
    '''
    Calculates the positional encoding for a target position from that
    of a source position
    '''
    
    k = target - pos # k should be positive
    i = np.arange(d//2)
    
    theta_i = 1 / (10000 ** (2 * i / d_model))
    
    pe_pos_even = np.sin(pos * theta_i) # even positions
    pe_pos_odd = np.cos(pos * theta_i)  # odd positions

    cos_k = np.cos(k * theta_i)
    sin_k = np.sin(k * theta_i)
     
    # position encoding equations for even and odd i's
    pe_target_even = pe_pos_even * cos_k + pe_pos_odd * sin_k
    pe_target_odd = pe_pos_odd * cos_k - pe_pos_even * sin_k  

    pe_pos = np.zeros(d_model)
    pe_target = np.zeros(d_model) 

    pe_pos[0::2], pe_pos[1::2] = pe_pos_even, pe_pos_odd
    pe_target[0::2], pe_target[1::2] = pe_target_even, pe_target_odd 

    return pe_pos, pe_target


# Encoding for position 1 from that of position 0
compute_pe(0, 1)
>>> (array([0., 1., 0., 1.]), 
     array([0.84147098, 0.54030231, 0.00999983, 0.99995 ]))

# Try any combinations like ...
# compute(1, 4)

Back to the pair of equations. We can rewrite them in a compact form to yield the following matrix equation: \[ \begin{equation} \begin{bmatrix} \sin \left((pos + k) \cdot \theta_i \right) \\ \cos \left((pos + k) \cdot \theta_i \right) \end{bmatrix} = \begin{bmatrix} \cos (k\cdot \theta_i) & \sin (k\cdot \theta_i) \\ -\sin (k\cdot \theta_i) & \cos (k\cdot \theta_i) \end{bmatrix} \cdot \begin{bmatrix} \sin \left(pos \cdot \theta_i \right) \\ \cos \left(pos \cdot \theta_i \right) \end{bmatrix} \end{equation} \]

We denote the transformation matrix as \(M_k\) where \(k\) is some positive integer. That is: \[ M_k = \begin{bmatrix} \cos (k\cdot \theta_i) & \sin (k\cdot \theta_i) \\ -\sin (k\cdot \theta_i) & \cos (k\cdot \theta_i) \end{bmatrix} \]

Here is what the last three equations tell us: given the position encoding (\(PE_{pos}\)) of some position, we can find the encoding (\(PE_{pos+k}\)) for a different position that is some fixed offset \(k\) ahead as the linear transformation of \(PE_{pos}\) by matrix \(M_k\). More specifically, given the encoding \(PE_{pos}=[e_i: i \in \{0, 1, \cdots, d_{model}\}]\), we can find the individual elements of \(PE_{pos+k}\) by transforming each successive sine-cosine pair of \(PE_{pos}\) using \(M_k\).

The following code shows how we can implement this matrix transformation approach:

# Helper Function
def get_transformation_matrix(theta):
    """
    Compute the transformation matrix for a given theta
    """
    M_k =  np.array([
        [np.cos(theta), np.sin(theta)],
        [-np.sin(theta), np.cos(theta)]
    ])
    return M_k  

def linear_trans_pe(d=4, pos=1):
    """
    Compute positional encoding for a given position using matrix 
    transformation M_k.
    """
    i_values = np.arange(d // 2)  # We have d/2 sine-cosine pairs
    thetas = 1 / (10000 ** (2 * i_values / d)) 

    computed_PE = np.zeros(d)  

    for j, theta in enumerate(thetas):
        M = get_transformation_matrix(pos * theta)
        vec = PE[0, 2*j:2*j+2]  # Get sin and cos values for pos=0
        computed_PE[2*j:2*j+2] = M @ vec  # Apply transformation  

    return computed_PE  

# encoding for position 1
computed_PE_1 = linear_trans_pe()

# Compare with directly computed PE(pos=1)
print("Computed PE(pos=1) from PE(pos=0):", computed_PE_1)
print("Directly Computed PE(pos=1):       ", PE[1])
print("Difference:", computed_PE_1 - PE[1])
>>> Computed PE(pos=1) from PE(pos=0): 
    [0.84147098 0.54030231 0.00999983 0.99995 ] 
    Directly Computed PE(pos=1): 
    [0.84147098 0.54030231 0.00999983 0.99995 ] 
    Difference: [0. 0. 0. 0.]


# encoding for position 10
computed_PE_10 = linear_trans_pe(pos=10)

# Compare with directly computed PE(pos=10)
print("\nComputed PE(pos=10) from PE(pos=0):", computed_PE_10)
print("Directly Computed PE(pos=10):       ", PE[10])
print("Difference:", computed_PE_10 - PE[10])
>>> Computed PE(pos=10) from PE(pos=0): 
    [-0.54402111 -0.83907153 0.09983342 0.99500417] 
    Directly Computed PE(pos=10): 
    [-0.54402111 -0.83907153 0.09983342 0.99500417] 
    Difference: [0. 0. 0. 0.]

We note that when \(k\) is negative, that is “look \(k\) steps backwards”, we only need to transpose \(M_k\) to find the right transformation. That is,

\[ M_{k^*} = M_k^T = \begin{bmatrix} \cos (k\cdot \theta_i) & -\sin (k\cdot \theta_i) \\ \sin (k\cdot \theta_i) & \cos (k\cdot \theta_i) \end{bmatrix} \]

It is left to the reader to derive this for themselves. Hint: Use the trigonometric identities and make the right changes to arrive at this.

3.3 Tying Everything Together

Before putting this code into its final training-ready form, we shall briefly discuss the problem of numerical instability arising from computing the sinusoid frequencies using floating point arithmetic. Without getting into too much detail, we shall simply note that computing \(\theta_i\) is more numerically stable when using a logarithmic approach. That is,

\[ \theta_i = \frac{1}{10000^{\frac{2i}{d_{model}}}} = 2i*\exp \left(-\frac{\log (10000)}{d_{model}}\right) \]

Having understood thus, we shall proceed to put everything we have learnt so far into a single class. We also note that the .forward method of the following class will implement a residual connection by adding the learned embeddings to the positional encodings. One more thing - dropout is implemented to the sum of embeddings and positional encodings.

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of \(P_{drop}=0.1\). Attention is all you need, 2017 - Vaswani et al.

class PositionalEncoding(nn.Module):
    """Implement the PE function."""  

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)  

        # Compute positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * 
        -ath.log(10000.0) / d_model)

        pe[:, 0::2] = torch.sin(position * div_term) # even indices
        pe[:, 1::2] = torch.cos(position * div_term) # odd indices

        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe) # Store but do not update during 
        # training  

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)].requires_grad_(False)
        return self.dropout(x)

Since these positional encodings are sinusoids, it would be beneficial to visualize the results as a graph.

import altair as alt

def example_positional():
    d_model = 20
    pe = PositionalEncoding(d_model, 0)
    y = pe(torch.zeros(1, 100, d_model))  

    data = pd.concat([
            pd.DataFrame({"embedding": y[0, :, dim], 
            "dimension": dim, "position": list(range(100)),}) for dim in 
            range(4, 8)
        ])

    return (
        alt.Chart(data)
        .mark_line()
        .properties(width=800)
        .encode(x="position", y="embedding", color="dimension:N")
        .interactive()
    ) 

example_positional()
pe_visualization.png

Recall the following statement from the paper:

Each dimension \(i\) of the positional encoding corresponds to a sinusoid. Attention is all you need, 2017 - Vaswani et al.

Notice in the above graph that this is indeed the case. In the graph, we have only selected four sinusoids rather than all twenty for visual appeal. Each position’s encoding can be found graphically by simply reading off each sinusoid’s value at each position (x-axis).

Okay, that’s a ton of information to absorb. Take some time to master this before proceeding to the next tutorial.

3.4 Conclusion

In this tutorial, we explored positional encodings, and understood how the transformer learns to attend by relative positions. Next, we delve into the most important part of the transformer - attention!

Acknowledgements

This series of tutorials has benefited a lot from the Harvard NLP’s codebase for The Annotated Transformer