The transformer is a deep learning neural network architecture introduced by Google researchers in the groundbreaking 2017 paper Attention is all you need. It is built on the multi-head attention mechanism, an evolution of the attention concepts proposed by Luong et al. and Bahdanau et al. for Recurrent Neural Networks.
Transformers are widely used in building state-of-the-art (SOTA) sequence transduction models and ushered in the era of foundation models. These models are now pivotal in fields like natural language processing, computer vision, and time series modeling.
Originally, attention mechanisms were integrated into RNNs and CNNs. However, Transformers revolutionized this approach by relying solely on attention, eliminating recurrence and convolutions altogether—hence the paper’s title, “Attention Is All You Need.”
Most competitive neural sequence transduction models have an encoder-decoder structure… The transformer follows this overall architecture using stacked self-attention and pointwise, fully-connected layers for both the encoder and decoder. Attention is all you need, 2017 - Vaswani et al.
The figure below shows the model architecture of the transformer.
Here is a high-level description of the main building blocks of the transformer:
Encoder: Composed of a stack of \(N=6\) identical layers. It processes the input sequence, transforming it into a meaningful internal representation. It consists of multiple layers, each with self-attention and feed-forward networks. Unlike traditional sequential models, the transformer processes the entire sequence in parallel, allowing it to capture global dependencies effectively.
Decoder: Also composed of a stack of \(N=6\) identical layers. It generates the output sequence based on the encoder’s representations and previous outputs. Like the encoder, it uses self-attention and feed-forward layers but also includes an additional attention mechanism to focus on relevant encoded information. The decoder is autoregressive, meaning it generates tokens one at a time while attending to past predictions.
Attention: Implements mechanisms that allow the model to weigh the importance of different tokens in a sequence when making predictions. Self-attention, specifically, enables the transformer to consider relationships between all tokens simultaneously. Multi-head attention enhances this by splitting the focus across multiple perspectives, improving the model’s ability to capture complex dependencies.
Position-wise Feed-Forward Networks: These networks refine token representations after attention layers by applying two fully connected layers with non-linear activation. Since attention preserves meaning but doesn’t introduce complex transformations, these feed-forward layers help capture more abstract features within the sequence.
Embeddings & Softmax: Embeddings convert discrete words or tokens into continuous vector representations that the model can process. At the output stage, a linear transformation followed by a softmax function maps the decoder’s predictions into probabilities over possible tokens, ensuring the model produces coherent and contextually relevant outputs.
Positional Encoding: Since the transformer doesn’t use recurrence or convolutions, it needs a way to understand token order. Positional encodings, added to embeddings, provide information about the position of tokens in a sequence. This can be achieved using sinusoidal functions, allowing the model to extrapolate to sequences longer than those seen during training.
The model architecture diagram illustrates how the transformer operates—how data enters the network, undergoes transformations, and produces the output. More than a simple graphical representation, it offers insights into building the transformer using frameworks like PyTorch and Keras.
Each “block” in the diagram can be treated as an object, aligning
with the Object-Oriented Programming paradigm. For instance, in
PyTorch
, the key building blocks discussed earlier can be
implemented as classes inheriting from torch.nn.Module
.
This enables the creation of reusable objects for both the encoder and
decoder.
In the upcoming tutorials, we will construct the transformer step-by-step, starting with embeddings and positional encodings, followed by encoder and decoder sub-layers, full encoder and decoder layers, and finally integrating everything into the complete model.
To closely follow the original transformer implementation, each tutorial will include relevant excerpts from the foundational paper. Any deviations from the original design will be clearly highlighted.
Let’s dive in!
This series of tutorials has benefited a lot from the Harvard NLP’s codebase for The Annotated Transformer