Transformers

7. Encoder and Decoder Stacks

7.1 Introduction

We have already built the components of the encoder and decoder stacks and all we need now is to bring them all together in a suitable configuration to make each layer of the encoder and decoder. Once that is done, we shall connect the encoder and the decoder into a single model to complete the architecture.

7.2 Encoder Stack

The encoder is a stack of \(N=6\) identical layers. We shall build a single encoder layer and then make copies of it to complete the stack. Here is a figure showing the layout of each layer in the encoder stack. As we can see, the encoder layer comprises two sub-layers: a multi-head attention layer followed by a feed-forward layer. Residual connections and layer normalisations following each of them.

Here is how we can implement a single encoder layer.

class EncoderLayer(nn.Module):
    '''
    Encoder is made of self attention and feed forward portions
    ''' 
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)  

    def forward(self, x, mask):
        # self attention
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward) ## feed forward

Recall, that:

The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. Attention is all you need, 2017 - Vaswani et al.

This tells us these layers in the encoder stack process the inputs sequentially. The first encoder layer receives the input \(X\) from the embedding and positional encoding layers, processes them and passes its outputs to the the second layer, the second layer to the third layer, and so on. With this in mind, we can now stack the encoder layers together.

class Encoder(nn.Module):
    '''
    Core encoder is a stack of N layers. In the original paper, N=6
    '''
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = nn.LayerNorm(layer.size)  

    def forward(self, x, mask):
        # process the inputs sequentially
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

7.3 Decoder Stack

Like the encoder, the decoder is also a stack of \(N=6\) layers identical layers. We shall build a single layer then make copies of it. As seen in the figure below, each decoder layer comprises three sub-layers: a masked self-attention layer, an encoder-decoder attention layer, and a position-wise feed-forward layer. There is a residual connection around of these, followed by layer normalisation.

There are two key differences between how the attention sublayers are implemented here in the decoder versus in the encoder:

Masked multi-head attention layer: This is a self-attention layer i.e. queries, keys, and values come from within the decoder. The masking prevents the decoder from “looking ahead” i.e., when predicting the token at position \(i\) it only depends on the tokens at position \(i\) or before. Therefore, since input tokens to the decoder are shifted right by one, it never has access to the token it is trying to predict.

… self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. Attention is all you need, 2017 - Vaswani et al.

Encoder-decoder attention layer: Here, the queries come from the previous decoder layer while the keys and values (memory) come from the output of the encoder. This way, every decoder position can attend to all positions in the input sequence.

In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. Attention is all you need, 2017 - Vaswani et al.

Like the encoder stack, the decoder layers also process its inputs sequentially, one layer after the other. Let’s now implement a single decoder layer.

class DecoderLayer(nn.Module):
    '''
    Decoder is made of self attention, source attention, and feed forward
    '''
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)  

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, 
            tgt_mask)) # self-attention
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, 
            src_mask)) # encoder-decoder attention
        return self.sublayer[2](x, self.feed_forward)

Now, let’s stack these layers together to form the decoder.

class Decoder(nn.Module):
    '''Generic N layer decoder with masking.''' 

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

Now that we have the encoder and decoder stacks in place, let’s connect them to form the transformer.

7.4 Connecting the Encoder and the Decoder

Before tying the encoder and decoder together, we recognise that the transformer has a linear and softmax layer that, together, transform the decoder’s output into a probability distribution over possible output tokens. The linear layer receives an input of dimensionality \(d_{model}\) and maps it into one of our model’s vocabulary size. The softmax layer transforms the linear layer’s outputs into probabilities that tell us how likely that some learned representation corresponds to some token/word in our target vocabulary.

We shall link these two layers together in a single class we shall call Generator, referring to its role in “generating” output probabilities that reflect as predicted or generated words. In essence, this single class takes the learned representations and transforms them into predictions that can be interpreted as words or symbols.

class Generator(nn.Module):
    '''
    Define standard linear + softmax generation step
    '''
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

Now, our transformer has five main components: - Encoder: Receives the learned input embeddings and encodes them into a new representation. - Decoder: Receives the target embeddings and the encoder’s output and decodes them to yield a predicted representation in the target language. - Generator: Takes the decoder’s output (predicted representation) and transforms it into a probability distribution over possible output tokens.

Let’s now implement the full encoder-decoder architecture with these main components and create the right methods to do the encoding and decoding processes.

class EncoderDecoder(nn.Module):
    '''
    A standard Encoder-Decoder architecture. Base for this and many 
    other models
    '''
    def __init__(self, encoder, decoder, src_embed, tgt_embed, 
    generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator  

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked source and target sequences"
        return self.decode(
                self.encode(src, src_mask), src_mask, tgt, tgt_mask
            )

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(
                self.tgt_embed(tgt), memory, src_mask, tgt_mask
            )

7.5 Creating the Model and Running Inferences

Now that we have the full architecture in place, we can instantiate a model and initiate it with some random parameters, after which we shall make a forward pass. First, we shall define a function to create the model.

def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):

    c = copy.deepcopy
    attn = MultiheadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)  

    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(
            DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N
        ),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )
    
    # Initialize parameters with Glorot / fan_avg
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)  

    return model

Now we make a forward pass to generate a prediction using the model. The output will be randomly generate since the model has not yet been trained.

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10) 

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src.data)  

    for i in range(9):
        out = test_model.decode(
                        memory, src_mask, ys, 
                        subsequent_mask(ys.size(1)).type_as(src.data)
                     )

        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
               [ys,torch.empty(1,1).type_as(src.data).fill_(next_word)], 
               dim=1
             )  

    print("Prediction: ", ys)

inference_test()
>>> Prediction: tensor([[0, 2, 9, 1, 1, 1, 1, 1, 1, 1]])

7.6 Conclusion

In this tutorial, we have explored how to build the encoder and decoder stacks as well as how to connect them together to form the transformer. We have also seen how add a generator at the decoder output to transform the decoded representations into actual words in the target language’s vocabulary. So, we have studied every detail of the transformer’s architecture. Next, we shall turn our attention to training the transformer so that we can use it for machine translation.

Acknowledgements

This series of tutorials has benefited a lot from the Harvard NLP’s codebase for The Annotated Transformer

<<< Previous

Back Home