A Vanilla Self Attention Layer

Transformer architectures have become a fairly hot topic in machine learning since the “Attention Is All You Need” paper was published in 2017. Since then, they have been applied to a variety of domains like image generation, music generation and language representation models. If there is a problem involving processing sequences, I bet someone has thrown the transformer model at it.

In this post I’m going to dig into a small piece of the transformer architecture – the self-attention layer. I find that this layer often confuses folks a great deal. I suspect part of the problem is the way it is presented in relation to other attention mechanisms, which often adopt the “query, key, value” terminology to describe their various components. However, self-attention ends up being, in a way, a degenerate case of attention where the query, keys, and values are all practically the same! So, we are going to throw away most of that language for now, leaving only the term “query”, and walk through the exact computations of the self-attention layer. In a follow-up post I’ll likely reintroduce the terminology to describe concepts that build on top of self-attention, like multi-headed attention.

Functionality

A self-attention layer takes as input a sequence of vectors \(\vec{v}_1, \vec{v}_2, ..., \vec{v}_n\) and produces as output a one-to-one corresponding sequence \(\vec{o}_1, \vec{o}_2, ..., \vec{o}_n\).

We will visit each input vector \( \vec{v_i} \) in turn, and compute its output vector \( \vec{o_i} \). When we visit an input vector, we will call it the query vector, and all other input vectors the non-query vectors.

The output for a query vector is computed by taking the weighted sum of all the non-query input vectors, where the weight for each is proportional to the exponentiated scaled dot-product of that vector and the query. The scale for the dot-product is the square-root of the dimensionality of the vectors, denoted \( d \) in the equation. The authors of the original paper report this scaling is to combat vanishing gradients in the weights for dot products between high dimensional vectors.

\[ \begin{gather*} \vec{o}_i = \sum_{j \ne i} w_j \vec{v}_j \\ w_j \propto \exp\left(\frac{\vec{v}_i \cdot \vec{v}_j}{\sqrt{d}}\right) \end{gather*} \]

The dot-product measures the similarity between two vectors. This gives us an important property – vectors in the input sequence similar to the query receive high weight, while vectors dissimilar to the query receive low weight. Note that the weights can be interpreted as the scaled softmax over the dot-products of all the input vectors to the query.

And…that is it. In summary, to compute the output sequence, we visit each input vector, declare it the query, compute the weights of the non-query vectors, and compute the weighted sum.

Caveats

When computing the output for a query vector, the weights associated with the input vectors, all together, is called an attention vector. The magnitude of the weight for an input vector is a measure of how salient the information in that vector will be in the output, with the saliency being defined, by the usage of the dot-product, as the degree of similarity between the query and the input vector. This can be counter intuitive to some, since there doesn’t seem to be any learned parameters here. In other words, how does the model learn what to pay attention to? Folks often mention the case of embeddings for words – just because two words might be semantically similar (are close in the embedding space), doesn’t mean that they should have high attention scores for one another.

I think if the self-attention layer is computed in isolation, without additional context around it, then this concern is valid. However, in most uses of self-attention, the embeddings are typically trainable, which means that the model will learn to craft word embeddings that are useful for the self-attention layer to perform well on the target task. Also, the self-attention layer rarely processes the raw word vectors themselves. Often self-attention layers are a component in a larger multi-headed attention layer, which I cover in a future post.

Another thing worth mentioning is that the way I outlined computing the self-attention layer is very inefficient, and doesn’t take advantage of the parallelism possible in modern GPUs. This was done to make the concept of self-attention as clear as possible, without obscuring it with optimizations. Typically, what you’d do instead is package the computation as a large matrix multiplication. I cover this in another post as well.