Scaled Dot Product Attention

transformers

deep learning

attention

aiayn

The foundational intuitions behind scaled dot product attention, the core operation of the transformer architecture.

Published

March 18, 2026

Written by Michael Gethers

Introduction

The purpose of this document is to fully articulate the foundational intuitions of scaled dot product attention.

This is the first in a series of documents I am writing about Attention is All You Need (Vaswani et al., 2017). My intent is that this will serve both to edify my own understanding of the paper’s concepts, but that it may also help others to build a practical understanding of the hows and whys of transformer architecture.

The function

The scaled dot product attention function, as shown in the Attention is All You Need paper, is the following: \[ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Where \(Q\) and \(K\) are matrices of size (batches, sequence_length, d_k), and \(V\) is a matrix of size (batches, sequence_length, d_v). Though one could theoretically design an architecture such that \(d_k\) and \(d_v\) were not equal (i.e. there is no mathematical constraint requiring these to be equal), in practice and in the paper, \(d_k = d_v\).

batches are also included for computational efficiency in training, but are not fundamental to understanding the scaled dot product architecture. So for now, we will assume the following simplification:

\(Q \in \mathbb{R}^{\text{n} \times d_k}\)
\(K \in \mathbb{R}^{\text{n} \times d_k}\)
\(V \in \mathbb{R}^{\text{n} \times d_v}\)

\(n\) here is the sequence length. \(d_k\) is determined by the full \(d_{model}\) and the number of attention heads used. This will be discussed in the multi-head attention section, but we will not discuss this yet here. For now, know that the paper used \(d_{model} = 512\) and 8 attention heads, so \[d_k = d_v = \frac{d_{model}}{n_{heads}} = \frac{512}{8} = 64\]

The \(Q\), \(K\), and \(V\) matrices

\(Q\), \(K\), and \(V\) matrices are all randomly initialized matrices of the same dimensions. However, the architecture of scaled dot product attention, and the transformer architecture more broadly, allows them to take on meaning as the model learns:

\(Q\): the query matrix
\(K\): the keys matrix
\(V\): the values matrix

Remember, each of these matrices has one row for every token in the input sequence, so each row effectively represents the query vector (in \(Q\)), key vector (in \(K\)), and the value vector (in \(V\)) for each respective token.

An analogy

But the concepts of queries, keys, and values are a bit abstract at first glance, so I’ve found it useful to think of these as an analogy to a library.

A query would be what someone is looking for when they go to the library: “I’m looking for information about the French Revolution”.

A key might be what’s on the spines of each book in the library: A Tale of Two Cities, To Kill a Mockingbird, Interpreting the French Revolution, Infinite Jest, etc.

A value would be the actual content of those books: what information does this book actually contain.

The transformer is designed to allow queries about the French Revolution to determine that To Kill a Mockingbird and Infinite Jest probably provide very little relevant information, A Tale of Two Cities may contain a bit, and that Interpreting the French Revolution probably contains a lot, and then to synthesize the content of those books into a coherent summary of the relevant information.

More explicitly

Our task is different, so while the library analogy is useful to generate a baseline understanding of what these matrices are intended to learn, we can make the task more explicit.

What the scaled dot product attention architecture is actually trying to do is something much more similar to this:

Take a sentence, like:

The creek was cold and the spunky dog wanted to jump over it.

As a human, when we read this sentence, it is clear to us that “cold” is referring to the “creek”, “spunky” is referring to “dog”, and “it” is referring to the “creek”.

This is essentially what the scaled dot product attention function is attempting to learn. What other words/tokens in the sequence should the model pay attention to when trying to understand the meaning of the word “it”?

In this example, the query would be “it”, which has its own vector in the \(Q\) matrix, call it \(q_{\mathrm{it}}\). Every other token in the sequence has its own vector in the \(K\) matrix (the other matrices as well, but for now we’re interested in \(K\)). We would like our model to learn that “it” and “creek” are related, and the way that gets encoded in the model is through a large dot product between \(q_{\mathrm{it}}\) and \(k_{\mathrm{creek}}\), and a small dot product between \(q_{\mathrm{it}}\) and vectors in \(K\) that are irrelevant to understanding “it”.

The \(V\) values matrix is then the final synthesis of this content. When appropriately trained, the softmax function allows us to see that “it” refers to “creek”, and maybe a bit to “cold”, and to very little else in the sequence. This breakdown gets multiplied by the \(V\) matrix to produce the final output of this single scaled dot product attention head.

It is important to note: there is nothing fundamental about the matrices themselves that makes them useful as queries, keys, or values: they are simply randomly initialized matrices of the same dimensions. Their usefulness emerges out of the architecture of scaled dot product attention and the training process: every time the model makes an incorrect prediction that resulted from the fact that “it” and “creek” were not recognized as related tokens, the model gets nudged in a direction that associates them more closely together.

Walking through the math

\[ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] The first matrix multiplication that happens here between \(Q\) and \(K^T\). Recall that these are each \(n \times d_k\) matrices, so the output of this matrix multiplication is an \(n \times n\) matrix, where \(n\) is the sequence length.

What does this \(n \times n\) matrix represent?

\[ C = QK^T \] Let’s think about this operation, and what the resulting matrix, which I’ll call \(C\), represents. When we multiply the \(Q\) and \(K^T\) matrices, we get one row and one column for every token in the sequence. In this matrix \(C\) we have the following entries \(c_{(i,j)}\):

\(c_{(1,1)}\): dot product of the 1st token’s query vector, multiplied by the 1st token’s own key vector
\(c_{(2,1)}\): dot product of the 2nd token’s query vector, multiplied by the 1st token’s key vector
\(c_{(1,2)}\): dot product of the 1st token’s query vector, multiplied by the 2nd token’s key vector
…

This is essentially a relevance score: how relevant is the \(j\mathrm{th}\) key to the \(i\mathrm{th}\) query? The higher that dot product, the more relevant those tokens are to each other.

Scaling by \(1/\sqrt{d_k}\)

Scaling by \(1/\sqrt{d_k}\) is the novel tweak that this paper made to previous existing attention mechanisms, and is founded in the simple idea that as \(d_k\) increases, the variance of the dot products in \(C\) are going to increase as well.

Increasing \(d_k\) pushes these dot products out into the tails of the softmax function, where the gradient is very small. Scaling them brings them back toward the center, but the selection of \(1/\sqrt{d_k}\) is not arbitrary.

Let’s try to understand this choice. We operate under the assumption that our \(Q\) and \(K\) matrices contain rows that are distributed with mean 0 and variance 1.

Note

This is reasonable assumption for a few reasons:

Weight initialization: network weights are typically initialized in a way that is designed to produce mean 0 and variance 1 activations throughout the network. So at the start of training, the assumption is approximately true by design.
Layer normalization: we will discuss this in a future document, but the Transfomer applies layer normalization before the attention operation. Layer norm explicitly normalizes its output to have mean 0 and variance 1 across the feature dimensions, giving us row-wise normalization.
Linear projections: this will also be discussed later, but if \(x \sim(0, 1)\), and \(W\) has iid mean-0 entries with variance \(1 / d\), then \(W x\) has mean 0 and variance \(\approx 1\).

The \((0,1)\) assumption will be discussed in more depth in future posts. For now, let us assume that our \(Q\) and \(K\) matrices are distributed \((0,1)\).

Let’s first revisit some fundamental statistics identities, first of the definition of variance of a random variable \(X\): \[ \begin{align} \operatorname{Var}(X) &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\ \implies \mathbb{E}[X^2] &= \operatorname{Var}(X) + \mathbb{E}[X]^2 \end{align} \] And of a the variance of the product \(XY\), where \(X\) and \(Y\) are independent: \[ \begin{align} \operatorname{Var}(XY) &= \mathbb{E}[X^2]\mathbb{E}[Y^2] - \mathbb{E}[X]^2\mathbb{E}[Y]^2 \\ \implies \operatorname{Var}(XY) &= \operatorname{Var}(X)\operatorname{Var}(Y) + \operatorname{Var}(X)\mathbb{E}[Y]^2 + \operatorname{Var}(Y)\mathbb{E}[X]^2 \end{align} \] Now, say we have two vectors \(a\) and \(b\), which each consist of \(n\) i.i.d. random variables \(A_i\) and \(B_i\), respectively: \[ \begin{align} a = [A_1,...,A_n]\\ b = [B_1,...,B_n] \end{align} \] Then the dot product of these vectors is: \[ a \cdot b = \sum a_i b_i \] The variance of \(a \cdot b\) follows from the fact that variance is additive for independent random variables. So, the variance of \(a \cdot b\) is: \[ \begin{align} \operatorname{Var}(a \cdot b) &= \operatorname{Var}(\sum a_i b_i)\\ &=\operatorname{Var}(A_1B_1) + Var(A_2B_2)+...+Var(A_nB_n)\\ &=n\operatorname{Var}(A_iB_i) \end{align} \] That is, the variance of \(a \cdot b\) is simply \(n\) times the variance of \(AB\). With our assumption that \(Q \sim (0,1)\) and \(K \sim (0,1)\), and our vector length is \(d_k\), the above variance formula simplifies considerably, as all expectation terms vanish as \(\mathbb{E}(X) = \mathbb{E}(Y) = 0\), and \(\operatorname{Var}(X) = \operatorname{Var}(Y) = 1\). \[ \operatorname{Var}(q \cdot k) = d_k\operatorname{Var}(Q_iK_i) = d_k \] Now, we will be putting this entire expression \(QK^T/c\) into our softmax function, where \(c\) is the constant we divide by to scale. We want the variance of this entire expression, including the scalar, to be \(1\).

Recall also that to pull a constant out of a variance function, it gets squared: \[ \begin{align} \operatorname{Var}(X) &= \mathbb{E}(X^2) - \mathbb{E}(X)^2\\ \implies \operatorname{Var}(aX) &= \mathbb{E}((aX)^2) - \mathbb{E}(aX)^2\\ &= a^2\mathbb{E}(X^2) - a^2\mathbb{E}(X)^2\\ &= a^2\operatorname{Var}(X) \end{align} \] So, in order to normalize, we need to scale by \(1/\sqrt{d_k}\), as this term will be squared when removed from the variance function.

Masking

Let’s think about what our task actually is as a concrete example. We have \(QK^T\) which is an \(n \times n\) matrix of relevance scores. We divide by \(\sqrt{d_k}\) to keep the variance under control going into the \(\operatorname{softmax}\).

Remember, we’re building a model that generates text left to right, one token at a time. However, when we train, we feed in a full sequence.

Doing this allows the model to cheat: if we let token 5 attend to token 8 during training, we’re training a model on information it will never actually have, and it will learn to simply copy from the future rather than predict it.

So we need to apply what’s called a causal mask: we need to hide future tokens from the model for training.

We want to mask out the upper triangle of our \(n \times n\) matrix, because this upper triangle contains information like \(C_{(1,5)}\), which is essentially saying “how much should token 1 as query pay attention to token 5 as key?” And the answer here should be none, because that would be cheating: we only want to pay attention to where the query token index is greater than our key token index.

The way we do this is informed by the function we are about to pass it to: \(\operatorname{softmax}\). We will actually set this entire upper triangle to -inf, as \(\operatorname{softmax}\) will actually zero out these -inf values in the next step.

The \(\operatorname{softmax}\)

Recall the softmax function: \[ \sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \] This function plays two critical roles: 1. It normalizes the vector it’s applied to into a probability distribution so that it sums to 1. 2. It further sharpens the vector, such that the higher values are heightened, and the lower values are depressed.

Within our softmax function we have \(\frac{QK^T}{\sqrt{d_k}}\). This is still an \(n \times n\) matrix, with the same interpretability as before: how relevant is the \(j\mathrm{th}\) key to the \(i\mathrm{th}\) query? The higher that dot product, the more relevant those tokens are to each other.

What we want here is a row-wise \(\operatorname{softmax}\), which means we’ll be softmaxing keys \(k_i\) for a particular query \(q_i\), making the output interpretable as “for every token as query, these are the keys we should be paying the most attention to”. Because of the masking, these will be only rearward looking. This allows our queries to do partial matches against multiple keys, and get back a weighted blend of the corresponding values.

The final step, multiplying by \(V\)

The final step is simple: we multiply the \(n \times n\) matrix coming out of our \(\operatorname{softmax}\) function by our \(n \times d_v\) (i.e. \(n \times d_k\)) values matrix \(V\).

This is an intuitive move, if we accept the interpretation of the softmax output and the \(V\) matrix itself. \(V\) represents the actual content of each token: each row is a vector representation that is not optimized simply for increasing the dot product of related tokens, but rather for the actual content of the token itself. When we multiply our \(\operatorname{softmax}\) output (which has told us that “it” is related to “creek” and “cold” but not related to “spunky”) by \(V\), we take the weighted blend from the \(\operatorname{softmax}\) and apply it to our actual token value representations.