The intuition behind the position-wise feed-forward network sublayer, with PyTorch implementation.
A deep dive into PyTorch memory layout and stride mechanics, motivated by the head-splitting reshape in multi-head attention.
How and why transformers learn multiple distinct attention subspaces simultaneously, with a full PyTorch implementation.
The foundational intuitions behind scaled dot product attention, the core operation of the transformer architecture.