Writing

The intuition behind the position-wise feed-forward network sublayer, with PyTorch implementation.

A deep dive into PyTorch memory layout and stride mechanics, motivated by the head-splitting reshape in multi-head attention.

How and why transformers learn multiple distinct attention subspaces simultaneously, with a full PyTorch implementation.

The foundational intuitions behind scaled dot product attention, the core operation of the transformer architecture.