Notes tagged with #transformers

Attention is all you need

Attention is a differentiable lookup. Covers scaled dot-product attention, multi-head, the n² complexity wall, and a minimal PyTorch implementation.