"Foundations of Large Language Models: Underthehood of the Transformer Architecture" • Invited Talk at San Diego State University (@SDSU) • November 12, 2024
• Relevant Primers:
http://transformer.aman.ai
http://llm.aman.ai
• Overview: The talk covered the foundational principles of Large Language Models (LLMs), focusing on the Transformer architecture and its key components, including embeddings, positional encoding, self and crossattention mechanisms, skip connections, token sampling, and the roles of the encoder and decoder, explaining how these innovations enable efficient and contextaware language processing.
• Agenda:
➜ Transformer Overview:
Scaled dotproduct attention and multihead mechanisms for parallel processing and contextual understanding.
Handles longrange dependencies and enables parallel computation for efficient training.
➜ Input Embeddings
Embeddings reduce the dimensionality of input data, projecting words into a lowerdimensional space where similar words are closer.
Enables generalization across words with similar meanings, significantly reducing the model's parameters and required training data.
➜ Positional Encoding
Absolute positional encoding uses sinusoidal functions to encode positions, enabling models to infer token order.
Rotary Positional Embeddings (RoPE) combine absolute and relative positional benefits for longsequence handling.
➜ SelfAttention
Maps query, key, and value vectors derived from the same sequence to calculate token relationships.
Enables dynamic weighting of token relevance, creating contextualized embeddings in parallel.
➜ Cross Attention
Bridges encoder and decoder stacks by using encoder outputs as keys and values, with decoder queries steering generation.
Essential for tasks like translation, where the target sequence depends on the source sequence.
➜ Skip/Residual Connections
Prevent gradient vanishing and ensure original input retention by adding inputs back into outputs of layers.
Improves gradient flow, avoids forgetting input tokens, and enhances training stability.
➜ Token Sampling
Converts dense deembedded outputs into probabilities using softmax for token prediction.
Techniques like temperature scaling or topk sampling refine the generation diversity and quality.
➜ Encoder
Stacks of selfattention and feedforward layers encode input sequences into fixedsize representations.
Processes input tokens bidirectionally to capture complete contextual relationships.
➜ Decoder
Uses causal selfattention to ensure autoregressive token generation.
Integrates crossattention to incorporate encoder outputs and generate coherent outputs tokenbytoken.
• Relevant Papers:
➜ Transformer Overview
Attention Is All You Need: https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
➜ Input Embeddings
Efficient Estimation of Word Representations in Vector Space (Word2Vec): https://arxiv.org/abs/1301.3781
GloVe: Global Vectors for Word Representation: https://aclanthology.org/D141162/
fastText: Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606
➜ Positional Encoding
Attention Is All You Need (Original Sinusoidal Positional Encoding): https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
SelfAttention with Relative Position Representations: https://arxiv.org/abs/1803.02155
RoFormer: Enhanced Transformer with Rotary Position Embedding: https://arxiv.org/abs/2104.09864
➜ SelfAttention
Attention Is All You Need: https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
Neural Machine Translation by Jointly Learning to Align and Translate (Additive Attention): https://arxiv.org/abs/1409.0473
➜ Cross Attention
Attention Is All You Need (EncoderDecoder Attention): https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation: https://aclanthology.org/D141179/
Skip/Residual Connections
Deep Residual Learning for Image Recognition (ResNet): https://arxiv.org/abs/1512.03385
Attention Is All You Need (Skip Connections in Transformers): https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
➜ Token Sampling
Categorical Reparameterization with GumbelSoftmax (Sampling methods): https://arxiv.org/abs/1611.01144
Decoding Strategies for Neural Machine Translation: https://aclanthology.org/W174717/
➜ Encoder
Attention Is All You Need (Encoder Architecture): https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/abs/1810.04805
➜ Decoder
Attention Is All You Need (Decoder Architecture): https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762'>https://arxiv.org/abs/1706.03762
Language Models are FewShot Learners (Autoregressive Decoding in GPT3): https://arxiv.org/abs/2005.14165