Once trained, your LLM must serve predictions efficiently. Raw autoregressive generation is slow because it recalculates attention matrices at every step. Optimizing Inference Store the Key ( ) and Value (
Practical examples of this phase can be found in notebook tutorials and hands-on workshops, where you can write a simple tokenizer yourself.
A pretrained LLM is a generalist. To make it useful for specific tasks, you'll need to it. As shown in detailed book chapters, this involves adapting your pretrained model to new, task-specific datasets. Fine-tuning can be divided into the following progressive stages:
A position-wise non-linear mapping that applies linear transformations and activation functions (such as SwiGLU ) to further process token representations. 2. Text Preprocessing and Tokenization build a large language model from scratch pdf
This is surprisingly tedious. The PDF will include a reference implementation that trains a tokenizer on the TinyStories dataset (a corpus of simple English stories for benchmarking small LLMs).
AdamW with decoupled weight decay to prevent overfitting.
Gathering datasets (e.g., Common Crawl, Wikipedia, books). Once trained, your LLM must serve predictions efficiently
Pre-training is the most computationally intensive phase. It relies on the objective: predicting the next token given all previous tokens. Optimization Configurations Optimizer: Use AdamW with decoupled weight decay.
12×layersthe fraction with numerator 1 and denominator the square root of 2 cross layers end-root end-fraction for residual layers to prevent exploding gradients.
Source data from diverse corpora such as Wikipedia, open-source code repositories, academic papers, and filtered web crawls (e.g., Common Crawl). A pretrained LLM is a generalist
Use bfloat16 to drastically reduce memory usage and speed up matrix multiplications while avoiding underflow issues common with float16 .
: Use Root Mean Square Normalization (RMSNorm) placed before the attention and feed-forward blocks (Pre-LN). This stabilizes gradient flow through deep networks.
Convert model weights from 16-bit floating points to lower precision formats like INT8 or INT4 using frameworks like AWQ, GPTQ, or bitsandbytes, allowing models to run on consumer hardware.