Jason's Blog

October 3, 2024Last Updated: October 10, 2024

Notes of Standford Online Course: Building Large Language Models(LLMs)

documentation5.2 min to read

Notes of Standford Online Course: Building LLMs :)

@Author: ZhuoQuanFan 
@Date: 03/10/2024
@Place: Dongguan Library

Preface

To start with, here we got the Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) From Youtube. And you can obtain the course slide.


Notes

I'm going to record my notes about this course according to the structure that the lecture gives. The lecture can be separated into five parts, which are architecture, training algorithm/loss, data, evaluation and systems.

image


1. Architecture

OMITTED


2. Task & loss

① Basic theories:

Language Modeling means a probability distribution over sequences of tokens/words :

$$P(x_1,...,x_L)$$

LMs are generative models :

$$x_{1:L} ~P(x_1,...,x_L)$$

Autoregressive(AR) language models:

$$p(x_1,...,x_L) = p(x_1) \times p(x_2|x_1) \times p(x_3|x_2,x_1) \times ... = \prod_ { i } p(x_i|x_ { 1: i-1 }) $$

=>We only need a model that can PREDICT the next token given past context.

② Steps of pretraining

  1. tokenize
  2. forward
  3. predict probability of next token
  4. sample
  5. detokenize

image

image

a.tokenizer

WHY:

IDEA: tokens as common subsequences Eg: Byte Pair Encoding (BPE).


3.Evaluation: Perplexity

Idea: validation loss:

$$ PPL(x_{1:L}) = 2^{ \frac 1 L \cdot L_{(x_1..L)}} = \prod p(x_i|x_{1:i-1})^{- \frac 1 L} $$

Perplexity: Between 1 and |Vocab|:

MMLU: Most trusted pretraining benchmark: The benchmark dataset will just provide questions and constrainted answers which will be of 4. We evaluate them by comparing the answers given by LLM and benchmark to show the accuracy of LLM.

Challenges: Sensitive to prompting or inconsistencies, Train & contamination and so on.


4. Data

Because the Internet is dirty and doesn't represent things what we want. So there is a lot of work to do to gain beautifyl datas. For example, here is a basic step of data collection and prosedure.

  1. Download all of internet.
  2. Text extraction from HTML (challenges: math, boiler plate)
  3. Filter undesirable content (e.g. NSFW, harmful content, PII)
  4. De-duplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same so we need to clean up data like that,
  5. Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)
  6. Model based filtering. Predict if page could be references by Wikipedia.
  7. Data mix. Classify data categories (code/books/entertainment). Reweight domains using scaling laws to get high downstream performance.

• Also: lr annealing on high-quality data, continual pretraining with longer context

Collecting well data is a huge part of practical LLM (~the key) • Lot of research to be done! • A lot of secrecy: How do we protect them from being leaked? • Common academic datasets:

• Closed: LLaMA 2 (2T tokens), LLaMA 3 (15T tokens), GPT-4 (~13T tokens?)

• Synthetic data? • Multi-modal data? • Competitive dynamics • Copyright liability

5. Scaling laws

Due to numerous data and larger models, we need to predict the performance of the model under different condition of parameters, so that we can reduce extra resources dependencies. The precondition of this scaling laws is that so far the large models has never reached a status where the model is overfitting. So we can predict model performance based on amount of data and parameter. image

Also, it has changed some modern pipeline of training process. Let's consider it longs for a month. The old pipeline needs to tune hyperparameters on big models and it will spend a day training a model. So for 30 days, we might collect 30 models hopefully and select one to be the best. However, when it comes to the new pipeline, we can find scaling recipes, tuning hypermeters on small models of different size, and extrapolate sizes using scaling laws to larger ones step by step, and the most importantly, more logically. Then we can spend 20 days or more to concentrate on one very model.

Post-training

Language Modeling is not a mature tool for assisting users. language modeling is not what we want, but it needs to follow user instructions and designer's desires. We need human feedback to generate a better assistant. Standford improved the algorithm by simplify the process of PPO, renamed their algorithm of DPO.

image

From the figure, we can see that PPO maximize the likelihood of human preference by using the reward model to conduct a reinforcement learning, but not labeling all the data so that the GPT will be likely to,say,study human preference. In the contrary, DPO changes the reinforment learning to a label work, which simplify the complexity of algorithm.However,in my thoughts, they just made a deal with the algorithm by making efforts manpower-lly.