Notes of Standford Online Course: Building LLMs :)
@Author: ZhuoQuanFan
@Date: 03/10/2024
@Place: Dongguan Library
Preface
To start with, here we got the Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) From Youtube. And you can obtain the course slide.
Notes
I'm going to record my notes about this course according to the structure that the lecture gives. The lecture can be separated into five parts, which are architecture, training algorithm/loss, data, evaluation and systems.
1. Architecture
OMITTED
2. Task & loss
① Basic theories:
Language Modeling means a probability distribution over sequences of tokens/words :
$$P(x_1,...,x_L)$$
LMs are generative models :
$$x_{1:L} ~P(x_1,...,x_L)$$
Autoregressive(AR) language models:
$$p(x_1,...,x_L) = p(x_1) \times p(x_2|x_1) \times p(x_3|x_2,x_1) \times ... = \prod_ { i } p(x_i|x_ { 1: i-1 }) $$
=>We only need a model that can PREDICT the next token given past context.
② Steps of pretraining
- tokenize
- forward
- predict probability of next token
- sample
- detokenize
a.tokenizer
WHY:
- More general than words: We can ignore or decrease processing the words which have different meanings in different speaking cases. For instance, a word might have different meanings but a token will only have its own unique tokenID. The meanings will only be shown when making connections.
- Shorter sequences than with characters
IDEA: tokens as common subsequences Eg: Byte Pair Encoding (BPE).
3.Evaluation: Perplexity
Idea: validation loss:
$$ PPL(x_{1:L}) = 2^{ \frac 1 L \cdot L_{(x_1..L)}} = \prod p(x_i|x_{1:i-1})^{- \frac 1 L} $$
Perplexity: Between 1 and |Vocab|:
- avg per token (~independent of length)
- Exponentiate => units independent of log base;
- Intuition : number of tokens that you are hesitating between
- Benchmarks: Holistic evaluation of language models (HELM), Huggingface open LLM leaderboard and so on.
MMLU: Most trusted pretraining benchmark: The benchmark dataset will just provide questions and constrainted answers which will be of 4. We evaluate them by comparing the answers given by LLM and benchmark to show the accuracy of LLM.
Challenges: Sensitive to prompting or inconsistencies, Train & contamination and so on.
4. Data
Because the Internet is dirty and doesn't represent things what we want. So there is a lot of work to do to gain beautifyl datas. For example, here is a basic step of data collection and prosedure.
- Download all of internet.
- Text extraction from HTML (challenges: math, boiler plate)
- Filter undesirable content (e.g. NSFW, harmful content, PII)
- De-duplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same so we need to clean up data like that,
- Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)
- Model based filtering. Predict if page could be references by Wikipedia.
- Data mix. Classify data categories (code/books/entertainment). Reweight domains using scaling laws to get high downstream performance.
• Also: lr annealing on high-quality data, continual pretraining with longer context
Collecting well data is a huge part of practical LLM (~the key) • Lot of research to be done! • A lot of secrecy: How do we protect them from being leaked? • Common academic datasets:
• Closed: LLaMA 2 (2T tokens), LLaMA 3 (15T tokens), GPT-4 (~13T tokens?)
• Synthetic data? • Multi-modal data? • Competitive dynamics • Copyright liability
5. Scaling laws
Due to numerous data and larger models, we need to predict the performance of the model under different condition of parameters, so that we can reduce extra resources dependencies. The precondition of this scaling laws is that so far the large models has never reached a status where the model is overfitting. So we can predict model performance based on amount of data and parameter.
Also, it has changed some modern pipeline of training process. Let's consider it longs for a month. The old pipeline needs to tune hyperparameters on big models and it will spend a day training a model. So for 30 days, we might collect 30 models hopefully and select one to be the best. However, when it comes to the new pipeline, we can find scaling recipes, tuning hypermeters on small models of different size, and extrapolate sizes using scaling laws to larger ones step by step, and the most importantly, more logically. Then we can spend 20 days or more to concentrate on one very model.
Post-training
Language Modeling is not a mature tool for assisting users. language modeling is not what we want, but it needs to follow user instructions and designer's desires. We need human feedback to generate a better assistant. Standford improved the algorithm by simplify the process of PPO, renamed their algorithm of DPO.
From the figure, we can see that PPO maximize the likelihood of human preference by using the reward model to conduct a reinforcement learning, but not labeling all the data so that the GPT will be likely to,say,study human preference. In the contrary, DPO changes the reinforment learning to a label work, which simplify the complexity of algorithm.However,in my thoughts, they just made a deal with the algorithm by making efforts manpower-lly.