Model optimizations to improve application performance
-
Distillation: uses a larger model, the teacher model, to train a smaller model, the student model, we freeze teacher's weights and generate completions, also generate student model's completion, the difference between those 2 completions is Distillation Loss . Student model will adjust its final prediction layer or hidden layer. You then use the smaller model for inference to lower your storage and compute budget.
-
Quantization: post training quantization transforms a model's weights to a lower precision representation, such as a 16-bit floating point or eight-bit integer. This reduces the memory footprint of your model.
-
Pruning: removes redundant model parameters that contribute little to the model's performance.
Cheat Sheet
RAG (Retrieval Augmented Generation)
Chain of thought prompting
Program-Aided Language Model (PAL)
- LLM + Code interpreter --> to solve the problem that LLM can't do math
Orchestrator: can manage the information between LLM, external app and external databases. ex. Langchain
ReAct: it's a format for prompting (?), synergizing reasoning and action in LLMs
- Thought: reason about the current situation
- Action: an external task model can carry out from an allowed set of actions--search, lookup, finish
- Observation: a few example