LLMs for working data scientists and machine learning practitioners

Large Language Models(LLM) are changing the landscape of data science - whether this represents a paradigm shift in the way information is produced, aggregated and consumed, or this is just another passing fad, the jury is still out. In any case, AI/LLM as a tool, increasingly an indispensable one, is becoming the accepted norm, especially among programmers and data scientists.

This collection of notes is basically a brain dump for me - it captures my personal learnings to make sense of this exploding field and try to keep myself sane.

History
Closed Foundation Models
Open Foundation Models
Datasets
Frameworks and Ecosystems
FAQs
My Cookbook

History

These are the important milestones leading up to the current state of LLM.

Transformer:
- The OG paper: Attention Is All You Need
- A detailed anatomy of transfomer : Transformer from scratch
- The python implementation of the original transformer: The annotated transformer
The Zoo of LLMs
- An overview and history of LLMs: A nice review paper of LLMs.
- List of transformers: A github repo for all the transformer-based models, not just LLMs.
- A catalog of transformer models: This started as a blog post, later they converted it into a nice paper.
GPT style decoder only models
- GPT 1.0: Improving Language Understanding by Generative Pre-Training
- GPT 2.0: Language Models are Unsupervised Multitask Learners
- GPT 3.0: Language models are few-shot learners
- GPT 4: GPT-4 Technical Report
- Kaparthy's NanoGPT Implementation
Bert style encoder only models
- BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
T5 style encoder-decoder models
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- T0: Multitask Prompted Training Enables Zero-Shot Task Generalization
"Making it bigger" - scaling and emergent properties
- Scaling Law: Scaling Laws for Neural Language Models
- Switch Transformers: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Chinchilla: An empirical analysis of compute-optimal large language model training
- Emergent Abilities: Emergent Abilities of Large Language Models
Instruction finetuning and alignment
- InstructGPT: Training language models to follow instructions with human feedback
- Flan-T5/PaLM: Scaling Instruction-Finetuned Language Models
- FLAN: Finetuned Language Models are Zero-Shot Learners
The birth of ChatGPT: the cambrian explosion started from here: Nov 2022
The race to catch up with ChartGPT in the open source community
- LLaMA: LLaMA: Open and Efficient Foundation Language Models
- Llama 2: Llama 2: Open Foundation and Fine-Tuned Chat Models

Prerequisites

The field is moving so fast you absolutely need a hacker's mentality

Python/Numpy/Pandas: basic skills needed to code up something quickly. Fortunately with tools like copilot/ChatGPT/Replit, it's quite easy to get up to speed quickly in this department, especially if you are a programmer to begin with. For instance I came from a C++/R/Haskell background, and made a switch from R to Python quite smoothly.
PyTorch: Most transformer models are in torch. Invest some time to get you comfortable with both the tensor library and building blocks for neural networks. Read a lot of library code to get a sound foundation.
Git and github: You will clone a tons of repos to experiment, so invest some time in building your own commands to quickly get things done.
Huggingface: this is a must now. Not just the transformers library itself, but also peft, accelerate, etc. You will spend tons of time with HF.
Linux, bash, and command line tools: Get a mac and get comfortable with command lines tools. Trust me it is worth your time.
GPUs. You can get away with CPUs for inference (GGLM is really coming up fast), but you will have to use GPUs for training models. You can build you own box with RTX 3090 (or 4090 if you have a few extra bucks), or rent online from one of those small guys: vast.ai, runpod, or Azure/AWS if you are not paying the bills out of your own pocket. Stick with A100s if you have budget - everything just works with A100s.

Closed Foundation Models

OpenAI: Build stuff with OpenAI to get your feet wet, read their cookbooks, they are really good.
Google: Have not really tried Bard model....
Anthropic: On my list, never really played with Claude/Claude2.

Open Foundation Models

LLaMA and finetuned variants

FB's release of LLaMA set off a wave of fine tuned variants of LLaMA 7/13/30/65B models, with some fun playing with names of Llama family. See Huggingface Leaderboard for Open LLMs for some of those notable models.

Now with release of Llama 2, the real competition and fun starts !

Alpaca: This is the first model coming out of Standford. It was trained on the instructions generated from ChatGPT.
Vicuna: NormicAI is behind this project. The model was finetuned with ShareGPT data, a crowd sourced dataset via ChatGPT. Also it comes with a fast inference engine - underlying it there is GPU-optimized version of inference engine called vLLM. They also have other open source models like T5.
WizardLM: This is from Microsoft Reseach. It is based on Evol-instruct, a tree-based instructions.
WizardVicuna: a combo of Wizard and Vicuna.
Open Assistant: dataset, RLHF fine tuning, etc.
QLora: This is a big for guys with consumer-grade GPUs like RTX series, you can fine tune a sizeable model with a single GPU. It was trained with Open Assistant dataset.

LLaMA alternatives (Updated on 7/20, now they are less attractive with the release of Llama 2 .... )

Datasets

Alpaca and variants:

Vicuna: Assistant clone of OAI ChatGPT(llmsys.org/fastchat/vllm/vicuna)

Vicuna_unfiltered

Wizard: DFS and BFS evolution of instructions

Open Assistant

Open Assistant Dataset

Guanaco

Guanaco Dataset

Orca: Why not give system prompts ?

GPT4all:

ShareGPT:

ShareGPT52K (it's 90k now)

Dolly:

databricks-dolly-15k

Others:

Frameworks and Ecosystems

Huggingface: transformers/feft/accelerate/bitsandbytes
PyTorch Lightning: More general high-level framework on top of PyTorch. Think of it as the Keras for PyTorch. Also it has two repos: one is the open source implementation of GPT, and one is a finetuning framework for open LLMs.
GGML/Llama.cpp: A lot of attention here, this project will probaby pave the road for LLM without GPUs.
GPT4all: Started as thin wrapper of GGML, but it's diverged since.
GPTQ/AutoGPTQ: An alternative to Int.8 quantization (bitsandbytes).
FlexGen: New kid on the block, have not really looked into it.
LangChain: It's hot now, a very convinience package to interface with LLMs, and vector stores. Here are the key concepts and abstractions in LangChain:
- A Chain is just a LLM + a prompt template.
- A agent is made of a llm chain and tools.
- A agent is usually wrapped in a Agent Executor, which itself is a type of chain.:wq
- The key ingredient of an agent is the ability to plan, which literally is a method defined for each type of agent.
My take on LC is that if you want to spin up something quickly, it's a great tool to get you started. But once you've moved beyond of building toy projects, you will probably need to build your own pipeline, or even your own abstractions. Treat LC as a huge cookbook, pick and choose whatever you need, especially the prompts. But keep in mind the field is moving lightning fast, a lot prompts might not be necessary now, especially with strong models like GPT4.
LLamaIndex: It has some overlapping with LangChain - it's a data framework. Same as LC, awesome tool to get you started quickly.
- A document is split into chunks, or nodes.
- Chunks are wrapped into indicies - that's the building blocks.
- Index + retrieval mode => retrievers
- retrievers + synthesizing methods + post processing => query engines
- query engines + memory => chat engines
- query engines + json/pydantic descriptions => tools
- tools + LLM => agents
I like LlamaIndex codebase better - cleaner, better documented.
DSP: Coming out of Standford NLP research group. A nice programming model for working with LLMs: demostrate, search and predict. Probably not as mature as Langchain and LlamaIndex, but worthing checking out.
Text generation web gui: Nice playground for experimenting with various LLMs.
LocalAI: a drop-in replacement for OpenAI.
Axolotl: Nice rep for finetuning
FastChat: Train/Eval/Deployment pipeline.

Stay Current

Twitter. Follow guys with real signals. I will post my twitter list of AI later.
If you have extra time, look at Reddit board like LocalLlama
If you really have time, go to discord servers of things you are into.

From Kaparthy:

FAQs

Why should I care about LLMs ?
What are the common use cases for LLMs ?
I am new to the field, where shoud I get started ?
What do those weird names like Llama/alpaca/vicuna/guanaco come from ?

stewarthu / llm-notes Goto Github PK

llm-notes's Introduction

LLMs for working data scientists and machine learning practitioners

Table of Contents

History

Prerequisites

Closed Foundation Models

Open Foundation Models

LLaMA and finetuned variants

LLaMA alternatives (Updated on 7/20, now they are less attractive with the release of Llama 2 .... )

Datasets

Frameworks and Ecosystems

Stay Current

FAQs

llm-notes's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent