Giter VIP home page Giter VIP logo

llm-analysis's Introduction

llm-analysis

PyPI Read the Docs Tests Codecov GitHub license

Latency and Memory Analysis of Transformer Models for Training and Inference

Overview

Many formulas or equations are floating around in papers, blogs, etc., about how to calculate training or inference latency and memory for Large Language Models (LLMs) or Transformers. Rather than doing math on papers or typing in Excel sheets, let's automate the boring stuff with llm-analysis โš™๏ธ!

Given the specified model, GPU, data type, and parallelism configurations, llm-analysis estimates the latency and memory usage of LLMs for training or inference. With llm-analysis, one can easily try out different training/inference setups theoretically, and better understand the system performance for different scenarios.

llm-analysis helps answer questions such as:

  • what batch size, data type, parallelism scheme to use to get a feasible (not getting OOM) and optimal (maximizing throughput with a latency constraint) setup for training or inference
  • time it takes with the given setup to do training or inference and the cost (GPU-hours)
  • how the latency/memory changes if using a different model, GPU type, number of GPU, data type for weights and activations, parallelism configuration (suggesting the performance benefit of modeling change, hardware improvement, quantization, parallelism, etc.)

Examples (updating)

Check the example use cases. With llm-analysis, you can do such analysis in minutes ๐Ÿš€!

Quick Start

  • To install llm-analysis from pypi:

    pip install llm-analysis
  • To install the latest development build:

    pip install --upgrade git+https://github.com/cli99/llm-analysis.git@main
  • To install from source, clone the repo and run pip install . or poetry install (install poetry by pip install poetry).

Using the LLMAnalysis class

To integrate llm-analysis in your code, use the LLMAnalysis class. Refer to doc LLMAnalysis for details.

LLMAnalysis is constructed with flops and memory efficiency numbers and the following configuration classes:

  • ModelConfig covers model information, i.e. max sequence length, number of transformer layers, number of attention heads, hidden dimension, vocabulary size
  • GPUConfig covers GPU compute and memory specifications
  • DtypeConfig covers the number of bits used for the model weight, activation, and embedding
  • ParallelismConfig covers Tensor Parallelism (tp), Pipeline Parallelism (pp), Sequence Parallelism (sp), and Data Parallelism (dp).

Then LLMAnalysis can be queried with different arguments through the training and inference methods.

Using the Entry Point Functions for Command Line

llm-analysis provides two entry functions, train and infer, for ease of use through the command line interface. Run

python -m llm_analysis.analysis train --help

or

python -m llm_analysis.analysis infer --help

to check the options or read the linked doc. Refer to the examples to see how they are used.

train and infer use the pre-defined name-to-configuration mappings (model_configs, gpu_configs, dtype_configs) and other user-input arguments to construct the LLMAnalysis and do the query.

The pre-defined mappings are populated at the runtime from the model, GPU, and data type configuration json files under model_configs, gpu_configs, and dtype_configs. To add a new model, GPU or data type to the mapping for query, just add a json description file to the corresponding folder.

llm-analysis also supports retrieving ModelConfig from Hugging Face with the model name (thus no need to add the model configuration to the model_configs folder). E.g. use EleutherAI/gpt-neox-20b as model_name when calling the train or infer entry functions.

A list of handy commands is provided to query against the pre-defined mappings as well as Hugging Face, or to dump configurations. Run python -m llm_analysis.config --help for details.

Some examples:

python -m llm_analysis.config get_model_config_by_name EleutherAI/gpt-neox-20b

gets the ModelConfig from the populated mapping by name, if not found, llm-analysis tries to get it from HuggingFace.

Note that LLaMA models need at least transformers-4.28.1 to retrieve, either update to a later transformers library, or use the predefined ModelConfig for LLaMA models (/ in model names are replaced with _).

python -m llm_analysis.config list_gpu_configs

lists the names of all predefined GPU configurations, then you can query with

python -m llm_analysis.config get_gpu_config_by_name a100-sxm-80gb

to show the corresponding GPUConfig.

How to Set FLOPS and Memory Efficiency

Setting flops and memory efficiency to 1 (default) gives the lower bound of training or inference latency, as it assumes the peak hardware performance (which is never the case). A close-to-reality flops or memory efficiency can be found by benchmarking and profiling using the input dimensions in the model (providing such scripts is on the TODOs list).

If one has to make assumptions, for flops efficiency, literature reports up to 0.5 for large scale model training, and up to 0.7 for inference; 0.9 can be an aggressive target for memory efficiencies.

Current Scope (expanding) and Limitations

  • tp, pp, and sp assume using Megatron-LM for training and FasterTransformer for inference, and the dp assumes using DeepSpeed ZeRO
  • the parallelism strategy used in llm-analysis follows Megatron-Turing NLG 530B where tp is used within a node; then pp is used if the model is still too large to fit in GPU memory; then extra GPUs are used for dp
  • supporting both full and selective activation recomputation, as described in Reducing Activation Recomputation in Large Transformer Models
  • tp communication is calculated as using ring allreduce; pp and dp communications across nodes are ignored for now
  • data types are expressed with the number of bits, only 16, 8, and 4 bits data types are modeled for now.
  • pre-training and fine-tuning are modeled the same (controlled by total_num_tokens passed to the train entry function), thus only full (all model parameters) fine-tuning is supported for now
  • training assumes using the Adam optimizer
  • training time only counts forward and backward for now
  • inference assumes perfect overlapping of compute and memory operations

Check the TODOs below for what's next and stay tuned ๐Ÿ“ป!

TODOs (stay tuned ๐Ÿ“ป)

The following features/improvements are on the roadmap. Stay tuned, and any contributions are welcome! ๐Ÿ˜„

  • Add more model case studies
  • Add scripts to benchmark and profile the latency, FLOPS and memory efficiency in real workloads
  • Improve the analytical model for inference
  • Support efficient fine-tuning methods such as LoRA or Adapters
  • Support other optimizers in training analysis
  • Add pp (across nodes) and dp (across and within a node) communications analysis
  • Add configuration/optimization advising
  • An interactive web UI with data visualization
  • Support other model architectures beyond encoders and decoders
  • Support CPU offloading (weight, kv cache, etc.) analysis in training and inference
  • Support sparse model inference
  • Support CPU for inference analysis
  • ...

Citation

If you use llm-analysis in your work, please cite:

Cheng Li. (2023). LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference. GitHub repository, https://github.com/cli99/llm-analysis.

or

@misc{llm-analysis-chengli,
  author = {Cheng Li},
  title = {LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cli99/llm-analysis}},
}

Contributing

Contributions and suggestions are welcome.

llm-analysis uses pre-commit to ensure code formatting is consistent. For pull requests with code contribution, please install the pre-commit (pip install pre-commit) as well as the used hooks (pip install in the repo), and format the code (runs automatically before each git commit) before submitting the PR.

Useful Links

  1. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  2. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
  3. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  4. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
  5. Reducing Activation Recomputation in Large Transformer Models
  6. Training Compute-Optimal Large Language Models
  7. Efficiently Scaling Transformer Inference
  8. Training Compute-Optimal Large Language Models
  9. Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
  10. A Comprehensive Study on Post-Training Quantization for Large Language Models

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.