Giter VIP home page Giter VIP logo

higgsfield-ai / higgsfield Goto Github PK

View Code? Open in Web Editor NEW
3.3K 79.0 555.0 4.95 MB

Fault-tolerant, highly scalable GPU orchestration, and a machine learning framework designed for training models with billions to trillions of parameters

License: Apache License 2.0

Jupyter Notebook 82.28% Python 16.23% Dockerfile 0.02% Jinja 1.47%
cluster-management deep-learning distributed llama llama2 llm machine-learning mlops pytorch

higgsfield's Introduction

higgsfield - multi node training without crying

Higgsfield is an open-source, fault-tolerant, highly scalable GPU orchestration, and a machine learning framework designed for training models with billions to trillions of parameters, such as Large Language Models (LLMs).

PyPI version

architecture

Higgsfield serves as a GPU workload manager and machine learning framework with five primary functions:

  1. Allocating exclusive and non-exclusive access to compute resources (nodes) to users for their training tasks.
  2. Supporting ZeRO-3 deepspeed API and fully sharded data parallel API of PyTorch, enabling efficient sharding for trillion-parameter models.
  3. Offering a framework for initiating, executing, and monitoring the training of large neural networks on allocated nodes.
  4. Managing resource contention by maintaining a queue for running experiments.
  5. Facilitating continuous integration of machine learning development through seamless integration with GitHub and GitHub Actions. Higgsfield streamlines the process of training massive models and empowers developers with a versatile and robust toolset.

Install

$ pip install higgsfield==0.0.3

Train example

That's all you have to do in order to train LLaMa in a distributed setting:

from higgsfield.llama import Llama70b
from higgsfield.loaders import LlamaLoader
from higgsfield.experiment import experiment

import torch.optim as optim
from alpaca import get_alpaca_data

@experiment("alpaca")
def train(params):
    model = Llama70b(zero_stage=3, fast_attn=False, precision="bf16")

    optimizer = optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.0)

    dataset = get_alpaca_data(split="train")
    train_loader = LlamaLoader(dataset, max_words=2048)

    for batch in train_loader:
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()

    model.push_to_hub('alpaca-70b')

How it's all done?

  1. We install all the required tools in your server (Docker, your project's deploy keys, higgsfield binary).
  2. Then we generate deploy & run workflows for your experiments.
  3. As soon as it gets into Github, it will automatically deploy your code on your nodes.
  4. Then you access your experiments' run UI through Github, which will launch experiments and save the checkpoints.

Design

We follow the standard pytorch workflow. Thus you can incorporate anything besides what we provide, deepspeed, accelerate, or just implement your custom pytorch sharding from scratch.

Enviroment hell

No more different versions of pytorch, nvidia drivers, data processing libraries. You can easily orchestrate experiments and their environments, document and track the specific versions and configurations of all dependencies to ensure reproducibility.

Config hell

No need to define 600 arguments for your experiment. No more yaml witchcraft. You can use whatever you want, whenever you want. We just introduce a simple interface to define your experiments. We have even taken it further, now you only need to design the way to interact.

Compatibility

We need you to have nodes with:

  • Ubuntu
  • SSH access
  • Non-root user with sudo privileges (no-password is required)

Clouds we have tested on:

  • Azure
  • LambdaLabs
  • FluidStack

Feel free to open an issue if you have any problems with other clouds.

Getting started

Here you can find the quick start guide on how to setup your nodes and start training.

API for common tasks in Large Language Models training.

Platform Purpose Estimated Response Time Support Level
Github Issues Bug reports, feature requests, install issues, usage issues, etc. < 1 day Higgsfield Team
Twitter For staying up-to-date on new features. Daily Higgsfield Team
Website Discussion, news. < 2 days Higgsfield Team

higgsfield's People

Contributors

arpanetus avatar dependabot[bot] avatar higgsfield avatar zxcjhg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

higgsfield's Issues

Some issues if try to use it in a real life :)

Hi, I tried to use your product but got a lot of small issues and found some lack of functionality.

  1. Please don't expect that git address is something line this:


    I tried to use it with the internal github which is different from public github and got an error.

  2. There are a lot of cases where your error messages are useless like in first example.
    I tried to use higgsfield manually and got a lot of messages like 'something is not a string'.
    Quick debug helped me to find that I forgot or put wrong command line parameter. It could be improved.

  3. LLama and hugging face:

    tokenizer=LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf"),

    When I import llama loader it automatically tries to get access to the HF without any my permitions. Overall trying to access something from internet without explicit calls is a big red flag from the security of view. In my case I've already downloaded everything and don't need to connect to the HG at all.

  4. Would be nice to see more examples:

  • very simple manually implemented architecture which supports deepspeed/zero distribution training.
  • example which show how to manually run everything without github and hf access.
  • ability to run your code on a single machine - single gpu and single machine multiple gpu too.
    Because how do you expect people to debug their code?
    I wanted to run a simple example without setting up my machines and using github and found it impossible which is a big problem in my opinion/

Overall great job and nice implementation but it could be much user friendlier.
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.