persiaml / persia Goto Github PK

View Code? Open in Web Editor NEW

392.0 392.0 52.0 1.12 MB

High performance distributed framework for training deep learning recommendation models based on PyTorch.

License: MIT License

Python 21.86% Makefile 0.15% Dockerfile 0.71% Rust 76.94% Shell 0.34%

bagua deep-learning distributed-computing machine-learning persia pytorch recommender-system rust-lang

persia's Introduction

WARNING: THIS PROJECT IS CURRENTLY IN MAINTENANCE MODE, DUE TO COMPANY REORGANIZATION.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It is a PyTorch-based (the first public one to our best knowledge) system for training large scale deep learning recommendation models on commodity hardwares. It is capable of training recommendation models with up to 100 trillion parameters. To the best of our knowledge, this is the largest model size in recommendation systems so far. Empirical study on public datasets indicate PERSIA's significant advantage over several other existing training systems in recommendation [1]. Its efficiency and robustness have also been validated by multiple applications with 100 million level DAU at Kuaishou.

Disclaimer: The program is usable and has served several important businesses. However, the official English documentation and tutorials are still under heavy construction and they are a bit raw now. We encourage adventurers to try out PERSIA and contribute!

News

Training Deep Learning-based recommender models of 100 trillion parameters over Google Cloud
突破百万亿参数规模，追求极致的效率和性价比：华人团队开源首个异构并行推荐系统训练框架 PERSIA (In Chinese. Title: Breaking through the trillion parameter scale in pursuit of ultimate efficiency and cost effectiveness: Chinese team open source PERSIA, the first heterogeneous parallel recommendation system)
参数量卷到一百万亿！华人团队开源史上最大的推荐训练系统 PERSIA (In Chinese. Title: PERSIA, the Largest Recommended Training System in the History of Open Source by Far)
AI Engines in the "Short-video" Era: Eating 100 Trillion Parameters, Invited talk, Facebook, 2021.
单机训练速度提升 640 倍！独家解读快手商业广告模型 GPU 训练平台 PERSIA (In Chinese. Title: 640x Faster GPU Based Learning System for Ad Recommendation)
- [AI Front] [中国日报] [InfoQ] [CSDN] [Tencent Cloud News] [AcFun]
创新、平衡与大格局：快手商业化的慢与快 (In Chinese. Title: Innovation, Balance, and Big Picture: The Speed of Kwai Commercialization)
- [TechSir] [China Daily] [Sohu]

Discussion

Feel free to join our Telegram Group for discussion!

References

Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ce Zhang, & Ji Liu. (2021). Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters.
Ji Liu & Ce Zhang. (2021). Distributed Learning Systems with First-order Methods.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

persia's People

Contributors

Stargazers

Watchers

persia's Issues

docs: create API documentation website

add timeline tools to analyze bottleneck

like https://github.com/horovod/horovod#horovod-timeline

use
https://crates.io/crates/opentelemetry-jaeger
https://docs.rs/opentelemetry/0.16.0/opentelemetry/

Generate criteo syn dataset

Hello，Could you tell us how to generate your criteo_syn dataset those used in your experiment?

Generate e2e test from examples

Generate the e2e test from PERSIA/examples repo. It could reduce the code redundant since most of the e2e test code is similar with examples code.

configurable start_deadlock_detection_thread

expose configuration for max message size in NATS

Default 1MB may not be enough for large batch of samples.

https://stackoverflow.com/questions/55368487/send-a-message-larger-than-1mb-using-nats-streaming

more examples

Load exists checkpoint to continue training

Add arguments to TrainCtx to continue training

feat: embedding model definition

Add the EmbeddingModel class to manage the current embedding tensors. It is the torch.nn.Module which instantiates by embedding_config.yml.

Features following:

The EmbeddingModel can be initialized from embedding_config.yml.
Support more attention operation on attention_embedding_tensor, such as mean, max.
Defining embedding initialization method in embedding_config.yml.
Abstracted attention_embedding and raw_embedding into Embedding Python class. It can make the EmbeddingModel more interpretable to users, especially in forward and backward phases.

npy is used for small dataset in the example, what data format should be used for large-size data?

I need to preprocess the dataset (say Criteo) into a bunch of npy files? or other data formats?

Thanks!

Missing .env file?

Hi, I am using PERSIA to test on my GPU server and there's some error.
I have tried three methods to use it.

Directly run docker image as docker-compose -f docker-compose.yml up
Encountered with ERROR: Couldn't find env file: /xx/xx/persia/examples/docker-compose/.env
Using make command as/xx/xx/persia/examples/docker-compose$ make run
Encountered with open /xx/xx/persia/examples/docker-compose/.env: no such file or directory
I even tried to run train.py manually but also failed as ENABLE_CUDA=1 python train.py
I set the rank, device_id, world_size manually and torch.cuda.is_available() return True
Encountered with
2021-11-27 02:48:45,834 INFO [train.py:35] test dataset size is 128 Traceback (most recent call last): File "train.py", line 116, in <module> with TrainCtx( File "/opt/conda/lib/python3.8/site-packages/persia/ctx.py", line 619, in __init__ super(TrainCtx, self).__init__(PreprocessMode.TRAIN, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/persia/ctx.py", line 224, in __init__ super(EmbeddingCtx, self).__init__(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/persia/ctx.py", line 109, in __init__ self.common_context = PersiaCommonContext( OverflowError: can't convert negative int to unsigned
It's quite weird.

Is there any suggestion or missing documents?
Thank you!

docs: add cite persia section in readme and tutorial

Refactor persia.data module to provide more functionality

Process PersiaTrainingBatch more handy and readable.
Combine pytorch dataset, auto download recommendation dataset.

Does it support Tensorflow?

Create appropriate CI image (persiaml/persia-ci) and use it for CI

API documentation

lack of docs:

some of them should be hidden, such as persia.wrapper.logger, since users are not expected to use them.

PS features require access rules

Each feature has a certain probability to be created, and the probability is accumulated and calculated according to the positive and negative sample probabilities of the feature, the positive sample probability create _ CLK _ prob and the negative sample probability create _ nonclk _ prob of a common slot, Special slots (slots specified in select _ prob _ slots) are calculated according to create _ CLK _ prob and create _ nonclk _ prob in select _ prob _ slot

CI: multi-machine multi-gpu system test

tracking issue: initial version

Documentation: @karoka

Bug:

persia-core PyForward && PyBackward thread handler exit with exceptions
insert embedding in offline inference phase

Feature:

python launcher @williamstar
cpu backward with gloo backend @williamstar
ddp launch with tcp to replace env_file @snowpeakz
support epoch level dataloader @williamstar
easy pip install persiaml @williamstar
unified storage trait, saving both dense & sparse parameters @snowpeakz
support s3 @williamstar

Experiments:

Single card + force enable communication
Change figure titles

Release

merge server code with python code @NOBLES5E
GitHub releases @williamstar
all persia container images @williamstar
kubernetes operator @williamstar

NATS integration @snowpeakz
multi-node multi-card usable demo @williamstar
fix remaining bugs (like in data processing) @williamstar
publish PyPI wheels on CI @NOBLES5E
API documentation website @NOBLES5E

remove intent in config file

Since we are using batch level arguments is_training, the intent in global_config.yaml is no longer needed.

docs: Tutorial cloudflare pages with an unaccessible footer

Currently, the tutorial website in README points to tutorial pages, with a footer of "Found a bug? Edit this file on GitHub."

However, the link should break and end up with a 404 because of the private repository visibility of the tutorials repo from the public view, which is expected to be fixed.

organize tests with pytest and pass CI

buildkite system test
- add shutdown API to exit all services normally
- use buildkite step to start training and services, use predefined port to communicate between components
- trainer determines whether the results are correct
- pass CI
pytest

ci: add semver auto release and pypi upload

请问这套架构为啥用rust写呢？

🤔️ 不太熟悉rust，求赐教～

feat: suppport fp16 embedding

Reduce half of embedding model size to scale up the model ability

support pure CPU mode

feat: dump/load dense model use hdfs

terminating training by nats server signal

Dataloader stop iteration by nats server signal rather than TimeoutException

用honcho启动卡在`SingleMachine training context init done`这里

请问有什么解决方案吗？是我配置还是哪里的原因吗？
有一点区别就是我没有sudo权限，不能把nats-server放到/usr/bin，而是在/home下面unzip了release build，然后改的Procfile里nats-server的启动路径，这一步有影响吗？

PS needs to be evicted by business policy

feature score
Each feature has a feature _ score that is calculated a
feature_score=clk_coeffpos_ins_num+nonclk_coeffneg_ins_num
On push, feature _ score is always accumulated
time_decay
End _ day (sample time) will trigger time _ decay
Time _ decay, the feature _ score is attenuated by CVM _ plugin.decay _ ratio
shrink_table
After time _ decay, or when feature _ num exceeds Max _ features during training, shrink _ table is triggered
If one of the conditions is met, it is deleted:
● score < _delete_threshold
● value.unseen_days > _delete_after_unseen_days
● _ select _ prob _ slot _ set.get (value.slot) = = 1 & & score < _ select _ delete _ threshold/Id Group Id type feature, increase delete threshold
● _ photoid _ slot _ set.get (value.slot) = = 1 & & value.unseen _ days > 2/Delete after two days of combined photo _ ID feature
If feature _ num is still greater than Max _ features after deletion, it will be deleted from small to large according to feature score until feature _ num < Max _ features is satisfied

ci: add pytype check

Where is EmbeddingWorkerNatsServicePublisher?

It seems that EmbeddingWorkerNatsServicePublisher is quite important to maintain the communication among data context, embedding servers/workers and nn workers. But I couldn't find the exact file for this class. Could you please tell where the class is defined and implemented?

provide shutdown interface to ensure rust thread handler exit normally

Expected to provide the shutdown interface for PyForward and PyBackward to ensure the thread handler exited normally

feat: accelerate server model dump and load

We need every server dump and load its own model in parallel by default

If number of shards changes, then use a load service to load files and dynamically insert entries to the servers.

The throughput is extremely low

I adopted Persia to implement a DLRM and run it over Criteo Kaggle dataset.
I set batch size to 1024, and below is the content of docker-compose.yml:

version: "3.2"
services:
  persia_nats_service:
    image: nats:latest
    deploy:
      replicas: 1

  data_loader1:
    env_file:
      - .docker.env
    depends_on:
      - nn_worker
      - embedding_worker
      - persia_nats_service
    image: persia-dlrm
    command: persia-launcher data-loader --replica-index 0 --replica-size 2
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  data_loader2:
    env_file:
      - .docker.env
    depends_on:
      - nn_worker
      - embedding_worker
      - persia_nats_service
    image: persia-dlrm
    command: persia-launcher data-loader --replica-index 1 --replica-size 2
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  nn_worker:
    env_file:
      - .docker.env
    environment:
      NCCL_SOCKET_IFNAME: eth0
      CUBLAS_WORKSPACE_CONFIG: :4096:8
    image: persia-dlrm
    command: persia-launcher nn-worker --nproc-per-node 1 --nnodes 1 --node-rank 0
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  embedding_worker1:
    env_file:
      - .docker.env
    depends_on:
      - embedding_parameter_server
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-worker --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 0 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

  embedding_worker2:
    env_file:
      - .docker.env
    depends_on:
      - embedding_parameter_server
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-worker --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 1 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

  embedding_parameter_server1:
    env_file:
      - .docker.env
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-parameter-server --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 0 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

  embedding_parameter_server2:
    env_file:
      - .docker.env
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-parameter-server --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 1 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

Here's a screenshot of the running process

As you can see, the throughput is about 30 it/s. Since the batch size is 1024, the throughput is only half of the results reported in your paper.
I also notice that the logger kept warning that the local forwarded queue is empty, and these processes didn't cost any GPU memory.
Is there any problem about my settings or, do you guys have any suggestions on how to improve the throughput?

persiaml / persia Goto Github PK

persia's Introduction

News

Links

Discussion

References

License

persia's People

Contributors

Stargazers

Watchers

Forkers

persia's Issues

Recommend Projects

Recommend Topics

Recommend Org