Giter VIP home page Giter VIP logo

persia's Introduction


tutorials Documentation Status PyPI version PyPI downloads Docker Pulls license

WARNING: THIS PROJECT IS CURRENTLY IN MAINTENANCE MODE, DUE TO COMPANY REORGANIZATION.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It is a PyTorch-based (the first public one to our best knowledge) system for training large scale deep learning recommendation models on commodity hardwares. It is capable of training recommendation models with up to 100 trillion parameters. To the best of our knowledge, this is the largest model size in recommendation systems so far. Empirical study on public datasets indicate PERSIA's significant advantage over several other existing training systems in recommendation [1]. Its efficiency and robustness have also been validated by multiple applications with 100 million level DAU at Kuaishou.

Disclaimer: The program is usable and has served several important businesses. However, the official English documentation and tutorials are still under heavy construction and they are a bit raw now. We encourage adventurers to try out PERSIA and contribute!

News

Links

Discussion

Feel free to join our Telegram Group for discussion!

References

  1. Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ce Zhang, & Ji Liu. (2021). Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters.

  2. Ji Liu & Ce Zhang. (2021). Distributed Learning Systems with First-order Methods.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

persia's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar jliu87 avatar karoka avatar nobles5e avatar snowpeakz avatar williamstar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

persia's Issues

Generate e2e test from examples

Generate the e2e test from PERSIA/examples repo. It could reduce the code redundant since most of the e2e test code is similar with examples code.

feat: embedding model definition

Add the EmbeddingModel class to manage the current embedding tensors. It is the torch.nn.Module which instantiates by embedding_config.yml.

Features following:

  • The EmbeddingModel can be initialized from embedding_config.yml.
  • Support more attention operation on attention_embedding_tensor, such as mean, max.
  • Defining embedding initialization method in embedding_config.yml.
  • Abstracted attention_embedding and raw_embedding into Embedding Python class. It can make the EmbeddingModel more interpretable to users, especially in forward and backward phases.

Missing .env file?

Hi, I am using PERSIA to test on my GPU server and there's some error.
I have tried three methods to use it.

  1. Directly run docker image as docker-compose -f docker-compose.yml up
    Encountered with ERROR: Couldn't find env file: /xx/xx/persia/examples/docker-compose/.env

  2. Using make command as/xx/xx/persia/examples/docker-compose$ make run
    Encountered with open /xx/xx/persia/examples/docker-compose/.env: no such file or directory

  3. I even tried to run train.py manually but also failed as ENABLE_CUDA=1 python train.py
    I set the rank, device_id, world_size manually and torch.cuda.is_available() return True
    Encountered with
    2021-11-27 02:48:45,834 INFO [train.py:35] test dataset size is 128 Traceback (most recent call last): File "train.py", line 116, in <module> with TrainCtx( File "/opt/conda/lib/python3.8/site-packages/persia/ctx.py", line 619, in __init__ super(TrainCtx, self).__init__(PreprocessMode.TRAIN, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/persia/ctx.py", line 224, in __init__ super(EmbeddingCtx, self).__init__(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/persia/ctx.py", line 109, in __init__ self.common_context = PersiaCommonContext( OverflowError: can't convert negative int to unsigned
    It's quite weird.

Is there any suggestion or missing documents?
Thank you!

API documentation

PS features require access rules

Each feature has a certain probability to be created, and the probability is accumulated and calculated according to the positive and negative sample probabilities of the feature, the positive sample probability create _ CLK _ prob and the negative sample probability create _ nonclk _ prob of a common slot, Special slots (slots specified in select _ prob _ slots) are calculated according to create _ CLK _ prob and create _ nonclk _ prob in select _ prob _ slot

tracking issue: initial version

Documentation: @karoka


Bug:

  • persia-core PyForward && PyBackward thread handler exit with exceptions
  • insert embedding in offline inference phase

Feature:


Experiments:

  • Single card + force enable communication
  • Change figure titles

Release


organize tests with pytest and pass CI

  • buildkite system test
    • add shutdown API to exit all services normally
    • use buildkite step to start training and services, use predefined port to communicate between components
    • trainer determines whether the results are correct
    • pass CI
  • pytest

用honcho启动卡在`SingleMachine training context init done`这里

请问有什么解决方案吗?是我配置还是哪里的原因吗?
有一点区别就是我没有sudo权限, 不能把nats-server放到/usr/bin,而是在/home下面unzip了release build,然后改的Procfile里nats-server的启动路径,这一步有影响吗?

PS needs to be evicted by business policy

  1. feature score
    Each feature has a feature _ score that is calculated a
    feature_score=clk_coeffpos_ins_num+nonclk_coeffneg_ins_num
    On push, feature _ score is always accumulated
  2. time_decay
    End _ day (sample time) will trigger time _ decay
    Time _ decay, the feature _ score is attenuated by CVM _ plugin.decay _ ratio
  3. shrink_table
    After time _ decay, or when feature _ num exceeds Max _ features during training, shrink _ table is triggered
    If one of the conditions is met, it is deleted:
    ● score < _delete_threshold
    ● value.unseen_days > _delete_after_unseen_days
    ● _ select _ prob _ slot _ set.get (value.slot) = = 1 & & score < _ select _ delete _ threshold/Id Group Id type feature, increase delete threshold
    ● _ photoid _ slot _ set.get (value.slot) = = 1 & & value.unseen _ days > 2/Delete after two days of combined photo _ ID feature
    If feature _ num is still greater than Max _ features after deletion, it will be deleted from small to large according to feature score until feature _ num < Max _ features is satisfied

Where is EmbeddingWorkerNatsServicePublisher?

It seems that EmbeddingWorkerNatsServicePublisher is quite important to maintain the communication among data context, embedding servers/workers and nn workers. But I couldn't find the exact file for this class. Could you please tell where the class is defined and implemented?

feat: accelerate server model dump and load

We need every server dump and load its own model in parallel by default

If number of shards changes, then use a load service to load files and dynamically insert entries to the servers.

The throughput is extremely low

I adopted Persia to implement a DLRM and run it over Criteo Kaggle dataset.
I set batch size to 1024, and below is the content of docker-compose.yml:

version: "3.2"
services:
  persia_nats_service:
    image: nats:latest
    deploy:
      replicas: 1

  data_loader1:
    env_file:
      - .docker.env
    depends_on:
      - nn_worker
      - embedding_worker
      - persia_nats_service
    image: persia-dlrm
    command: persia-launcher data-loader --replica-index 0 --replica-size 2
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  data_loader2:
    env_file:
      - .docker.env
    depends_on:
      - nn_worker
      - embedding_worker
      - persia_nats_service
    image: persia-dlrm
    command: persia-launcher data-loader --replica-index 1 --replica-size 2
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  nn_worker:
    env_file:
      - .docker.env
    environment:
      NCCL_SOCKET_IFNAME: eth0
      CUBLAS_WORKSPACE_CONFIG: :4096:8
    image: persia-dlrm
    command: persia-launcher nn-worker --nproc-per-node 1 --nnodes 1 --node-rank 0
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  embedding_worker1:
    env_file:
      - .docker.env
    depends_on:
      - embedding_parameter_server
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-worker --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 0 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

  embedding_worker2:
    env_file:
      - .docker.env
    depends_on:
      - embedding_parameter_server
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-worker --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 1 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

  embedding_parameter_server1:
    env_file:
      - .docker.env
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-parameter-server --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 0 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

  embedding_parameter_server2:
    env_file:
      - .docker.env
    image: persia-dlrm
    command: >
      bash -c "persia-launcher embedding-parameter-server --embedding-config $$PERSIA_EMBEDDING_CONFIG
      --global-config $$PERSIA_GLOBAL_CONFIG --replica-index 1 --replica-size 2"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - type: bind
        source: .
        target: /workspace
      - type: bind
        source: ../criteo_kaggle
        target: /workspace/criteo_kaggle

Here's a screenshot of the running process
image

As you can see, the throughput is about 30 it/s. Since the batch size is 1024, the throughput is only half of the results reported in your paper.
I also notice that the logger kept warning that the local forwarded queue is empty, and these processes didn't cost any GPU memory.
Is there any problem about my settings or, do you guys have any suggestions on how to improve the throughput?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.