Giter VIP home page Giter VIP logo

monolith's Introduction

Monolith

What is it?

Monolith is a deep learning framework for large scale recommendation modeling. It introduces two important features which are crucial for advanced recommendation system:

  • collisionless embedding tables guarantees unique represeantion for different id features
  • real time training captures the latest hotspots and help users to discover new intersts rapidly

Monolith is built on the top of TensorFlow and supports batch/real-time training and serving.

Discussion Group

Join us at Discord

https://discord.gg/QYTDeKxGMX

Quick start

Build from source

Currently, we only support compilation on the Linux.

First, download bazel 3.1.0

wget https://github.com/bazelbuild/bazel/releases/download/3.1.0/bazel-3.1.0-installer-linux-x86_64.sh && \
  chmod +x bazel-3.1.0-installer-linux-x86_64.sh && \
  ./bazel-3.1.0-installer-linux-x86_64.sh && \
  rm bazel-3.1.0-installer-linux-x86_64.sh

Then, prepare a python environment

pip install -U --user pip numpy wheel packaging requests opt_einsum
pip install -U --user keras_preprocessing --no-deps

Finally, you can build any target in the monolith. For example,

bazel run //monolith/native_training:demo --output_filter=IGNORE_LOGS

Demo and tutorials

There are a tutorial in markdown/demo on how to run distributed async training, and few guides on how to use the MonolithModel API here.

monolith's People

Contributors

hanzhi713 avatar zhangpiu avatar zlqiszlqbd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

monolith's Issues

??

这是什么高端加密?还是属于宇宙厂的自high啊~

can monolith load data from hdfs?

we are deploying monolith on our environment.
we manage our data by pyspark. So usually we have a pyspark dataframe as data input.
In demos, monolith can load data from tdfs or kafka.
I was wondering that can monolith surpport loading data from pyspark dataframe or hdfs dir?
Or we have to dump files from pyspark to local memory to let monolith load it?

AttributeError: module 'tensorflow.tools.docs.doc_controls' has no attribute 'inheritable_header'

Got an error when try running the demo example command bazel run //markdown/demo:demo_local_runner -- --training_type=batch.

Here is the error:
Screen Shot 2024-03-05 at 10 36 46

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/c5b32d279cfc333125f37cd2f6c40738/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/markdown/demo/demo_model.py", line 24, in <module>
    from kafka_receiver import decode_example, to_ragged
  File "/home/arditto.trianggada/project/monolith/markdown/demo/kafka_receiver.py", line 16, in <module>
    from monolith.native_training.data.datasets import create_plain_kafka_dataset
  File "/root/.cache/bazel/_bazel_root/c5b32d279cfc333125f37cd2f6c40738/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/monolith/native_training/data/datasets.py", line 46, in <module>
    from monolith.native_training.hooks import ckpt_hooks
  File "/root/.cache/bazel/_bazel_root/c5b32d279cfc333125f37cd2f6c40738/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/monolith/native_training/hooks/ckpt_hooks.py", line 26, in <module>
    from monolith.native_training import barrier_ops
  File "/root/.cache/bazel/_bazel_root/c5b32d279cfc333125f37cd2f6c40738/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/monolith/native_training/barrier_ops.py", line 143, in <module>
    class BarrierHook(tf.estimator.SessionRunHook):
  File "/root/.cache/bazel/_bazel_root/c5b32d279cfc333125f37cd2f6c40738/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/org_tensorflow/tensorflow/python/util/lazy_loader.py", line 62, in __getattr__
    module = self._load()
  File "/root/.cache/bazel/_bazel_root/c5b32d279cfc333125f37cd2f6c40738/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/org_tensorflow/tensorflow/python/util/lazy_loader.py", line 45, in _load
    module = importlib.import_module(self.__name__)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/__init__.py", line 8, in <module>
    from tensorflow_estimator._api.v1 import estimator
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/_api/v1/estimator/__init__.py", line 8, in <module>
    from tensorflow_estimator._api.v1.estimator import experimental
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/_api/v1/estimator/experimental/__init__.py", line 8, in <module>
    from tensorflow_estimator.python.estimator.canned.dnn import dnn_logit_fn_builder
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/canned/dnn.py", line 27, in <module>
    from tensorflow_estimator.python.estimator import estimator
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 70, in <module>
    @doc_controls.inheritable_header("""\
AttributeError: module 'tensorflow.tools.docs.doc_controls' has no attribute 'inheritable_header'

FYI my setup:

  • Ubuntu 20.04
  • Python 3.8.10
  • GCC 9.4.0

What is the version of python and g++

Dear Author,

I'm running into errors while test building this command line: bazel run //monolith/native_training:demo --output_filter=IGNORE_LOGS. The error message looks like the following. I suspect this is related to python and c/c++ version diff. Could you please kindly share the versions that you are currently using for python and g++? Many thanks!

/external/com_github_grpc_grpc/src/python/grpcio/grpc/_cython/BUILD.bazel:5:1: C++ compilation of rule '@com_github_grpc_grpc//src/python/grpcio/grpc/_cython:cygrpc.so' failed (Exit 1)
bazel-out/k8-opt/bin/external/com_github_grpc_grpc/src/python/grpcio/grpc/_cython/cygrpc.cpp: In function 'PyObject* __pyx_f_4grpc_7_cython_6cygrpc__initialize()':
bazel-out/k8-opt/bin/external/com_github_grpc_grpc/src/python/grpcio/grpc/_cython/cygrpc.cpp:81271:29: warning: 'void PyEval_InitThreads()' is deprecated [-Wdeprecated-declarations]
81271 |   (void)(PyEval_InitThreads());
      |                             ^
In file included from bazel-out/k8-opt/bin/external/local_config_python/python_include/Python.h:145,
                 from bazel-out/k8-opt/bin/external/com_github_grpc_grpc/src/python/grpcio/grpc/_cython/cygrpc.cpp:4:
bazel-out/k8-opt/bin/external/local_config_python/python_include/ceval.h:130:37: note: declared here

Question about embedding table lookup

I'm new to this repository and trying to figure out the demos. The embedding table lookup confused me a lot. It's said "Note that we do not use features directly to obtain the sparse ids here, as it is handled internally through self.lookup_embedding_slice", but how can the model's lookup_embedding_slice get the information of the sparse ids? Thanks

Monolith Docker Image Unavailable

Hi Team,

Currently, the third demo to run Monolith on cloud is still not runnable without a good docker image. Could you share a workable docker image for me to try the real distributed training mode. Thank you!!

Environmental installation encountered issues when running //monolith/native_training:demo

I run "bazel run //monolith/native_training:demo --output_filter=IGNORE_LOGS", then occur the following error.
Someone can help answer the question?Thanks~

error: Command "clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/cy/opt/anaconda3/envs/python310/include -fPIC -O2 -isystem /Users/cy/opt/anaconda3/envs/python310/include -g0 -DNPY_INTERNAL_BUILD=1 -DHAVE_NPY_CONFIG_H=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -DNO_ATLAS_INFO=3 -DHAVE_CBLAS -Ibuild/src.macosx-10.9-x86_64-3.10/numpy/core/src/umath -Ibuild/src.macosx-10.9-x86_64-3.10/numpy/core/src/npymath -Ibuild/src.macosx-10.9-x86_64-3.10/numpy/core/src/common -Inumpy/core/include -Ibuild/src.macosx-10.9-x86_64-3.10/numpy/core/include/numpy -Inumpy/core/src/common -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/src/npysort -I/Users/cy/PycharmProjects/chatGLM/tensorflow/monolith/my-test-env/include -I/Users/cy/opt/anaconda3/envs/python310/include/python3.10 -Ibuild/src.macosx-10.9-x86_64-3.10/numpy/core/src/common -Ibuild/src.macosx-10.9-x86_64-3.10/numpy/core/src/npymath -c build/src.macosx-10.9-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c -o build/temp.macosx-10.9-x86_64-3.10/build/src.macosx-10.9-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o -MMD -MF build/temp.macosx-10.9-x86_64-3.10/build/src.macosx-10.9-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o.d -msse3 -I/System/Library/Frameworks/vecLib.framework/Headers" failed with exit status 1

ERROR: Failed building wheel for numpy
ERROR: Command errored out with exit status 1:
command: /Users/cy/PycharmProjects/chatGLM/tensorflow/monolith/my-test-env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/dk/p__18bnx7c96lgzrbn7fjh2h0000gn/T/pip-wheel-8gof1mee/numpy/setup.py'"'"'; file='"'"'/private/var/folders/dk/p__18bnx7c96lgzrbn7fjh2h0000gn/T/pip-wheel-8gof1mee/numpy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' clean --all
cwd: /private/var/folders/dk/p__18bnx7c96lgzrbn7fjh2h0000gn/T/pip-wheel-8gof1mee/numpy

ERROR: Failed cleaning build dir for numpy
ERROR: Failed to build one or more wheels
Traceback (most recent call last):
File "/Users/cy/opt/anaconda3/envs/python310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/cy/opt/anaconda3/envs/python310/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/private/var/tmp/_bazel_cy/7227b87b59980dee872adc59136222e4/external/rules_python/python/pip_install/extract_wheels/main.py", line 5, in
main()
File "/private/var/tmp/_bazel_cy/7227b87b59980dee872adc59136222e4/external/rules_python/python/pip_install/extract_wheels/init.py", line 87, in main
subprocess.run(pip_args, check=True)
File "/Users/cy/opt/anaconda3/envs/python310/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/Users/cy/PycharmProjects/chatGLM/tensorflow/monolith/my-test-env/bin/python3', '-m', 'pip', 'wheel', '-r', '/Users/cy/PycharmProjects/chatGLM/tensorflow/monolith/third_party/pip_deps/requirements.txt']' returned non-zero exit status 1.
)
INFO: Elapsed time: 158.542s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)

Demo 1 GRPC related error

Hi team,

I'm able to successfully build with this command bazel run //markdown/demo:demo_local_runner -- --training_type=batch but I came across gRPC related error like the following. Does this occur to you before? Any idea on the solution?

INFO:tensorflow:loss = 1.1790854, step = 1952
I1108 21:10:20.894386 140460748154688 basic_session_run_hooks.py:262] loss = 1.1790854, step = 1952
INFO:tensorflow:loss = 1.2298307, step = 2152 (18.186 sec)
I1108 21:10:39.080899 140460748154688 basic_session_run_hooks.py:260] loss = 1.2298307, step = 2152 (18.186 sec)
I1108 21:10:47.662103 140675923150656 cpu_training.py:374] MetricsHeartBeat thread stopped
I1108 21:10:47.664155 140675923150656 cpu_training.py:1712] Try to shutdown ps 0
I1108 21:10:47.677361 140269666805568 cpu_training.py:1776] Ps 0 shutdown successfully!
I1108 21:10:47.677551 140675923150656 cpu_training.py:1718] Shutdown ps 0 successfully!
I1108 21:10:47.677928 140675923150656 cpu_training.py:1712] Try to shutdown ps 1
I1108 21:10:47.678347 140269666805568 cpu_training.py:2158] Finished ps 0.
I1108 21:10:47.678776 140269666805568 runner_utils.py:396] exit monolith_discovery!
I1108 21:10:47.684976 140603018831680 cpu_training.py:1776] Ps 1 shutdown successfully!
I1108 21:10:47.685158 140675923150656 cpu_training.py:1718] Shutdown ps 1 successfully!
I1108 21:10:47.685652 140603018831680 cpu_training.py:2158] Finished ps 1.
I1108 21:10:47.686046 140603018831680 runner_utils.py:396] exit monolith_discovery!
I1108 21:10:47.693424 140675923150656 cpu_training.py:2155] Worker End 1699477847.693356, Cost: 30.059291124343872(s)
I1108 21:10:47.693858 140675923150656 cpu_training.py:2158] Finished worker 0.
I1108 21:10:47.694137 140675923150656 runner_utils.py:396] exit monolith_discovery!
2023-11-08 21:10:48.412364: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412458: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412479: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412496: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
2023-11-08 21:10:48.412575: I external/org_tensorflow/tensorflow/core/distributed_runtime/worker.cc:207] Cancellation requested for RunGraph.
2023-11-08 21:10:48.412993: W external/org_tensorflow/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:514] RecvTensor cancelled for 128048405063079430
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:ps/replica:0/task:1:
Socket closed
Additional GRPC error information from remote target /job:ps/replica:0/task:1:
:{"created":"@1699477848.412335053","description":"Error received from peer ipv4:10.128.0.74:34391","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
I1108 21:10:48.415307 140460748154688 monitored_session.py:1285] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:ps/replica:0/task:1:
Socket closed
Additional GRPC error information from remote target /job:ps/replica:0/task:1:
:{"created":"@1699477848.412335053","description":"Error received from peer ipv4:10.128.0.74:34391","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}

Related specs:

share_cluster_devices_in_session: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 200, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({'chief': ['10.128.0.74:33213'], 'ps': ['10.128.0.74:33337', '10.128.0.74:34391'], 'worker': ['10.128.0.74:57669']}), '_task_type': 'worker', '_task_id': 0, '_evaluation_master': '', '_master': 'grpc://10.128.0.74:57669', '_num_ps_replicas': 2, '_num_worker_replicas': 2, '_global_id_in_cluster': 1, '_is_chief': False}
I1108 21:10:19.861951 140460748154688 estimator.py:191] Using config: {'_model_dir': '/tmp/movie_lens_tutorial', '_tf_random_seed': None, '_save_summary_steps': 200, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': device_filters: "/job:ps"
device_filters: "/job:chief"
device_filters: "/job:worker/task:0"
gpu_options {
  allow_growth: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    disable_meta_optimizer: true
  }
}
operation_timeout_in_ms: -1
cluster_def {
  job {
    name: "chief"
    tasks {
      key: 0
      value: "10.128.0.74:33213"
    }
  }
  job {
    name: "ps"
    tasks {
      key: 0
      value: "10.128.0.74:33337"
    }
    tasks {
      key: 1
      value: "10.128.0.74:34391"
    }
  }
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.128.0.74:57669"
    }
  }
}

Demo is currently broken

I ran bazel run //markdown/demo:demo_local_runner -- --training_type=batch and output suggests that there's some code missing: AttributeError: module 'monolith.native_training.env_utils' has no attribute 'generate_psm_from_uuid'

Traceback (most recent call last):
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/markdown/demo/demo_model.py", line 128, in <module>
    app.run(main)
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/absl_py/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/absl_py/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/markdown/demo/demo_model.py", line 124, in main
    estimator.train(max_steps=1000000)
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/monolith/native_training/estimator.py", line 417, in train
    with monolith_discovery(self._runner_conf) as discovery:
  File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/monolith/native_training/runner_utils.py", line 371, in monolith_discovery
    raise e
  File "/home/green/.cache/bazel/_bazel_green/bf4782e691ac8318220629c47f43c1eb/execroot/__main__/bazel-out/k8-opt/bin/markdown/demo/demo_local_runner.runfiles/__main__/monolith/native_training/runner_utils.py", line 365, in monolith_discovery
    psm = env_utils.generate_psm_from_uuid(runner_conf.uuid)
AttributeError: module 'monolith.native_training.env_utils' has no attribute 'generate_psm_from_uuid'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.