Giter VIP home page Giter VIP logo

mlcommons / cm4mlops Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 8.0 27.35 MB

A collection of reusable and cross-platform automation recipes (CM scripts) with a human-friendly interface and minimal dependencies to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware (cloud/edge)

Home Page: https://access.cknowledge.org

License: Apache License 2.0

Python 72.30% Batchfile 2.11% Shell 9.24% C++ 8.16% C 3.32% Dockerfile 4.52% Cuda 0.14% Java 0.06% HCL 0.15%
ai-systems artificial-intelligence automation cross-platform devops machine-learning mlops mlperf reusable workflow

cm4mlops's People

Contributors

ailurus1 avatar alered01 avatar anandhu-eng avatar arjunsuresh avatar ctuning-admin avatar davegreasley avatar dsavenko avatar ens-lg4 avatar gfursin avatar hanwenzhu avatar himanshu-dutta avatar interestinglsy avatar jdesfossez avatar makaveli10 avatar maximallnyi avatar morphine00 avatar nacc avatar nathanw-mlc avatar nijoj avatar psyhtest avatar raduetsya avatar sennikovandrey avatar slahiruk avatar xintin avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cm4mlops's Issues

Help students run MLPerf inference at the Student Cluster Competition'24

We were asked to help students run MLPerf inference benchmark at the Student Cluster Competition'24 and automate their submission and grading via the MLCommons CM automation framework.

Current plan is to use MLPerf inference Stable Diffusion benchmark with Stability AI’s Stable Diffusion XL model (2.6 billion parameters) and COCO data set. This popular model is used to create compelling images through a text-based prompt.

We must check the following:

  • Check current CM workflows to run reference MLPerf SD benchmark
  • Check CM workflows to run optimized MLPerf SD benchmark v4.0
    • Intel
    • Nvidia
  • Check if support for AMD GPUs can be provided
  • Check how to support multi-node inference
  • Prepare tutorial about MLPerf, loadgen, this benchmark and CM
  • Check MLCommons Croissant format for the dataset?
  • Automate submission and grading
    • Need to agree how to report accuracy
    • We may train a smaller model to analyze produced images
    • Create live scoreboard (W&B?)

Improving CM script automation and CM scripts

Aggregating tasks to improve CM script automation and CM scripts based on user feedback (this ticket is being gradually updated and tasks resolved based on our bandwidth and engineering resources):

Automation

  • Add --silent / -s mode to avoiding printing CM workflow execution
  • Add cm_min_version check for automation and scripts
  • Check current dumping of all versions of all dependencies Commit
  • Check readme generation with all deps Commit
    • Add --version when available Commit
  • Prepare tmp_run_final_script.sh and stop to let user run it manually without CM if needed
  • Generate sample Docker container during --repro
  • Add cfg to load default env, state and keys for all CM scripts
    • cm set cfg default --key.cm-script.silent --key.cm-script.env.CM_SUDO="no" ...
    • Add silent mode as default
    • Support add/update cfg keys ...
  • Add cm status to show version, paths to current repositories and if the version is up-to-date (useful for virtual environments)
  • Add to meta where to report errors if not default repository (for MLPerf inference -> report https://github.com/mlcommons/inference)
  • Add proper version detection for cuDNN

Documentation

  • Check individual documentation and input description for all main CM scripts
  • Add individual tests for all main CM scripts

Tutorials

Need to explain how to extend CM scripts, add new ones, unify inputs, etc)

  • Prepare tutorial about CM basics and CM scripts
  • Prepare tutorial about CM automation for basic inference
  • Prepare tutorial about CM automation for MLPerf loadgen
  • Prepare tutorial about CM automation for MLPerf inference
  • Prepare tutorial about CM automation for ABTF
  • Prepare tutorial about CM automation for SCC'24

Longer term

  • Use logging instead of print
  • Refactor and simplify CM script automation - it was heavily prototyped and it is stable but the implementation can be considerably improved and simplified
  • Refactor docker container generation and execution - the prototype implementation is very complex and can be dramatically simplified

could not identify license file for third_party/opentelemetry-cpp/tools/vcpkg/ports/hungarian

running bdist_wheel
Traceback (most recent call last):
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 1288, in <module>
   main()
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 1239, in main
   setup(
 File "/home/ebay/projects/devops/mlperftest/cm/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
   return distutils.core.setup(**attrs)
 File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
   dist.run_commands()
 File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
   self.run_command(cmd)
 File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
   cmd_obj.run()
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 744, in run
   with concat_license_files(include_files=True):
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 722, in __enter__
   create_bundled(os.path.relpath(third_party_path), f1,
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/third_party/build_bundled.py", line 42, in create_bundled
   collected = collect_license(d)
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/third_party/build_bundled.py", line 20, in collect_license
   raise ValueError('could not identify license file '
ValueError: could not identify license file for third_party/opentelemetry-cpp/tools/vcpkg/ports/hungarian

CM error: Portable CM script failed (name = install-pytorch-from-src, return code = 256)

cm-run-script-input.json
cm-run-script-info.json

Adding profiling and performance analysis during benchmarking

We need to continue improving universal benchmarking and optimization capabilities in CM for different OS and hardware targets:

  • For compiled code (C/C++ ...) we improve the following CM scripts

  • For Python:

    • create CM script with a python package to collect various profiling info (memory utilization, etc) particularly to analyze ML/AI models (was asked by ABTF).
    • Collect function-level profiling
  • Add support for universal performance analysis to CM experiment:

    • Aggregate profiling from multiple runs and perform stat analysis (variation, min/max, phases, etc)
    • Visualize experiments

Improving universal build and run scripts to support cross-platform compilation

We should extend our universal compile and benchmark scripts to support cross-platform compilation and execution (requested by several MLCommons workgroups):

As a test example, we can compile and run program for Android or some SSH-based platform.

We should add get-target-device CM to define target platform capabilities (env, config and tools) similar to how I did it in the original CK framework and add it to the build/run CM script.

Check/provide flag to skip sudo / system installations

We had feedback from some MLPerf users that they do not have sudo access while most of their system deps are already installed. In such case, we should have a flag and env var to skip all SUDO/system installations ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.