mlcommons / cm4mlops Goto Github PK

A collection of reusable and cross-platform automation recipes (CM scripts) with a human-friendly interface and minimal dependencies to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware (cloud/edge)

Home Page: https://access.cknowledge.org

License: Apache License 2.0

Python 72.30% Batchfile 2.11% Shell 9.24% C++ 8.16% C 3.32% Dockerfile 4.52% Cuda 0.14% Java 0.06% HCL 0.15%

ai-systems artificial-intelligence automation cross-platform devops machine-learning mlops mlperf reusable workflow

cm4mlops's Issues

Help students run MLPerf inference at the Student Cluster Competition'24

We were asked to help students run MLPerf inference benchmark at the Student Cluster Competition'24 and automate their submission and grading via the MLCommons CM automation framework.

Current plan is to use MLPerf inference Stable Diffusion benchmark with Stability AI’s Stable Diffusion XL model (2.6 billion parameters) and COCO data set. This popular model is used to create compelling images through a text-based prompt.

We must check the following:

Improving CM script automation and CM scripts

Aggregating tasks to improve CM script automation and CM scripts based on user feedback (this ticket is being gradually updated and tasks resolved based on our bandwidth and engineering resources):

Automation

Documentation

Check individual documentation and input description for all main CM scripts
Add individual tests for all main CM scripts

Tutorials

Need to explain how to extend CM scripts, add new ones, unify inputs, etc)

Prepare tutorial about CM basics and CM scripts
Prepare tutorial about CM automation for basic inference
Prepare tutorial about CM automation for MLPerf loadgen
Prepare tutorial about CM automation for MLPerf inference
Prepare tutorial about CM automation for ABTF
Prepare tutorial about CM automation for SCC'24

Longer term

Use logging instead of print
Refactor and simplify CM script automation - it was heavily prototyped and it is stable but the implementation can be considerably improved and simplified
Refactor docker container generation and execution - the prototype implementation is very complex and can be dramatically simplified

could not identify license file for third_party/opentelemetry-cpp/tools/vcpkg/ports/hungarian

running bdist_wheel
Traceback (most recent call last):
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 1288, in <module>
   main()
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 1239, in main
   setup(
 File "/home/ebay/projects/devops/mlperftest/cm/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
   return distutils.core.setup(**attrs)
 File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
   dist.run_commands()
 File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
   self.run_command(cmd)
 File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
   cmd_obj.run()
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 744, in run
   with concat_license_files(include_files=True):
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/setup.py", line 722, in __enter__
   create_bundled(os.path.relpath(third_party_path), f1,
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/third_party/build_bundled.py", line 42, in create_bundled
   collected = collect_license(d)
 File "/home/ebay/CM/repos/local/cache/2bfde55101034352/pytorch/third_party/build_bundled.py", line 20, in collect_license
   raise ValueError('could not identify license file '
ValueError: could not identify license file for third_party/opentelemetry-cpp/tools/vcpkg/ports/hungarian

CM error: Portable CM script failed (name = install-pytorch-from-src, return code = 256)

cm-run-script-input.json
cm-run-script-info.json

Create dummy cache placeholder that can be updated (to deal with large datasets)

Update all tests in GitHub workflows for mlcommons@cm4mlops

Add a check for min CM version when loading scripts

We may rely on some extra functionality in CM in scripts so we need to check min CM version and suggest to upgrade it ...

Add "install-docker" automation recipe to install it via SUDO on Ubuntu, Debian and Red Hat

The procedure to install the latest docker (CPU & CUDA) is somewhat tricky for different OS. I suggest to create a script to aggregate all this knowledge in one place. It will simplify using CM workflows with MLPerf and other benchmarks via Docker.

Adding profiling and performance analysis during benchmarking

We need to continue improving universal benchmarking and optimization capabilities in CM for different OS and hardware targets:

For compiled code (C/C++ ...) we improve the following CM scripts
- CM scripts:
- TBD
  - support for gprof/oprofile/hardware counters
  - universal support to pin threads (numactl)
  - expose internal profiling info from ML frameworks and run-times if/when available (onnx, TFLite ...)
- Sample apps:
For Python:
- create CM script with a python package to collect various profiling info (memory utilization, etc) particularly to analyze ML/AI models (was asked by ABTF).
- Collect function-level profiling
Add support for universal performance analysis to CM experiment:
- Aggregate profiling from multiple runs and perform stat analysis (variation, min/max, phases, etc)
- Visualize experiments

Add --silent / -s mode to suppress all extra CM info

Improving universal build and run scripts to support cross-platform compilation

We should extend our universal compile and benchmark scripts to support cross-platform compilation and execution (requested by several MLCommons workgroups):

As a test example, we can compile and run program for Android or some SSH-based platform.

We should add get-target-device CM to define target platform capabilities (env, config and tools) similar to how I did it in the original CK framework and add it to the build/run CM script.

Check/provide flag to skip sudo / system installations

We had feedback from some MLPerf users that they do not have sudo access while most of their system deps are already installed. In such case, we should have a flag and env var to skip all SUDO/system installations ...

Add CM support for private configuration for rclone

We had to remove configs with private keys from the CM get-rclone script.

We need to add proper support to handle such private keys. We can use the new CM cfg automation or create a "get-rclone-config" script.

TBD: should brainstorm more.

Add git history to this repo from mlcommons@ck

Hi @arjunsuresh . Let sync how to do that . Thanks!

mlcommons / cm4mlops Goto Github PK

cm4mlops's People

Contributors

Stargazers

Watchers

Forkers

cm4mlops's Issues

Automation

Documentation

Tutorials

Longer term

Recommend Projects

Recommend Topics

Recommend Org