zarr-developers / perfcapture Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 65 KB

Capture the performance of a computer system whilst running a set of benchmark workloads.

License: MIT License

Python 100.00%

perfcapture's People

Contributors

Stargazers

Watchers

Forkers

jbms ap--

perfcapture's Issues

Document how to add a new workload and dataset

Implement parameterised `Workload.run` & `Dataset.prepare` methods

Using same semantics as pytest.

https://github.com/pytest-dev/pytest/blob/main/src/_pytest/mark/structures.py#L151

Increment the `Development status` in setup.cfg

See https://github.com/zarr-developers/perfcapture/blob/main/setup.cfg#L13

Document terminology & make it consistent throughout the code

Need to standardise on words like the "recipes", what's a single "run", etc.

Maybe a run should be a single execution of Workload.run() against a specific Dataset.

Automate building & publishing of docs

See https://github.com/jaraco/skeleton

Automatically release to PyPI & publish release on GitHub

See https://github.com/jaraco/skeleton

`perfcapture` should output file, which we then read in an `ipynb`

Simple API for specifying each benchmark workload

In today's Benchmarking meeting, we all agreed that it'd be great to get a simple benchmarking solution implemented ASAP, so we can get on with the fun work of trying to speed things up 🙂. To start with, we just need something simple that'll allow us to run benchmark workloads on our machines, figure out how close the current code is to the theoretical IO performance of the hardware, and compare performance between git commits.

Here's an early attempt to specify a very simple framework to allow people to specify workloads. Here's a quick example of what is required to implement a simple dataset and workload.

To implement a new workload, you'd implement a class which inherits from Workload, and you'd just override init_dataset and run. To implement a new dataset, you'd implement a class which inherits from 'Dataset', and override prepare.

Does this look OK? Is there anything missing?

The (very simple) framework would then take care of discovering & running the workloads, whilst also recording relevant metrics to disk as JSON, and printing a summary.

The framework isn't ready yet! But the idea is that the MVP will expose a very simple CLI that allows you to run all benchmarks, or a specific benchmark. It'll automatically record metadata about the system and the git commit. And it'll make it easy to compare performance between different git commits.

The idea is to make it super-easy for anyone to submit a new workload via a PR. And easy for us to share "human-readable" metrics ("My new PR speeds up Zarr-Python on workload x by 10x on my machine! Hurray!") and share machine-readable metrics (JSON).

I'd imagine moving this code to zarr-developers if folks agree that the approach is OK.

(See zarr-developers/zarr-benchmark#1 for a discussion of why I'm currently leaning towards the idea of implementing our own benchmarking tool. But feel free to make the case for using an existing tool! I'm just not sure that any existing tool would allow us to measure IO performance.)

Enable `perfcapture` to be run against recipes on GitHub so users don't have to clone `zarr-benchmark`

CI is failing

Maybe rename `Dataset.prepare` to `Dataset.create`

Allow users to select workload(s) & dataset(s) at the CLI

`PerfCounters` should return a `results` DataFrame, not format its output.

Implement a get_results() -> pd.DataFrame method where the counter name is the column, and the run_id is the row.

Remove __str__ from each PerfCounter.

run_workloads() should return a dict where values are these results dataframes.

Consider different names for project (instead of `perfcapture`)?

I'm not entirely happy with the name perfcapture. perfcapture felt right when I was thinking of capturing a timeseries of performance metrics for each workload (e.g. sampling every 100 ms). But now that we're just capturing total metrics at the end of each workload run, perfcapture doesn't feel quite right. It's not a terrible name. But it doesn't quite sit right with me.

Something more like iobench feels better, except there are already multiple projects called iobench!

Configure entry point to make it easier to run `cli.py`

So pip install installs the cli. So it's available from all paths.

Automated integration test which runs the cli against all examples

Consider defining workloads & benchmarks in yaml to be consistent with `zarr_implementations`

In today's meeting for "Zarr Performance & Benchmarking (Europe-friendly time)", @joshmoore described the Zarr Implementations project: The Zarr Implementations project collects "data in zarr / n5 format written by different implementations" and tests for compatibility. Zarr Implementations is related to - although distinct from - benchmarking. Specifically: it might be nice to benchmark Zarr Implementations.

With perfcapture's current API, it should be possible to call Zarr Implementations from perfcapture.Workload.run (probably using one perfcapture.Workload class per Zarr implementation).

But it might also be nice to harmonize the API, such that both perfcapture and Zarr Implementations use the same yaml structures to define the workloads.

Measure & record performance while benchmarks run!

xref: #15

What to measure during benchmarking?

The plan is to implement a benchmarking tool which automatically runs a suite of "Zarr workloads" across a range of compute platforms, storage media, chunk sizes, and Zarr implementations.

What would we like to measure for each workload?

Existing benchmarking tools only measure the runtime of each workload. That doesn't feel sufficient for Zarr because one of our main questions during benchmarking is whether the Zarr implementation is able to saturate the IO subsystem, and how much CPU and RAM is required to saturate the IO.

I'd propose that it'd be great to measure these parameters each time each workload is run:

Total execution time of the workload
Total bytes read / written for disk / network
Total IO operations
Total bytes in final numpy array
Average CPU utilization (per CPU).
Max RAM usage during the execution of the workload
CPU cache hit ratio

(Each run would also capture a bunch of metadata about the environment such as the compute environment, storage media, chunk sizes, Zarr implementation name and version, etc.)

I had previously gotten over-excited and starting thinking about capturing a full "trace" during the execution of each workload, e.g. capturing a timeseries of the IO utilization every 100 milliseconds. This might be useful, but makes the benchmarking code rather more complex, and maybe doesn't tell us much more than the "totals per workload" tell us. And some benchmark workloads might run for less than 100 ms. And psutil's documentation states that some of its counters aren't reliable when polled more frequently than 10 times a second.

What do you folks think? Do we need to record a full "trace" during each workload? Or is it sufficient to just capture totals per workload? Are there any changes you'd make to the list of parameters I proposed above?

Remove `parameterize.py` if we don't use it.

xref: #6 (comment)