Testing framework for Deep Learning models (Tensorflow and PyTorch) on Google Cloud hardware accelerators (TPU and GPU)

License: Apache License 2.0

Dockerfile 2.02% Shell 5.52% Python 41.36% Jsonnet 46.41% Starlark 2.06% Jinja 2.64%

testing-accelerators machine-learning tpu gpu

ml-testing-accelerators's Issues

Add more comprehensive performance metrics

E.g.

p50, p95, p99 of examples/sec
Start up and wall time

pass/fail dashboard: boxes are different sizes

Different tabs have different-sized boxes representing pass/fail status. The plot height/width scales with number of tests and number of days of data. Try to find some way to scale such that:

boxes are square
boxes are same size between tabs

Sorting graphs in metrics history

I think we should continue to show any currently out-of-bounds metrics first in the page.

Aside from that, some kind of sorting would be nice for all the in-bounds metrics.

One idea is to sort by stddev of the metric for the last N runs. This would give us a sense of which metrics are becoming increasingly unstable

Hide inactive tests (option to toggle hide)

FR: Current dashboard seems to display every test stored in the metrics table. We should have the option to toggle hiding inactive tests or hide them by default and give an option to reveal them (this was a pretty nice feature of our legacy dashboard).

Dashboard scroll should freeze the test names and dates.

As more and more tests are added, we'll be scrolling when looking to the dashboard, and the top of the screen looks like: https://screenshot.googleplex.com/TvKeW5ivhuh

Feature Request: Have one shots modifiable and runnable from the dashboard UI

It would be cool if we could:

Click a button, which pulls up a text box
Inside the text box contains the command for a given one shot
Make any modifications to that textbox
Run the command that way and return the logs

Test failures should cut bugs in buganizer, and auto-close when new run passes.

Right now, we're having to track the email alerts and cut b/'s separately. While email alerts are nice, we could also have buganizer bugs in parallel to track non trivial failures, and noise will be reduced if we have trivial failures that auto-close the next succesful run.

Add support for tagging failures in dashboard

It would be great if we could tag failures with causes such as infrastructure flakiness, model regressions, or erroneous test configs. It may also be useful to link to bugs or github issues. This would be useful both for record-keeping (e.g. tracking flakiness over time) and to communicate with teammates if an error has been triaged.

1-click button to download logs

The current command in the dashboard to download logs requires copy-pasting into a terminal with gcloud installed.

We should instead be able to hit the entries.list HTTP endpoint: https://cloud.google.com/logging/docs/reference/v2/rest/v2/entries/list

This should give the same result as gcloud logging read: https://cloud.google.com/logging/docs/reference/tools/gcloud-logging#listing-logs

With that, we should be able to download the logs from the dashboard UI

dashboard request: click out of modal to close

Right now you have to click the "x" button at top right of modal to click it. I'd like to be able to click anywhere outside the modal to close it

Dashboard test names are not Ctrl+f searchable

Small comment that it seems like the test names are not searchable on the dashboard UI. I find it useful especially when we start having many tests per page.

Oneshot retries may be unnecessary.

Idea: Oneshot runs are usually looked at and analyzed manually, so no automated restarts may be required in that case.

Feature request: more configurable email alerts

I'd like to:

Be able to email alerts to multiple recipients
Send some test failure emails to certain recipients and not others, i.e. each test should have a list of addresses to alert

Dashboard colors get stuck when a cell is clicked.

When I click a cell and then close the popup, colors get stuck like this: https://screenshot.googleplex.com/KgG3vdkxuYz

I can click on an empty pixel and it comes back.

Metrics History time series progressing from right to left feels awkward.

example: https://xl-ml-test.appspot.com/metrics?test_name=pt-nightly-resnet50-func-v3-8

The newest entry is on the left, oldest is to the right. It feels weird and counterintuitive to me. I feel it would be better if it went from left to right, WDYT?

Direct links to dashboard tabs

Right now, I'm having to first go to the dashboard url and then click on the pytorch-nightly tab. If I can pin that tab's direct url to my browser, that would be awesomesauce.

Add a convergence test for training UNet3D on TPU v3-8

@thisisalbertliang developed a UNet3D PyTorch/XLA codebase that can run on both TPU and GPU.
To enhance our TPU test suite, we hope to add a convergence test for this UNet3D model on TPU v3-8

PyTorch Pods tests are failing due to Instance Groups not stabilizing in time.

This is a non-interesting failure mode, and these failures are mostly noise.

Happened twice this week. Example.

Instead of counting stable instances in the IG and sleeping, I think it would be better to use gcloud compute instance-groups managed wait-until-stable $IG --zone $ZONE. WDYT?

Feature Request: Show hyperparameters in dashboard view

Occasionally we might want to check hyperparameters when comparing between stacks. This would require the model itself to dump its parameters (e.g. classifier_trainer in TF) but would be great for interpretability.

Dashboard showing one line error msgs for failed tests could be useful.

Failed runs just say "failure" at the moment. If it showed a one line message from the traceback, could be useful.

googlecloudplatform / ml-testing-accelerators Goto Github PK

ml-testing-accelerators's Issues

Recommend Projects

Recommend Topics

Recommend Org