googlecloudplatform / ml-testing-accelerators Goto Github PK
View Code? Open in Web Editor NEWTesting framework for Deep Learning models (Tensorflow and PyTorch) on Google Cloud hardware accelerators (TPU and GPU)
License: Apache License 2.0
Testing framework for Deep Learning models (Tensorflow and PyTorch) on Google Cloud hardware accelerators (TPU and GPU)
License: Apache License 2.0
E.g.
Different tabs have different-sized boxes representing pass/fail status. The plot height/width scales with number of tests and number of days of data. Try to find some way to scale such that:
I think we should continue to show any currently out-of-bounds metrics first in the page.
Aside from that, some kind of sorting would be nice for all the in-bounds metrics.
One idea is to sort by stddev of the metric for the last N runs. This would give us a sense of which metrics are becoming increasingly unstable
FR: Current dashboard seems to display every test stored in the metrics table. We should have the option to toggle hiding inactive tests or hide them by default and give an option to reveal them (this was a pretty nice feature of our legacy dashboard).
As more and more tests are added, we'll be scrolling when looking to the dashboard, and the top of the screen looks like: https://screenshot.googleplex.com/TvKeW5ivhuh
It would be cool if we could:
Right now, we're having to track the email alerts and cut b/'s separately. While email alerts are nice, we could also have buganizer bugs in parallel to track non trivial failures, and noise will be reduced if we have trivial failures that auto-close the next succesful run.
It would be great if we could tag failures with causes such as infrastructure flakiness, model regressions, or erroneous test configs. It may also be useful to link to bugs or github issues. This would be useful both for record-keeping (e.g. tracking flakiness over time) and to communicate with teammates if an error has been triaged.
The current command in the dashboard to download logs requires copy-pasting into a terminal with gcloud
installed.
We should instead be able to hit the entries.list
HTTP endpoint: https://cloud.google.com/logging/docs/reference/v2/rest/v2/entries/list
This should give the same result as gcloud logging read
: https://cloud.google.com/logging/docs/reference/tools/gcloud-logging#listing-logs
With that, we should be able to download the logs from the dashboard UI
Right now you have to click the "x" button at top right of modal to click it. I'd like to be able to click anywhere outside the modal to close it
Small comment that it seems like the test names are not searchable on the dashboard UI. I find it useful especially when we start having many tests per page.
Idea: Oneshot runs are usually looked at and analyzed manually, so no automated restarts may be required in that case.
I'd like to:
When I click a cell and then close the popup, colors get stuck like this: https://screenshot.googleplex.com/KgG3vdkxuYz
I can click on an empty pixel and it comes back.
example: https://xl-ml-test.appspot.com/metrics?test_name=pt-nightly-resnet50-func-v3-8
The newest entry is on the left, oldest is to the right. It feels weird and counterintuitive to me. I feel it would be better if it went from left to right, WDYT?
Right now, I'm having to first go to the dashboard url and then click on the pytorch-nightly
tab. If I can pin that tab's direct url to my browser, that would be awesomesauce.
@thisisalbertliang developed a UNet3D PyTorch/XLA codebase that can run on both TPU and GPU.
To enhance our TPU test suite, we hope to add a convergence test for this UNet3D model on TPU v3-8
This is a non-interesting failure mode, and these failures are mostly noise.
Happened twice this week. Example.
Instead of counting stable instances in the IG and sleeping, I think it would be better to use gcloud compute instance-groups managed wait-until-stable $IG --zone $ZONE
. WDYT?
Occasionally we might want to check hyperparameters when comparing between stacks. This would require the model itself to dump its parameters (e.g. classifier_trainer in TF) but would be great for interpretability.
Failed runs just say "failure" at the moment. If it showed a one line message from the traceback, could be useful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.