mlcommons / inference_policies Goto Github PK
View Code? Open in Web Editor NEWIssues related to MLPerf™ Inference policies, including rules and suggested changes
Home Page: https://mlcommons.org/en/groups/inference/
License: Apache License 2.0
Issues related to MLPerf™ Inference policies, including rules and suggested changes
Home Page: https://mlcommons.org/en/groups/inference/
License: Apache License 2.0
Proposal:
Imaging: 50ms
GNMT: 100ms
We propose these targets stand unless counterproposal on 7/10.
@guschmue asks, "Say I have a runtime that takes fp16 but internally uses 4 bit – can one submit this in closed?"
"We’d load the weights as fp16 but internally activations and weights would be 4 bit. We’d use calibration using the published list."
Proposal:
99% of FP32 accuracy.
We propose these targets stand unless counterproposal on 7/10.
For "Performance mode", I am trying to understand if there are specific expectations on the value the parameter "total_sample_count" in QSL should be set to. Can this be random value?. Please help provide clarification.
For "Accuracy mode" I am assuming this will be equal to the entire Validation set.
We need to determine the QSL size. Currently, 1000 and 500 are being debated.
Please see link here.
https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#41-benchmarks
mobilenet = 66ms
ssd-resnet34 = 50 ms
seems wrong. Need to swap this?
The quantization rules need to be explained more clearly:
"The quantization method must be publicly described at a level where it could be reproduced. To be considered principled, the description of the quantization method must be much much smaller than the non-zero weights it produce."
It is not clear to others (and me) what it means that the description of the method must be much much smaller ...
"Weight quantization algorithm that are similar in size to the non-zero weights they produce."
Hmm... huh?
Just want to know whether per row data padding to match with hardware's alignment requirement is untimed or not.
ResNet returns a tuple of tensors representing classes and probabilities. Are both necessary?
we should add: mixing of any white listed precision is allowed. This is missing in the current doc.
William from Qualcomm asked:
"Will MLperf have submission auditing rule on running the loadgen version command to report the signature of source files used in the current build of the app? Since current loadgen is implemented on python and python does not run on android, we integrated loadgen with our framework and we removed these source files as we don’t use them. Will Mlperf be mandating the loadgen version command to function during result submission auditing?"
I don't think we have the right standard for exposing quantization - much though I would be interested to see how everyone else is doing it, what we should really care about is that it's a function of the input model and the calibration set, and generalizes across a wide range of models.
That's less strict than publishing it to the level of detail where it can be reproduced.
We should clarify that the same output should be passed in QuerySampleComplete in both accuracy and performance mode explicitly, although the performance mode does not check the content of the output.
An alternative proposal:
ImageNet: first 500 images and class labels.
COCO: first 500 images and annotations.
The current calibration sets are also 500 images each but randomly chosen (?).
I commented on an email discussion:
For calibration, what's the current fp_trained_accuracy for the planned set of inference benchmarks?
It seems to me that formulation in terms of the %correct might not be the right way to frame the error margin. The difference between 50% correct and 49% correct is probably negligible, but the difference between 99% and 98% correct is 2x more errors. None of which might matter if all of the current inference benchmarks are at 70% correct in float, but if anything is above 90%, then we're allowing very wide error margins for the high-accuracy benchmarks that might not represent what people want to deploy.
So, recommendations:
duplicating:
mlcommons/training_policies#217
WG: no one objects so far.
Hi all,
In our cases, we commonly use and provide a web API based compiler for internal development & customers for security reasons.
Likewise, we hope to compile models using the web API based compiler during MLPerf inference submissions too. (obviously, auditors can use the compiler via web API in our inference submission system).
Is it available to consist our compilation process like this during inference submissions?
Some inference systems may produce outputs that are arbitrarily padded, transposed, or reshaped relative to the reference implementation. In real systems, downstream code can frequently be adapted to read inputs in unusual layouts. Should adapters be allowed for connecting the reference accuracy checking code?
currently it is set to map of 0.23
Right now a submission to any model should have:
In the rules, it states:
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen modes.
I suggest we change it to:
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen mode with single stream.
@briandersn how would 'SubmissionRun' work in loadgen with the scheme above?
For the offline scenario, can a run that is much longer than the minimum number of queries and the time duration be submitted?
Can we give driver a hint to preload the image data to somewhere closer to chip during LoadSamplesToRam?
In the current terms of use:
https://github.com/mlperf/policies/blob/master/TERMS%20OF%20USE.md
You may cite either official results obtained from the MLPerf results page or unofficial results measured independently. If you cite an unofficial result you must clearly specify that the result is “Unverified” in text and clearly state “Result not verified by MLPerf” in a footnote. The result must comply with the letter and spirit of the relevant MLPerf rules. For example:
SmartAI Corp announced an estimated score of 0.3 on the MLPerf v0.5 Training Closed Division - Image Classification benchmark using a cluster of 20 SmartChips running MLFramework v4.1 [1].
[1] Result not verified by MLPerf. MLPerf name and logo are trademarks. See www.mlperf.org for more information.
--> actually 'Unverified' in not in example text. It only has 'estimated'
Following WG's discussion on 14/Aug/2019, I suggest we document the request and decision here.
we would like the symmetrically quantized model for MobiletNet-SSD that Habana trained to be added to the list of allowed models...
Motivation: currently MLPerf allows people to use the code and claim performance freely as long as you name it 'unverified'. Link:https://github.com/mlperf/policies/blob/master/TERMS%20OF%20USE.md
However, it might be really hard to verify any inference claims given the ‘black box’ nature of some inference engines.
A strawperson proposal from Intel with ‘unofficial MLPerf results’
There are a small number of queries in the GNMT sample set.
In order to avoid all samples being implicitly cached, can we increase the GNMT source queries by having N copies of the samples such that they take up 512MB?
In order to avoid changing the loadgen API for v0.5, we could increase the size of the library itself and allow the performance size to be larger than the accuracy size in the QSL.
cc: @tjablin, @nvmbreughe
As the rules section 7.1 indicate
May convert data among all the whitelisted numerical formats
and as int8 is in white list format, float to int8 conversion is allowed in pre-processing.
We used calibration list 1 to profile the input tensor value range post pre-processing, the min max range is (-123.68, 151.06), thus to fit the input tensor into int8, a scale of 0.84 shall be multiplied, otherwise, out of range value will be clamped and precision will be affected.
Just want to clarify the scaling of 0.84 on opencv pre-processed value can be done in pre-processing process, which is not timed in performance mode.
Is the server scenario for
If 1) I suggest we set a few latency targets and don't force everyone to submit to the same latency targets. We also need to clarify in the rules that server scenario is a testing mode, not targeting a real server use case.
If 2) I suggest we set latency targets which represents actual data center use cases.
Thanks.
Assigned to @tjablin
SSD 1200 tensorflow checkpoint: https://zenodo.org/record/3247091#.XQ0S4pJKhGE is inconsistent with the reference model: https://github.com/mlperf/inference/blob/master/v0.5/classification_and_detection/python/models/ssd_r34.py#L295
Batchnorm is added after each convolution in additional layers in the checkpoint, for example, ssd1200/additional_layers/conv8/conv8_bn1/beta is found in the checkpoint.
But in the reference model, there is no batchnorm added after convolution: https://github.com/mlperf/inference/blob/master/v0.5/classification_and_detection/python/models/ssd_r34.py#L295
The weight provenance is not recorded for all the benchmarks and this is something that the README files should include. Need to have BM owners do this.
ReLU6 / ReLU8 is an allowed equivalence. We request this be extended to ReLU N for any N, including FLT_MAX.
Currently the rule says:
2.3. System and framework must be available
If you are measuring the performance of a publicly available and widely-used system or framework, you must use publicly available and widely-used versions of the system or framework.
If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication.
--
In training, we allowed PRs to frameworks to be allowed. How shall we reflect this in the rules?
How about closed source code and binaries? Do we require them to be on an official site? What defines to be an official site?
I believe that reducing the beam width of GNMT is not an approximation for the purposes of MLPerf Inference. Does everyone agree?
Following WG's discussion on 14/Aug/2019, I suggest we document the request and decision here.
We would like potentially to provide [an additional] model we trained for MobileNet INT8 to be fine-tuned with the canonical image processing flow, as we think that might recover some of the accuracy lost.”
For the cloud scenario, batching logic wants to predict the marginal latency cost to currently queued queries of waiting for the next inbound request. For the CV benchmarks this can be calculated from just the Poisson rate parameter, but for GNMT the distribution of sequence lengths is also needed.
Hi, Results WG, we have a question that we are doing NMT Inference benchmark with some engineers from another company, can we metion their names (like XXX from XXX)in the results publicaion?
7.1 has fp32 <-> int8 preprocessing conversion as untimed. We would like to add fp16 also.
I thought we agreed to disallow pruning, but the rules don't reflect this. Can we please update the rules to either allow or disallow pruning in the Closed division?
I'd like to clarify that the LoadGen language bindings are not part of the LoadGen and may be modified for MLPerf Inference 0.5 Closed division.
Please add to the rule @tjablin
Must upstream to the MLPerf github
Is mapping words in a sentence sample to embedding indices an acceptable form of preprocessing when loading text samples to RAM? There are many whitelisted int variants and conversions to white listed data types are permitted :)
“Is it possible to change operations only if preserving mathematical semantics?
We are targeting closed division of inference submission.
For example,
In the provided SSD model, concatenation is applied before NMS operation.
During NMS operation, detections with low scores will be filtered out.
We can achieve better latency if we filter out detections before concatenation”
Fusion of the operations is allowed according to the rules.
I am curious whether this is allowed or not for closed division.”
The WG decided this reordering is allowed given they are mathematically equivalent.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.