mlcommons / inference_policies Goto Github PK

View Code? Open in Web Editor NEW

54.0 54.0 50.0 583 KB

Issues related to MLPerf™ Inference policies, including rules and suggested changes

Home Page: https://mlcommons.org/en/groups/inference/

License: Apache License 2.0

inference_policies's People

Contributors

Stargazers

Watchers

inference_policies's Issues

update mobilenet accuracy to 98%

Finalize latency targets

Proposal:
Imaging: 50ms
GNMT: 100ms
We propose these targets stand unless counterproposal on 7/10.

quantitization rules questions

@guschmue asks, "Say I have a runtime that takes fp16 but internally uses 4 bit – can one submit this in closed?"

"We’d load the weights as fp16 but internally activations and weights would be 4 bit. We’d use calibration using the published list."

Finalize quality targets

Proposal:
99% of FP32 accuracy.
We propose these targets stand unless counterproposal on 7/10.

Loadgen: Need clarification on usage of total_sample_count which is part of QSL

For "Performance mode", I am trying to understand if there are specific expectations on the value the parameter "total_sample_count" in QSL should be set to. Can this be random value?. Please help provide clarification.
For "Accuracy mode" I am assuming this will be equal to the entire Validation set.

Query Library Size determination

We need to determine the QSL size. Currently, 1000 and 500 are being debated.

Please see link here.

multi stream latency constrain wrong ?

https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#41-benchmarks

mobilenet = 66ms
ssd-resnet34 = 50 ms

seems wrong. Need to swap this?

Clarify quantization rule description

The quantization rules need to be explained more clearly:

"The quantization method must be publicly described at a level where it could be reproduced. To be considered principled, the description of the quantization method must be much much smaller than the non-zero weights it produce."

It is not clear to others (and me) what it means that the description of the method must be much much smaller ...

"Weight quantization algorithm that are similar in size to the non-zero weights they produce."

Hmm... huh?

What the minimum accuracy or mAP requirement in the open division?

pre-processing per row data padding due to hardware alignment requirement

Just want to know whether per row data padding to match with hardware's alignment requirement is untimed or not.

ResNet Classes and Probabilities

ResNet returns a tuple of tensors representing classes and probabilities. Are both necessary?

clarify mixing of precisions in quantization rules

https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#81-weight-definition-and-quantization

we should add: mixing of any white listed precision is allowed. This is missing in the current doc.

Submission auditing rules for LoadGen

William from Qualcomm asked:

"Will MLperf have submission auditing rule on running the loadgen version command to report the signature of source files used in the current build of the app? Since current loadgen is implemented on python and python does not run on android, we integrated loadgen with our framework and we removed these source files as we don’t use them. Will Mlperf be mandating the loadgen version command to function during result submission auditing?"

Wrong quantization rules

I don't think we have the right standard for exposing quantization - much though I would be interested to see how everyone else is doing it, what we should really care about is that it's a function of the input model and the calibration set, and generalizes across a wide range of models.

That's less strict than publishing it to the level of detail where it can be reproduced.

clarification on accuracy vs performance mode

We should clarify that the same output should be passed in QuerySampleComplete in both accuracy and performance mode explicitly, although the performance mode does not check the content of the output.

allow to use sub set of the calibration dataset.

alternative calibration set proposal

An alternative proposal:
ImageNet: first 500 images and class labels.
COCO: first 500 images and annotations.

The current calibration sets are also 500 images each but randomly chosen (?).

Error margin on accuracy or inaccuracy?

I commented on an email discussion:

For calibration, what's the current fp_trained_accuracy for the planned set of inference benchmarks?

It seems to me that formulation in terms of the %correct might not be the right way to frame the error margin. The difference between 50% correct and 49% correct is probably negligible, but the difference between 99% and 98% correct is 2x more errors. None of which might matter if all of the current inference benchmarks are at 70% correct in float, but if anything is above 90%, then we're allowing very wide error margins for the high-accuracy benchmarks that might not represent what people want to deploy.

So, recommendations:

Can we please state what current accuracies people think apply for floating-point inference today? If we're currently 75% accurate / 25% inaccurate, the distinction I'm raising is probably moot. But if we're closer to 99% accurate, we should be careful not to widen our accuracy margins by ridiculous factors because of how the math work.
Please discuss: do people have opinions about %increase in errors versus %decrease in correct_percent?

clarify the number of outstanding queries allowed

duplicating:
mlcommons/training_policies#217

submission deadline moved to Sept 20.

WG: no one objects so far.

Is it available to use a web API based compilation?

Hi all,

In our cases, we commonly use and provide a web API based compiler for internal development & customers for security reasons.
Likewise, we hope to compile models using the web API based compiler during MLPerf inference submissions too. (obviously, auditors can use the compiler via web API in our inference submission system).
Is it available to consist our compilation process like this during inference submissions?

Padding/Transposing/Reshaping in Accuracy Mode

Some inference systems may produce outputs that are arbitrarily padded, transposed, or reshaped relative to the reference implementation. In real systems, downstream code can frequently be adapted to read inputs in unusual layouts. Should adapters be allowed for connecting the reference accuracy checking code?

lower SSD-mobilenet accuracy to 0.22

currently it is set to map of 0.23

clarify accuracy run is needed for every performance run.

Right now a submission to any model should have:

accuracy run with single stream (mandatory)
performance run with single stream (optional)
performance run with multi-stream stream (optional)
performance run with server stream (optional)
performance run with offline stream (optional)
At least one of 2-5 above should be submitted.

In the rules, it states:
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen modes.

I suggest we change it to:
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen mode with single stream.

@briandersn how would 'SubmissionRun' work in loadgen with the scheme above?

Long Offline Runs

For the offline scenario, can a run that is much longer than the minimum number of queries and the time duration be submitted?

image preloading

Can we give driver a hint to preload the image data to somewhere closer to chip during LoadSamplesToRam?

terms of use typo?

In the current terms of use:
https://github.com/mlperf/policies/blob/master/TERMS%20OF%20USE.md

You may cite either official results obtained from the MLPerf results page or unofficial results measured independently. If you cite an unofficial result you must clearly specify that the result is “Unverified” in text and clearly state “Result not verified by MLPerf” in a footnote. The result must comply with the letter and spirit of the relevant MLPerf rules. For example:

SmartAI Corp announced an estimated score of 0.3 on the MLPerf v0.5 Training Closed Division - Image Classification benchmark using a cluster of 20 SmartChips running MLFramework v4.1 [1].

[1] Result not verified by MLPerf. MLPerf name and logo are trademarks. See www.mlperf.org for more information.

--> actually 'Unverified' in not in example text. It only has 'estimated'

Add new reference quantized SSD-MobileNet model

Following WG's discussion on 14/Aug/2019, I suggest we document the request and decision here.

@DilipSequeira:

we would like the symmetrically quantized model for MobiletNet-SSD that Habana trained to be added to the list of allowed models...

unofficial MLPerf results

Motivation: currently MLPerf allows people to use the code and claim performance freely as long as you name it 'unverified'. Link:https://github.com/mlperf/policies/blob/master/TERMS%20OF%20USE.md
However, it might be really hard to verify any inference claims given the ‘black box’ nature of some inference engines.

A strawperson proposal from Intel with ‘unofficial MLPerf results’

need a PR to MLPerf results repo to show the code
OR: run the same code with minor modifications based on a past submission

Increase GNMT query library size to avoid implicit caching.

There are a small number of queries in the GNMT sample set.

In order to avoid all samples being implicitly cached, can we increase the GNMT source queries by having N copies of the samples such that they take up 512MB?

In order to avoid changing the loadgen API for v0.5, we could increase the size of the library itself and allow the performance size to be larger than the accuracy size in the QSL.

cc: @tjablin, @nvmbreughe

clarification for format conversion in pre-processing

As the rules section 7.1 indicate
May convert data among all the whitelisted numerical formats

and as int8 is in white list format, float to int8 conversion is allowed in pre-processing.

We used calibration list 1 to profile the input tensor value range post pre-processing, the min max range is (-123.68, 151.06), thus to fit the input tensor into int8, a scale of 0.84 shall be multiplied, otherwise, out of range value will be clamped and precision will be affected.

Just want to clarify the scaling of 0.84 on opencv pre-processed value can be done in pre-processing process, which is not timed in performance mode.

goal of server scenario

Is the server scenario for

all devices including phones?
actual HW which are targeting datacenter servers?

If 1) I suggest we set a few latency targets and don't force everyone to submit to the same latency targets. We also need to clarify in the rules that server scenario is a testing mode, not targeting a real server use case.

If 2) I suggest we set latency targets which represents actual data center use cases.

Thanks.

update rules doc with latency, quality targets, allowing dropping

Assigned to @tjablin

SSD tensorflow checkpoint inconsistent with the reference model

SSD 1200 tensorflow checkpoint: https://zenodo.org/record/3247091#.XQ0S4pJKhGE is inconsistent with the reference model: https://github.com/mlperf/inference/blob/master/v0.5/classification_and_detection/python/models/ssd_r34.py#L295

Batchnorm is added after each convolution in additional layers in the checkpoint, for example, ssd1200/additional_layers/conv8/conv8_bn1/beta is found in the checkpoint.

But in the reference model, there is no batchnorm added after convolution: https://github.com/mlperf/inference/blob/master/v0.5/classification_and_detection/python/models/ssd_r34.py#L295

Weight provenance

The weight provenance is not recorded for all the benchmarks and this is something that the README files should include. Need to have BM owners do this.

ReLU in model equivalence

ReLU6 / ReLU8 is an allowed equivalence. We request this be extended to ReLU N for any N, including FLT_MAX.

clarify the SW availability

Currently the rule says:
2.3. System and framework must be available
If you are measuring the performance of a publicly available and widely-used system or framework, you must use publicly available and widely-used versions of the system or framework.

If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication.

--
In training, we allowed PRs to frameworks to be allowed. How shall we reflect this in the rules?
How about closed source code and binaries? Do we require them to be on an official site? What defines to be an official site?

update the rules with accuracy represented 3 digits after decimal

GNMT Beam Width

I believe that reducing the beam width of GNMT is not an approximation for the purposes of MLPerf Inference. Does everyone agree?

Where is the V0.6 Wish List?

Add new reference quantized MobileNet model

Following WG's discussion on 14/Aug/2019, I suggest we document the request and decision here.

@DilipSequeira:

We would like potentially to provide [an additional] model we trained for MobileNet INT8 to be fine-tuned with the canonical image processing flow, as we think that might recover some of the accuracy lost.”

Add GNMT sentence length distribution to the repository

For the cloud scenario, batching logic wants to predict the marginal latency cost to currently queued queries of waiting for the next inbound request. For the CV benchmarks this can be calculated from just the Poisson rate parameter, but for GNMT the distribution of sequence lengths is also needed.

Questions about team member?

Hi, Results WG, we have a question that we are doing NMT Inference benchmark with some engineers from another company, can we metion their names (like XXX from XXX)in the results publicaion?

fp16 preprocessing

7.1 has fp32 <-> int8 preprocessing conversion as untimed. We would like to add fp16 also.

should core count be required to be reported in the details page for a submitting system?

Pruning?

I thought we agreed to disallow pruning, but the rules don't reflect this. Can we please update the rules to either allow or disallow pruning in the Closed division?

LoadGen Language Bindings

I'd like to clarify that the LoadGen language bindings are not part of the LoadGen and may be modified for MLPerf Inference 0.5 Closed division.

loadgen modifications must be upstreamed and approved

Please add to the rule @tjablin
Must upstream to the MLPerf github

clarification on preprocessing text for GNMT

Is mapping words in a sentence sample to embedding indices an acceptable form of preprocessing when loading text samples to RAM? There are many whitelisted int variants and conversions to white listed data types are permitted :)

order of operations in SSD

“Is it possible to change operations only if preserving mathematical semantics?
We are targeting closed division of inference submission.
For example,
In the provided SSD model, concatenation is applied before NMS operation.
During NMS operation, detections with low scores will be filtered out.
We can achieve better latency if we filter out detections before concatenation”
Fusion of the operations is allowed according to the rules.
I am curious whether this is allowed or not for closed division.”

The WG decided this reordering is allowed given they are mathematically equivalent.

mlcommons / inference_policies Goto Github PK

inference_policies's People

Contributors

Stargazers

Watchers

Forkers

inference_policies's Issues

Recommend Projects

Recommend Topics

Recommend Org