Giter VIP home page Giter VIP logo

inference_policies's People

Contributors

ahmadki avatar anirban-ghosh avatar arjunsuresh avatar bitfort avatar christ1ne avatar ckstanton avatar dilipsequeira avatar galv avatar guschmue avatar itayhubara avatar liorkhe avatar morphine00 avatar mrasquinha-g avatar mrmhodak avatar nathanw-mlc avatar nv-ananjappa avatar nv-jinhosuh avatar nvitramble avatar nvpohanh avatar nvyihengz avatar nvzhihanj avatar petermattson avatar pgmpablo157321 avatar profvjreddi avatar psyhtest avatar rnaidu02 avatar s-idgunji avatar thekanter avatar tjablin avatar yyetim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inference_policies's Issues

Finalize latency targets

Proposal:
Imaging: 50ms
GNMT: 100ms
We propose these targets stand unless counterproposal on 7/10.

quantitization rules questions

@guschmue asks, "Say I have a runtime that takes fp16 but internally uses 4 bit – can one submit this in closed?"

"We’d load the weights as fp16 but internally activations and weights would be 4 bit. We’d use calibration using the published list."

Finalize quality targets

Proposal:
99% of FP32 accuracy.
We propose these targets stand unless counterproposal on 7/10.

Clarify quantization rule description

The quantization rules need to be explained more clearly:

"The quantization method must be publicly described at a level where it could be reproduced. To be considered principled, the description of the quantization method must be much much smaller than the non-zero weights it produce."

It is not clear to others (and me) what it means that the description of the method must be much much smaller ...

"Weight quantization algorithm that are similar in size to the non-zero weights they produce."

Hmm... huh?

Submission auditing rules for LoadGen

William from Qualcomm asked:

"Will MLperf have submission auditing rule on running the loadgen version command to report the signature of source files used in the current build of the app? Since current loadgen is implemented on python and python does not run on android, we integrated loadgen with our framework and we removed these source files as we don’t use them. Will Mlperf be mandating the loadgen version command to function during result submission auditing?"

Wrong quantization rules

I don't think we have the right standard for exposing quantization - much though I would be interested to see how everyone else is doing it, what we should really care about is that it's a function of the input model and the calibration set, and generalizes across a wide range of models.

That's less strict than publishing it to the level of detail where it can be reproduced.

clarification on accuracy vs performance mode

We should clarify that the same output should be passed in QuerySampleComplete in both accuracy and performance mode explicitly, although the performance mode does not check the content of the output.

alternative calibration set proposal

An alternative proposal:
ImageNet: first 500 images and class labels.
COCO: first 500 images and annotations.

The current calibration sets are also 500 images each but randomly chosen (?).

Error margin on accuracy or inaccuracy?

I commented on an email discussion:

For calibration, what's the current fp_trained_accuracy for the planned set of inference benchmarks?

It seems to me that formulation in terms of the %correct might not be the right way to frame the error margin. The difference between 50% correct and 49% correct is probably negligible, but the difference between 99% and 98% correct is 2x more errors. None of which might matter if all of the current inference benchmarks are at 70% correct in float, but if anything is above 90%, then we're allowing very wide error margins for the high-accuracy benchmarks that might not represent what people want to deploy.

So, recommendations:

  1. Can we please state what current accuracies people think apply for floating-point inference today? If we're currently 75% accurate / 25% inaccurate, the distinction I'm raising is probably moot. But if we're closer to 99% accurate, we should be careful not to widen our accuracy margins by ridiculous factors because of how the math work.
  2. Please discuss: do people have opinions about %increase in errors versus %decrease in correct_percent?

Is it available to use a web API based compilation?

Hi all,

In our cases, we commonly use and provide a web API based compiler for internal development & customers for security reasons.
Likewise, we hope to compile models using the web API based compiler during MLPerf inference submissions too. (obviously, auditors can use the compiler via web API in our inference submission system).
Is it available to consist our compilation process like this during inference submissions?

Padding/Transposing/Reshaping in Accuracy Mode

Some inference systems may produce outputs that are arbitrarily padded, transposed, or reshaped relative to the reference implementation. In real systems, downstream code can frequently be adapted to read inputs in unusual layouts. Should adapters be allowed for connecting the reference accuracy checking code?

clarify accuracy run is needed for every performance run.

Right now a submission to any model should have:

  1. accuracy run with single stream (mandatory)
  2. performance run with single stream (optional)
  3. performance run with multi-stream stream (optional)
  4. performance run with server stream (optional)
  5. performance run with offline stream (optional)
    At least one of 2-5 above should be submitted.

In the rules, it states:
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen modes.

I suggest we change it to:
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen mode with single stream.

@briandersn how would 'SubmissionRun' work in loadgen with the scheme above?

Long Offline Runs

For the offline scenario, can a run that is much longer than the minimum number of queries and the time duration be submitted?

image preloading

Can we give driver a hint to preload the image data to somewhere closer to chip during LoadSamplesToRam?

terms of use typo?

In the current terms of use:
https://github.com/mlperf/policies/blob/master/TERMS%20OF%20USE.md

You may cite either official results obtained from the MLPerf results page or unofficial results measured independently. If you cite an unofficial result you must clearly specify that the result is “Unverified” in text and clearly state “Result not verified by MLPerf” in a footnote. The result must comply with the letter and spirit of the relevant MLPerf rules. For example:

SmartAI Corp announced an estimated score of 0.3 on the MLPerf v0.5 Training Closed Division - Image Classification benchmark using a cluster of 20 SmartChips running MLFramework v4.1 [1].

[1] Result not verified by MLPerf. MLPerf name and logo are trademarks. See www.mlperf.org for more information.

--> actually 'Unverified' in not in example text. It only has 'estimated'

unofficial MLPerf results

Motivation: currently MLPerf allows people to use the code and claim performance freely as long as you name it 'unverified'. Link:https://github.com/mlperf/policies/blob/master/TERMS%20OF%20USE.md
However, it might be really hard to verify any inference claims given the ‘black box’ nature of some inference engines.

A strawperson proposal from Intel with ‘unofficial MLPerf results’

  • need a PR to MLPerf results repo to show the code
  • OR: run the same code with minor modifications based on a past submission

Increase GNMT query library size to avoid implicit caching.

There are a small number of queries in the GNMT sample set.

In order to avoid all samples being implicitly cached, can we increase the GNMT source queries by having N copies of the samples such that they take up 512MB?

In order to avoid changing the loadgen API for v0.5, we could increase the size of the library itself and allow the performance size to be larger than the accuracy size in the QSL.

cc: @tjablin, @nvmbreughe

clarification for format conversion in pre-processing

As the rules section 7.1 indicate
May convert data among all the whitelisted numerical formats

and as int8 is in white list format, float to int8 conversion is allowed in pre-processing.

We used calibration list 1 to profile the input tensor value range post pre-processing, the min max range is (-123.68, 151.06), thus to fit the input tensor into int8, a scale of 0.84 shall be multiplied, otherwise, out of range value will be clamped and precision will be affected.

Just want to clarify the scaling of 0.84 on opencv pre-processed value can be done in pre-processing process, which is not timed in performance mode.

goal of server scenario

Is the server scenario for

  1. all devices including phones?
  2. actual HW which are targeting datacenter servers?

If 1) I suggest we set a few latency targets and don't force everyone to submit to the same latency targets. We also need to clarify in the rules that server scenario is a testing mode, not targeting a real server use case.

If 2) I suggest we set latency targets which represents actual data center use cases.

Thanks.

SSD tensorflow checkpoint inconsistent with the reference model

SSD 1200 tensorflow checkpoint: https://zenodo.org/record/3247091#.XQ0S4pJKhGE is inconsistent with the reference model: https://github.com/mlperf/inference/blob/master/v0.5/classification_and_detection/python/models/ssd_r34.py#L295

Batchnorm is added after each convolution in additional layers in the checkpoint, for example, ssd1200/additional_layers/conv8/conv8_bn1/beta is found in the checkpoint.

But in the reference model, there is no batchnorm added after convolution: https://github.com/mlperf/inference/blob/master/v0.5/classification_and_detection/python/models/ssd_r34.py#L295

Weight provenance

The weight provenance is not recorded for all the benchmarks and this is something that the README files should include. Need to have BM owners do this.

ReLU in model equivalence

ReLU6 / ReLU8 is an allowed equivalence. We request this be extended to ReLU N for any N, including FLT_MAX.

clarify the SW availability

Currently the rule says:
2.3. System and framework must be available
If you are measuring the performance of a publicly available and widely-used system or framework, you must use publicly available and widely-used versions of the system or framework.

If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication.

--
In training, we allowed PRs to frameworks to be allowed. How shall we reflect this in the rules?
How about closed source code and binaries? Do we require them to be on an official site? What defines to be an official site?

GNMT Beam Width

I believe that reducing the beam width of GNMT is not an approximation for the purposes of MLPerf Inference. Does everyone agree?

Add new reference quantized MobileNet model

Following WG's discussion on 14/Aug/2019, I suggest we document the request and decision here.

@DilipSequeira:

We would like potentially to provide [an additional] model we trained for MobileNet INT8 to be fine-tuned with the canonical image processing flow, as we think that might recover some of the accuracy lost.”

Add GNMT sentence length distribution to the repository

For the cloud scenario, batching logic wants to predict the marginal latency cost to currently queued queries of waiting for the next inbound request. For the CV benchmarks this can be calculated from just the Poisson rate parameter, but for GNMT the distribution of sequence lengths is also needed.

Questions about team member?

Hi, Results WG, we have a question that we are doing NMT Inference benchmark with some engineers from another company, can we metion their names (like XXX from XXX)in the results publicaion?

fp16 preprocessing

7.1 has fp32 <-> int8 preprocessing conversion as untimed. We would like to add fp16 also.

Pruning?

I thought we agreed to disallow pruning, but the rules don't reflect this. Can we please update the rules to either allow or disallow pruning in the Closed division?

LoadGen Language Bindings

I'd like to clarify that the LoadGen language bindings are not part of the LoadGen and may be modified for MLPerf Inference 0.5 Closed division.

clarification on preprocessing text for GNMT

Is mapping words in a sentence sample to embedding indices an acceptable form of preprocessing when loading text samples to RAM? There are many whitelisted int variants and conversions to white listed data types are permitted :)

order of operations in SSD

“Is it possible to change operations only if preserving mathematical semantics?
We are targeting closed division of inference submission.
For example,
In the provided SSD model, concatenation is applied before NMS operation.
During NMS operation, detections with low scores will be filtered out.
We can achieve better latency if we filter out detections before concatenation”
Fusion of the operations is allowed according to the rules.
I am curious whether this is allowed or not for closed division.”

The WG decided this reordering is allowed given they are mathematically equivalent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.