Giter VIP home page Giter VIP logo

Comments (4)

Roger-Shepherd avatar Roger-Shepherd commented on August 30, 2024

Well LaurelinTheGold, you've raised a can of worms here. I thought the situation was unclear but explicable. However, it looks like we ended up with a discrepancy between theory (documentation) and practice. Anyhow, this is what I think is going on, I may be wrong. But whether I'm right or wrong, your point that we have a documentation problem remains valid.

When got Embench working on Apple Mac I also found the documentation
about CPU_MHZ and normalised times confusing. The documentation is not helped by reporting speed in mS - mS is time, speed is 1/time. Fortunately the code says what actually happens and, eventually, I understood the code. (I think).

Before building on your comments, I want to address the number of
iterations performed by each program in the the suite. This is determined by the product of two numbers LOCAL_SCALE_FACTOR and CPU_MHZ. LOCAL_SCALE_FACTOR is defined for each program and its purpose is to cause the execution time of the program on a nominal 1 MHz processor to be around 4s. (4s is chosen because it is long enough to make time measurements reliable, and short enough that the whole benchmark suite can be run in a reasonable time). CPU_MHZ is defined per target (processor) and its purpose is to scale the number of iterations so that a fast processor still takes around 4s to run the program. [Detail: the number of iterations is a compile time constant. This was to avoid problems with embedded systems where it would be difficult to pass a parameter at run time].

Moving to your comments:

  1. Yes. The assumption is that, once the caches are warmed, execution time is linear in the number of iterations. There is a proposal on the table for V2.0 Embench, that we run each program N time and 2N times, and use the difference as the (nominal) time for N iterations, avoiding a separate warming phase.
  2. Strictly, the time is determined by the routines start_trigger and stop_trigger, interpreted by decode_results in the Python file indicated by ``--target-module. For the Mac I use the o/s function clock_gettime` which gives me real time independent frequency. For systems which count cycles, the relationship `time = cycles/frequency` is
    used.
  3. The documentation says this but it never happens.
  4. True
  5. In fact the non-normalized times are used (which yields the same result).
    rel_data[bench] = baseline[bench] / raw_data[bench] # line 296 benchmark_speed.py The resulting score says how many times faster the platform being measured is than the reference.
  6. Yes.
  7. This is where I get confused.... The user guide in doc/README.md says:

"The reference CPU is an Arm Cortex M4 processor .... The reference platform is a ST Microelectronics STM32F4 Discovery Board ... using its default clock speed of 16MHz".

"The benchmark value is the geometric mean of the relative speeds. A larger value means a faster platform. The range gives an indication of how much variability there is in this performance."

"In addition the geometric mean may then be divided by the value used for CPU_MHZ, to yield an Embench score per MHz.""

  1. (continued) So by definition the reference platform's benchmark value must be 1.0, and the Embench score per MHz must be 1.0/16 which is 0.625. However, the reported results in embench/embench-iot-results include a couple of 16 MHz M4s which report speeds of 14.4 and 16.0, and speeds per MHz of 0.9 and 0.93 respective. These are wrong by the definition above, and look like they are using a 1 MHz processor as the reference, i.e. as you describe.
  2. I think the definition in the documentation says something different (above) but you are describing what seems to be done.

Clearly there is a problem with the documentation. Someone (maybe me, maybe other people) are misinterpreting it to the extent that I think people are publishing wrong results. Regarding your suggestions:

  1. "emphasize the difference between the real time and the normalized time" I agree. The current state of affairs is not clear. (see my final comment also)
  2. "clarify that the workload is scaled by cpu clock" I agree.
  3. "clarify that 4 implies the baseline speed score will be 16" I think this is a change, not a clarification (but maybe I've misunderstood the wording)
  4. "clarify that 8 implies 6 so while 5 is true, the speed score is not what benchmark_speed.py outputs" I disagree. The score output is (I think) what the documentation defines.
  5. "clarify that 8 implies we do not need to worry about normalizing the time at all since we can just multiply the speed output by the cpu clock". Except I think this is wrong; it is what people are doing, but it seems to contradict the documentation.

I suspect we have to change the documentation to match usage, although from a quick look at a couple of papers which use Embench it seems people quote figures relative to there own baseline, so perhaps we can keep the definitions we have and correct our published results. (Or maybe I've got things wrong here).

Personally, I think we should define an Embench MIP which is the speed of a nominal 1 MIP processor running Embench (anyone want to port Embench to a vintage VAX 11/780?). This can be just a fudge factor to the reference score - if people think a 16MHz M4 is a worthy 1 MIP processor, then the reference platform would be a 16 Embench MIP platform.

Finally, for V2, I think we should have a scheme where there is a per-benchmark normalisation factor NF which works in place of CPU_MHZ to scale the execution time to about 4s - that is, the number of iterations would be LOCAL_SCALE_FACTOR * NF and the reported time would be actual time / NF. This allows for differences in performance characteristics of processors to be accommodated.

from embench-iot.

LaurelinTheGold avatar LaurelinTheGold commented on August 30, 2024

Hi Roger, most of what you are saying makes sense but I will disagree on point 5 that normalized time ratios yields the same results as unnormalized time ratios.

If the chip has a clock of C and the baseline benchmark takes N cycles, then the scaled cycles is CN and the time taken to run is CN/C=N. The normalized time to run would be N/C (C being unitless). If we set the baseline normalized time B=N_0/C_0, the ratio of the normalized times is B/(N/C)=BC/N=N_0C/(NC_0). The ratio of the real times is N/N_0=scorepermhz. Even if the baseline clock is set to 1MHz, the normalized time still gives scorepermhz*C where C is the clock speed of the chip being tested.

Is there a better way of finding papers that use embench other than google scholar searching embench?

I am not experienced enough in benchmarking lore to worry about V2 yet.

Thanks for the reply!

from embench-iot.

Roger-Shepherd avatar Roger-Shepherd commented on August 30, 2024

LaurelinTheGold,

"... I will disagree on point 5 that normalized time ratios yields the same results as unnormalized time ratios.

If the chip has a clock of C and the baseline benchmark takes N cycles, then the scaled cycles is CN and the time taken to run is CN/C=N. The normalized time to run would be N/C (C being unitless). If we set the baseline normalized time B=N_0/C_0, the ratio of the normalized times is B/(N/C)=BC/N=N_0C/(NC_0). The ratio of the real times is N/N_0=scorepermhz. Even if the baseline clock is set to 1MHz, the normalized time still gives scorepermhz*C where C is the clock speed of the chip being tested.

You are right.

Quickly thinking about my responses to 6, 7, and 8.

  1. We are in agreement about 6 and I think we are correct.

  2. I need to work through

  3. The problem is the baseline (reference platform) is 16 MHz.

You being right about 5 means I don't understand how the reporting is working! Line 549 of embench-iot/doc/README.md says "These computations are carried out by the benchmark scripts". I can't see that the "benchmarking scripts" in embench/iot do this, and from a quick look at embench/embench-iot-results I can't see a solution there. In particular, I can't see how the normalised reference results are produced. If the reference platform is run using --baseline-output the non-normalised results are output. There is a comment (line 243 in embench-iot-results/embres/data.py) which says ""# Speed data in file is per MHz" (i.e. normalised) but the only way I can see that being tru is if the results have been edited - the results produced by benchmark_speed.py aren't normalised.

Is there a better way of finding papers that use embench other than google scholar searching embench?

Not that I know of.

from embench-iot.

hirooih avatar hirooih commented on August 30, 2024

@LaurelinTheGold, and @Roger-Shepherd,

I also confused by the description in "Computing a benchmark value for speed" section.
Your discussion above helps me to understand it.

Here is my summary.


  • baseline: score of Cortex M4

size

  • absolute score
    • size of text segments [byte]
    • smaller is better
  • relative score
    • rel_data[bench] = raw_totals[bench] / baseline[bench]
    • smaller is better

speed

CPIter[bench] := Cycles Per a Iteration of each benchmark
LOCAL_SCALE_FACTOR[bench] := 4,000,000/(CPIter[bench] of Cortex M4)
  • absolute score
    • elapsed time of (LOCAL_SCALE_FACTOR[bench] * CPU_MHZ) times loop [ms]
    • smaller is better
== (the total number of cycle of a benchmark) / CPU_MHZ / 1000
== (CPIter[bench] * LOCAL_SCALE_FACTOR[bench] * CPU_MHZ) / CPU_MHZ / 1000
== CPIter[bench] * LOCAL_SCALE_FACTOR[bench] / 1000
== raw_data[bench] (=~ 4000 for Cortex M4 -O2)
  • relative score (Speed/MHz)
    • larger is better
rel_data[bench] = baseline[bench] / raw_data[bench]
  • Speed: (relative score) * CPU_MHZ
    • larger is better

The current last two paragraph:

The benchmark value is the geometric mean of the relative speeds. A larger
value means a faster platform. The range gives an indication of how much
variability there is in this performance.

In addition the geometric mean may then be divided by the value used for
CPU_MHZ, to yield an Embench score per MHz. This is an indication of the
efficiency of the platform in carrying out computation.

How about change this as follows?

The geometric mean yields an Embench speed score per MHz. This is an indication
of the efficiency of the platform in carrying out computation. A larger
value means a more efficient platform. The range gives an indication of how much
variability there is in this performance.

In addition the geometric mean then is multiplied by the value used for
CPU_MHZ. A larger value means a faster platform.

If you agree with me, shall I send a PR?

from embench-iot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.