Giter VIP home page Giter VIP logo

Comments (13)

tcal-x avatar tcal-x commented on May 16, 2024 1

I am able to get correct behavior if I make these changes:

  • in gateware/sequencing.py, change Signal(11) to Signal(12) in three places
  • in gateware/mnv2_cfu.py, change FILTER_DATA_MEM_DEPTH from 512 to 2048

@AvG, I'll let you make the actual fix; I'm not sure my edits are the best solution.

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

I looked through the assembly and found a sequence known to cause problems: a conditional branch immediately followed by a CFU instruction. So I added some no-ops in the source code in front of the CFU instruction, and verified that these translated into assembly (they did). But the hang still occurred, so this sequence does not seem to be the cause.

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

waves_hang

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

The hang occurs in the pixel loop, 100+ pixels into the computation.

Each iteration, there are

  • 4 x CFU_STORE_INPUT_VALUE
  • 1 x CFU_MACC_RUN
  • 8 x CFU_GET_OUTPUT

In all earlier pixels, including the first full iteration shown in the image above, all CFU instructions respond immediately with a rsp_valid asserted. However, in the last two iterations, things seem to slow down. Starting at the 3rd of the 8 CFU_GET_OUTPUT, there is a delay of about 10 cycles between the cmd_valid asserted and the corresponding rsp_valid assertion.

In the final pixel, the first CFU_STORE_INPUT_VALUE is sent, but no reply ever occurs.

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

More findings: The "level" of the fifo for the output queue seems corrupted from the start of the waveform capture. Its starting value is 0x1f8, out of a maximum 0x1ff. So apparently it underflowed at some earlier point, or was left almost full. (I started the waveform capture on the second full inference, since this is where the hang typically occurred).

So for the first 100+ pixels, the output data comes out immediately with no delay, since the output queue contains data. When I observed things "slow down", the output queue had finally emptied and was waiting for new items to be produced. That timing is probably what we should expect for the entire run.

The actual error likely occurred before the waveform capture started.

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model.

To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported.

@AvG, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground.

from cfu-playground.

alanvgreen avatar alanvgreen commented on May 16, 2024

from cfu-playground.

alanvgreen avatar alanvgreen commented on May 16, 2024

Looking at a simulator trace, I'm seeing 2 invocations of CFU_RUN_MACC per conv 2D pixel instead of one.

image

According to the code, which specifies the order:

  1. LoadInputs (ins_set)
  2. Run Calc (start_run)
  3. Get from output queue (oq_get)

    for (int p = 0; p < num_pixels; p++) {
      // Load twice on first loop, no load on last loop and once every other
      // time.
      if (p == 0) {
        LoadInputValues(input_ptr, input_depth_words);
      }
      if (p != num_pixels - 1) {
        LoadInputValues(input_ptr, input_depth_words);
      }

      CFU_MACC_RUN();
      UnloadOutputValues(output_ptr, batch_size / 4);
      output_ptr += (output_depth - batch_size) / 4;
    }

In summary, it seems that speculative execution and stateful CFUs don't mix.

Could we maybe try a completely unpipelined VexRiscV ?

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

That is a very puzzling trace. I don't see any way an extra CFU_MACC_RUN() could be executed between the unloading and the next loading.

There are two possible scenarios that we know of where a CFU instruction is executed when it shouldn't be:

  • A conditional branch is taken to hop around an immediately following CFU instruction, but because the branch is not evaluated soon enough, the following instruction is sent to the CFU incorrectly. However, I have versions of VexRiscv that eliminate this possibility, and I still see the hang. Also, this is not exactly the scenario here, since the CFU_MACC_RUN() is not conditional.

  • A CFU instruction immediately follows an instruction that causes a data cache miss. In this case, the first CFU execution is squashed and re-executed, but like above, we can't actually squash a CFU instruction after it has been sent to the CFU. So it is executed twice in rapid succession when it should have been executed just one. But that is not the case here either.

Of course we might have multiple bugs here. Let me merge the VexRiscv with earlyBranch=true to eliminate one of the issues.

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

wave_annot

This is a waveform I captured from the first pdti8 inference I ran. I captured waves only during conv_2d. The signals pointed out seem to get wonky after the 6th conv_2d layer.

from cfu-playground.

alanvgreen avatar alanvgreen commented on May 16, 2024

Are you also seeing the additional starts in the first conv2d?

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

I don't see the extra starts now.

I did notice that f_count_max is zero where I notice things getting strange (6th conv2d). The count being loaded is 2048, which doesnt' fit in the [10:0] representation.

from cfu-playground.

tcal-x avatar tcal-x commented on May 16, 2024

Verified the fix works on the board.

from cfu-playground.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.