On current main, after PR <a class="issue-link js-issue-link" data-error-text="Failed

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

That is a very puzzling trace. I don't see any way an extra <code class="notranslate

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Hang encountered with pdti8 tests using proj/mnv2_first/. about cfu-playground HOT 13 CLOSED

google commented on May 16, 2024

Hang encountered with pdti8 tests using proj/mnv2_first/.

from cfu-playground.

Comments (13)

tcal-x commented on May 16, 2024 1

I am able to get correct behavior if I make these changes:

in gateware/sequencing.py, change Signal(11) to Signal(12) in three places
in gateware/mnv2_cfu.py, change FILTER_DATA_MEM_DEPTH from 512 to 2048

@AvG, I'll let you make the actual fix; I'm not sure my edits are the best solution.

from cfu-playground.

tcal-x commented on May 16, 2024

I looked through the assembly and found a sequence known to cause problems: a conditional branch immediately followed by a CFU instruction. So I added some no-ops in the source code in front of the CFU instruction, and verified that these translated into assembly (they did). But the hang still occurred, so this sequence does not seem to be the cause.

from cfu-playground.

tcal-x commented on May 16, 2024

from cfu-playground.

tcal-x commented on May 16, 2024

The hang occurs in the pixel loop, 100+ pixels into the computation.

Each iteration, there are

4 x CFU_STORE_INPUT_VALUE
1 x CFU_MACC_RUN
8 x CFU_GET_OUTPUT

In all earlier pixels, including the first full iteration shown in the image above, all CFU instructions respond immediately with a rsp_valid asserted. However, in the last two iterations, things seem to slow down. Starting at the 3rd of the 8 CFU_GET_OUTPUT, there is a delay of about 10 cycles between the cmd_valid asserted and the corresponding rsp_valid assertion.

In the final pixel, the first CFU_STORE_INPUT_VALUE is sent, but no reply ever occurs.

from cfu-playground.

tcal-x commented on May 16, 2024

More findings: The "level" of the fifo for the output queue seems corrupted from the start of the waveform capture. Its starting value is 0x1f8, out of a maximum 0x1ff. So apparently it underflowed at some earlier point, or was left almost full. (I started the waveform capture on the second full inference, since this is where the hang typically occurred).

So for the first 100+ pixels, the output data comes out immediately with no delay, since the output queue contains data. When I observed things "slow down", the output queue had finally emptied and was waiting for new items to be produced. That timing is probably what we should expect for the entire run.

The actual error likely occurred before the waveform capture started.

from cfu-playground.

tcal-x commented on May 16, 2024

Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model.

To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported.

@AvG, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground.

from cfu-playground.

alanvgreen commented on May 16, 2024

Agreed that this is a bug I'd like to find and fix. Even if it does not fully accelerate the pdti8 model, the accelerator should run correctly.

…

On Thu, Apr 8, 2021 at 1:23 PM TCal ***@***.***> wrote: Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model. To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported. @AvG <https://github.com/AvG>, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4WATSLXZOMBNMXTIVTH3THUOUZANCNFSM42NL7FZQ> .

-- Alan Green ***@***.***

from cfu-playground.

alanvgreen commented on May 16, 2024

Looking at a simulator trace, I'm seeing 2 invocations of CFU_RUN_MACC per conv 2D pixel instead of one.

According to the code, which specifies the order:

LoadInputs (ins_set)
Run Calc (start_run)
Get from output queue (oq_get)


    for (int p = 0; p < num_pixels; p++) {
      // Load twice on first loop, no load on last loop and once every other
      // time.
      if (p == 0) {
        LoadInputValues(input_ptr, input_depth_words);
      }
      if (p != num_pixels - 1) {
        LoadInputValues(input_ptr, input_depth_words);
      }

      CFU_MACC_RUN();
      UnloadOutputValues(output_ptr, batch_size / 4);
      output_ptr += (output_depth - batch_size) / 4;
    }

In summary, it seems that speculative execution and stateful CFUs don't mix.

Could we maybe try a completely unpipelined VexRiscV ?

from cfu-playground.

tcal-x commented on May 16, 2024

That is a very puzzling trace. I don't see any way an extra CFU_MACC_RUN() could be executed between the unloading and the next loading.

There are two possible scenarios that we know of where a CFU instruction is executed when it shouldn't be:

A conditional branch is taken to hop around an immediately following CFU instruction, but because the branch is not evaluated soon enough, the following instruction is sent to the CFU incorrectly. However, I have versions of VexRiscv that eliminate this possibility, and I still see the hang. Also, this is not exactly the scenario here, since the CFU_MACC_RUN() is not conditional.
A CFU instruction immediately follows an instruction that causes a data cache miss. In this case, the first CFU execution is squashed and re-executed, but like above, we can't actually squash a CFU instruction after it has been sent to the CFU. So it is executed twice in rapid succession when it should have been executed just one. But that is not the case here either.

Of course we might have multiple bugs here. Let me merge the VexRiscv with earlyBranch=true to eliminate one of the issues.

from cfu-playground.

tcal-x commented on May 16, 2024

This is a waveform I captured from the first pdti8 inference I ran. I captured waves only during conv_2d. The signals pointed out seem to get wonky after the 6th conv_2d layer.

from cfu-playground.

alanvgreen commented on May 16, 2024

Are you also seeing the additional starts in the first conv2d?

from cfu-playground.

tcal-x commented on May 16, 2024

I don't see the extra starts now.

I did notice that f_count_max is zero where I notice things getting strange (6th conv2d). The count being loaded is 2048, which doesnt' fit in the [10:0] representation.

from cfu-playground.

tcal-x commented on May 16, 2024

Verified the fix works on the board.

from cfu-playground.

Hang encountered with pdti8 tests using proj/mnv2_first/. about cfu-playground HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent