Giter VIP home page Giter VIP logo

Comments (6)

emmericp avatar emmericp commented on May 28, 2024

Yes, I've unfortunately seen this exact behavior before.
It only seems to happen on older ixgbe NICs at almost 100% load.
IIRC the rx packets sometimes simply don't get timestamped, try to check the number of received but not timestamped packets in your debug code. It should be 0 with a loopback cable and proper filters.

It doesn't happen on an X550 or i40e NICs, so it looks like a hardware issue and we can't do anything :(

Let me know if you find anything useful or a work-around. Software timestamping should work.

from libmoon.

atheurer avatar atheurer commented on May 28, 2024

BTW, I tried this on i40, and I still lose some packets, 5.8% at 19.84 Mpps. I will try some SW timestamping approach to see if it is truly losing these packets.

from libmoon.

emmericp avatar emmericp commented on May 28, 2024

Interesting result, I only tested it on an X710 10 Gbit i40e NIC at 14.88 Mpps (since that was the only one that's currently in a loop-back config in our lab).

from libmoon.

emmericp avatar emmericp commented on May 28, 2024

I just talked to Franck Baudin at the DPDK Summit about this issue and he stressed the importance of this for you guys in the opnfv project.
I think I have a work-around by changing the way the statistics are reported.

I unfortunately don't have access to any directly connected ixgbe or XL710 NICs at the moment. I'll setup a test system with a direct connection between two ixgbe ports in my lab on monday.

I've just tested an X710 (i40e 10 GbE NIC) and this NIC works fine:

[Device: id=4] TX: 14.88 Mpps, 7619 Mbit/s (10000 Mbit/s with framing)
[Device: id=5] RX: 14.88 Mpps, 7619 Mbit/s (10000 Mbit/s with framing)
[Device: id=4] TX: 14.88 Mpps, 7619 Mbit/s (10000 Mbit/s with framing)
[INFO]  Sent 59997 packets, received 59997
Samples: 59997, Average: 533.1 ns, StdDev: 12.8 ns, Quartiles: 526.0/532.0/538.0 ns
[Device: id=5] RX: 14.88 (StdDev 0.00) Mpps, 7619 (StdDev 0) Mbit/s (10000 Mbit/s with framing), total 152856921 packets with 9783802896 bytes (incl. CRC)
[Device: id=4] RX: 0.00 (StdDev 0.00) Mpps, 0 (StdDev 0) Mbit/s (0 Mbit/s with framing), total 0 packets with 0 bytes (incl. CRC)
[Device: id=5] TX: 0.00 (StdDev 0.00) Mpps, 0 (StdDev 0) Mbit/s (0 Mbit/s with framing), total 0 packets with 0 bytes (incl. CRC)
[Device: id=4] TX: 14.88 (StdDev 0.00) Mpps, 7619 (StdDev 0) Mbit/s (10000 Mbit/s with framing), total 152856921 packets with 9783802896 bytes (incl. CRC)

from libmoon.

emmericp avatar emmericp commented on May 28, 2024

Okay, I've setup a few loopback connections and found the following:

  • it only happens when the NIC is fully loaded in terms of packets/s
  • filtering the packets (e.g. incorrect dst mac and disabling promisc) "solves" this
  • there is also a very rare case where a timestamp is not taken (but the packet is received correctly) without any overload scenario. This happens every 50k to 100k packets or so.
  • NICs affected: XL710 (40 GbE), 82599 (10 GbE)
  • NICs not affected: X710 (10 GbE), X540/X550 (10 GbE 10GBASE-T)
  • there are two types of losses: packets not received at all and packets not timestamped
  • the device rx and tx counters always match, i.e. MoonGen counts all packets correctly even if they are not timestamped or received properly

Specific to 82599 NICs:

  • it both loses packets and fails to timestamp packets that are received, for example
  • [INFO] Sent 2638 packets, received 2139 packets, got 2086 timestamps
  • the worst-case are minimum-sized packets

Specific to XL710 NICs:

  • it only loses packets, but timestamps all packets that are received correctly, for example
  • [INFO] Sent 22803 packets, received 22716 packets, got 22716 timestamps
  • the worst case are again minimum-sized packets, which the NIC doesn't handle very well at all anyways
  • it is very likely that the NIC is simply better at sending packets than at receiving them -- this NIC is full of such weird hardware limits...

To conclude:

So I believe that this is not a big problem, it merely reduces the sample rate for timestamps at high packet loads. Use device Rx/Tx counters to report throughput and packet loss, ignore non-timestamped packets, this simply means that the sample rate will be lower under full load. Certainly not a good thing, but it doesn't look like that we can do better with the hardware.

Maybe it's also possible to install an explicit drop filter (like the commented out :setPromisc(false) call in the example script) for non-timestamped packets. I'm however not sure if that works and if the counters still work (they don't with promisc = false, but I think with an fdir filter they should).

BTW: timestamping at full load is not a useful scenario in many cases. For example, if you are forwarding between two ports with the same speed, then buffers might fill up due to short interruptions on the DuT and it's not possible for the DuT to "catch up" since the packets are coming in at the same rate that they can be sent out. This will be visible as an increasing latency over time for no obvious reason.)

from libmoon.

atheurer avatar atheurer commented on May 28, 2024

Thanks for all the testing and information. Initially this problem was quite severe around 10 Mpps (losing every single latency packet), but that was on a much older version of MoonGen/DPDK. More recent versions were significantly better, only seeing a small percentage of loss. I'll run your test script on the latest code just be sure I am seeing the same thing.

I agree that time-stamping at full load might not be useful, if the DUT cannot sustain 0 packet loss. However, we tune the DUT quite extensively to obtain 0-packet loss, and typically test this for 2 hours, and sometimes 12 hours or more. Technically this is not full load, because we need a DUT to process packets at a slightly higher rate than it is receiving, so that when there is some preemption, and buffer use increases, the buffer can later be "drained" before the next preemption happens. But, at this maximum, sustained, no-loss rate, we really do want to have a good characterization of latency.

from libmoon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.