The reference plane for "time" is defined to be the rtio-clock side of the FIFOs, wher

manual latency compensation examples in <a class="issue-link js-issue-link" data-error

I don't want to delay 3.0 any further with this, and as <a class="user-mention notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

develop generic pattern to compensate for delays/latencies,about m-labs/artiq

Comments (31)

jordens commented on August 16, 2024

manual latency compensation examples in #530 (comment)

from artiq.

jbqubit commented on August 16, 2024

DRTIO enables increased flexibility with an incremental increase in complexity in the form of implementation-dependent latency compensation. As @jordens pointed out the ARTIQ plan is for an elegant mechanism handle this issue.

There are many features of DRTIO/Sinara that have new user-dependent latencies. Some examples:

DRTIO over microTCA AMC backplane, per-slot latency
DTRIO over variable-length optical fiber (SFP)
SPI, TTL over variable-length VHDCI cables to VHDCI Carrier (and 3U peripherals)

Latency compensation is already a common feature in v2.x ARTIQ experiment code used in the lab (eg for DDSs). Labs using Sinara will incur an experiment code rewrite migrate from v2.x to v3.0. It is a good economy for end users to also use a new latency compensation scheme at this juncture.

It looks like latency compensation is no longer a 3.0 milestone. I'd like to see it included in 3.0. Thoughts @dhslichter, @r-srinivas, @dleibrandt, @hartytp, @cjbe?

from artiq.

sbourdeauducq commented on August 16, 2024

Nope. There is no funding for it plus we need to get 3.0 out soon, and things like #685 are already a major difficulty.

from artiq.

dhslichter commented on August 16, 2024

I don't want to delay 3.0 any further with this, and as @sbourdeauducq points out there are many more issues to be addressed with higher priority for users. While figuring out the latencies is an important issue to address in the fullness of time, it is also something that can be addressed simply, with the current ARTIQ, by end users -- you just wrap pulse/on/off methods with something that looks at the channel being operated on and adds/subtracts a deterministic amount looked up from a calibration dataset. If your pulse durations/starts are being calculated on the fly by the core device, this will slow things a little, but if they are known at compile time the addition should be done in compilation and there is no performance hit.

from artiq.

jordens commented on August 16, 2024

That's the scheme we would advocate as well. It would just be nice to have examples and some skeleton code that shows how it can be implemented.

from artiq.

jbqubit commented on August 16, 2024

you just wrap pulse/on/off methods with something that looks at the channel being operated on and adds/subtracts a deterministic amount looked up from a calibration dataset

As I understand it the latency for each instance of each PHY should be constant. So for the moment the approach you're recommending is a per-PHY wrapper that accounts for latency, right? However, even for coredevice.ttl.TTLOut this is excessively annoying since so many high-level calls need to be wrapped.

set_o()
sync()
on()
off()
pulse_mu()
pulse()

End users fixing this by wrapping the ARTIQ API results in a terrible mess in experiment files. Most PHYs are far more complex than TTLOut.

Here's what a fix coredevice.ttl.TTLOut might look like. Modified lines marked with $.

$   def __init__(self, dmgr, channel, core_device="core", latency=0):
        self.core = dmgr.get(core_device)
        self.channel = channel
$      self.latency_mu = seconds_to_mu(latency)

@kernel
    def set_o(self, o):
$      t0 = tcursor_mu()
$      at_mu(t0-self.latency_mu)

        rtio_output(now_mu(), self.channel, 0, 1 if o else 0)
        self.o_previous_timestamp = now_mu()

$      at_mu(t0)

    @kernel
    def sync(self):
        """Busy-wait until all programmed level switches have been
        effected."""
$      while self.core.get_rtio_counter_mu() - latency_mu < self.o_previous_timestamp:
            pass

    @kernel
    def on(self):
        """Sets the output to a logic high state at the current position
        of the time cursor.
        The time cursor is not modified by this function."""
        self.set_o(True)

    @kernel
    def off(self):
        """Set the output to a logic low state at the current position
        of the time cursor.
        The time cursor is not modified by this function."""
        self.set_o(False)

    @kernel
    def pulse_mu(self, duration):
        """Pulse the output high for the specified duration
        (in machine units).
        The time cursor is advanced by the specified duration."""
        self.on()
        delay_mu(duration)
        self.off()

    @kernel
    def pulse(self, duration):
        """Pulse the output high for the specified duration
        (in seconds).
        The time cursor is advanced by the specified duration."""
        self.on()
        delay(duration)
        self.off()

Relatively simple, right? Or am I missing something.

from artiq.

dhslichter commented on August 16, 2024

Why does the end user wrapping this result in a "terrible mess" in experiment files? Just create a subclass called TTLOutLC(TTLOut), and implement the modified versions of the methods as above. Then just use TTLOutLC instead of TTLOut in your experiment code. Seems incredibly clean and simple to me. Now, if you set latencies to be wacky numbers you may get some tricky corner cases (such as the above-mentioned slow shutter being programmed in the past), but that happens no matter how you choose to approach the problem, it's unavoidable.

If you want to perform latency correction on SPI, you do the same thing, make a wrapper class in which some of the timing is modified. If you want to do this for SAWG, again you make a wrapper class and modify the timestamps of spline knots/phase/freq/amplitude changes/etc.

from artiq.

jbqubit commented on August 16, 2024

So rather than modifying six lines of code in the ARTIQ code base each lab is now creating a new class, overloading 7 members of TTLOut and debugging it. In TTL PHY repeat for

TTLInOut (25 methods!)
TTLClockGen
Then for all the other PHYs... DMA, I2C, PDQ, sawg, spi, dds??

from artiq.

dhslichter commented on August 16, 2024

So submit the above patch to the ARTIQ code base. For TTLInOut and TTLClockGen should be similar.

from artiq.

jbqubit commented on August 16, 2024

@jordens asked

It would just be nice to have examples and some skeleton code that shows how it can be implemented.

I don't have the bandwidth to write and test a holistic the solution to this. At this stage I think discussion of implementation and pitfalls is where we're at.

Better than the new constructor method above, it may well be cleaner to add a latency argument for PHYs in device_db.py. Or support both.

@jordens Please add label for-contract and add to 4.0 milestone.

from artiq.

jordens commented on August 16, 2024

@jbqubit this is not how we operate. Happy to make it for-contract. But that is mutually exclusive with adding it to a milestone.

The problem with your example is that it is not generic enough. The falling and rising edges of input, output, and oe could well be all different. Accounting for all that makes this latency compensation scheme inappropriate for the generic ARTIQ TTL channels.

from artiq.

jbqubit commented on August 16, 2024

Let's discuss a first order solution that addresses gross latency first. 1) PHY-to-PHY latency (eg TTL to SWAG, currently couple micro-second) 2) cable length.

Agreed that there are higher order corrections that may depend on eg what's connected to a TLL.

from artiq.

sbourdeauducq commented on August 16, 2024

How about the following:

Gateware adds an offset to each input and output channel.
Offset resolution is a fine RTIO cycle.
Offsets default to the interface.delay specified in the RTIO channel (defined in gateware files)
Offsets can be read and written by kernels. Read/writes of those offsets are not real-time. Typically, the startup kernel would set all those offsets before doing anything. In the startup kernel, the read method is particularly useful to obtain interface.delay and add some value to it.
Negative offsets are not supported.
RTIO drivers (TTLOut, SAWG, ...) provide wrapper methods to access the offsets of their corresponding channels.

This won't support complicated latency compensation schemes (such as different offsets for rising and falling edges), but it is doable in gateware with minimal penalty in overall latency or decreasing throughput.

Ping @hartytp

from artiq.

hartytp commented on August 16, 2024

I'm not really the right person to ask about that. Try @cjbe or @klickverbot, or the NIST/ARL guys...

from artiq.

dhslichter commented on August 16, 2024

I am assuming from this that the offset for a given channel is added to the timestamp sent from the kernel, in other words if you send a timestamp of 0 and the offset is 25, then it goes into the FIFO to be emitted at timestamp 25. Is this correct? If this is the case, negative offsets are really the "useful" ones, in that what you want to do is to be able to send an output (or assign the original time of an input) to a time before the current time. That is, if you want to trigger something where there is a high latency, you want to send the trigger pulse a lot earlier than the time when you actually need the thing to have triggered by. Likewise, if you are counting some input event, the true time of that event must have been sometime at an earlier timestamp than the one it was assigned upon reaching the FPGA.

If I have the signs backwards here, and the positive offset moves the final timestamp earlier, then everything seems fine. I don't think that complicated latencies (e.g. different latencies for rising and falling edges) needs to be handled here, someone can write a driver to account for this at the software level.

from artiq.

sbourdeauducq commented on August 16, 2024

I am assuming from this that the offset for a given channel is added to the timestamp sent from the kernel, in other words if you send a timestamp of 0 and the offset is 25, then it goes into the FIFO to be emitted at timestamp 25. Is this correct?

Yes.

from artiq.

dhslichter commented on August 16, 2024

@sbourdeauducq adding on times to output timestamps is not particularly useful IMHO. As I pointed out above, the useful case is to put events on in the past. I understand that this is problematic from many standpoints with RTIO. The solution is to have everything default to a certain amount of positive offset, and then you reduce the positive offset for the higher-latency channels. This ends up just increasing the latency for everything, which is undesirable if you want to do e.g. feedback.

Is interface.delay a negative number, as described above?

I guess my feeling is that unless we have a clean architecture for implementing negative offsets, the idea of gateware latency compensation is not really very useful. @dtcallcock @r-srinivas @jbqubit thoughts?

from artiq.

sbourdeauducq commented on August 16, 2024

Negative is doable.

from artiq.

dhslichter commented on August 16, 2024

What are the issues/tradeoffs with negative offsets? Does this impact things badly when combined with the SRTIO design being produced, because of all the timeline rewinding it could potentially create?

from artiq.

sbourdeauducq commented on August 16, 2024

SRTIO makes things a bit trickier because the FIFO switch decision depends on the final (corrected) timestamp - that's a lot of combinatorial logic to spread in the couple cycles that a RTIO write currently takes. But I have found a way that should work, unless Xilinx silicon is even slower than I imagine and doesn't meet timing.

from artiq.

jordens commented on August 16, 2024

There are a few assumptions about slack uniformity across channels. One issue with negative offsets (executing earlier than the time cursor) is that it breaks those assumptions (e.g. setting the cursor to X after wall clock as we currently do will not guarantee positive slack anymore). The replacement code also has a virtual deadline ahead of the wall clock that would be broken.

from artiq.

dhslichter commented on August 16, 2024

I was worried about some of these issues @jordens. Given all this, I think that negative offsets (the useful ones) are still problematic, while positive ones can "solve" the latency calibration problem at the cost of adding substantial latency to everything. I think unless there is a great hue and cry to have these features, this is something that people should be doing in their code rather than something that happens in gateware. If people do end up wanting this in gateware, then it should be disabled by default, because it will probably cause a lot of hard-to-track-down RTIO timing errors if you don't know exactly what you're doing.

from artiq.

jbqubit commented on August 16, 2024

I agree with @dhslichter:

negative offsets are what's useful
complicated latencies (e.g. different latencies for rising and falling edges) are not needed

@dhslichter notes:
"If you are counting some input event, the true time of that event must have been sometime at an earlier timestamp than the one it was assigned upon reaching the FPGA." @sbourdeauducq does your approach permit negative time stamps on input events too? I'm thinking of TTL inputs and ADCs.

As a matter of semantics I like .latency over .delay as it puts emphasis on on what's happening in the lab vs how ARTIQ compensates (by adding a delay). From a physics perspective this functionality permits the user to define a point in space that acts as timing reference plane (TRP). .latency accounts for the duration between an RTIO timestamp and when the resulting action reaches the TRP. Commonly the TRP will be the location of a trapped ion.

How should timing errors be handled? Here's an example.

def build(self):
    self.in0.latency = 0*ns
    self.out1.latency = 99*ns
    self.out2.latency = 2.2*us
    self.my_cond_latency_t = 1.1*us  # time for uP to do comparison

@kernel
def run(self):
    self.in0.gate_rising(500*ns)
    if self.in0.count() > 20:
        #delay( max(self.out2.latency, self.out1.latency) + self.my_cond_latency_t )
        with parallel:
            self.out1.pulse(500*ns)
            self.out2.pulse(500*ns)

I expect that an error is generated by the above code since there is insufficient slack. Removing the comment would fix it.

Per-channel RTIO .latency can be static at compile time. This permits the compiler to pre-compute delay() in preceding example. It also simplifies postmortem timing analysis since a single set of latencies would apply for any given experiment data file. Need to record latencies in .hdf5.

How should artiq_coreanalyzer behave? The current implementation records RTIO timestamps for each RTIO channel. This is helpful for low-level debugging but due to per-channel delays core_analyzer events are not time ordered with respect to a TRP. The addition of a per-RTIO .latency presents an opportunity to address this. A flag could instruct core_analyzer to emit a .vcd that accounts for known latencies.

from artiq.

jbqubit commented on August 16, 2024

Currently prerecorded core_dma can span multiple experiments. If RTIO .latency could be set per-experiment things get complicated. AFACT it would be cleaner if RTIO latencies were settable only by the startup kernel (and if startup kernel erased recorded DMA events). Does anybody see a problem with that constraint?

from artiq.

sbourdeauducq commented on August 16, 2024

There is no problem with negative offsets, either on inputs or outputs, other than what I explained above and the increased slack requirements for outputs.
With the minimal amount of code modification, coreanalyzer would have compensated timestamps on inputs and uncompensated timestamps on outputs.

from artiq.

jbqubit commented on August 16, 2024

With the minimal amount of code modification, coreanalyzer would have compensated timestamps on inputs and uncompensated timestamps on outputs.

An option for compensated coreanalyzer timestamps on both inputs and outputs is desirable AFACT. How hard is this to do for outputs?

@sbourdeauducq Will RTIO .latency will be static at compile time? Can compiler optimize?

Do you need more information to fully specify and quote this feature?

from artiq.

dhslichter commented on August 16, 2024

Negative-compensated timestamps on inputs is pretty straightforward I think; the trick is negative-compensated timestamps on outputs, and what this does to the FIFO structure proposed for SRTIO. In the current structure, with one FIFO per RTIO channel, it's pretty straightforward because each event in a given FIFO will have the same latency, but with this round-robin configuration things could get ugly fast.

I think the core analyzer should be configured to be able to output the values with the latency compensation removed or in place, because you might want to look at it with different versions.

All in all, this feature worries me a bit because it has the potential to hide a lot of subtle timing bugs in ways that are not very visible to the user. We need to be careful that things are implemented in such a way that one can easily change/reset the latency values, that it is very apparent to the user in the documentation (or potentially even with the use of special "lc" methods) when latency compensation is being applied.

from artiq.

jbqubit commented on August 16, 2024

one can easily change/reset the latency values

Since the use case is compensation with respect to a laboratory-fixed TRP I don't see a need to easily set/resetting the latency values. It would suffice if latencies were set at startup (startup kernel).

from artiq.

dhslichter commented on August 16, 2024

Per-experiment latency control is a calibration/debugging feature. If you put some wack-ass latency values in, it may make things hard to debug, and you want to reset them to zero (for example) for debugging purposes. Likewise, you might change a cable in your lab setup and it would be nice to be able to recalibrate this without restarting everything. You might have an ion trapped and want to keep running your "ion monitor" equivalent while debugging some other TTL pulses, so you don't want to have to reboot the FPGA each time. Definitely you don't want to have to re-flash the FPGA!

I think the messiness/persistence issues you allude to with DMA are inevitable facts of life when you start introducing something as ugly as this kind of automated latency compensation. The latency compensation is automatically going to be fragile, and will rely on delays of the sorts @jbqubit had commented out in the code above.

from artiq.

dhslichter commented on August 16, 2024

A related question/thought: related to some of the issues in #778, it would be nice to do as little timeline rewinding as possible in the kernels. Is there a clean way that one could, at compile time, change the order in which pulse timestamps are emitted from the kernel in order to reduce rewinds? For example:

@kernel
def run(self):
    self.ttl1.pulse(3*us)
    self.ttl2.pulse(5*us)
    self.delay(-10*us)
    self.ttl3.pulse(2*us)

It would be nice if the compiler could recognize that ttl3 is being pulsed on before either of the others, and change the ordering such that you get instead:

@kernel
def run(self):
    self.ttl3.pulse(2*us)
    self.ttl1.pulse(3*us)
    self.ttl2.pulse(5*us)

If you could do this, it would mean you are not sending out-of-order timestamps from the kernel out to the FIFOs. Obviously this is impossible to implement if the durations of some of the pulses are only calculated at runtime, but if the timeline can be known explicitly at compile time, at least for contiguous local sections, (which is true for most of our experiments), then doing this kind of reordering where possible would alleviate some of the pressure from timeline rewinds clogging up SRTIO FIFOs.

If there are latency compensations running for several channels, you are inevitably going to have timelines with rewinds. For example, if you pulse any two channels with non-equal latencies (and I am assuming this is going to be far and away the most common case, if the position of an ion in a trap is used as the time reference plane) in a parallel block, you will have a timeline rewind/out-of-order timestamp that depends on which operation is listed first in the parallel block -- very subtle! If running lots of such pulses then clogs all your SRTIO FIFOs based on the current arbitration scheme and you get underflow errors, this is going to be a really nasty thing for many users (especially beginning users) to debug, not to mention the fact that it will be more taxing on someone writing experimental code (keeping track of relative latencies for all the channels) even if they are aware of this issue. If you really want to implement this kind of latency correction, getting help from the compiler in preventing the sorts of non-user-obvious rewinds mentioned above is probably going to be important.

from artiq.

sbourdeauducq commented on August 16, 2024

An option for compensated coreanalyzer timestamps on both inputs and outputs is desirable AFACT.
How hard is this to do for outputs?

Not super hard, just needs a bit of boring glue code to pass the compensation values from the gateware into the analyzer when the trace is collected.

@sbourdeauducq Will RTIO .latency will be static at compile time? Can compiler optimize?

It's all in gateware and the compiler doesn't see it.

@dhslichter this is off-topic here, but the limit on the timeline rewinds from SED (SRTIO) simply comes from the requirement that all timestamps in a given FIFO must be strictly monotonic. Say you execute your first kernel in a loop with the delay changed to -8, and say you have 3 FIFOs, the FIFO numbers and timestamps (in µs for simplicity, and only listing the first event in each pulse) would be as follows:

Command	Timestamp	FIFO#
self.ttl1.pulse(3*us)	0	0
self.ttl2.pulse(5*us)	3	0
self.delay(-8*us)
self.ttl3.pulse(2*us)	0	1
self.ttl1.pulse(3*us)	2	1
self.ttl2.pulse(5*us)	5	1
self.delay(-8*us)
self.ttl3.pulse(2*us)	2	2
self.ttl1.pulse(3*us)	4	2
self.ttl2.pulse(5*us)	7	2
self.delay(-8*us)
self.ttl3.pulse(2*us)	4	0

It's OK to put the last event (with timestamp 4) into FIFO 0, because the last timestamp written to it was 3, and this loop can be repeated indefinitely.

It would not work and produce a sequence error if there were only 2 FIFOs, because it would attempt to put self.ttl3.pulse(2*us) with timestamp 2 into FIFO 0, and 2 ≤ 3.

With your original code and another delay after the last pulse (so that the timeline advances and it is equivalent to your second code snippet) it's like this:

Command	Timestamp	FIFO#
self.ttl1.pulse(3*us)	10	0
self.ttl2.pulse(5*us)	13	0
self.delay(-10*us)
self.ttl3.pulse(2*us)	8	1
self.delay(10*us)
self.ttl1.pulse(3*us)	20	1
self.ttl2.pulse(5*us)	23	1
self.delay(-10*us)
self.ttl3.pulse(2*us)	18	0
self.delay(10*us)
self.ttl1.pulse(3*us)	30	0
self.ttl2.pulse(5*us)	33	0
self.delay(-10*us)
self.ttl3.pulse(2*us)	28	1
self.delay(10*us)

Only 2 FIFOs are needed to repeat this loop indefinitely.

Please discuss this further in #778 or on the mailing list.

As for the compiler reordering things, there is with interleave which is like with parallel. But it is currently underdeveloped and buggy and not documented/exposed.

from artiq.

develop generic pattern to compensate for delays/latencies about artiq HOT 31 OPEN

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent