Comments (31)
manual latency compensation examples in #530 (comment)
from artiq.
DRTIO enables increased flexibility with an incremental increase in complexity in the form of implementation-dependent latency compensation. As @jordens pointed out the ARTIQ plan is for an elegant mechanism handle this issue.
There are many features of DRTIO/Sinara that have new user-dependent latencies. Some examples:
- DRTIO over microTCA AMC backplane, per-slot latency
- DTRIO over variable-length optical fiber (SFP)
- SPI, TTL over variable-length VHDCI cables to VHDCI Carrier (and 3U peripherals)
Latency compensation is already a common feature in v2.x ARTIQ experiment code used in the lab (eg for DDSs). Labs using Sinara will incur an experiment code rewrite migrate from v2.x to v3.0. It is a good economy for end users to also use a new latency compensation scheme at this juncture.
It looks like latency compensation is no longer a 3.0 milestone. I'd like to see it included in 3.0. Thoughts @dhslichter, @r-srinivas, @dleibrandt, @hartytp, @cjbe?
from artiq.
Nope. There is no funding for it plus we need to get 3.0 out soon, and things like #685 are already a major difficulty.
from artiq.
I don't want to delay 3.0 any further with this, and as @sbourdeauducq points out there are many more issues to be addressed with higher priority for users. While figuring out the latencies is an important issue to address in the fullness of time, it is also something that can be addressed simply, with the current ARTIQ, by end users -- you just wrap pulse/on/off methods with something that looks at the channel being operated on and adds/subtracts a deterministic amount looked up from a calibration dataset. If your pulse durations/starts are being calculated on the fly by the core device, this will slow things a little, but if they are known at compile time the addition should be done in compilation and there is no performance hit.
from artiq.
That's the scheme we would advocate as well. It would just be nice to have examples and some skeleton code that shows how it can be implemented.
from artiq.
you just wrap pulse/on/off methods with something that looks at the channel being operated on and adds/subtracts a deterministic amount looked up from a calibration dataset
As I understand it the latency for each instance of each PHY should be constant. So for the moment the approach you're recommending is a per-PHY wrapper that accounts for latency, right? However, even for coredevice.ttl.TTLOut this is excessively annoying since so many high-level calls need to be wrapped.
- set_o()
- sync()
- on()
- off()
- pulse_mu()
- pulse()
End users fixing this by wrapping the ARTIQ API results in a terrible mess in experiment files. Most PHYs are far more complex than TTLOut.
Here's what a fix coredevice.ttl.TTLOut might look like. Modified lines marked with $.
$ def __init__(self, dmgr, channel, core_device="core", latency=0):
self.core = dmgr.get(core_device)
self.channel = channel
$ self.latency_mu = seconds_to_mu(latency)
@kernel
def set_o(self, o):
$ t0 = tcursor_mu()
$ at_mu(t0-self.latency_mu)
rtio_output(now_mu(), self.channel, 0, 1 if o else 0)
self.o_previous_timestamp = now_mu()
$ at_mu(t0)
@kernel
def sync(self):
"""Busy-wait until all programmed level switches have been
effected."""
$ while self.core.get_rtio_counter_mu() - latency_mu < self.o_previous_timestamp:
pass
@kernel
def on(self):
"""Sets the output to a logic high state at the current position
of the time cursor.
The time cursor is not modified by this function."""
self.set_o(True)
@kernel
def off(self):
"""Set the output to a logic low state at the current position
of the time cursor.
The time cursor is not modified by this function."""
self.set_o(False)
@kernel
def pulse_mu(self, duration):
"""Pulse the output high for the specified duration
(in machine units).
The time cursor is advanced by the specified duration."""
self.on()
delay_mu(duration)
self.off()
@kernel
def pulse(self, duration):
"""Pulse the output high for the specified duration
(in seconds).
The time cursor is advanced by the specified duration."""
self.on()
delay(duration)
self.off()
Relatively simple, right? Or am I missing something.
from artiq.
Why does the end user wrapping this result in a "terrible mess" in experiment files? Just create a subclass called TTLOutLC(TTLOut)
, and implement the modified versions of the methods as above. Then just use TTLOutLC
instead of TTLOut
in your experiment code. Seems incredibly clean and simple to me. Now, if you set latencies to be wacky numbers you may get some tricky corner cases (such as the above-mentioned slow shutter being programmed in the past), but that happens no matter how you choose to approach the problem, it's unavoidable.
If you want to perform latency correction on SPI, you do the same thing, make a wrapper class in which some of the timing is modified. If you want to do this for SAWG, again you make a wrapper class and modify the timestamps of spline knots/phase/freq/amplitude changes/etc.
from artiq.
So rather than modifying six lines of code in the ARTIQ code base each lab is now creating a new class, overloading 7 members of TTLOut and debugging it. In TTL PHY repeat for
- TTLInOut (25 methods!)
- TTLClockGen
Then for all the other PHYs... DMA, I2C, PDQ, sawg, spi, dds??
from artiq.
So submit the above patch to the ARTIQ code base. For TTLInOut
and TTLClockGen
should be similar.
from artiq.
@jordens asked
It would just be nice to have examples and some skeleton code that shows how it can be implemented.
I don't have the bandwidth to write and test a holistic the solution to this. At this stage I think discussion of implementation and pitfalls is where we're at.
Better than the new constructor method above, it may well be cleaner to add a latency argument for PHYs in device_db.py. Or support both.
@jordens Please add label for-contract and add to 4.0 milestone.
from artiq.
@jbqubit this is not how we operate. Happy to make it for-contract. But that is mutually exclusive with adding it to a milestone.
The problem with your example is that it is not generic enough. The falling and rising edges of input, output, and oe could well be all different. Accounting for all that makes this latency compensation scheme inappropriate for the generic ARTIQ TTL channels.
from artiq.
Let's discuss a first order solution that addresses gross latency first. 1) PHY-to-PHY latency (eg TTL to SWAG, currently couple micro-second) 2) cable length.
Agreed that there are higher order corrections that may depend on eg what's connected to a TLL.
from artiq.
How about the following:
- Gateware adds an offset to each input and output channel.
- Offset resolution is a fine RTIO cycle.
- Offsets default to the
interface.delay
specified in the RTIO channel (defined in gateware files) - Offsets can be read and written by kernels. Read/writes of those offsets are not real-time. Typically, the startup kernel would set all those offsets before doing anything. In the startup kernel, the read method is particularly useful to obtain
interface.delay
and add some value to it. - Negative offsets are not supported.
- RTIO drivers (
TTLOut
,SAWG
, ...) provide wrapper methods to access the offsets of their corresponding channels.
This won't support complicated latency compensation schemes (such as different offsets for rising and falling edges), but it is doable in gateware with minimal penalty in overall latency or decreasing throughput.
Ping @hartytp
from artiq.
I'm not really the right person to ask about that. Try @cjbe or @klickverbot, or the NIST/ARL guys...
from artiq.
I am assuming from this that the offset for a given channel is added to the timestamp sent from the kernel, in other words if you send a timestamp of 0 and the offset is 25, then it goes into the FIFO to be emitted at timestamp 25. Is this correct? If this is the case, negative offsets are really the "useful" ones, in that what you want to do is to be able to send an output (or assign the original time of an input) to a time before the current time. That is, if you want to trigger something where there is a high latency, you want to send the trigger pulse a lot earlier than the time when you actually need the thing to have triggered by. Likewise, if you are counting some input event, the true time of that event must have been sometime at an earlier timestamp than the one it was assigned upon reaching the FPGA.
If I have the signs backwards here, and the positive offset moves the final timestamp earlier, then everything seems fine. I don't think that complicated latencies (e.g. different latencies for rising and falling edges) needs to be handled here, someone can write a driver to account for this at the software level.
from artiq.
I am assuming from this that the offset for a given channel is added to the timestamp sent from the kernel, in other words if you send a timestamp of 0 and the offset is 25, then it goes into the FIFO to be emitted at timestamp 25. Is this correct?
Yes.
from artiq.
@sbourdeauducq adding on times to output timestamps is not particularly useful IMHO. As I pointed out above, the useful case is to put events on in the past. I understand that this is problematic from many standpoints with RTIO. The solution is to have everything default to a certain amount of positive offset, and then you reduce the positive offset for the higher-latency channels. This ends up just increasing the latency for everything, which is undesirable if you want to do e.g. feedback.
Is interface.delay
a negative number, as described above?
I guess my feeling is that unless we have a clean architecture for implementing negative offsets, the idea of gateware latency compensation is not really very useful. @dtcallcock @r-srinivas @jbqubit thoughts?
from artiq.
Negative is doable.
from artiq.
What are the issues/tradeoffs with negative offsets? Does this impact things badly when combined with the SRTIO design being produced, because of all the timeline rewinding it could potentially create?
from artiq.
SRTIO makes things a bit trickier because the FIFO switch decision depends on the final (corrected) timestamp - that's a lot of combinatorial logic to spread in the couple cycles that a RTIO write currently takes. But I have found a way that should work, unless Xilinx silicon is even slower than I imagine and doesn't meet timing.
from artiq.
There are a few assumptions about slack uniformity across channels. One issue with negative offsets (executing earlier than the time cursor) is that it breaks those assumptions (e.g. setting the cursor to X after wall clock as we currently do will not guarantee positive slack anymore). The replacement code also has a virtual deadline ahead of the wall clock that would be broken.
from artiq.
I was worried about some of these issues @jordens. Given all this, I think that negative offsets (the useful ones) are still problematic, while positive ones can "solve" the latency calibration problem at the cost of adding substantial latency to everything. I think unless there is a great hue and cry to have these features, this is something that people should be doing in their code rather than something that happens in gateware. If people do end up wanting this in gateware, then it should be disabled by default, because it will probably cause a lot of hard-to-track-down RTIO timing errors if you don't know exactly what you're doing.
from artiq.
I agree with @dhslichter:
- negative offsets are what's useful
- complicated latencies (e.g. different latencies for rising and falling edges) are not needed
@dhslichter notes:
"If you are counting some input event, the true time of that event must have been sometime at an earlier timestamp than the one it was assigned upon reaching the FPGA." @sbourdeauducq does your approach permit negative time stamps on input events too? I'm thinking of TTL inputs and ADCs.
As a matter of semantics I like .latency
over .delay
as it puts emphasis on on what's happening in the lab vs how ARTIQ compensates (by adding a delay). From a physics perspective this functionality permits the user to define a point in space that acts as timing reference plane (TRP). .latency
accounts for the duration between an RTIO timestamp and when the resulting action reaches the TRP. Commonly the TRP will be the location of a trapped ion.
How should timing errors be handled? Here's an example.
def build(self):
self.in0.latency = 0*ns
self.out1.latency = 99*ns
self.out2.latency = 2.2*us
self.my_cond_latency_t = 1.1*us # time for uP to do comparison
@kernel
def run(self):
self.in0.gate_rising(500*ns)
if self.in0.count() > 20:
#delay( max(self.out2.latency, self.out1.latency) + self.my_cond_latency_t )
with parallel:
self.out1.pulse(500*ns)
self.out2.pulse(500*ns)
I expect that an error is generated by the above code since there is insufficient slack. Removing the comment would fix it.
Per-channel RTIO .latency
can be static at compile time. This permits the compiler to pre-compute delay()
in preceding example. It also simplifies postmortem timing analysis since a single set of latencies would apply for any given experiment data file. Need to record latencies in .hdf5
.
How should artiq_coreanalyzer
behave? The current implementation records RTIO timestamps for each RTIO channel. This is helpful for low-level debugging but due to per-channel delays core_analyzer
events are not time ordered with respect to a TRP. The addition of a per-RTIO .latency
presents an opportunity to address this. A flag could instruct core_analyzer
to emit a .vcd
that accounts for known latencies.
from artiq.
Currently prerecorded core_dma
can span multiple experiments. If RTIO .latency
could be set per-experiment things get complicated. AFACT it would be cleaner if RTIO latencies were settable only by the startup kernel (and if startup kernel erased recorded DMA events). Does anybody see a problem with that constraint?
from artiq.
There is no problem with negative offsets, either on inputs or outputs, other than what I explained above and the increased slack requirements for outputs.
With the minimal amount of code modification, coreanalyzer would have compensated timestamps on inputs and uncompensated timestamps on outputs.
from artiq.
With the minimal amount of code modification, coreanalyzer would have compensated timestamps on inputs and uncompensated timestamps on outputs.
An option for compensated coreanalyzer timestamps on both inputs and outputs is desirable AFACT. How hard is this to do for outputs?
@sbourdeauducq Will RTIO .latency will be static at compile time? Can compiler optimize?
Do you need more information to fully specify and quote this feature?
from artiq.
Negative-compensated timestamps on inputs is pretty straightforward I think; the trick is negative-compensated timestamps on outputs, and what this does to the FIFO structure proposed for SRTIO. In the current structure, with one FIFO per RTIO channel, it's pretty straightforward because each event in a given FIFO will have the same latency, but with this round-robin configuration things could get ugly fast.
I think the core analyzer should be configured to be able to output the values with the latency compensation removed or in place, because you might want to look at it with different versions.
All in all, this feature worries me a bit because it has the potential to hide a lot of subtle timing bugs in ways that are not very visible to the user. We need to be careful that things are implemented in such a way that one can easily change/reset the latency values, that it is very apparent to the user in the documentation (or potentially even with the use of special "lc" methods) when latency compensation is being applied.
from artiq.
one can easily change/reset the latency values
Since the use case is compensation with respect to a laboratory-fixed TRP I don't see a need to easily set/resetting the latency values. It would suffice if latencies were set at startup (startup kernel).
from artiq.
Per-experiment latency control is a calibration/debugging feature. If you put some wack-ass latency values in, it may make things hard to debug, and you want to reset them to zero (for example) for debugging purposes. Likewise, you might change a cable in your lab setup and it would be nice to be able to recalibrate this without restarting everything. You might have an ion trapped and want to keep running your "ion monitor" equivalent while debugging some other TTL pulses, so you don't want to have to reboot the FPGA each time. Definitely you don't want to have to re-flash the FPGA!
I think the messiness/persistence issues you allude to with DMA are inevitable facts of life when you start introducing something as ugly as this kind of automated latency compensation. The latency compensation is automatically going to be fragile, and will rely on delays of the sorts @jbqubit had commented out in the code above.
from artiq.
A related question/thought: related to some of the issues in #778, it would be nice to do as little timeline rewinding as possible in the kernels. Is there a clean way that one could, at compile time, change the order in which pulse timestamps are emitted from the kernel in order to reduce rewinds? For example:
@kernel
def run(self):
self.ttl1.pulse(3*us)
self.ttl2.pulse(5*us)
self.delay(-10*us)
self.ttl3.pulse(2*us)
It would be nice if the compiler could recognize that ttl3
is being pulsed on before either of the others, and change the ordering such that you get instead:
@kernel
def run(self):
self.ttl3.pulse(2*us)
self.ttl1.pulse(3*us)
self.ttl2.pulse(5*us)
If you could do this, it would mean you are not sending out-of-order timestamps from the kernel out to the FIFOs. Obviously this is impossible to implement if the durations of some of the pulses are only calculated at runtime, but if the timeline can be known explicitly at compile time, at least for contiguous local sections, (which is true for most of our experiments), then doing this kind of reordering where possible would alleviate some of the pressure from timeline rewinds clogging up SRTIO FIFOs.
If there are latency compensations running for several channels, you are inevitably going to have timelines with rewinds. For example, if you pulse any two channels with non-equal latencies (and I am assuming this is going to be far and away the most common case, if the position of an ion in a trap is used as the time reference plane) in a parallel
block, you will have a timeline rewind/out-of-order timestamp that depends on which operation is listed first in the parallel
block -- very subtle! If running lots of such pulses then clogs all your SRTIO FIFOs based on the current arbitration scheme and you get underflow errors, this is going to be a really nasty thing for many users (especially beginning users) to debug, not to mention the fact that it will be more taxing on someone writing experimental code (keeping track of relative latencies for all the channels) even if they are aware of this issue. If you really want to implement this kind of latency correction, getting help from the compiler in preventing the sorts of non-user-obvious rewinds mentioned above is probably going to be important.
from artiq.
An option for compensated coreanalyzer timestamps on both inputs and outputs is desirable AFACT.
How hard is this to do for outputs?
Not super hard, just needs a bit of boring glue code to pass the compensation values from the gateware into the analyzer when the trace is collected.
@sbourdeauducq Will RTIO .latency will be static at compile time? Can compiler optimize?
It's all in gateware and the compiler doesn't see it.
@dhslichter this is off-topic here, but the limit on the timeline rewinds from SED (SRTIO) simply comes from the requirement that all timestamps in a given FIFO must be strictly monotonic. Say you execute your first kernel in a loop with the delay changed to -8, and say you have 3 FIFOs, the FIFO numbers and timestamps (in ยตs for simplicity, and only listing the first event in each pulse) would be as follows:
Command | Timestamp | FIFO# |
---|---|---|
self.ttl1.pulse(3*us) | 0 | 0 |
self.ttl2.pulse(5*us) | 3 | 0 |
self.delay(-8*us) | ||
self.ttl3.pulse(2*us) | 0 | 1 |
self.ttl1.pulse(3*us) | 2 | 1 |
self.ttl2.pulse(5*us) | 5 | 1 |
self.delay(-8*us) | ||
self.ttl3.pulse(2*us) | 2 | 2 |
self.ttl1.pulse(3*us) | 4 | 2 |
self.ttl2.pulse(5*us) | 7 | 2 |
self.delay(-8*us) | ||
self.ttl3.pulse(2*us) | 4 | 0 |
It's OK to put the last event (with timestamp 4) into FIFO 0, because the last timestamp written to it was 3, and this loop can be repeated indefinitely.
It would not work and produce a sequence error if there were only 2 FIFOs, because it would attempt to put self.ttl3.pulse(2*us)
with timestamp 2 into FIFO 0, and 2 โค 3.
With your original code and another delay after the last pulse (so that the timeline advances and it is equivalent to your second code snippet) it's like this:
Command | Timestamp | FIFO# |
---|---|---|
self.ttl1.pulse(3*us) | 10 | 0 |
self.ttl2.pulse(5*us) | 13 | 0 |
self.delay(-10*us) | ||
self.ttl3.pulse(2*us) | 8 | 1 |
self.delay(10*us) | ||
self.ttl1.pulse(3*us) | 20 | 1 |
self.ttl2.pulse(5*us) | 23 | 1 |
self.delay(-10*us) | ||
self.ttl3.pulse(2*us) | 18 | 0 |
self.delay(10*us) | ||
self.ttl1.pulse(3*us) | 30 | 0 |
self.ttl2.pulse(5*us) | 33 | 0 |
self.delay(-10*us) | ||
self.ttl3.pulse(2*us) | 28 | 1 |
self.delay(10*us) |
Only 2 FIFOs are needed to repeat this loop indefinitely.
Please discuss this further in #778 or on the mailing list.
As for the compiler reordering things, there is with interleave
which is like with parallel
. But it is currently underdeveloped and buggy and not documented/exposed.
from artiq.
Related Issues (20)
- ARTIQ 8 DRTIO satellites: repeaters hang on TSC send
- Will the Rust 2018 firmware fix be back-ported to ARTIQ 7.0? HOT 28
- Reset Error When Running OpenOCD with Kasli SoC HOT 1
- move all subkernel message handling to aux channel HOT 1
- mirny/ADF5356 driver doesn't have 52-bits precision HOT 1
- mirny: Unstable initiliazation with MUXOUT not low/too high errors
- Impossibility of using spare Urukul alongside with SUServo HOT 1
- LLVM Linker (ld.lld) Invocation Fails on Compilation with MSYS2 HOT 31
- Python 3.12 compatibility HOT 1
- dashboard dataset editor conversion to float64 HOT 2
- artiq_master command line reference not built
- artiq_corelog: command not found HOT 1
- Grabber input_mu function ROI limits HOT 1
- Override DAC values and DDS attenuation in dashboard applet
- Allow additional fields and cards in JSON descriptions HOT 4
- handle cancelled interactive argument requests properly
- expand interactive argument example
- interactive args: unclear error message when supplying args simultaneously
- issues with moving existing item in `DictSyncModel`
- What happens in the gatware when there is a delay/jitter on the reference clock provided to the KC705? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from artiq.