symbiflow / nextpnr Goto Github PK

This project forked from yosyshq/nextpnr

nextpnr portable FPGA place and route tool

License: ISC License

CMake 2.40% C++ 82.78% C 4.59% Verilog 6.16% Python 2.78% Shell 0.30% Makefile 0.02% Jupyter Notebook 0.07% Tcl 0.10% Pawn 0.73% Cap'n Proto 0.03% VHDL 0.02%

nextpnr's People

Contributors

Stargazers

Watchers

Forkers

oh3eqn 201934744

nextpnr's Issues

Need site routing test framework

One of the more fragile but critical pieces of logic in the FPGA interchange nextpnr arch is the site routing logic. For clarity, this is the collection of code that implements the isBelLocationValid part of the nextpnr Arch API. This implementation must both be fast (amortized over the entire placement step) and precise and accurate. That is a mixture that means that it should be well tested, so that as complexity increases or speed improvements are done, there is a way to verify that it is still correct.

The suggested site routing test framework would consists of 3 parts:

A way to define a logical netlist that is very specific. This part could be as simple as a verilog file that requires no elaboration or synthesis.
A way to specify test cases. Each test case would consist of:

A batch of cell placement directives (e.g. place this cell at this BEL)
A set of BELs and whether they are valid

A way to evaluate test cases and gather performance data (both memory and wallclock times).

Example:

Netlist:

module testcase(input [5:0] lut_1_in, input [4:0] lut_2_in, output lut_1_out, output lut_2_out);

LUT6 lut_1 #(.INIT(64'hFFFFFFFFFFFFFFFF)) (
 .I0(lut_1_in[0]),
 .I1(lut_1_in[1]),
 .I2(lut_1_in[2]),
 .I3(lut_1_in[3]),
 .I4(lut_1_in[4]),
 .I5(lut_1_in[5]),
 .O(lut_1_out)
);


LUT5 lut_2 #(.INIT(32'h0)) (
 .I0(lut_2_in[0]),
 .I1(lut_2_in[1]),
 .I2(lut_2_in[2]),
 .I3(lut_2_in[3]),
 .I4(lut_2_in[4]),
 .O(lut_2_out)
);

endmodule

Test case:

test_case:
 - place:
    # Place cell `lut_2` at BEL `SLICE_X1Y8.SLICEL/A6LUT`
    lut_2: SLICE_X1Y8.SLICEL/A6LUT
 - test:
   # Make sure this placement is accept
   SLICE_X1Y8.SLICEL/A6LUT: true
 - place:
    lut_1: SLICE_X1Y8.SLICEL/B6LUT
 - test:
   # Make sure this placement is accept
   SLICE_X1Y8.SLICEL/A6LUT: true
   SLICE_X1Y8.SLICEL/B6LUT: true
 - place:
    lut_1: SLICE_X1Y8.SLICEL/A6LUT
    lut_2: SLICE_X1Y8.SLICEL/A5LUT
 - test:
   # The site is now invalid because too many signals into the A6/A5LUT
   SLICE_X1Y8.SLICEL/A6LUT: false
   SLICE_X1Y8.SLICEL/A5LUT: false
 - unplace:
    - lut_2
 - test:
   # By removing lut_2, the site is valid again
   SLICE_X1Y8.SLICEL/A6LUT: true
   SLICE_X1Y8.SLICEL/A5LUT: true

Invocation might look like:

python3 create_logical_netlist_from_verilog.py circuit.v circuit.netlist
nextpnr-fpga_interchange --chipdb xxx.bin --netlist circuit.netlist --run run_placement_test.py

Alternate designs are welcome and accepted.

Implement primitive macro's

Current status

Primitive macro's are unimplemented, and therefore designs that use primitive's that have macro's cannot be placed and routed using the FPGA interchange implementation.

Description

The FPGA interchange DeviceResource's format describes primitive macro's for any library primitive that is multiple elements during place and route. The most commonly encountered 7-series primitive's like this are different IO (e.g. OBUFTDS) and LUT-RAMs (e.g. RAM64X1D, RAM128X1D).

Implementing primitive macro's has two major parts:

Apply cell type exception map
Expand the macro
Place the design respecting the macro

Apply cell type exception map

The first is cell type rename, based on the https://github.com/SymbiFlow/fpga-interchange-schema/blob/f537a57fd7af091aabb776cb888f103b2b29349b/interchange/DeviceResources.capnp#L84 . This is required for cell types that have a name collision between the pre-macro expansion step and and the post-macro expansion step.

Example:

The primitive OBUFTDS macro expansion in verilog looks like:

module OBUFTDS_DUAL_BUF(input I, input T, output O, output OB);

OBUFTDS P(.I(I), .T(T), .O(O), .OB(OB));
wire I_INV;
INV INV(.I(I), .O(I_B));
OBUFTDS N(.I(I_B), .(T), .O(OB));

endmodule

In this case, the primitive OBUFTDS has a macro that inside of instances cells named OBUFTDS. To avoid the namespace collision, the exceptionMap in the DeviceResources renames the primitive type to OBUFTDS_DUAL_BUF.

Expand the macro

In the DeviceResources primLibs field are two libraries. One of those libraries are cell primitives with no macros. These cells need no further expansion. Examples would be FDRE, IBUF, OBUF. These cells have no interior contents.

The other set of library cells have interior contents and represent the macros to be expanded. Macro's can a depth greater than 1. The expansion should continue until all elements of the macro's are non-macro primitive cells. Several things to note:

The physical netlist net's are named for their driver. When applying a macro to a primitive that has outputs, the new net name is the name of the driver pin from within the macro, not the net name from before applying the macro
The names of the contents of macro's are important and must be preserved in the placement directives in the physical netlist.

Example for cell names:

Above, there was an example of the OBUFTDS macro. In that expansion, one primitive cell became 3 cells. If the original cell was named "example_obuftds", then the cell placements would be named "example_obuftds/P", "example_obuftds/INV" and "example_obuftds/N".

Place the design respecting the macro

Because a macro primitive expands into multiple cells there is the question of how to place the macro legally. In some cases, for each site there is only 1 legal placement.

As an example, the OBUFTDS_DUAL_BUF macro has exactly one legal placement per IOB pair. As a concrete example (part xc7a35tcpg236-1), if the P element of the macro is placed at IOB_X0Y14.IOB33M/OUTBUF BEL, then the INV element must be placed at IOB_X0Y13.IOB33S/O_ININV BEL and the N element must be placed at IOB_X0Y13.IOB33S/OUTBUF BEL.

In some cases, each site will be a handful of legal placements. The RAM64X1D has two legal placements in 7-series. The DP cell can be placed at C6LUT or A6LUT, and the SP cell at D6LUT or B6LUT within the same site, respectively.

Test SRL's and SRL chains

Current status

7-series SRL primitives are untested, but should work in the current FPGA interchange implementation.

Work to be done

Test designs should be added to exercise SRL placement. It is expected that single SRL elements should work right away. Chained SRL primitives are more complicated. If the SRL chain length is less than 4, then the SRL's have exactly 1 legal placement, and probably need cell constraints (similiar to #262). However if the SRL chain length is longer than 4, then the chain has to be broken into chains of length 4 in some way. There are many legal valid chain arrangements, and using the cell constraint system will force the chain split into a particular configuration. This may be okay.

XDC parser enchancements

The current FPGA interchange XDC parser doesn't support the set of features required. Many of the XDC commands implemented in https://github.com/SymbiFlow/yosys-symbiflow-plugins/ need to be added to the FPGA interchange XDC parser.

Specific examples:

[get_ports] should return all ports
None of the clock commands are implemented.

Linux LiteX XDC file is a good driving example.

Fabric placement constraints

Current Status

In 7-series and UltraScale+, some cells have placement rules that the current FPGA interchange does not express. Following these rules are important for good clock placement. In some cases, it will be required to create valid design. The current FPGA interchange implementation doesnt' detect or implement proper placement in these cases.

Examples

7-series CCIO IO pins have a dedicated path from the IO pin to the BUFG/BUFGCTRL if the BUFG is placed in the same half of the fabric.
7-series PLL/MMCM's have dedicate paths from other PLL/MMCM's in the CMT's above and below their location
7-series PLL/MMCM's have dedicate paths from clock buffers in the same CMT (BUFR/BUFH) or neighboring CMT (BUFMR) or fabric half (e.g. BUFG)
7-series IDELAYCTRL have implicit connections to IDELAY2 elements within the same IO bank

These types of constraints are enforced in the symbiflow-arch-def's script https://github.com/SymbiFlow/symbiflow-arch-defs/blob/master/xc/common/utils/prjxray_create_place_constraints.py

Discussion

Some of these rules could be discovered by examine the routing fabric, but it may be easier to constraint these explicitly. In the case of the IDELAYCTRL, these rules have to be supplied, because the routing fabric lacks the connection altogether.

FPGA interchange development road map

Project page: https://github.com/orgs/SymbiFlow/projects/22

The implementation of the FPGA interchange as of YosysHQ@692d7dc and chipsalliance/python-fpga-interchange@b4331ef can place and route 7-series designs using single ended IO, FF's, LUT's and BRAM's for the most part. This issue covers the remaining work in different directions that the FPGA interchange development can take from here. I've broadly broken future work into 3 area's of work.

Improvement testing, robustness and speed

There is a lack of testing around some key portions of the FPGA interchange implementation. There is also some key performance issues that prevent the implementation from being particularly usable.

Need to add Read the Docs build and initial documentation structure (chipsalliance/fpga-interchange-schema#16)
Investigate lookahead improvements (#260)
Investigate BEL validity checking improvements (#261)
Need site routing test framework (#234)
Improve YAML anchor names (chipsalliance/python-fpga-interchange#30)

Get FPGA interchange 7-series nextpnr implementation to feature parity with symbiflow-arch-defs

Many important features present in the symbiflow-arch-defs are missing from the currently FPGA interchange flow. Significant work remains to reach feature parity.

Improve clock/PLL/MMCM placement (#263)
Need a FPGA interchange to FASM generator (chipsalliance/python-fpga-interchange#27)
- Add kokoro runner for nextpnr to run diff FASM tests
CARRY4/XORCY/MUXCY support (#262)
LUT-RAM / Differential IO support (#264)
Test SRL's and SRL chains (#265)
Timing driven place and route (chipsalliance/fpga-interchange-schema#15)
Start testing more circuits (some may be blocked by CARRY4/LUT-RAM implementations)

Get FPGA interchange nextpnr implementation to bigger fabrics and circuits

The current 7-series and other initial target architectures are limited to the 50k - 200k LUT range. To test with larger design, UltraScale+ (or other) fabrics will need to be supported. This category of work covers what is required to being to explore working with larger and larger fabrics, with the end goal of operating on the largest fabrics from Xilinx (VU19P) and other vendors.

UltraScale+ bring-up (Ultra96 / UltraScale+ MPSoC ZU3EG) (Put link to issue here!)
Improve node storage (Put link to issue here!)

FPGA interchange lookahead improvements

The current lookahead in the FPGA interchange implementation is disabled by default because it is slow to compute and requires significant memory, both when being computed and when using it. This issue covers topics around how the lookahead might be improved and goals in that directly.

Current status

The current lookahead is data-driven, and in theory should be robust to various architectures, and work both on timing and non-timing driven situations. It is derived from the extended map lookahead (https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/route/router_lookahead_extended_map.h) developed as part of the VPR flow for https://github.com/symbiflow/symbiflow-arch-defs/ .

Current problems

The time to compute the lookahead is fairly high for the A35T (6 min on a 56 core machine). In theory, computation time should not increase with larger fabrics, but this has not been verified.
Memory consumption during the computation of the lookahead is fairly high (~15 GiB) and when the lookahead is being used (~6 GiB).

Potential solutions

To lower the computation time when computing the lookahead, consider shrinking small search space, either geometrically or via limiting the depth of the expansion. Doing so will need to be traded with router performance
Profile lookahead computation and see if there are further optimizations that are possible
Explore ways to shring number of and size of CostMap (which consume the majority of the disk and RAM)
Currently the lookahead has no duplicate detection for wire pairs with similiar cost map's. Potentially detect (or supply in the chipdb) wire similarity information. There is a trade off to requiring arch's to supply this information, so prefer solutions that are driven based on the routing graph alone, rather than architecture specific input.
Explore parameteric equations to represent CostMap's. It is likely that many/all of the cost maps could be reduced to a set of parameteric equations and/or table lookups with less resolution. See if level of detail techiniques can be applied.

[interchange] Site pin conflicts in RAM test

Run make test-fpga_interchange-ram_basys3-dcp in nextpnr and open the DCP in Vivado. There are routing issues (nets in yellow rather than green), report_route_status reports the following:

  GLOBAL_LOGIC0
    Unrouted Nodes (not associated with pins or ports):
      BRAM_L_X6Y30/BRAM_FIFO18_WEBWE4
      BRAM_L_X6Y30/BRAM_FIFO18_WEBWE5
      BRAM_L_X6Y30/BRAM_FIFO18_WEBWE6
      BRAM_L_X6Y30/BRAM_FIFO18_WEBWE7
  GLOBAL_LOGIC1
    Unrouted Nodes (not associated with pins or ports):
      BRAM_L_X6Y30/BRAM_FIFO18_REGCLKARDRCLK
      BRAM_L_X6Y30/BRAM_FIFO18_REGCLKB
    Conflicts with Site Pins: BRAM_L_X6Y30/BRAM_FIFO18_REGCLKB (RAMB18_X0Y12/REGCLKB)
  ram.rdclk
    Conflicts with Site Pins: BRAM_L_X6Y30/BRAM_FIFO18_REGCLKB (RAMB18_X0Y12/REGCLKB)

An illustration of some of the routing problems:

It appears like the cell-bel pin mapping isn't correct.

Design files:
ram_basys3.zip

cc @acomodi - are the examples in nextpnr still supposed to work, or in practice are they deprecated and I should be using fpga-interchange-tests instead?

Investigate FPGA interchange BEL validity checking improvements

The current FPGA interchange BEL validity checks are slow and have some potential robustness issues. The robustness issues will hopefully be flushed out once #234 is stood up and test cases are added. This issue primarily is covering the performance issues in the current BEL validity checks.

Current status

The BEL validity checks (isBelLocationValid) implement the following checks:

Constraint testing (here)
Fabric specific connection wiring (in DedicatedInterconnect)
Site routing (in SiteRouter)
LUT equation checking (in LutMapper)

This covers most of the validity checks required for doing P&R on 7-series fabrics, but further testing is required to ensure that the current set of implementations is complete enough to implement the 7-series placement rules (see #234).

Current problems

The performance of the BEL validity checks is fairly poor. For example, Murax HeAP time is currently:

Info: HeAP Placer Time: 87.79s                                                                                                               
Info:   of which solving equations: 1.62s                                                                                                    
Info:   of which spreading cells: 0.57s                                                                                                      
Info:   of which strict legalisation: 83.49s

Approximately 40% of the 83 seconds spent in strict legalisation is solely in the LUT equation handling. There is also significant wasted time in the cell BEL pin mapping logic that should be fixable.

Potential solutions

The current cell BEL pin cache check doesn't properly detect when cells are placed on a different BEL with a site, but otherwise has the same pin mapping. See here. (cell->cell_mapping != mapping) is basically a cache hit detection, and the mapping index is BEL specific within a site, rather than mapping specific. As a result, time is being spent re-mapping Cell to BEL pins even though the mapping will be identical
The LUT equation logic is currently taking 40% of the strict legalization time. This seems too high. Consider rewriting after testing from #234 is in place. Expectation is that a better chip-db pre-compute or caching should enable LUT equation logic code to be faster and simpler.
Profile more placements of circuits (e.g. run with --no-route) and see if other parts of the strict legalization check can be improved.

Need to stand up MSAN / ASAN / TSAN test runs

Currently nextpnr has support for building with MSAN / ASAN / TSAN, but MSAN immediately fails with the following error:

==1506074==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55ac3a in std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_check_len(unsigned long, char const*) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1758:6
    #1 0x574d94 in void std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_realloc_insert<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:436:2
    #2 0x7f54f5cb0ae6 in boost::program_options::option_description::set_names(char const*) (/usr/lib/x86_64-linux-gnu/libboost_program_options.so.1.74.0+0x2eae6)
    #3 0x7f54f5cb1301 in boost::program_options::option_description::option_description(char const*, boost::program_options::value_semantic const*, char const*) (/usr/lib/x86_64-linux-gnu/libboost_program_options.so.1.74.0+0x2f301)
    #4 0x7f54f5cb1454 in boost::program_options::options_description_easy_init::operator()(char const*, char const*) (/usr/lib/x86_64-linux-gnu/libboost_program_options.so.1.74.0+0x2f454)
    #5 0x529390 in nextpnr_fpga_interchange::CommandHandler::getGeneralOptions() cat_x/nextpnr/common/command.cc:103:5

SUMMARY: MemorySanitizer: use-of-uninitialized-value /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1758:6 in std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_check_len(unsigned long, char const*) const
Exiting

This appears to be a problem in boost::program_options, but it could also be a usage problem.

To catch these kinds of errors, there should be a CI that builds nextpnr in its various arches, and make sure that it passes in some of the smoke tests. The current Cirrus CI is too slow to support this in a reasonable amount of time, but a GH Actions flow should be able to do this in a reasonable amount of time.

@gatecat FYI

Test and implement CARRY4/XORCY/MUXCY support

Current status

The current FPGA interchange implementation never sets placement relative constraints. This will mean that carry chain structures will have a hard timing statifying the DedicatedInterconnect and waste a lot of time in placement strict legalization. nextpnr's relatively constraint system should be usable for this case, with at least 1 exception.

Potential solutions

During packing, chains of cells that have dedicated interconnect need to be detected and those chain structures identified. In the simple case, once the chain is found settings the cell placement constraint will be enough, see https://github.com/YosysHQ/nextpnr/blob/692d7dc26ddf21e2d38dd16aecef652ab4c0d5e3/common/nextpnr_types.h#L168-L175

There is an exception case that is not solvable with the current system from nextpnr, and that has to do with 7-series carry chain that extend across CMTs. It is worth noting that the symbiflow-arch-defs implementation has the same deficiency, so ignoring this case is potentially fine. The specific case in question is what happens if a carry chain extends across the BRKH_CLB tile type that appears between CMTs. In this case the constr_y will be 2 instead of 1. To solve this would likely require either a placement constraint system change or an Arch API change.

FPGA Interchange Global Clock Routing

Prerequisites:

wire types, as being discussed in chipsalliance/fpga-interchange-schema#31
specification for more complex rules, chipsalliance/fpga-interchange-schema#34

Global clock routing is likely to be based on a BFS, at least to start with, primarily from the sink back to the root but following certain wire type constraints and aiming to maximise shared routing along the way. This approach has worked well for previous devices - see Nexus example:

https://github.com/YosysHQ/nextpnr/blob/3fd1ee7757356660c7f440705553d345837eaed5/nexus/global.cc#L45-L137

Similarly, such approach could also be used to determine placements for global buffers that maximise use of dedicated resources automatically, similar to https://github.com/YosysHQ/nextpnr/blob/3fd1ee7757356660c7f440705553d345837eaed5/nexus/pack.cc#L674-L797. For cases where this doesn't work; manual associations as described in #263 will be needed.

Finally, UltraScale clock routing will need some logic to find the clock root and route to/from it.

[Interchange] Clock routed through general interconnect

In some designs (e.g. ram-test), the clock net crosses a clock region and gets into the general interconnect.

In the image above, the highlighted signal is the clock net that enters the general interconnect through a CLB site-thru.

Despite this route being accepted, it should not occur unless strictly needed.

I think this situation can be fixed with the following:

Global clock routing
Support timing-driven routing

symbiflow / nextpnr Goto Github PK

nextpnr's People

Contributors

Stargazers

Watchers

Forkers

nextpnr's Issues

Current status

Description

Apply cell type exception map

Expand the macro

Place the design respecting the macro

Current status

Work to be done

Current Status

Examples

Discussion

Improvement testing, robustness and speed

Get FPGA interchange 7-series nextpnr implementation to feature parity with symbiflow-arch-defs

Get FPGA interchange nextpnr implementation to bigger fabrics and circuits

Current status

Current problems

Potential solutions

Current status

Current problems

Potential solutions

Current status

Potential solutions

Recommend Projects

Recommend Topics

Recommend Org