fastmachinelearning / hls4ml Goto Github PK

Machine learning on FPGAs using HLS

Home Page: https://fastmachinelearning.org/hls4ml

License: Apache License 2.0

Python 42.72% C++ 53.51% C 0.11% Tcl 1.04% Shell 0.87% Makefile 0.02% SystemVerilog 1.14% Verilog 0.59%

hls machine-learning fpga python keras pytorch onnx vivado vivado-hls neural-network

hls4ml's Introduction

A package for machine learning inference in FPGAs. We create firmware implementations of machine learning algorithms using high level synthesis language (HLS). We translate traditional open-source machine learning package models into HLS that can be configured for your use-case!

If you have any questions, comments, or ideas regarding hls4ml or just want to show us how you use hls4ml, don't hesitate to reach us through the discussions tab.

Documentation & Tutorial

For more information visit the webpage: https://fastmachinelearning.org/hls4ml/

Detailed tutorials on how to use hls4ml's various functionalities can be found here.

Installation

pip install hls4ml

To install the extra dependencies for profiling:

pip install hls4ml[profiling]

Getting Started

Creating an HLS project

import hls4ml

# Fetch a keras model from our example repository
# This will download our example model to your working directory and return an example configuration file
config = hls4ml.utils.fetch_example_model('KERAS_3layer.json')

# You can print the configuration to see some default parameters
print(config)

# Convert it to a hls project
hls_model = hls4ml.converters.keras_to_hls(config)

# Print full list of example models if you want to explore more
hls4ml.utils.fetch_example_list()

Building a project with Xilinx Vivado HLS (after downloading and installing from here)

Note: Vitis HLS is not yet supported. Vivado HLS versions between 2018.2 and 2020.1 are recommended.

# Use Vivado HLS to synthesize the model
# This might take several minutes
hls_model.build()

# Print out the report if you want
hls4ml.report.read_vivado_report('my-hls-test')

Citation

If you use this software in a publication, please cite the software

@software{fastml_hls4ml,
  author       = {{FastML Team}},
  title        = {fastmachinelearning/hls4ml},
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.8.1},
  doi          = {10.5281/zenodo.1201549},
  url          = {https://github.com/fastmachinelearning/hls4ml}
}

and first publication:

@article{Duarte:2018ite,
    author = "Duarte, Javier and others",
    title = "{Fast inference of deep neural networks in FPGAs for particle physics}",
    eprint = "1804.06913",
    archivePrefix = "arXiv",
    primaryClass = "physics.ins-det",
    reportNumber = "FERMILAB-PUB-18-089-E",
    doi = "10.1088/1748-0221/13/07/P07027",
    journal = "JINST",
    volume = "13",
    number = "07",
    pages = "P07027",
    year = "2018"
}

Additionally, if you use specific features developed in later papers, please cite those as well. For example, CNNs:

@article{Aarrestad:2021zos,
    author = "Aarrestad, Thea and others",
    title = "{Fast convolutional neural networks on FPGAs with hls4ml}",
    eprint = "2101.05108",
    archivePrefix = "arXiv",
    primaryClass = "cs.LG",
    reportNumber = "FERMILAB-PUB-21-130-SCD",
    doi = "10.1088/2632-2153/ac0ea1",
    journal = "Mach. Learn. Sci. Tech.",
    volume = "2",
    number = "4",
    pages = "045015",
    year = "2021"
}
@article{Ghielmetti:2022ndm,
    author = "Ghielmetti, Nicol\`{o} and others",
    title = "{Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml}",
    eprint = "2205.07690",
    archivePrefix = "arXiv",
    primaryClass = "cs.CV",
    reportNumber = "FERMILAB-PUB-22-435-PPD",
    doi = "10.1088/2632-2153/ac9cb5",
    journal ="Mach. Learn. Sci. Tech.",
    year = "2022"
}

binary/ternary networks:

@article{Loncar:2020hqp,
    author = "Ngadiuba, Jennifer and others",
    title = "{Compressing deep neural networks on FPGAs to binary and ternary precision with HLS4ML}",
    eprint = "2003.06308",
    archivePrefix = "arXiv",
    primaryClass = "cs.LG",
    reportNumber = "FERMILAB-PUB-20-167-PPD-SCD",
    doi = "10.1088/2632-2153/aba042",
    journal = "Mach. Learn. Sci. Tech.",
    volume = "2",
    pages = "015001",
    year = "2021"
}

Acknowledgments

If you benefited from participating in our community, we ask that you please acknowledge the Fast Machine Learning collaboration, and particular individuals who helped you, in any publications. Please use the following text for this acknowledgment:

We acknowledge the Fast Machine Learning collective as an open community of multi-domain experts and collaborators. This community and <names of individuals>, in particular, were important for the development of this project.

Funding

We gratefully acknowledge previous and current support from the U.S. National Science Foundation (NSF) Harnessing the Data Revolution (HDR) Institute for Accelerating AI Algorithms for Data Driven Discovery (A3D3) under Cooperative Agreement No. PHY-2117997, U.S. Department of Energy (DOE) Office of Science, Office of Advanced Scientific Computing Research under the Real‐time Data Reduction Codesign at the Extreme Edge for Science (XDR) Project (DE-FOA-0002501), DOE Office of Science, Office of High Energy Physics Early Career Research Program (DE-SC0021187, DE-0000247070), and the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant No. 772369).

hls4ml's People

Contributors

Stargazers

Watchers

Forkers

nhanvtran primercuervo violatingcp gyb1325 d123456ddq shubhampachori12110095 thesps yuechengli jngadiub vloncar microideax jeonggunlee booool tahuang1991 sdnuzwk michalhusejko tschucker sycomix drankincms pavan-devara tuanho27 jshlee benjaminkreis omkarkavi mathmax12 autohe charudatta10 jerenner blackcathj powerpan1990 catonblack miss-sally uoft-hprc saadmahboob aiyangyang963 fermilab-accelerator-ai kyliel4 kim-sunghoon xw2333 emageir sidjos ejk43 franzoni dongzwhitsz nikhil-garg yosukeueda33 ricardolera gitchn233 rajumachupalli kuzemchik matt-komm rob0tsunny magic3007 abhijay97 jorgeaportilla 2017053334 yiiyama therwig shgoupf lastweek l4es zeromvp cdtennant lyq1874 yingyiluo jason6583 mingsung111 amirzainol danieltangdx atomauro andresfelquintero mpv89 milesmiles egokoo liucj97 ailearnerli rajeshpandit107 regraz maxpark zwl1671 richarao gkorol giuseppediguglielmo saquibahmad01 rubick-fafafa flavio58it sharpertool vesal-rm robocon2011 louyu2015 filipemlins ljt12345 nevilshah235 abhishekkumarjain hamzajaved780 xialeilianbing zzulb utri092 virusfunk guyzsarun

hls4ml's Issues

Performance variation with input configs

Base on the current 1 layer NN, I scan the reuse factor and precision. The performance are plotted as in https://www.dropbox.com/s/jmn4b4dwt5vjtuk/Zhenbin_HLSScan.pdf?dl=0

Precision too low causes synthesis crash

@zhenbinwu reports that running with a precision at ap_fixed <8,4> crashes 1 layer example need to figure out why

softmax activation outputs value greater than 1

Hi,

First off, good job on the entire hls4ml package! I'm using it in a project of mine and it's great.

I believe there's an issue with the softmax activation.
First row is the input, second row is the output.:

I expected the output to be less than 1 in all positions, but the second position which is 1 as expected.
Note that 1.33 isn't feasible for softmax by definition.

Reproduce:

All variables are ap_fixed<32,8>
The configuration struct is:

#define M_in 4
typedef ap_fixed<32,8> result_t;
...
struct dec_softmax_config2 : nnet::activ_config {
    static const unsigned n_in = M_in;
    static const unsigned table_size = 1024;
    static const unsigned io_type = nnet::io_parallel;
};

The call to softmax is

result_t logits2[M_in];
#pragma HLS ARRAY_PARTITION variable=logits2 complete dim=0
result_t logits3[M_in];
#pragma HLS ARRAY_PARTITION variable=logits3 complete dim=0
nnet::softmax<result_t, result_t, dec_softmax_config2>(logits2, logits3);

Maybe I'm using it incorrectly?
Let me know if you need more info.

Thanks.

Concatenate layer

Add support for Concatenate layer type:

https://github.com/keras-team/keras/blob/master/keras/layers/merge.py#L320

update documentation with miniconda setup and document examples

BatchNormalization layer

add support for BatchNormalization layer type:

https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L16

make part number part of yml configuration

trivial, but making an issue so we don't forget

MaxPooling1D/2D/3D Layer

1D: https://github.com/keras-team/keras/blob/master/keras/layers/pooling.py#L57
2D: https://github.com/keras-team/keras/blob/master/keras/layers/pooling.py#L170
3D: https://github.com/keras-team/keras/blob/master/keras/layers/pooling.py#L339

Clarify which keras layers/activations we support in documentation

We should list which ones we currently support and which ones we plan to support in the documentation.

Below is a list of all Keras layers, not including abstract layers and aliases.

Activation
ActivityRegularization
Add
AlphaDropout
AtrousConv1D
AtrousConv2D
Average
AveragePooling1D
AveragePooling2D
AveragePooling3D
BatchNormalization
Bidirectional
Concatenate
Conv1D
Conv2D
Conv2DTranspose
Conv3D
Conv3DTranspose
ConvLSTM2D
ConvRecurrent2D
Cropping1D
Cropping2D
Cropping3D
CuDNNGRU
CuDNNLSTM
Dense
Dot
Dropout
ELU
Embedding
Flatten
GRU
GRUCell
GaussianDropout
GaussianNoise
GlobalAveragePooling1D
GlobalAveragePooling2D
GlobalAveragePooling3D
GlobalMaxPooling1D
GlobalMaxPooling2D
GlobalMaxPooling3D
Highway
Input
LSTM
LSTMCell
Lambda
LeakyReLU
LocallyConnected1D
LocallyConnected2D
Masking
MaxPooling1D
MaxPooling2D
MaxPooling3D
Maximum
MaxoutDense
Merge
Minimum
Multiply
PReLU
Permute
RepeatVector
Reshape
SeparableConv1D
SeparableConv2D
SimpleRNN
SimpleRNNCell
Softmax
SpatialDropout1D
SpatialDropout2D
SpatialDropout3D
StackedRNNCells
Subtract
ThresholdedReLU
TimeDistributed
UpSampling1D
UpSampling2D
UpSampling3D
ZeroPadding1D
ZeroPadding2D
ZeroPadding3D

Making pragmas configurable

In one project, we made a configurable pipeline that you can set in the tcl script:
https://github.com/p2l1pfp/GlobalCorrelator_HLS/blob/dev/run_hls_fullpfalgo_mp7.tcl#L7
https://github.com/p2l1pfp/GlobalCorrelator_HLS/blob/3edff5f79aa840b8a6ddb7f7eedea14f6b894197/firmware/simple_fullpfalgo.cpp#L430
but this only let's you set it once. This won't work for us if we want to use the same compute_layer function multiple times but with different e.g. unroll factors each time.

Instead, @nhanvtran had the idea of using preprocessor directives, as so:

#define my_unroll 1
compute layer...
#undef my_unroll

#define my_unroll 2
compute layer...
#undef my_unroll

I also tried passing a C++ object to a pragma, and surprisingly (to me anyway), it seemed to work. I just did

int test = 1;
#pragma HLS pipeline II=test

and got the same results when setting test to a value as when setting the pipeline directly.

Assuming giving a C++ object to a pragma really works, we could add arguments to the nnet_utils functions. If not, we could go with the preprocessor directives.

Last thing to note is we will need a place for the user to define what they want!

parallel mode and BRAMs

getting past "sublayer" with block parititioning and putting BRΑΜs in parallel mode

See discussion in issue #46

Automated testing/continuous integration?

Just tossing an idea out here, but would there be any interest in setting up a small automated test framework to "protect" some of the intended behavior of the resource reuse API?

I'm starting to get afraid that something I edit may accidentally break existing features. Thoughts? Does anyone have experience with Jenkins or Travis CI? I believe they both provide free access for open source projects.

Softmax layer latency

Using the branch nt/resource-reuse-api I checked what the latency and resource usage is for the 3-layer model with two ReuseFactor test cases (below). In either case the softmax layer takes 34 clocks and I was going to check the code to see if this is expected.

ReuseFactor: 1

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+----------+
    |  Latency  |  Interval | Pipeline |
    | min | max | min | max |   Type   |
    +-----+-----+-----+-----+----------+
    |   59|   59|    1|    1| dataflow |
    +-----+-----+-----+-----+----------+

    + Detail: 
        * Instance: 
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |                                        |                       |  Latency  |  Interval | Pipeline |
        |                Instance                |         Module        | min | max | min | max |   Type   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |grp_compute_layer_0_0_0_2_fu_440        |compute_layer_0_0_0_2  |    5|    5|    1|    1| function |
        |grp_compute_layer_0_0_0_1_fu_508        |compute_layer_0_0_0_1  |    4|    4|    1|    1| function |
        |grp_compute_layer_0_0_0_3_fu_539        |compute_layer_0_0_0_3  |    4|    4|    1|    1| function |
        |grp_softmax_fu_575                      |softmax                |   34|   34|    1|    1| function |
        |grp_compute_layer_0_0_0_s_fu_587        |compute_layer_0_0_0_s  |    3|    3|    1|    1| function |
        |call_ret2_relu_2_fu_623                 |relu_2                 |    0|    0|    1|    1| function |
        |call_ret4_relu_1_fu_691                 |relu_1                 |    0|    0|    1|    1| function |
        |call_ret_relu_fu_727                    |relu                   |    0|    0|    1|    1| function |
        |StgValue_114_myproject_entry3_fu_763    |myproject_entry3       |    0|    0|    0|    0|   none   |
        |StgValue_115_myproject_entry490_fu_848  |myproject_entry490     |    0|    0|    0|    0|   none   |
        |StgValue_572_Block_proc_fu_906          |Block_proc             |    0|    0|    0|    0|   none   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+

ReuseFactor: 4

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+----------+
    |  Latency  |  Interval | Pipeline |
    | min | max | min | max |   Type   |
    +-----+-----+-----+-----+----------+
    |   69|   69|    4|    4| dataflow |
    +-----+-----+-----+-----+----------+

    + Detail: 
        * Instance: 
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |                                        |                       |  Latency  |  Interval | Pipeline |
        |                Instance                |         Module        | min | max | min | max |   Type   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |grp_compute_layer_0_0_0_2_fu_440        |compute_layer_0_0_0_2  |    6|    6|    3|    3| function |
        |grp_compute_layer_0_0_0_3_fu_508        |compute_layer_0_0_0_3  |    6|    6|    3|    3| function |
        |grp_compute_layer_0_0_0_s_fu_539        |compute_layer_0_0_0_s  |    7|    7|    4|    4| function |
        |grp_softmax_fu_575                      |softmax                |   34|   34|    1|    1| function |
        |grp_compute_layer_0_0_0_1_fu_587        |compute_layer_0_0_0_1  |    7|    7|    4|    4| function |
        |call_ret2_relu_fu_623                   |relu                   |    0|    0|    1|    1| function |
        |call_ret4_relu_2_fu_691                 |relu_2                 |    0|    0|    1|    1| function |
        |call_ret_relu_1_fu_727                  |relu_1                 |    0|    0|    1|    1| function |
        |StgValue_125_myproject_entry3_fu_763    |myproject_entry3       |    0|    0|    0|    0|   none   |
        |StgValue_126_myproject_entry505_fu_848  |myproject_entry505     |    0|    0|    0|    0|   none   |
        |StgValue_593_Block_proc_fu_906          |Block_proc             |    0|    0|    0|    0|   none   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+

Memory problems when synthesizing Conv1D models

@ejk43 We're trying to synthesize some larger Conv1D models in Vivado HLS 2017.2 and we're wondering if the problems we're seeing are just due to the memory available on our computer.

When we synthesize a very small Conv1D model, it works. Here's a keras-config.yml

KerasJson: example-keras-model-files/KERAS_conv1d_small.json
KerasH5:   example-keras-model-files/KERAS_conv1d_small_weights.h5
OutputDir: my-hls-test
ProjectName: myproject
XilinxPart:  xcku115-flvf1924-2-i
ClockPeriod: 5

IOType: io_parallel # options: io_serial/io_parallel
ReuseFactor: 1
DefaultPrecision: ap_fixed<16,6>

and the command to build it:

python keras-to-hls -c keras-config.yml
cd my-hls-test
vivado_hls -f build_prj.tcl

but when we run a larger model, e.g.

KerasJson: example-keras-model-files/KERAS_conv1d.json
KerasH5:   example-keras-model-files/KERAS_conv1d_weights.h5
OutputDir: my-hls-test

We get the following error during csynth:

ERROR: [XFORM 203-504] Stop unrolling loop 'ConvOut' (/home/jduarte1/hls-fpga-machine-learning/nnet_utils/nnet_conv.h:79) in function 'nnet::conv_1d<ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, config2>' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body.
ERROR: [HLS 200-70] Pre-synthesis failed.

Is this just a memory problem? Or can we solve this by changing our HLS code? What kind of computer are you using to synthesize and compile firmware?

Thanks!

Compiler errror with large amount of weights

This issue was originally reported by Rishraj. He has a NN with layer 1 dimensions 784x512, layer 2 512x512, layer 3 512512 and layer4 51210. But the code failed at CSIM compiling:

INFO: [SIM 211-2] *************** CSIM start ***************
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
Compiling ../../../../myproject_test.cpp in debug mode
Compiling ../../../../firmware/myproject.cpp in debug mode
gcc: internal compiler error: Segmentation fault (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions.
make: *** [obj/myproject.o] Error 4
ERROR: [SIM 211-100] 'csim_design' failed: compilation error(s).
INFO: [SIM 211-3] *************** CSIM finish ***************

This is track down to the weight initiation. For our Dense 200 Network, we have only 164,200 weights and passed the CSIM, while the above network has 930,796 weights.

From a series of test, the vivado can compile 262,144 weights and 262,144 *2 weights, but failed with 663,168 weights. The memory usage is ~10% on correlator2.fnal.gov. This seems another HLS compiler issue.

multi pumping

Starting a new issue so we can discuss multi pumping

Following up on @zhenbinwu's presentation today, I tried running our 1 hidden layer example with the multipliers using LUT-based cores instead of DSP-based cores (ie #pragma HLS RESOURCE variable=my_var core=Mul_LUT)

Using Vivado 2018.2 and default hls4ml options from the head, this is what I see:

	Latency	DSP	FF	LUT
DSP multipliers	12	305	9798	11819
LUT multipliers	11	0	9127	137371

I think this is more like what we expected to see -- that the DSPs go to zero, and we use way more LUTs.

The number of LUTs per multiplication roughly makes sense I think. If you take the excess LUTs and divide by the number of DSPs that were used, you get ~411 LUTs per multiplication. If you create a 18bit x 18bit LUT-based multiplier IP core in regular Vivado, you get 365 or 401 LUTs per multiplication depending on the optimization option (non constant coefficient).

It will be interesting to find out how this multi pumping code uses so few!

Zero initializing the arrays

While fiddling with @jngadiub's binary network I have stumbled upon trivial change which seems to reduce resource usage and latency significantly.

In compute_layer_nobias, array acc is initialized to zero in a loop. Using array initialization syntax {0}:

typename CONFIG_T::accum_t acc[CONFIG_T::n_out] = {0};

and commenting out/removing ResetAccum loop reduces the resource usage and latency significantly. This is still valid C++ code, it passes csim with the same results, synthetises, and even cosim passes. The question is why does it use significantly less resources?

Here are the examples of synthesis reports without array initialization and with array initialization.
Loop initialization:

Array initialization:

Looks too good to be true :-)

The same idea can be applied elsewhere, e.g,, in softmax activation. It doesn't change the latency there, but reduces resource usage.

Pruning conv1d

I have a branch going to allow the conv1d to take advantage of pruning in training. @jmduarte and I have been discussing this, and we think there isn't a simple formula for reducing the multiplier limit based on the number of weights equaling zero. It gets pretty complicated due to the presence of padding.

So the approach here is to just do the loops once in advance and count the number of multiplications by nonzero weights, then divide that by the reuse factor to get the multiplier limit.

Perhaps the simplest would be to do these first loops for counting in separate code (e.g. our python), but I think it would be nice for the nnet_utils to have this feature. So I've added the count here. Unfortunately it looks to me like I haven't managed to fully decouple it from the firmware part. I see a slightly different resource usage if I use this function versus just passing the limit from the outside world. If anyone sees the problem, let me know!

Some test results:
Model: KERAS_conv1d_small
Precision: ap_fixed<18,8>
HLS Version: v2017.2

Default weights:

Reuse	BRAM	DSP	FF	LUT	Lat	II
1	13	547	47149	28161	24	1
2	13	398	47188	30769	26	2
3	13	266	47504	33031	30	3

Random 20% weights set to zero:

Reuse	BRAM	DSP	FF	LUT	Lat	II
1	13	455	41276	23950	24	1
2	13	303	37926	24188	26	2
3	13	206	38074	26472	30	3

Random 50% weights set to zero:

Reuse	BRAM	DSP	FF	LUT	Lat	II
1	13	279	24769	15877	22	1
2	13	200	27442	16876	25	2
3	13	99	17592	11563	25	3

I'm setting the mult limit, so I think here the test is to see if the non-DSP numbers look problematic. I don't see the FF or LUT numbers explode suggesting we are doing multiplications with logic. Latency looks alright. II is as expected.

So if people like this approach, I can make a PR. It would be nice to first figure out why the mult count is possibly taking some firmware to do though.

Next version of CNN not hitting pipeline target

Hi folks,

I made some big changes to the CNN code so that it:

Allows for multiple filters
Sums over channels within a filter
Configurable stride
Configurable padding

The csim results match within a few percent for the one example I've tried, but unfortunately HLS can't hit the interval=1 target and gets 2 instead. I haven't yet been able to figure out exactly what's causing it, but perhaps some relevant output is this:
INFO: [SCHED 204-61] Pipelining function 'conv_1d.0.0.0.0'. WARNING: [SCHED 204-69] Unable to schedule 'store' operation (/home/kreis/conv/HLS4ML/nnet_utils/nnet_conv.h:133) of variable 'acc[0][2].V', /home/kreis/conv/HLS4ML/nnet_utils/nnet_conv.h:122 on array 'res_0_V' due to limited memory ports. INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 6. WARNING: [SCHED 204-21] Estimated clock period (4.89ns) exceeds the target (target clock period: 5ns, clock uncertainty: 0.625ns, effective delay budget: 4.38ns). WARNING: [SCHED 204-21] The critical path consists of the following: 'mul' operation ('p_Val2_257_2', /home/kreis/conv/HLS4ML/nnet_utils/nnet_conv.h:98) (4.89 ns)

I tried paring down the computations (lower number of multiplications and precision, change multiplications to additions, less accumulation etc) and could only get to interval=1 in some pretty specific circumstances that I don't think provide any great insight.

I'm still investigating, but I thought I'd create this issue in case anyone already has some ideas.

The main branch I'm developing on is this one

And I have a second one where I tried separating the accumulator loop within the filter and over channels into two here

The example project is updated with a 1D CNN I trained in Keras.

Synthesis fails with io_serial

After changing IOType to io_serial in keras-config.yml the resulting code does not synthesize. Only 1layer with sigmoid activation succeeds, and that requires #81 to succeed.

Another problem is that models with Conv layers produce code with the wrong pragmas, e.g., #pragma HLS STREAM variable=logits1 complete depth=1. This fails with the error:

ERROR: [HLS 200-70] pragma 'STREAM variable=logits1 complete depth=1' has unknown option 'complete'
ERROR: [HLS 200-70] '#pragma HLS STREAM variable=logits1 complete depth=1' is not a valid pragma.

However, this is trivially solved by removing the offending keyword. I can make a PR with the change in hls-writer.py, but this does not solve all the problems of synthesis hanging.

The errors are different for different models, some take forever to finish (like 3layer), some just fail with some error about memory (sorry, I didn't keep the logs), and some (like KERAS_conv1d_small model) fails with a an error like:

WARNING: [XFORM 203-124] Array  'conv_layer2_in.V' (firmware/serial_test.cpp:74): The entries are not accessed in sequential order.
WARNING: [XFORM 203-124] Array  'conv_layer3_in.V' (firmware/serial_test.cpp:87): The entries are not accessed in sequential order.
WARNING: [XFORM 203-124] Array  'logits5.V' (firmware/serial_test.cpp:105): The entries are not accessed in sequential order.
ERROR: [XFORM 203-123] Cannot stream  'data.V' (firmware/serial_test.cpp:42): The entries are not accessed in sequential order.
ERROR: [HLS 200-70] Pre-synthesis failed.

Synthesis failed with large input database

I got the below failed message when trying to synthesis a project, with ap_fixed<36, 4> with reuse factor of 3. The only different from the previous trial is that I included a large input text file as the input event for the test bench. Guessing the large data files causing this.

`INFO: [RTGEN 206-100] Generating core module 'myproject_mul_36sudo': 1 instance(s).
INFO: [RTGEN 206-100] Generating core module 'myproject_mul_36svdy': 3 instance(s).
INFO: [RTGEN 206-100] Generating core module 'myproject_mul_36swdI': 1 instance(s).
INFO: [RTGEN 206-100] Finished creating RTL model for 'compute_layer_0_0_0_s'.
INFO: [HLS 200-111] Elapsed time: 2.74 seconds; current allocated memory: 554.319 MB.

ERROR: unknown exception in database saving.
Synthesis failed.
while executing
"source [lindex $::argv 1] "
("uplevel" body line 1)
invoked from within
"uplevel #0 { source [lindex $::argv 1] } "

INFO: [HLS 200-112] Total elapsed time: 83 seconds; peak allocated memory: 554.319 MB.
`

Other ML package interfaces?

TensorFlow? How similar is it to Keras?
Caffe?

Training with reduced precision

It would be easy to reduce the precision of the inputs only and still represent them with high precision in Keras or TF. However, it would be interesting if we could train reduced precision weights, too.

@violatingcp pointed out some precision options
https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types
keras-team/keras#2019
but we haven't yet found a way to do this with arbitrary precision.

We may also find something interesting in the binarized network implementation: https://gitlab.com/kunglab/ddnn

testing MNIST model from Grzegorz Korcyl

see if we can synthesize this with the master

store_in_bram functionality

So far we have 2 working modes -- one for LHC trigger (low reuse factor, 1-6, weights in the fabric) and one for "naive" serial mode (see PR #45).

One interesting mode is a very large reuse factor and weights stored in the fabric. This is particularly useful for really big networks and SDAccel-like use cases.

However, playing around with the current code, I found that it's not so trivial to store weights in BRAMs and run with #pragma HLS PIPELINE. HLS always wants to partition the weight array.

@ejk43 you have any ideas here? I thought we could remove the PIPELINE and go back to the days of DATAFLOW pragma...

Understanding HLS multipliers

@ejk43 I was hoping to do some simple tests with a single multiplication in HLS to get a more quantitative understanding of the latency, II, and DSP usage. We know the trends qualitatively, such as more DSPs and latency for higher precision, but I'd love to know the numbers.

HLS must have some freedom in determining these things. First of all, when you create a multiplier IP in normal Vivado, you have choices for the number of pipeline stages, a.k.a. latency, even for a fixed precision.

I also see this in some basic tests:

In a trivial HLS project with one multiplication of ap_fixed<18,8> numbers with PIPELINE=1, I got a multiplier with a latency of 1:

    test_mult_mul_mulbkb_U1 : component test_mult_mul_mulbkb
    generic map (
        ID => 1,
        NUM_STAGE => 1,
        din0_WIDTH => 18,
        din1_WIDTH => 18,
        dout_WIDTH => 28)
    port map (
        din0 => b_V,
        din1 => a_V,
        dout => p_Val2_s_fu_67_p2);

However in our 1 hidden layer example, also with ap_fixed<18,8>, I see that the number of stages is 3:

    myproject_mul_11nfYi_U7 : component myproject_mul_11nfYi
    generic map (
	ID => 1,
        NUM_STAGE => 3,
        din0_WIDTH => 11,
        din1_WIDTH => 18,
        dout_WIDTH => 28)
    port map (
        clk => ap_clk,
        reset => ap_rst,
        din0 => grp_fu_743_p0,
        din1 => grp_fu_743_p1,
        ce => grp_fu_743_ce,
        dout => grp_fu_743_p2);

So my question is, how does HLS make these choices?

Do you know if there are some basic rules for this, or does it really depend case-by-case on the whole project routing, in which case I have no hope of mapping this out with a single multiplier project?

Make input range of activation LUTs configurable

This is a suggestion from @jngadiub. I think it's a good one. We could also add it to the yaml.

[Feature Request] Standard Interfaces

Data inputs as standard Stream-style interface. Weights, etc, as addressable parameters on a Wishbone or AXI interface.

Compilation failure when result_t and data_t of compute_layer point to same data type

In this branch I've added a model from Sergo that's failing to compile. If you do the normal python keras-to-hls.py -c keras-config.yml you'll pick up the model and see the error when you try to build the project.

The error is

INFO: [SIM 211-2] *************** CSIM start ***************
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
   Compiling ../../../../myproject_test.cpp in debug mode
   Compiling ../../../../firmware/myproject.cpp in debug mode
In file included from ../../../../firmware/parameters.h:7:0,
                 from ../../../../firmware/myproject.cpp:21:
/home/kreis/muon/hls4ml/nnet_utils/nnet_layer.h: In function ‘void nnet::compute_layer(data_T*, res_T*, typename CONFIG_T::weight_t (*)[CONFIG_T:: n_out], typename CONFIG_T::bias_t*) [with data_T = ap_fixed<18, 8>, res_T = ap_fixed<18, 8>, CONFIG_T = config4, typename CONFIG_T::weight_t = ap_fixed<18, 8>, typename CONFIG_T::bias_t = ap_fixed<18, 8>]’:
../../../../firmware/myproject.cpp:82:77:   instantiated from here
/home/kreis/muon/hls4ml/nnet_utils/nnet_layer.h:100:13: error: invalid use of incomplete type ‘class ap_fixed<18, 8>’
/data/xilinx/Vivado_HLS/2017.2/include/ap_int.h:318:7: error: declaration of ‘class ap_fixed<18, 8>’
/home/kreis/muon/hls4ml/nnet_utils/nnet_layer.h:100: confused by earlier errors, bailing out
make: *** [obj/myproject.o] Error 1
ERROR: [SIM 211-100] 'csim_design' failed: compilation error(s).
INFO: [SIM 211-3] *************** CSIM finish ***************
4
    while executing
"source [lindex $::argv 1] "
    ("uplevel" body line 1)
    invoked from within
"uplevel \#0 { source [lindex $::argv 1] } "

I've definitely seen this one before, but I can't remember the previous causes.

In any case, what's special about this model is that there is no activation on the final layer, and it seems that compute_layer does not like it when the data_t and result_t typedefs both point to the same type. If I change result_t to ap_fixed<19,8>, it works. It also works if I add an activation after, which is why we haven't seen this before.

(and it must only matter for the last layer with output res??)

Create nn_lib git submodule

Can we create a common git submodule to serve all purposes?

Conv2d image shapes

I don't fully understand the problem yet, but we have a bug in defining the image shapes of Conv2ds.

I have a model where the height and width of a second Conv2d is being assigned the height and width of the first Conv2d.

Relevant code:
https://github.com/hls-fpga-machine-learning/hls4ml/blob/master/keras-to-hls/keras-to-hls.py#L234-L235
https://github.com/hls-fpga-machine-learning/hls4ml/blob/master/keras-to-hls/keras-to-hls.py#L280

I'm investigating

Interface for data type selection/tuning?

How would a user specify data types for the generated network?

A few possible scenarios come to mind:

The converter uses a default width that is "okay" but could be improved (for example: 32 bits wide, which I believe is the current default)
Bit widths are specified in a configuration file (float, 32 bits, 18 bits, 4 bits, whatever)
Developer tunes bit widths manually after initially validating performance with larger types
The python converter evaluates the network and asserts the best selection of integer/fractional bit widths

Or some combination of the above. Any other ideas?

Modification for Quartus

adapt hls4ml (Vivado based) to be also be compatible with Quartus

compression in serial mode

How do we handle sparse matrices in serial mode?

also, @ejk43 had an idea to compress weights to powers of 2 as another way to save resources.

dynamic fixed point

test out the gains of this modified data type

put clock period in yaml

good idea from @ejk43

Reconfigurable weights?

HLS + Visual System Integrator

Just wondering if you could use the tool to accelerate the development. We are a Xilinx Partner company and specialize in making Embedded development + FPGA easier. Take a look at our page https://systemviewinc.com and https://docs.systemviewinc.com.

We would be very interested in getting involved if it fits your need.

[Feature Request] FuseSOC IP packaging support

Would allow for much better portability and integration in to a higher level design. Allows for interface wrappers.

Network compression

After low weights are removed from a network, how do we implement the mechanism for skipping those in the HLS translation and RTL?

multiple layers running in parallel on same input

@sergojin has a use-case for multiple dense layers running in parallel on the same input and producing multiple outputs.

Two example keras models are here:
https://github.com/hls-fpga-machine-learning/hls4ml/tree/multiple_layers/keras-to-hls/fromSergo

And a working HLS project made by hand from the ".5" model is here:
https://github.com/hls-fpga-machine-learning/hls4ml/tree/multiple_layers/keras-to-hls/my-hls-test-modified
The final layers run in parallel on the output of the previous layer, and their output is merged to form the result.

What is needed is the hls4ml translation part. Right now we assume the output of each layer is input to only one layer, with the order taken from the order in the json file. @sergojin and @nhanvtran found we can use the inbound_nodes of the json to map the layers to each other.

documenting serial dense

We should add some more documentation on the basic idea of io_parallel and io_serial to the concepts page of our docs.

https://hls-fpga-machine-learning.github.io/hls4ml/CONCEPTS.html

Use ARRAY_RESHAPE directives at the top level myproject.cpp?

Hey guys,

Just ran into something interesting about the top-level interfaces... By default it seems that the ARRAY_PARTITION directive instantiates a separate port for each partitioned element. So, if you have an array with N elements, you'll have N separate data ports, which would get rather large for the number of inputs you'll probably need.

I think if we replace the ARRAY_PARTITION directive with ARRAY_RESHAPE, we get a similar impact as partitioning the array-- and the array elements will also be concatenated into a single larger array element. Here's my updated directives:

    #pragma HLS ARRAY_RESHAPE variable=data complete dim=0
    #pragma HLS ARRAY_RESHAPE variable=res complete dim=0
    #pragma HLS INTERFACE ap_hs port=data,res

Here's the port output for Nhan's conv1d example: (notice data port is 768 bits wide, result port is 192 bits wide)

* Summary: 
+-----------------------+-----+-----+------------+----------------+--------------+
|       RTL Ports       | Dir | Bits|  Protocol  |  Source Object |    C Type    |
+-----------------------+-----+-----+------------+----------------+--------------+
|ap_clk                 |  in |    1| ap_ctrl_hs |    myproject   | return value |
|ap_rst                 |  in |    1| ap_ctrl_hs |    myproject   | return value |
|ap_start               |  in |    1| ap_ctrl_hs |    myproject   | return value |
|ap_done                | out |    1| ap_ctrl_hs |    myproject   | return value |
|ap_idle                | out |    1| ap_ctrl_hs |    myproject   | return value |
|ap_ready               | out |    1| ap_ctrl_hs |    myproject   | return value |
|data_V_ap_vld          |  in |    1|    ap_hs   |     data_V     |    pointer   |
|data_V                 |  in |  768|    ap_hs   |     data_V     |    pointer   |
|data_V_ap_ack          | out |    1|    ap_hs   |     data_V     |    pointer   |
|res_V_ap_ack           |  in |    1|    ap_hs   |      res_V     |    pointer   |
|res_V                  | out |  192|    ap_hs   |      res_V     |    pointer   |
|res_V_ap_vld           | out |    1|    ap_hs   |      res_V     |    pointer   |
|const_size_in          | out |   16|   ap_vld   |  const_size_in |    pointer   |
|const_size_in_ap_vld   | out |    1|   ap_vld   |  const_size_in |    pointer   |
|const_size_out         | out |   16|   ap_vld   | const_size_out |    pointer   |
|const_size_out_ap_vld  | out |    1|   ap_vld   | const_size_out |    pointer   |
+-----------------------+-----+-----+------------+----------------+--------------+

I'd be interested if this breaks any of the other examples, but I'd suggest it's probably a good idea to swap over for the top level function (maybe also for the subfunctions too??)

I threw a test together based on Nhan's conv1d branch here: https://github.com/hls-fpga-machine-learning/HLS4ML/tree/ejk/conv1d-array-reshape

RNN/LSTM in HLS Library

Fill out the plausible types of network layers:

Convolutional
Recurrent

What else am I missing?

Python3 support

move the table instantiations for activations outside of the firmware

We are instantiating the table every single time through the code, and this seems to slow down the testbench (only when we move to 2017.4!). We could move the sigmoid, tanh, softmax table initiations calculated outside of the FW and have it read in as a header file.

add linear activation

Syn failed for 3Layer with sublayer

The current master branch of the hls4ml can't synthesize the 3layer Model.

ERROR: [XFORM 203-103] Array 'mult.V' (/data/benwu/HLS4ML_2018/hls4ml/nnet_utils/nnet_layer.h:56): partitioned elements number (2048) has exeeded the threshold (1024), which may cause long run-time.
ERROR: [HLS 200-70] Pre-synthesis failed.
command 'ap_source' returned error code
    while executing
"source [lindex $::argv 1] "
    ("uplevel" body line 1)
    invoked from within
"uplevel \#0 { source [lindex $::argv 1] } "

INFO: [Common 17-206] Exiting vivado_hls at Fri Jul 13 12:00:25 2018...

The error can be trace to https://github.com/hls-fpga-machine-learning/hls4ml/pull/62/files#diff-b37c065f136460b015788b96b5c25102L52

Implement Resource-Reuse API for Fully-Connected Layer

I'd like to take a stab at ironing out what I'll tentatively call the "resource reuse API"-- ie, how the user of the compute_layer function will manipulate the resource usage of the core...

To that end: Can we come up with a few use cases we'd like to capture in this API? (the scope of the resource-usage problem is large enough that I'd prefer to start by agreeing on a few common use cases and design the function to cover these scenarios)

A few suggestions:

Totally unrolled, fully parallel layer
- This is your general use case-- all features are consumed and operated on in parallel
- Usage: Set fully_parallel option to true in the struct
Partially unrolled with 1-4 cycles of Initiation Interval (II)
- Also useful for your application. Notionally, for an II=4, this should cut the multipliers by 4x and consume 1/4 of the features per clock cycle.
- Usage: Set fully_parallel to false. Set target_initiation_interval to 4? Or do we want to use the "roll_factor"?
"Serial" operation with more unrolling
- I'd classify this scenario as any situation where data is consumed serially (even if this is say 64-128 bits per clock cycle, which is plausibly common for lots of applications due to DMAs and other serial data transfer)
- Usage: Set fully_parallel to false. Maybe specify this operation by how many features are consumed per input cycle?

Does this all make sense so far? What am I missing?

AveragePooling1D/2D/3D Layer

1D: https://github.com/keras-team/keras/blob/master/keras/layers/pooling.py#L87
2D: https://github.com/keras-team/keras/blob/master/keras/layers/pooling.py#L225
3D: https://github.com/keras-team/keras/blob/master/keras/layers/pooling.py#L389