Giter VIP home page Giter VIP logo

fastmachinelearning / hls4ml Goto Github PK

View Code? Open in Web Editor NEW
1.1K 1.1K 381.0 257.6 MB

Machine learning on FPGAs using HLS

Home Page: https://fastmachinelearning.org/hls4ml

License: Apache License 2.0

Python 42.10% C++ 54.07% C 0.10% Tcl 1.06% Shell 0.88% Makefile 0.02% SystemVerilog 1.16% Verilog 0.60%
fpga hls intel-hls keras machine-learning neural-network onnx python pytorch vivado vivado-hls

hls4ml's People

Contributors

adrianalan avatar benjaminkreis avatar bo3z avatar calad0i avatar d-gol avatar delonshen avatar dependabot[bot] avatar drankincms avatar duchstf avatar ejk43 avatar hamzajaved780 avatar janfschulte avatar jmduarte avatar jmitrevs avatar jngadiub avatar jochist avatar joshlerner avatar keb-l avatar landay7 avatar laurilaatu avatar maksgraczyk avatar ngpaladi avatar nhanvtran avatar nicologhielmetti avatar pitmonticone avatar pre-commit-ci[bot] avatar thesps avatar vloncar avatar yiiyama avatar zhenbinwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hls4ml's Issues

multi pumping

Starting a new issue so we can discuss multi pumping

Following up on @zhenbinwu's presentation today, I tried running our 1 hidden layer example with the multipliers using LUT-based cores instead of DSP-based cores (ie #pragma HLS RESOURCE variable=my_var core=Mul_LUT)

Using Vivado 2018.2 and default hls4ml options from the head, this is what I see:

Latency DSP FF LUT
DSP multipliers 12 305 9798 11819
LUT multipliers 11 0 9127 137371

I think this is more like what we expected to see -- that the DSPs go to zero, and we use way more LUTs.

The number of LUTs per multiplication roughly makes sense I think. If you take the excess LUTs and divide by the number of DSPs that were used, you get ~411 LUTs per multiplication. If you create a 18bit x 18bit LUT-based multiplier IP core in regular Vivado, you get 365 or 401 LUTs per multiplication depending on the optimization option (non constant coefficient).

It will be interesting to find out how this multi pumping code uses so few!

multiple layers running in parallel on same input

@sergojin has a use-case for multiple dense layers running in parallel on the same input and producing multiple outputs.

Two example keras models are here:
https://github.com/hls-fpga-machine-learning/hls4ml/tree/multiple_layers/keras-to-hls/fromSergo

And a working HLS project made by hand from the ".5" model is here:
https://github.com/hls-fpga-machine-learning/hls4ml/tree/multiple_layers/keras-to-hls/my-hls-test-modified
The final layers run in parallel on the output of the previous layer, and their output is merged to form the result.

What is needed is the hls4ml translation part. Right now we assume the output of each layer is input to only one layer, with the order taken from the order in the json file. @sergojin and @nhanvtran found we can use the inbound_nodes of the json to map the layers to each other.

Synthesis fails with io_serial

After changing IOType to io_serial in keras-config.yml the resulting code does not synthesize. Only 1layer with sigmoid activation succeeds, and that requires #81 to succeed.

Another problem is that models with Conv layers produce code with the wrong pragmas, e.g., #pragma HLS STREAM variable=logits1 complete depth=1. This fails with the error:

ERROR: [HLS 200-70] pragma 'STREAM variable=logits1 complete depth=1' has unknown option 'complete'
ERROR: [HLS 200-70] '#pragma HLS STREAM variable=logits1 complete depth=1' is not a valid pragma.

However, this is trivially solved by removing the offending keyword. I can make a PR with the change in hls-writer.py, but this does not solve all the problems of synthesis hanging.

The errors are different for different models, some take forever to finish (like 3layer), some just fail with some error about memory (sorry, I didn't keep the logs), and some (like KERAS_conv1d_small model) fails with a an error like:

WARNING: [XFORM 203-124] Array  'conv_layer2_in.V' (firmware/serial_test.cpp:74): The entries are not accessed in sequential order.
WARNING: [XFORM 203-124] Array  'conv_layer3_in.V' (firmware/serial_test.cpp:87): The entries are not accessed in sequential order.
WARNING: [XFORM 203-124] Array  'logits5.V' (firmware/serial_test.cpp:105): The entries are not accessed in sequential order.
ERROR: [XFORM 203-123] Cannot stream  'data.V' (firmware/serial_test.cpp:42): The entries are not accessed in sequential order.
ERROR: [HLS 200-70] Pre-synthesis failed.

store_in_bram functionality

So far we have 2 working modes -- one for LHC trigger (low reuse factor, 1-6, weights in the fabric) and one for "naive" serial mode (see PR #45).

One interesting mode is a very large reuse factor and weights stored in the fabric. This is particularly useful for really big networks and SDAccel-like use cases.

However, playing around with the current code, I found that it's not so trivial to store weights in BRAMs and run with #pragma HLS PIPELINE. HLS always wants to partition the weight array.

@ejk43 you have any ideas here? I thought we could remove the PIPELINE and go back to the days of DATAFLOW pragma...

Compiler errror with large amount of weights

This issue was originally reported by Rishraj. He has a NN with layer 1 dimensions 784x512, layer 2 512x512, layer 3 512512 and layer4 51210. But the code failed at CSIM compiling:

INFO: [SIM 211-2] *************** CSIM start ***************
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
Compiling ../../../../myproject_test.cpp in debug mode
Compiling ../../../../firmware/myproject.cpp in debug mode
gcc: internal compiler error: Segmentation fault (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See http://gcc.gnu.org/bugs.html for instructions.
make: *** [obj/myproject.o] Error 4
ERROR: [SIM 211-100] 'csim_design' failed: compilation error(s).
INFO: [SIM 211-3] *************** CSIM finish ***************

This is track down to the weight initiation. For our Dense 200 Network, we have only 164,200 weights and passed the CSIM, while the above network has 930,796 weights.

From a series of test, the vivado can compile 262,144 weights and 262,144 *2 weights, but failed with 663,168 weights. The memory usage is ~10% on correlator2.fnal.gov. This seems another HLS compiler issue.

Training with reduced precision

It would be easy to reduce the precision of the inputs only and still represent them with high precision in Keras or TF. However, it would be interesting if we could train reduced precision weights, too.

@violatingcp pointed out some precision options
https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types
keras-team/keras#2019
but we haven't yet found a way to do this with arbitrary precision.

We may also find something interesting in the binarized network implementation: https://gitlab.com/kunglab/ddnn

Syn failed for 3Layer with sublayer

The current master branch of the hls4ml can't synthesize the 3layer Model.

ERROR: [XFORM 203-103] Array 'mult.V' (/data/benwu/HLS4ML_2018/hls4ml/nnet_utils/nnet_layer.h:56): partitioned elements number (2048) has exeeded the threshold (1024), which may cause long run-time.
ERROR: [HLS 200-70] Pre-synthesis failed.
command 'ap_source' returned error code
    while executing
"source [lindex $::argv 1] "
    ("uplevel" body line 1)
    invoked from within
"uplevel \#0 { source [lindex $::argv 1] } "

INFO: [Common 17-206] Exiting vivado_hls at Fri Jul 13 12:00:25 2018...

The error can be trace to https://github.com/hls-fpga-machine-learning/hls4ml/pull/62/files#diff-b37c065f136460b015788b96b5c25102L52

Interface for data type selection/tuning?

How would a user specify data types for the generated network?

A few possible scenarios come to mind:

  • The converter uses a default width that is "okay" but could be improved (for example: 32 bits wide, which I believe is the current default)
  • Bit widths are specified in a configuration file (float, 32 bits, 18 bits, 4 bits, whatever)
  • Developer tunes bit widths manually after initially validating performance with larger types
  • The python converter evaluates the network and asserts the best selection of integer/fractional bit widths

Or some combination of the above. Any other ideas?

Understanding HLS multipliers

@ejk43 I was hoping to do some simple tests with a single multiplication in HLS to get a more quantitative understanding of the latency, II, and DSP usage. We know the trends qualitatively, such as more DSPs and latency for higher precision, but I'd love to know the numbers.

HLS must have some freedom in determining these things. First of all, when you create a multiplier IP in normal Vivado, you have choices for the number of pipeline stages, a.k.a. latency, even for a fixed precision.

I also see this in some basic tests:

In a trivial HLS project with one multiplication of ap_fixed<18,8> numbers with PIPELINE=1, I got a multiplier with a latency of 1:

    test_mult_mul_mulbkb_U1 : component test_mult_mul_mulbkb
    generic map (
        ID => 1,
        NUM_STAGE => 1,
        din0_WIDTH => 18,
        din1_WIDTH => 18,
        dout_WIDTH => 28)
    port map (
        din0 => b_V,
        din1 => a_V,
        dout => p_Val2_s_fu_67_p2);

However in our 1 hidden layer example, also with ap_fixed<18,8>, I see that the number of stages is 3:

    myproject_mul_11nfYi_U7 : component myproject_mul_11nfYi
    generic map (
	ID => 1,
        NUM_STAGE => 3,
        din0_WIDTH => 11,
        din1_WIDTH => 18,
        dout_WIDTH => 28)
    port map (
        clk => ap_clk,
        reset => ap_rst,
        din0 => grp_fu_743_p0,
        din1 => grp_fu_743_p1,
        ce => grp_fu_743_ce,
        dout => grp_fu_743_p2);

So my question is, how does HLS make these choices?

Do you know if there are some basic rules for this, or does it really depend case-by-case on the whole project routing, in which case I have no hope of mapping this out with a single multiplier project?

RNN/LSTM in HLS Library

Fill out the plausible types of network layers:

  • Convolutional
  • Recurrent

What else am I missing?

Next version of CNN not hitting pipeline target

Hi folks,

I made some big changes to the CNN code so that it:

  1. Allows for multiple filters
  2. Sums over channels within a filter
  3. Configurable stride
  4. Configurable padding

The csim results match within a few percent for the one example I've tried, but unfortunately HLS can't hit the interval=1 target and gets 2 instead. I haven't yet been able to figure out exactly what's causing it, but perhaps some relevant output is this:
INFO: [SCHED 204-61] Pipelining function 'conv_1d.0.0.0.0'. WARNING: [SCHED 204-69] Unable to schedule 'store' operation (/home/kreis/conv/HLS4ML/nnet_utils/nnet_conv.h:133) of variable 'acc[0][2].V', /home/kreis/conv/HLS4ML/nnet_utils/nnet_conv.h:122 on array 'res_0_V' due to limited memory ports. INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 6. WARNING: [SCHED 204-21] Estimated clock period (4.89ns) exceeds the target (target clock period: 5ns, clock uncertainty: 0.625ns, effective delay budget: 4.38ns). WARNING: [SCHED 204-21] The critical path consists of the following: 'mul' operation ('p_Val2_257_2', /home/kreis/conv/HLS4ML/nnet_utils/nnet_conv.h:98) (4.89 ns)

I tried paring down the computations (lower number of multiplications and precision, change multiplications to additions, less accumulation etc) and could only get to interval=1 in some pretty specific circumstances that I don't think provide any great insight.

I'm still investigating, but I thought I'd create this issue in case anyone already has some ideas.

The main branch I'm developing on is this one

And I have a second one where I tried separating the accumulator loop within the filter and over channels into two here

The example project is updated with a 1D CNN I trained in Keras.

Zero initializing the arrays

While fiddling with @jngadiub's binary network I have stumbled upon trivial change which seems to reduce resource usage and latency significantly.

In compute_layer_nobias, array acc is initialized to zero in a loop. Using array initialization syntax {0}:

typename CONFIG_T::accum_t acc[CONFIG_T::n_out] = {0};

and commenting out/removing ResetAccum loop reduces the resource usage and latency significantly. This is still valid C++ code, it passes csim with the same results, synthetises, and even cosim passes. The question is why does it use significantly less resources?

Here are the examples of synthesis reports without array initialization and with array initialization.
Loop initialization:
loop_init

Array initialization:
array_init

Looks too good to be true :-)

The same idea can be applied elsewhere, e.g,, in softmax activation. It doesn't change the latency there, but reduces resource usage.

Memory problems when synthesizing Conv1D models

@ejk43 We're trying to synthesize some larger Conv1D models in Vivado HLS 2017.2 and we're wondering if the problems we're seeing are just due to the memory available on our computer.

When we synthesize a very small Conv1D model, it works. Here's a keras-config.yml

KerasJson: example-keras-model-files/KERAS_conv1d_small.json
KerasH5:   example-keras-model-files/KERAS_conv1d_small_weights.h5
OutputDir: my-hls-test
ProjectName: myproject
XilinxPart:  xcku115-flvf1924-2-i
ClockPeriod: 5

IOType: io_parallel # options: io_serial/io_parallel
ReuseFactor: 1
DefaultPrecision: ap_fixed<16,6>

and the command to build it:

python keras-to-hls -c keras-config.yml
cd my-hls-test
vivado_hls -f build_prj.tcl

but when we run a larger model, e.g.

KerasJson: example-keras-model-files/KERAS_conv1d.json
KerasH5:   example-keras-model-files/KERAS_conv1d_weights.h5
OutputDir: my-hls-test

We get the following error during csynth:

ERROR: [XFORM 203-504] Stop unrolling loop 'ConvOut' (/home/jduarte1/hls-fpga-machine-learning/nnet_utils/nnet_conv.h:79) in function 'nnet::conv_1d<ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, config2>' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body.
ERROR: [HLS 200-70] Pre-synthesis failed.

Is this just a memory problem? Or can we solve this by changing our HLS code? What kind of computer are you using to synthesize and compile firmware?

Thanks!

Automated testing/continuous integration?

Just tossing an idea out here, but would there be any interest in setting up a small automated test framework to "protect" some of the intended behavior of the resource reuse API?

I'm starting to get afraid that something I edit may accidentally break existing features. Thoughts? Does anyone have experience with Jenkins or Travis CI? I believe they both provide free access for open source projects.

Network compression

After low weights are removed from a network, how do we implement the mechanism for skipping those in the HLS translation and RTL?

parallel mode and BRAMs

getting past "sublayer" with block parititioning and putting BRΑΜs in parallel mode

See discussion in issue #46

Implement Resource-Reuse API for Fully-Connected Layer

I'd like to take a stab at ironing out what I'll tentatively call the "resource reuse API"-- ie, how the user of the compute_layer function will manipulate the resource usage of the core...

To that end: Can we come up with a few use cases we'd like to capture in this API? (the scope of the resource-usage problem is large enough that I'd prefer to start by agreeing on a few common use cases and design the function to cover these scenarios)

A few suggestions:

  1. Totally unrolled, fully parallel layer
    • This is your general use case-- all features are consumed and operated on in parallel
    • Usage: Set fully_parallel option to true in the struct
  2. Partially unrolled with 1-4 cycles of Initiation Interval (II)
    • Also useful for your application. Notionally, for an II=4, this should cut the multipliers by 4x and consume 1/4 of the features per clock cycle.
    • Usage: Set fully_parallel to false. Set target_initiation_interval to 4? Or do we want to use the "roll_factor"?
  3. "Serial" operation with more unrolling
    • I'd classify this scenario as any situation where data is consumed serially (even if this is say 64-128 bits per clock cycle, which is plausibly common for lots of applications due to DMAs and other serial data transfer)
    • Usage: Set fully_parallel to false. Maybe specify this operation by how many features are consumed per input cycle?

Does this all make sense so far? What am I missing?

Synthesis failed with large input database

I got the below failed message when trying to synthesis a project, with ap_fixed<36, 4> with reuse factor of 3. The only different from the previous trial is that I included a large input text file as the input event for the test bench. Guessing the large data files causing this.

`INFO: [RTGEN 206-100] Generating core module 'myproject_mul_36sudo': 1 instance(s).
INFO: [RTGEN 206-100] Generating core module 'myproject_mul_36svdy': 3 instance(s).
INFO: [RTGEN 206-100] Generating core module 'myproject_mul_36swdI': 1 instance(s).
INFO: [RTGEN 206-100] Finished creating RTL model for 'compute_layer_0_0_0_s'.
INFO: [HLS 200-111] Elapsed time: 2.74 seconds; current allocated memory: 554.319 MB.

ERROR: unknown exception in database saving.
Synthesis failed.
while executing
"source [lindex $::argv 1] "
("uplevel" body line 1)
invoked from within
"uplevel #0 { source [lindex $::argv 1] } "

INFO: [HLS 200-112] Total elapsed time: 83 seconds; peak allocated memory: 554.319 MB.
`

Making pragmas configurable

In one project, we made a configurable pipeline that you can set in the tcl script:
https://github.com/p2l1pfp/GlobalCorrelator_HLS/blob/dev/run_hls_fullpfalgo_mp7.tcl#L7
https://github.com/p2l1pfp/GlobalCorrelator_HLS/blob/3edff5f79aa840b8a6ddb7f7eedea14f6b894197/firmware/simple_fullpfalgo.cpp#L430
but this only let's you set it once. This won't work for us if we want to use the same compute_layer function multiple times but with different e.g. unroll factors each time.

Instead, @nhanvtran had the idea of using preprocessor directives, as so:

#define my_unroll 1
compute layer...
#undef my_unroll

#define my_unroll 2
compute layer...
#undef my_unroll

I also tried passing a C++ object to a pragma, and surprisingly (to me anyway), it seemed to work. I just did

int test = 1;
#pragma HLS pipeline II=test

and got the same results when setting test to a value as when setting the pipeline directly.

Assuming giving a C++ object to a pragma really works, we could add arguments to the nnet_utils functions. If not, we could go with the preprocessor directives.

Last thing to note is we will need a place for the user to define what they want!

Softmax layer latency

Using the branch nt/resource-reuse-api I checked what the latency and resource usage is for the 3-layer model with two ReuseFactor test cases (below). In either case the softmax layer takes 34 clocks and I was going to check the code to see if this is expected.

ReuseFactor: 1

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+----------+
    |  Latency  |  Interval | Pipeline |
    | min | max | min | max |   Type   |
    +-----+-----+-----+-----+----------+
    |   59|   59|    1|    1| dataflow |
    +-----+-----+-----+-----+----------+

    + Detail: 
        * Instance: 
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |                                        |                       |  Latency  |  Interval | Pipeline |
        |                Instance                |         Module        | min | max | min | max |   Type   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |grp_compute_layer_0_0_0_2_fu_440        |compute_layer_0_0_0_2  |    5|    5|    1|    1| function |
        |grp_compute_layer_0_0_0_1_fu_508        |compute_layer_0_0_0_1  |    4|    4|    1|    1| function |
        |grp_compute_layer_0_0_0_3_fu_539        |compute_layer_0_0_0_3  |    4|    4|    1|    1| function |
        |grp_softmax_fu_575                      |softmax                |   34|   34|    1|    1| function |
        |grp_compute_layer_0_0_0_s_fu_587        |compute_layer_0_0_0_s  |    3|    3|    1|    1| function |
        |call_ret2_relu_2_fu_623                 |relu_2                 |    0|    0|    1|    1| function |
        |call_ret4_relu_1_fu_691                 |relu_1                 |    0|    0|    1|    1| function |
        |call_ret_relu_fu_727                    |relu                   |    0|    0|    1|    1| function |
        |StgValue_114_myproject_entry3_fu_763    |myproject_entry3       |    0|    0|    0|    0|   none   |
        |StgValue_115_myproject_entry490_fu_848  |myproject_entry490     |    0|    0|    0|    0|   none   |
        |StgValue_572_Block_proc_fu_906          |Block_proc             |    0|    0|    0|    0|   none   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+

ReuseFactor: 4

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+----------+
    |  Latency  |  Interval | Pipeline |
    | min | max | min | max |   Type   |
    +-----+-----+-----+-----+----------+
    |   69|   69|    4|    4| dataflow |
    +-----+-----+-----+-----+----------+

    + Detail: 
        * Instance: 
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |                                        |                       |  Latency  |  Interval | Pipeline |
        |                Instance                |         Module        | min | max | min | max |   Type   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+
        |grp_compute_layer_0_0_0_2_fu_440        |compute_layer_0_0_0_2  |    6|    6|    3|    3| function |
        |grp_compute_layer_0_0_0_3_fu_508        |compute_layer_0_0_0_3  |    6|    6|    3|    3| function |
        |grp_compute_layer_0_0_0_s_fu_539        |compute_layer_0_0_0_s  |    7|    7|    4|    4| function |
        |grp_softmax_fu_575                      |softmax                |   34|   34|    1|    1| function |
        |grp_compute_layer_0_0_0_1_fu_587        |compute_layer_0_0_0_1  |    7|    7|    4|    4| function |
        |call_ret2_relu_fu_623                   |relu                   |    0|    0|    1|    1| function |
        |call_ret4_relu_2_fu_691                 |relu_2                 |    0|    0|    1|    1| function |
        |call_ret_relu_1_fu_727                  |relu_1                 |    0|    0|    1|    1| function |
        |StgValue_125_myproject_entry3_fu_763    |myproject_entry3       |    0|    0|    0|    0|   none   |
        |StgValue_126_myproject_entry505_fu_848  |myproject_entry505     |    0|    0|    0|    0|   none   |
        |StgValue_593_Block_proc_fu_906          |Block_proc             |    0|    0|    0|    0|   none   |
        +----------------------------------------+-----------------------+-----+-----+-----+-----+----------+

compression in serial mode

How do we handle sparse matrices in serial mode?

also, @ejk43 had an idea to compress weights to powers of 2 as another way to save resources.

Compilation failure when result_t and data_t of compute_layer point to same data type

In this branch I've added a model from Sergo that's failing to compile. If you do the normal python keras-to-hls.py -c keras-config.yml you'll pick up the model and see the error when you try to build the project.

The error is

INFO: [SIM 211-2] *************** CSIM start ***************
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
   Compiling ../../../../myproject_test.cpp in debug mode
   Compiling ../../../../firmware/myproject.cpp in debug mode
In file included from ../../../../firmware/parameters.h:7:0,
                 from ../../../../firmware/myproject.cpp:21:
/home/kreis/muon/hls4ml/nnet_utils/nnet_layer.h: In function ‘void nnet::compute_layer(data_T*, res_T*, typename CONFIG_T::weight_t (*)[CONFIG_T:: n_out], typename CONFIG_T::bias_t*) [with data_T = ap_fixed<18, 8>, res_T = ap_fixed<18, 8>, CONFIG_T = config4, typename CONFIG_T::weight_t = ap_fixed<18, 8>, typename CONFIG_T::bias_t = ap_fixed<18, 8>]’:
../../../../firmware/myproject.cpp:82:77:   instantiated from here
/home/kreis/muon/hls4ml/nnet_utils/nnet_layer.h:100:13: error: invalid use of incomplete type ‘class ap_fixed<18, 8>’
/data/xilinx/Vivado_HLS/2017.2/include/ap_int.h:318:7: error: declaration of ‘class ap_fixed<18, 8>’
/home/kreis/muon/hls4ml/nnet_utils/nnet_layer.h:100: confused by earlier errors, bailing out
make: *** [obj/myproject.o] Error 1
ERROR: [SIM 211-100] 'csim_design' failed: compilation error(s).
INFO: [SIM 211-3] *************** CSIM finish ***************
4
    while executing
"source [lindex $::argv 1] "
    ("uplevel" body line 1)
    invoked from within
"uplevel \#0 { source [lindex $::argv 1] } "

I've definitely seen this one before, but I can't remember the previous causes.

In any case, what's special about this model is that there is no activation on the final layer, and it seems that compute_layer does not like it when the data_t and result_t typedefs both point to the same type. If I change result_t to ap_fixed<19,8>, it works. It also works if I add an activation after, which is why we haven't seen this before.

(and it must only matter for the last layer with output res??)

Use ARRAY_RESHAPE directives at the top level myproject.cpp?

Hey guys,

Just ran into something interesting about the top-level interfaces... By default it seems that the ARRAY_PARTITION directive instantiates a separate port for each partitioned element. So, if you have an array with N elements, you'll have N separate data ports, which would get rather large for the number of inputs you'll probably need.

I think if we replace the ARRAY_PARTITION directive with ARRAY_RESHAPE, we get a similar impact as partitioning the array-- and the array elements will also be concatenated into a single larger array element. Here's my updated directives:

    #pragma HLS ARRAY_RESHAPE variable=data complete dim=0
    #pragma HLS ARRAY_RESHAPE variable=res complete dim=0
    #pragma HLS INTERFACE ap_hs port=data,res

Here's the port output for Nhan's conv1d example: (notice data port is 768 bits wide, result port is 192 bits wide)

* Summary: 
+-----------------------+-----+-----+------------+----------------+--------------+
|       RTL Ports       | Dir | Bits|  Protocol  |  Source Object |    C Type    |
+-----------------------+-----+-----+------------+----------------+--------------+
|ap_clk                 |  in |    1| ap_ctrl_hs |    myproject   | return value |
|ap_rst                 |  in |    1| ap_ctrl_hs |    myproject   | return value |
|ap_start               |  in |    1| ap_ctrl_hs |    myproject   | return value |
|ap_done                | out |    1| ap_ctrl_hs |    myproject   | return value |
|ap_idle                | out |    1| ap_ctrl_hs |    myproject   | return value |
|ap_ready               | out |    1| ap_ctrl_hs |    myproject   | return value |
|data_V_ap_vld          |  in |    1|    ap_hs   |     data_V     |    pointer   |
|data_V                 |  in |  768|    ap_hs   |     data_V     |    pointer   |
|data_V_ap_ack          | out |    1|    ap_hs   |     data_V     |    pointer   |
|res_V_ap_ack           |  in |    1|    ap_hs   |      res_V     |    pointer   |
|res_V                  | out |  192|    ap_hs   |      res_V     |    pointer   |
|res_V_ap_vld           | out |    1|    ap_hs   |      res_V     |    pointer   |
|const_size_in          | out |   16|   ap_vld   |  const_size_in |    pointer   |
|const_size_in_ap_vld   | out |    1|   ap_vld   |  const_size_in |    pointer   |
|const_size_out         | out |   16|   ap_vld   | const_size_out |    pointer   |
|const_size_out_ap_vld  | out |    1|   ap_vld   | const_size_out |    pointer   |
+-----------------------+-----+-----+------------+----------------+--------------+

I'd be interested if this breaks any of the other examples, but I'd suggest it's probably a good idea to swap over for the top level function (maybe also for the subfunctions too??)

I threw a test together based on Nhan's conv1d branch here: https://github.com/hls-fpga-machine-learning/HLS4ML/tree/ejk/conv1d-array-reshape

Clarify which keras layers/activations we support in documentation

We should list which ones we currently support and which ones we plan to support in the documentation.

Below is a list of all Keras layers, not including abstract layers and aliases.

Activation
ActivityRegularization
Add
AlphaDropout
AtrousConv1D
AtrousConv2D
Average
AveragePooling1D
AveragePooling2D
AveragePooling3D
BatchNormalization
Bidirectional
Concatenate
Conv1D
Conv2D
Conv2DTranspose
Conv3D
Conv3DTranspose
ConvLSTM2D
ConvRecurrent2D
Cropping1D
Cropping2D
Cropping3D
CuDNNGRU
CuDNNLSTM
Dense
Dot
Dropout
ELU
Embedding
Flatten
GRU
GRUCell
GaussianDropout
GaussianNoise
GlobalAveragePooling1D
GlobalAveragePooling2D
GlobalAveragePooling3D
GlobalMaxPooling1D
GlobalMaxPooling2D
GlobalMaxPooling3D
Highway
Input
LSTM
LSTMCell
Lambda
LeakyReLU
LocallyConnected1D
LocallyConnected2D
Masking
MaxPooling1D
MaxPooling2D
MaxPooling3D
Maximum
MaxoutDense
Merge
Minimum
Multiply
PReLU
Permute
RepeatVector
Reshape
SeparableConv1D
SeparableConv2D
SimpleRNN
SimpleRNNCell
Softmax
SpatialDropout1D
SpatialDropout2D
SpatialDropout3D
StackedRNNCells
Subtract
ThresholdedReLU
TimeDistributed
UpSampling1D
UpSampling2D
UpSampling3D
ZeroPadding1D
ZeroPadding2D
ZeroPadding3D

Pruning conv1d

I have a branch going to allow the conv1d to take advantage of pruning in training. @jmduarte and I have been discussing this, and we think there isn't a simple formula for reducing the multiplier limit based on the number of weights equaling zero. It gets pretty complicated due to the presence of padding.

So the approach here is to just do the loops once in advance and count the number of multiplications by nonzero weights, then divide that by the reuse factor to get the multiplier limit.

Perhaps the simplest would be to do these first loops for counting in separate code (e.g. our python), but I think it would be nice for the nnet_utils to have this feature. So I've added the count here. Unfortunately it looks to me like I haven't managed to fully decouple it from the firmware part. I see a slightly different resource usage if I use this function versus just passing the limit from the outside world. If anyone sees the problem, let me know!

Some test results:
Model: KERAS_conv1d_small
Precision: ap_fixed<18,8>
HLS Version: v2017.2

Default weights:

Reuse BRAM DSP FF LUT Lat II
1 13 547 47149 28161 24 1
2 13 398 47188 30769 26 2
3 13 266 47504 33031 30 3

Random 20% weights set to zero:

Reuse BRAM DSP FF LUT Lat II
1 13 455 41276 23950 24 1
2 13 303 37926 24188 26 2
3 13 206 38074 26472 30 3

Random 50% weights set to zero:

Reuse BRAM DSP FF LUT Lat II
1 13 279 24769 15877 22 1
2 13 200 27442 16876 25 2
3 13 99 17592 11563 25 3

I'm setting the mult limit, so I think here the test is to see if the non-DSP numbers look problematic. I don't see the FF or LUT numbers explode suggesting we are doing multiplications with logic. Latency looks alright. II is as expected.

So if people like this approach, I can make a PR. It would be nice to first figure out why the mult count is possibly taking some firmware to do though.

softmax activation outputs value greater than 1

Hi,

First off, good job on the entire hls4ml package! I'm using it in a project of mine and it's great.

I believe there's an issue with the softmax activation.
First row is the input, second row is the output.:

bug

I expected the output to be less than 1 in all positions, but the second position which is 1 as expected.
Note that 1.33 isn't feasible for softmax by definition.

Reproduce:

  • All variables are ap_fixed<32,8>
  • The configuration struct is:
#define M_in 4
typedef ap_fixed<32,8> result_t;
...
struct dec_softmax_config2 : nnet::activ_config {
    static const unsigned n_in = M_in;
    static const unsigned table_size = 1024;
    static const unsigned io_type = nnet::io_parallel;
};
  • The call to softmax is
result_t logits2[M_in];
#pragma HLS ARRAY_PARTITION variable=logits2 complete dim=0
result_t logits3[M_in];
#pragma HLS ARRAY_PARTITION variable=logits3 complete dim=0
nnet::softmax<result_t, result_t, dec_softmax_config2>(logits2, logits3);

Maybe I'm using it incorrectly?
Let me know if you need more info.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.