This is the simulation environment for the following paper,
O. Leitersdorf, D. Leitersdorf, J. Gal, M. Dahan, R. Ronen, and S. Kvatinsky, “AritPIM: High-Throughput In-Memory Arithmetic,” 2022.
The goal of the simulator is to verify the correctness of the algorithms proposed in the paper, to mesaure the performance of the algorithms (latency, area, energy), and to serve as an open-source library that may be utilized in other works.
The simulator logically models a row of a PIM array, and provides an interface for the proposed algorithm to perform basic logic gates on the row. The correctness of the operations is verified externally, and the performance is measured.
The simulation environment is implemented via numpy
to enable fast execution of many samples in parallel. Therefore,
the project requires the following libraries:
- python3
- numpy
The repository is organized into the following directories:
util
: Contains general-purpose utility functions, e.g., for converting between binary and floating-point numbers.serial
: Contains the simulator, AritPIM implementation, and tester for bit-serial algorithms.parallel
: Contains the simulator, AritPIM implementation, and tester for bit-parallel algorithms.output
: Contains the final outputs of the AritPIM functions for 32-bit numbers.results
: Contains the final performance of the AritPIM functions for a wide variety of parameters.gpu
: Contains the code used for the GPU benchmark evaluation.
We detail here the structure of the correctness simulations. The project is divided into separate simulations for the bit-serial and bit-parallel algorithms. Each include a class representing the simulation interface, a class representing the AritPIM algorithms, and a tester class. The tester class consists of tests for each of the proposed algorithms, with each test performing the following:
- Instantiating a simulator for that test. The simulator models 1024 crossbars with 1024 columns and 1024 rows each, stored
as a single
1024 x (1024 * 1024)
matrix (as the operations are identical regardless, so this does not affect correctness). Effectively, the simulator models1024 * 1024
executions of the arithmetic function in parallel. - Generating random input samples for each row, and writing them to the simulator.
- Executing the AritPIM algorithm on the given simulator. The algorithm is only allowed to request that the simulator perform logic operations (e.g., the algorithm cannot read/write from/to the memory).
- Reading the output generated by the algorithm for each row.
- Asserting that this output is identical to the expected result. For floating-point numbers, this is the result generated by
numpy
(which adheres to the IEEE 754 standard); note that we require an exact match and not an approximate match.
Notice that the simulator simulates NOT/NOR logic functionality for the crossbars, such that:
- Writing to the crossbars using initialization operations (INIT0 or INIT1) is only supported for a single index at a time. While some previous works perform such initializations in bulk [1], the periphery and control for supporting arbitrary initialization was never studied [2], and thus we assume the conservative approach. We refer to this conservative approach as 1-INIT.
- Performing a logic gate without initializing the output first results in the output cell storing the AND of the previous value stored in the cell with the result of the logic gate. We refer to such operations as X-MAGIC [3].
We compare here the AritPIM algorithms to other works and GPU. The performance of the proposed algorithms is derived from the correctness simulations through performance counters in the simulator class (e.g., counting the number of logic instructions received from the algorithm, and the number of cells accessed). For this comparison, we assume the NOT/NOR family of logic gates as this provides a fair comparison to the previous works (which also included a NOT/NOR implementation). The below tables include a throughput (TOPS) and energy comparison (TOPS/W) for all the proposed algorithms on 32-bit numbers.
For the comparison to previous PIM works, we compare the cycle counts, energy (gate count), and area (cell count). We modify the results of the previous works to correspond to the conservative 1-INIT assumption for a fair comparison. We find:
Approach | Function | AritPIM Latency (Cycles) | AritPIM Energy (Gates) | AritPIM Area (cells) | Previous Work | Previous Latency (Cycles) | Previous Energy (Gates) | Previous Area (Cells) |
---|---|---|---|---|---|---|---|---|
Bit-Serial | Fixed Add | 577 | 577 | 101 | Talati et al. [4] | 577 | 577 | 101 |
Fixed Subtract | 641 | 641 | 102 | Talati et al. [4] | 641 | 641 | 102 | |
Fixed Multiply | 18123 | 18123 | 187 | Haj-Ali et al. [5] | 20095 | 20095 | 256 | |
Fixed Divide | 28423 | 28423 | 170 | Li et al. [6] | 145472 | 145472 | 608 | |
Floating Add (unsigned) | 2306 | 2306 | 135 | N/A | ||||
Floating Add (signed) | 3997 | 3997 | 142 | N/A | ||||
Floating Multiply | 11586 | 11586 | 172 | N/A | ||||
Floating Divide | 19909 | 19909 | 139 | N/A | ||||
Bit-Parallel | Fixed Add | 95 | 1359 | 256 | N/A | |||
Fixed Subtract | 98 | 1424 | 288 | N/A | ||||
Fixed Multiply | 1251 | 25039 | 352 | MultPIM [7] | 1604 | 38016 | 352 | |
Fixed Divide | 4291 | 62338 | 448 | N/A | ||||
Floating Add (unsigned) | 817 | 5822 | 403 | N/A | ||||
Floating Add (signed) | 1359 | 10186 | 480 | N/A | ||||
Floating Multiply | 1407 | 16887 | 448 | N/A | ||||
Floating Divide | 3963 | 44530 | 544 | N/A |
Notice that we do not compare to FloatPIM [8] as FloatPIM requires additional periphery that does not adhere to the
abstract PIM model. See the paper for more details. Also, note that addition requires 18N + 1
cycles as we assume
1-INIT with a 9-NOR implementation of a full adder.
For the comparison to GPU, we consider a benchmark of performing vectored operations of dimension n=64M
, where the inputs start
in the memory and the outputs are stored in the memory. To extrapolate AritPIM performance for this benchmark, we consider
parameters from the RACER [2] architecture. To evaluate the GPU performance, we use the PyTorch
library to initialize
such vectors in the GPU memory and perform the vectored operations, while also measuring performance; the code is available in the
gpu
folder. Overall, we find:
Approach | Function | AritPIM TOPS | AritPIM TOPS/W | GPU Latency (sec) | GPU Power (W) | GPU TOPS | GPU TOPS/W | TOPS Improvement | TOPS/W Improvement |
---|---|---|---|---|---|---|---|---|---|
Bit-Serial | Fixed Add | 34.89 | 0.271 | 0.001974 | 149 | 0.034 | 0.00023 | 1026x | 1187x |
Fixed Subtract | 31.41 | 0.244 | 0.001974 | 148 | 0.034 | 0.00023 | 924x | 1061x | |
Fixed Multiply | 1.11 | 0.009 | 0.001974 | 151 | 0.034 | 0.00023 | 33x | 38x | |
Fixed Divide | 0.71 | 0.005 | 0.001975 | 159 | 0.034 | 0.00021 | 21x | 26x | |
Floating Add (unsigned) | 8.73 | 0.068 | 0.001973 | 150 | 0.034 | 0.00023 | 257x | 299x | |
Floating Add (signed) | 5.04 | 0.039 | 0.001973 | 151 | 0.034 | 0.00023 | 148x | 174x | |
Floating Multiply | 1.74 | 0.013 | 0.001973 | 149 | 0.034 | 0.00023 | 51x | 59x | |
Floating Divide | 1.01 | 0.008 | 0.001978 | 201 | 0.034 | 0.00017 | 30x | 46x | |
Bit-Parallel | Fixed Add | 211.92 | 0.115 | 0.001974 | 149 | 0.034 | 0.00023 | 6234x | 504x |
Fixed Subtract | 205.44 | 0.110 | 0.001974 | 148 | 0.034 | 0.00023 | 6043x | 478x | |
Fixed Multiply | 16.09 | 0.006 | 0.001974 | 151 | 0.034 | 0.00023 | 473x | 28x | |
Fixed Divide | 4.69 | 0.003 | 0.001975 | 159 | 0.034 | 0.00021 | 138x | 12x | |
Floating Add (unsigned) | 24.64 | 0.027 | 0.001973 | 150 | 0.034 | 0.00023 | 724x | 118x | |
Floating Add (signed) | 14.81 | 0.015 | 0.001973 | 151 | 0.034 | 0.00023 | 436x | 68x | |
Floating Multiply | 14.31 | 0.009 | 0.001973 | 149 | 0.034 | 0.00023 | 421x | 41x | |
Floating Divide | 5.08 | 0.004 | 0.001978 | 201 | 0.034 | 0.00017 | 150x | 21x |
We seek to promote the adoption of the AritPIM algorithms in other PIM works. Therefore, we suggest three methods for utilizing the AritPIM algorithms:
- Adopt the code for an AritPIM algorithm in the simulator of the larger application. This enables executing the AritPIM algorithms for arbitrary parameters, and with relatively-simple code. See the "system tests" in the testers for examples.
- Use the final compiled gate lists from the AritPIM algorithm, provided in the
output
folder. Such outputs are specific to certain parameters; the published versions correspond to 32-bit numbers, yet any version can be produced by changing the parameters in the tester classes. - Use the final results for the latency, energy and area of the AritPIM algorithms that is provided in the
results
folder.
[1] R. Ben-Hur et al., "SIMPLER MAGIC: Synthesis and Mapping of In-Memory Logic Executed in a Single Row to Improve Throughput," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2434-2447, Oct. 2020.
[2] Minh S. Q. Truong et al., "RACER: Bit-Pipelined Processing Using Resistive Memory," 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021, pp. 100-116.
[3] N. Peled, R. Ben-Hur, R. Ronen and S. Kvatinsky, "X-MAGIC: Enhancing PIM Using Input Overwriting Capabilities," 2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC), 2020, pp. 64-69.
[4] N. Talati, S. Gupta, P. Mane and S. Kvatinsky, "Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC)," in IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635-650, July 2016.
[5] A. Haj-Ali, R. Ben-Hur, N. Wald and S. Kvatinsky, "Efficient Algorithms for In-Memory Fixed Point Multiplication Using MAGIC," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, pp. 1-5.
[6] R. Li, S. Song, Q. Wu and L. K. John, "Accelerating Force-directed Graph Layout with Processing-in-Memory Architecture," 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2020, pp. 271-282.
[7] O. Leitersdorf, R. Ronen and S. Kvatinsky, "MultPIM: Fast Stateful Multiplication for Processing-in-Memory," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 3, pp. 1647-1651, March 2022.