AritPIM: High-Throughput In-Memory Arithmetic

Overview

This is the simulation environment for the following paper,

O. Leitersdorf, D. Leitersdorf, J. Gal, M. Dahan, R. Ronen, and S. Kvatinsky, “AritPIM: High-Throughput In-Memory Arithmetic,” 2022.

The goal of the simulator is to verify the correctness of the algorithms proposed in the paper, to mesaure the performance of the algorithms (latency, area, energy), and to serve as an open-source library that may be utilized in other works.

The simulator logically models a row of a PIM array, and provides an interface for the proposed algorithm to perform basic logic gates on the row. The correctness of the operations is verified externally, and the performance is measured.

User Information

Dependencies

The simulation environment is implemented via numpy to enable fast execution of many samples in parallel. Therefore, the project requires the following libraries:

python3
numpy

Organization

The repository is organized into the following directories:

util: Contains general-purpose utility functions, e.g., for converting between binary and floating-point numbers.
serial: Contains the simulator, AritPIM implementation, and tester for bit-serial algorithms.
parallel: Contains the simulator, AritPIM implementation, and tester for bit-parallel algorithms.
output: Contains the final outputs of the AritPIM functions for 32-bit numbers.
results: Contains the final performance of the AritPIM functions for a wide variety of parameters.
gpu: Contains the code used for the GPU benchmark evaluation.

Correctness

We detail here the structure of the correctness simulations. The project is divided into separate simulations for the bit-serial and bit-parallel algorithms. Each include a class representing the simulation interface, a class representing the AritPIM algorithms, and a tester class. The tester class consists of tests for each of the proposed algorithms, with each test performing the following:

Instantiating a simulator for that test. The simulator models 1024 crossbars with 1024 columns and 1024 rows each, stored as a single 1024 x (1024 * 1024) matrix (as the operations are identical regardless, so this does not affect correctness). Effectively, the simulator models 1024 * 1024 executions of the arithmetic function in parallel.
Generating random input samples for each row, and writing them to the simulator.
Executing the AritPIM algorithm on the given simulator. The algorithm is only allowed to request that the simulator perform logic operations (e.g., the algorithm cannot read/write from/to the memory).
Reading the output generated by the algorithm for each row.
Asserting that this output is identical to the expected result. For floating-point numbers, this is the result generated by numpy (which adheres to the IEEE 754 standard); note that we require an exact match and not an approximate match.

Notice that the simulator simulates NOT/NOR logic functionality for the crossbars, such that:

Writing to the crossbars using initialization operations (INIT0 or INIT1) is only supported for a single index at a time. While some previous works perform such initializations in bulk [1], the periphery and control for supporting arbitrary initialization was never studied [2], and thus we assume the conservative approach. We refer to this conservative approach as 1-INIT.
Performing a logic gate without initializing the output first results in the output cell storing the AND of the previous value stored in the cell with the result of the logic gate. We refer to such operations as X-MAGIC [3].

Comparison

We compare here the AritPIM algorithms to other works and GPU. The performance of the proposed algorithms is derived from the correctness simulations through performance counters in the simulator class (e.g., counting the number of logic instructions received from the algorithm, and the number of cells accessed). For this comparison, we assume the NOT/NOR family of logic gates as this provides a fair comparison to the previous works (which also included a NOT/NOR implementation). The below tables include a throughput (TOPS) and energy comparison (TOPS/W) for all the proposed algorithms on 32-bit numbers.

For the comparison to previous PIM works, we compare the cycle counts, energy (gate count), and area (cell count). We modify the results of the previous works to correspond to the conservative 1-INIT assumption for a fair comparison. We find:

Approach	Function	AritPIM Latency (Cycles)	AritPIM Energy (Gates)	AritPIM Area (cells)	Previous Work	Previous Latency (Cycles)	Previous Energy (Gates)	Previous Area (Cells)
Bit-Serial	Fixed Add	577	577	101	Talati et al. [4]	577	577	101
	Fixed Subtract	641	641	102	Talati et al. [4]	641	641	102
	Fixed Multiply	18123	18123	187	Haj-Ali et al. [5]	20095	20095	256
	Fixed Divide	28423	28423	170	Li et al. [6]	145472	145472	608
	Floating Add (unsigned)	2306	2306	135	N/A
	Floating Add (signed)	3997	3997	142	N/A
	Floating Multiply	11586	11586	172	N/A
	Floating Divide	19909	19909	139	N/A
Bit-Parallel	Fixed Add	95	1359	256	N/A
	Fixed Subtract	98	1424	288	N/A
	Fixed Multiply	1251	25039	352	MultPIM [7]	1604	38016	352
	Fixed Divide	4291	62338	448	N/A
	Floating Add (unsigned)	817	5822	403	N/A
	Floating Add (signed)	1359	10186	480	N/A
	Floating Multiply	1407	16887	448	N/A
	Floating Divide	3963	44530	544	N/A

Notice that we do not compare to FloatPIM [8] as FloatPIM requires additional periphery that does not adhere to the abstract PIM model. See the paper for more details. Also, note that addition requires 18N + 1 cycles as we assume 1-INIT with a 9-NOR implementation of a full adder.

For the comparison to GPU, we consider a benchmark of performing vectored operations of dimension n=64M, where the inputs start in the memory and the outputs are stored in the memory. To extrapolate AritPIM performance for this benchmark, we consider parameters from the RACER [2] architecture. To evaluate the GPU performance, we use the PyTorch library to initialize such vectors in the GPU memory and perform the vectored operations, while also measuring performance; the code is available in the gpu folder. Overall, we find:

Approach	Function	AritPIM TOPS	AritPIM TOPS/W	GPU Latency (sec)	GPU Power (W)	GPU TOPS	GPU TOPS/W	TOPS Improvement	TOPS/W Improvement
Bit-Serial	Fixed Add	34.89	0.271	0.001974	149	0.034	0.00023	1026x	1187x
	Fixed Subtract	31.41	0.244	0.001974	148	0.034	0.00023	924x	1061x
	Fixed Multiply	1.11	0.009	0.001974	151	0.034	0.00023	33x	38x
	Fixed Divide	0.71	0.005	0.001975	159	0.034	0.00021	21x	26x
	Floating Add (unsigned)	8.73	0.068	0.001973	150	0.034	0.00023	257x	299x
	Floating Add (signed)	5.04	0.039	0.001973	151	0.034	0.00023	148x	174x
	Floating Multiply	1.74	0.013	0.001973	149	0.034	0.00023	51x	59x
	Floating Divide	1.01	0.008	0.001978	201	0.034	0.00017	30x	46x
Bit-Parallel	Fixed Add	211.92	0.115	0.001974	149	0.034	0.00023	6234x	504x
	Fixed Subtract	205.44	0.110	0.001974	148	0.034	0.00023	6043x	478x
	Fixed Multiply	16.09	0.006	0.001974	151	0.034	0.00023	473x	28x
	Fixed Divide	4.69	0.003	0.001975	159	0.034	0.00021	138x	12x
	Floating Add (unsigned)	24.64	0.027	0.001973	150	0.034	0.00023	724x	118x
	Floating Add (signed)	14.81	0.015	0.001973	151	0.034	0.00023	436x	68x
	Floating Multiply	14.31	0.009	0.001973	149	0.034	0.00023	421x	41x
	Floating Divide	5.08	0.004	0.001978	201	0.034	0.00017	150x	21x

Adoption

We seek to promote the adoption of the AritPIM algorithms in other PIM works. Therefore, we suggest three methods for utilizing the AritPIM algorithms:

Adopt the code for an AritPIM algorithm in the simulator of the larger application. This enables executing the AritPIM algorithms for arbitrary parameters, and with relatively-simple code. See the "system tests" in the testers for examples.
Use the final compiled gate lists from the AritPIM algorithm, provided in the output folder. Such outputs are specific to certain parameters; the published versions correspond to 32-bit numbers, yet any version can be produced by changing the parameters in the tester classes.
Use the final results for the latency, energy and area of the AritPIM algorithms that is provided in the results folder.

References

[1] R. Ben-Hur et al., "SIMPLER MAGIC: Synthesis and Mapping of In-Memory Logic Executed in a Single Row to Improve Throughput," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2434-2447, Oct. 2020.

[2] Minh S. Q. Truong et al., "RACER: Bit-Pipelined Processing Using Resistive Memory," 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021, pp. 100-116.

[3] N. Peled, R. Ben-Hur, R. Ronen and S. Kvatinsky, "X-MAGIC: Enhancing PIM Using Input Overwriting Capabilities," 2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC), 2020, pp. 64-69.

[4] N. Talati, S. Gupta, P. Mane and S. Kvatinsky, "Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC)," in IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635-650, July 2016.

[5] A. Haj-Ali, R. Ben-Hur, N. Wald and S. Kvatinsky, "Efficient Algorithms for In-Memory Fixed Point Multiplication Using MAGIC," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, pp. 1-5.

[6] R. Li, S. Song, Q. Wu and L. K. John, "Accelerating Force-directed Graph Layout with Processing-in-Memory Architecture," 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2020, pp. 271-282.

[7] O. Leitersdorf, R. Ronen and S. Kvatinsky, "MultPIM: Fast Stateful Multiplication for Processing-in-Memory," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 3, pp. 1647-1651, March 2022.

nirajnsharma / aritpim Goto Github PK

aritpim's Introduction

AritPIM: High-Throughput In-Memory Arithmetic

Overview

User Information

Dependencies

Organization

Correctness

Comparison

Adoption

References

aritpim's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent