Giter VIP home page Giter VIP logo

aritpim's Introduction

AritPIM: High-Throughput In-Memory Arithmetic

Overview

This is the simulation environment for the following paper,

O. Leitersdorf, D. Leitersdorf, J. Gal, M. Dahan, R. Ronen, and S. Kvatinsky, “AritPIM: High-Throughput In-Memory Arithmetic,” 2022.

The goal of the simulator is to verify the correctness of the algorithms proposed in the paper, to mesaure the performance of the algorithms (latency, area, energy), and to serve as an open-source library that may be utilized in other works.

The simulator logically models a row of a PIM array, and provides an interface for the proposed algorithm to perform basic logic gates on the row. The correctness of the operations is verified externally, and the performance is measured.

User Information

Dependencies

The simulation environment is implemented via numpy to enable fast execution of many samples in parallel. Therefore, the project requires the following libraries:

  1. python3
  2. numpy

Organization

The repository is organized into the following directories:

  • util: Contains general-purpose utility functions, e.g., for converting between binary and floating-point numbers.
  • serial: Contains the simulator, AritPIM implementation, and tester for bit-serial algorithms.
  • parallel: Contains the simulator, AritPIM implementation, and tester for bit-parallel algorithms.
  • output: Contains the final outputs of the AritPIM functions for 32-bit numbers.
  • results: Contains the final performance of the AritPIM functions for a wide variety of parameters.
  • gpu: Contains the code used for the GPU benchmark evaluation.

Correctness

We detail here the structure of the correctness simulations. The project is divided into separate simulations for the bit-serial and bit-parallel algorithms. Each include a class representing the simulation interface, a class representing the AritPIM algorithms, and a tester class. The tester class consists of tests for each of the proposed algorithms, with each test performing the following:

  1. Instantiating a simulator for that test. The simulator models 1024 crossbars with 1024 columns and 1024 rows each, stored as a single 1024 x (1024 * 1024) matrix (as the operations are identical regardless, so this does not affect correctness). Effectively, the simulator models 1024 * 1024 executions of the arithmetic function in parallel.
  2. Generating random input samples for each row, and writing them to the simulator.
  3. Executing the AritPIM algorithm on the given simulator. The algorithm is only allowed to request that the simulator perform logic operations (e.g., the algorithm cannot read/write from/to the memory).
  4. Reading the output generated by the algorithm for each row.
  5. Asserting that this output is identical to the expected result. For floating-point numbers, this is the result generated by numpy (which adheres to the IEEE 754 standard); note that we require an exact match and not an approximate match.

Notice that the simulator simulates NOT/NOR logic functionality for the crossbars, such that:

  • Writing to the crossbars using initialization operations (INIT0 or INIT1) is only supported for a single index at a time. While some previous works perform such initializations in bulk [1], the periphery and control for supporting arbitrary initialization was never studied [2], and thus we assume the conservative approach. We refer to this conservative approach as 1-INIT.
  • Performing a logic gate without initializing the output first results in the output cell storing the AND of the previous value stored in the cell with the result of the logic gate. We refer to such operations as X-MAGIC [3].

Comparison

We compare here the AritPIM algorithms to other works and GPU. The performance of the proposed algorithms is derived from the correctness simulations through performance counters in the simulator class (e.g., counting the number of logic instructions received from the algorithm, and the number of cells accessed). For this comparison, we assume the NOT/NOR family of logic gates as this provides a fair comparison to the previous works (which also included a NOT/NOR implementation). The below tables include a throughput (TOPS) and energy comparison (TOPS/W) for all the proposed algorithms on 32-bit numbers.

For the comparison to previous PIM works, we compare the cycle counts, energy (gate count), and area (cell count). We modify the results of the previous works to correspond to the conservative 1-INIT assumption for a fair comparison. We find:

Approach Function AritPIM Latency (Cycles) AritPIM Energy (Gates) AritPIM Area (cells) Previous Work Previous Latency (Cycles) Previous Energy (Gates) Previous Area (Cells)
Bit-Serial Fixed Add 577 577 101 Talati et al. [4] 577 577 101
Fixed Subtract 641 641 102 Talati et al. [4] 641 641 102
Fixed Multiply 18123 18123 187 Haj-Ali et al. [5] 20095 20095 256
Fixed Divide 28423 28423 170 Li et al. [6] 145472 145472 608
Floating Add (unsigned) 2306 2306 135 N/A
Floating Add (signed) 3997 3997 142 N/A
Floating Multiply 11586 11586 172 N/A
Floating Divide 19909 19909 139 N/A
Bit-Parallel Fixed Add 95 1359 256 N/A
Fixed Subtract 98 1424 288 N/A
Fixed Multiply 1251 25039 352 MultPIM [7] 1604 38016 352
Fixed Divide 4291 62338 448 N/A
Floating Add (unsigned) 817 5822 403 N/A
Floating Add (signed) 1359 10186 480 N/A
Floating Multiply 1407 16887 448 N/A
Floating Divide 3963 44530 544 N/A

Notice that we do not compare to FloatPIM [8] as FloatPIM requires additional periphery that does not adhere to the abstract PIM model. See the paper for more details. Also, note that addition requires 18N + 1 cycles as we assume 1-INIT with a 9-NOR implementation of a full adder.

For the comparison to GPU, we consider a benchmark of performing vectored operations of dimension n=64M, where the inputs start in the memory and the outputs are stored in the memory. To extrapolate AritPIM performance for this benchmark, we consider parameters from the RACER [2] architecture. To evaluate the GPU performance, we use the PyTorch library to initialize such vectors in the GPU memory and perform the vectored operations, while also measuring performance; the code is available in the gpu folder. Overall, we find:

Approach Function AritPIM TOPS AritPIM TOPS/W GPU Latency (sec) GPU Power (W) GPU TOPS GPU TOPS/W TOPS Improvement TOPS/W Improvement
Bit-Serial Fixed Add 34.89 0.271 0.001974 149 0.034 0.00023 1026x 1187x
Fixed Subtract 31.41 0.244 0.001974 148 0.034 0.00023 924x 1061x
Fixed Multiply 1.11 0.009 0.001974 151 0.034 0.00023 33x 38x
Fixed Divide 0.71 0.005 0.001975 159 0.034 0.00021 21x 26x
Floating Add (unsigned) 8.73 0.068 0.001973 150 0.034 0.00023 257x 299x
Floating Add (signed) 5.04 0.039 0.001973 151 0.034 0.00023 148x 174x
Floating Multiply 1.74 0.013 0.001973 149 0.034 0.00023 51x 59x
Floating Divide 1.01 0.008 0.001978 201 0.034 0.00017 30x 46x
Bit-Parallel Fixed Add 211.92 0.115 0.001974 149 0.034 0.00023 6234x 504x
Fixed Subtract 205.44 0.110 0.001974 148 0.034 0.00023 6043x 478x
Fixed Multiply 16.09 0.006 0.001974 151 0.034 0.00023 473x 28x
Fixed Divide 4.69 0.003 0.001975 159 0.034 0.00021 138x 12x
Floating Add (unsigned) 24.64 0.027 0.001973 150 0.034 0.00023 724x 118x
Floating Add (signed) 14.81 0.015 0.001973 151 0.034 0.00023 436x 68x
Floating Multiply 14.31 0.009 0.001973 149 0.034 0.00023 421x 41x
Floating Divide 5.08 0.004 0.001978 201 0.034 0.00017 150x 21x

Adoption

We seek to promote the adoption of the AritPIM algorithms in other PIM works. Therefore, we suggest three methods for utilizing the AritPIM algorithms:

  1. Adopt the code for an AritPIM algorithm in the simulator of the larger application. This enables executing the AritPIM algorithms for arbitrary parameters, and with relatively-simple code. See the "system tests" in the testers for examples.
  2. Use the final compiled gate lists from the AritPIM algorithm, provided in the output folder. Such outputs are specific to certain parameters; the published versions correspond to 32-bit numbers, yet any version can be produced by changing the parameters in the tester classes.
  3. Use the final results for the latency, energy and area of the AritPIM algorithms that is provided in the results folder.

References

[1] R. Ben-Hur et al., "SIMPLER MAGIC: Synthesis and Mapping of In-Memory Logic Executed in a Single Row to Improve Throughput," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 10, pp. 2434-2447, Oct. 2020.

[2] Minh S. Q. Truong et al., "RACER: Bit-Pipelined Processing Using Resistive Memory," 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021, pp. 100-116.

[3] N. Peled, R. Ben-Hur, R. Ronen and S. Kvatinsky, "X-MAGIC: Enhancing PIM Using Input Overwriting Capabilities," 2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC), 2020, pp. 64-69.

[4] N. Talati, S. Gupta, P. Mane and S. Kvatinsky, "Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC)," in IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635-650, July 2016.

[5] A. Haj-Ali, R. Ben-Hur, N. Wald and S. Kvatinsky, "Efficient Algorithms for In-Memory Fixed Point Multiplication Using MAGIC," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, pp. 1-5.

[6] R. Li, S. Song, Q. Wu and L. K. John, "Accelerating Force-directed Graph Layout with Processing-in-Memory Architecture," 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2020, pp. 271-282.

[7] O. Leitersdorf, R. Ronen and S. Kvatinsky, "MultPIM: Fast Stateful Multiplication for Processing-in-Memory," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 3, pp. 1647-1651, March 2022.

aritpim's People

Contributors

oleitersdorf avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.