Giter VIP home page Giter VIP logo

thesis's People

Contributors

beepy0 avatar

Watchers

 avatar  avatar

Forkers

martinkiefer

thesis's Issues

Visualization Techniques

  • Bar Charts - M
  • Tables - LaTeX, M
  • GUI Widgets - M
  • Simple Plots - M
  • Scatter Plot - M, P, S
  • Box Plot - M, P, S
  • Violin Plot - M, P, S
  • Joint Plot - S
  • Facet Grid - S

Matplotlib

Pandas

Seaborn

Vega-Lite

Visualizations in the field (SIMD, Join Estimators etc.)

  • Read a few papers that use SIMD and such to optimize to get inspired on structure and visualization

Python

Benchmark AGMS Optimized

Note speed-up based on SIMD, memory optimization or anything else that was done to improve throughput of the algorithms.

SIMD Implementation Notes

Environment setup

  • Install GSL development library for tsimd

Practical research

Dummy tests for learning vectorization optimization

Enforcing automatic compiler SIMD optimization:

Initial research SIMD Libraries

AVX-512

blank

Credit

blank

Other

Questions For Martin

Clear before benchmarking any throughput

  • 1. In the update function there is this multiplication by 1 that we discussed at the very beginning, where you mentioned this could be used if you have data that arrives in packets, or something similar. Is it required to leave it in the code or should I remove it since it introduces an unnecessary multiplication in my use-case and is not going to be part of the benchmarking?
  • 2. I have multiple files that are not shared on the github repo - the large input data files (~10GB) and the profiling reports from VTUNE (~20GB). Do I have to leave sources for these in the thesis, and if yes, how?

Writing

  • 3. In your opinion, what could be a good outline of the problem statement for my thesis? (Do I talk about the need for more throughput as datastreams grow even more, do I talk about how vectorization works and what needs to be considered (e.g. compute bounds) before implementing it, or do I talk about something else?)

  • 4. Do I provide pseudo-code for the join/self_join functions for both algorithms in the description section of those algorithms, or just for the update functions?

  • 5. In which part does it make most sense to discuss the baseline benchmark results (with graphs etc)? I was initially planing to discuss them after I introduce the test conditions under Approach, e.g. after talking about implementation and tools, but now I'm thinking it might be that all results must go under Evaluation? Please let me know

Thesis Structure

Create an initial outline plan and post it here.
It will be updated and and added to this ticket as time goes on.

Benchmark Fast-AGMS baseline

This ticket is to log and document the process of benchmarking, gathering of data, and interpretation of that data for the baseline case of the Fast-AGMS algorithm.

Runtime

VTune

Memory Tuning Fast-AGMS

General Sources on Memory Tuning

  • High Throughput Heavy Hitter Aggregation for Modern SIMD Processors

Benchmarks and optimization

The goals are as follows:

  • #17 #18 Benchmark both algorithms’ baseline speed
  • #19 #20 Analyze algorithms for hotspots
  • #21 #22 Determine existence of memory bounds
  • #23 #24 Determine existence of compute bounds
  • #6 Adapt to use SIMD instructions where suitable
  • #25 #26 Consider memory-tuning if feasable
  • #27 #28 Benchmark improved algorithms

Open-Source AGMS and Fast-AGMS

  • Research project's code
  • Setup initial project
  • Lock any random data to seeded data for use during benchmarks
  • Run implementation, understand parameters

Research AGMS

Put here all the related information for this algorithm. Papers, experiments, videos, etc.

Papers

  • (Pseudo-code) Tracking join and self-join sizes in limited storage
  • (EH3 definition) Pseudo-Random Number Generation for Sketch-Based Estimations
  • (section Randomized Stream Sketching) Sketching Streams Through the Net - Distributed Approximate Query Tracking
  • (original proposal) The Space Complexity of Approximating the Frequency Moments

Algorithm

  • Join/Self-Join estimation
  • Update function (EH3)

Extra theory

Credit

  • Sketches project source code

Writing TODO

  • revise Problem Statement (comments in email)

  • 3 side notes in Feedback 2 regarding Background

Vectorization and AVX Research

  • From https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/
    "Hello,
    I did a few vector AVX512 benchmarks. They are mostly arithmetic vector operations. I found that the peak of floating point multiplications is doubled from AVX2 (I configured bios for not throttling down). That’s the case when the vectors with samples are smaller than the cache pages, otherwise there is memory bottleneck. So, both AVX512 and AVX2 have same flops. I guess for intensive computations which require little memory and lots of operations AVX512 provides a better performance. Also, I’ve seen that different intel architectures have different performance in shuffling operations which are a really big bottleneck.
    In any case, benchmarking instructions sets is pretty complex and highly application dependant."

Benchmark Fast-AGMS Optimized

Note speed-up based on SIMD, memory optimization or anything else that was done to improve throughput of the algorithms.

  • Removal of possibilities for conditional branching
  • Removal of heavy functions like modulo

Questions to Martin

  • 1. How rapid does streaming data arrive in real use-cases? Does it make sense to load all data in RAM first or just read it sequentially from a file-pointer (might be too slow)?
  • 2. About data generators - I don't see any seeding options in the data generators provided by the sketches source code, so I need to use a different one. Do you have any generator in mind or can I just create the data using python and load it (read question 1.) before I run the sketches? Since the data itself won't have to change from run to run, this could also work?
  • 3. Would you prefer a specific data distribution for the tests? Normal, Zipfian, etc?
  • 4. Would the distribution impact the throughput at all? Should I test for that?
  • 5. Which aspects of the algorithms am I benchmarking? sketch update only, or also join size, self-join size ?
  • 6. Is there a way to interpret the resulting self_join sizes to make sure the algorithms are working as they should? I have seeded the I1 and I2 variables to make sure they are consistent between runs and the data vectors are also constant, but I don't know how to work with the end result other than comparing the number between the baseline and optimized versions later on.
  • 6-extra. I found the Fast-AGMS algorithm to be better-to-significantly-better in approximating the size - AGMS seemed a lot more sensitive to the right ratio between vector size and input data size (it worked best for zipf but other distributions like discrete normal were off by a factor of 4-5 from the target result). I want to discuss this with you to cancel out any suspicion of bugs in the implementation. TODO: compare a couple of results to make sure the numbers are okay.
  • 7. Sorry if this question isn't clear enough: I have noticed that AGMS's throughput is sensitive to the amount of independent variables, since the whole vector gets updated for each new value (in the code I have bins_no and rows_no and all bins_no * rows_no get updated for each new value) whereas Fast-AGMS isn't. Should I include a diagram showing AGMS's comparative disadvantage in that regard, as part of a discussion subsection for example?
  • 8. Since the source code doesn't provide an H3-family hash function (there are only a 2-wise CW2 and 4-wise CW4 b-valued random variables according to the author, which I cannot fully understand so far.), I remember you mentioned they are quite common. I am having a hard time googling for an implementation, or was the plan that I implement it myself? All I need is a boolean matrix and the function from the paper you emailed me (https://www.cise.ufl.edu/~adobra/Publications/tods-2007-06.pdf), right?
  • 9. The resulting value after hashing with H3 can be of any length I choose (currently 32bit). You mentioned that modulo operations for such functions slow down the algorithm, so how do I make sure the resulting value only goes up to buckets_no? The other bucket hash functions in the project used a modulo at the end for the return value, and currently the H3 implementation also.
  • 10. Discuss in person : For example, it's not possible to draw a normal distribution of size 100k with a range between the inputs greater than 100k, right? But AGMS will not play nice with sets that are greater than 1 million inputs for example(due to rows_no and buckets_no being > 100 each, to make it more accurate), so I can't do much in that case. I can however do what you suggested for the uniform distribution, but still it's worth discussing in more detail when we meet.
  • 11. The SIMD libraries I found support the basic operators such as +,-,&,^ etc. But what about math functions like ceil(), lo2() etc? Same issue when I have to pass the values of a vector of size n to n simultaneous update functions, how does that look like? Do I rewrite the functions to pass the SIMD vectors instead of single variables?
  • 12. Do I have to do a presentation? If yes, roughly when? Details?
  • 13. Does it make sense to show a snippet of the alternative H3 from this ticket, as to provide the reader with a second option and also make a reference to the other work (if present in any work)?
  • 14. Any sense to have performance numbers in a diagram for AVX-2, as a side note?
  • 15. Do you have any specific profiling settings you need me to run via VTune? The ones I plan to use initially are: Advanced Hotspots, Microarchitecture Exploration, Memory Access
  • 16. Note down the different comparison approaches Martin said to discuss directly.

In-Depth CPU analysis with VTune Amplifier

Get guides and tutorials on how to do the different types of analysis using VTune.

Build Specifics

  • Create a fully optimized build
  • With O2 compiler optimization
  • WITHOUT debug mode (otherwise it gets less accurate) : -g

Analysis Aspects

  • Module breakdown
  • (parallel execution) Spin-up time : one thread waiting for another to release a shared resource.
  • (parallel execution) Overhead time (thread transition time) : the time it takes for one thread to retire and another to start using the CPU.
  • Function CPI

Analysis Types

  • -collect advanced-hotspots
  • -collect general-exploration
  • -collect memory-access

Baseline Implementation

This milestone concerns understanding how the provided open source C++ code for AGMS and Fast-AGMS works, how to use it, and how to work with the additionally provided data generators. The goals are as follows:

  • #13 Research and test open-source implementations of AGMS and Fast-AGMS
  • #14 Make use of data generators to prepare test data

1. Questions/Discussion for Martin

A list of questions I need to ask Martin the next time we meet.

Thesis Layout

  • Section 2.2 "Related Work", does something like that make sense to include in my thesis?
  • If yes, what exactly? I see other papers combine this with multi-threaded optimization, sometimes with MIMD, and maybe I can include things like OpenMP etc.?
    Answer: Papers that have tested out count-min or AGMS/FAGMS in some setting are good examples.
    I remember Martin saying I could stick to some more general information if I don't have enough direct examples, but should also try to keep it brief.
  • Where do I discuss alternative sketches? - In the Background section
  • Could Section 2.3 be about background of AGMS and Fast-AGMS?

Input Data

  • Understand data generation for this thesis (what type of data) 32 bit integers (any)
  • How do I simulate data streaming for the benchmarks? Feed the algorithm sequentially from a data file? -> Just use the approach like in the sample where you generate data and run through it.
  • Should I use data sampling for the algorithms? No

Libraries and Implementation

  • tsimd doesn't support unsigned ints, need to EITHER combine it with the default library to achieve vectorization OR use the additional libsimdpp
  • (1) "... where m v is the number of members with value v" : What is m in our case? The compare_sketches sample file assigns a static 1.0 as m
  • I need to vectorize the update functions as well? (EH_3 implementation) Or just whatever makes sense after analysis?

Misc

  • Get server access to learn using VTune remotely
  • LaTeX - Where did he gather/use the "List of Tables/Figures/Listings" ? Self-generated, check the main file.

Helpful papers that give insight on what I did in this paper can briefly mentioned.

Note for related work: Fast AGMS is AGMS + Count-Min -> check papers that worked on Count-Min to maybe include some of those insights in the related work.

Memory Tuning AGMS

General Sources on Memory Tuning

  • High Throughput Heavy Hitter Aggregation for Modern SIMD Processors

Research Fast-AGMS

Put here all the related information for this algorithm. Papers, experiments, videos, etc.

Papers

  • Sketching streams through the net: distributed approximate query tracking

Random Variables

  • EH3 for update (Pseudo-Random Number Generation for Sketch-Based Estimations)
  • Source code provides CW2 for the hash buckets (rows in count-min sketch), which is also 3-wise independent. But I could also implement H3 (Efficient Hardware Hashing Functions for High Performance Computers)

TODO

VTune

  • Note down a plan on how to run initial VTune Hotspot Analysis
  • implement dynamic values for static variables, test if that works
  • comment out all prints
  • put a sleep 10 between the two update loops

SIMD

Benchmarks

  • Make dummy data
  • Implement sample diagram with dummy data
  • Make sure that each run uses random values for variables that need random values
  • Implement saving metrics in a file for each test run
  • Decide on test cases data (^^^Rx^^^B sizes)
  • Compile different test cases
  • Book critical timeslot on server
  • Organize test folder on server, create a separate folder for this test
  • Run tests x-times each and collect data in the output file
  • Transfer data to diagram

Visualization Ideas

  • Roofline Model for Bottleneck representation of both algorithms
  • A bunch of interesting statistical visualizations in the "Statistical Analysis of Sketch Estimators" paper
  • Show a snippet of critical code, then show the resulting VTune profile, discuss the problem+optimization, then show snippets of the result
  • A multi-diagram showing the CPU Time distribution per different sample distribution / buckets size
  • A quick diagram showing the throughput for depending on sample
  • A graph displaying throughput scaling depending on sample size(and distribution?) per algorithm. And then another one after optimizations
  • A simple graph showing only improvement in percent for AGMS-Stock vs AGMS-Optimized and F-AGMS-Stock vs F-AGMS-Optimized (averaged across all distributions and all RowXBucket sizes)

Credit

  • Random (linear) projection "Random projection in dimensionality reduction: Applications to image and text data" ; "Random Projection, Margins, Kernels, and Feature-Selection"
  • Sketches project source code
  • tsimd Library (need to contact somewhat soon)
  • libsimdpp Library
  • Martin

Data Generation

  • Understand data generation for this thesis (what type of data) ((unsigned) integers should do)
  • Understand how the provided open-source data generators work
  • Find seeded data generators (using some distribution, like zip/normal) to generate consistent data
  • Make use of data generators to prepare a Zipfian distribution test data
  • Make use of data generators to prepare a Uniform distribution test data, as a means of worst-case scenario
  • Generate data with a discrete Normal Distribution pattern.

Benchmark AGMS baseline

This ticket is to log and document the process of benchmarking, gathering of data, and interpretation of that data for the baseline case of the AGMS algorithm.

Runtime

VTune

VTune Amplifier’s CLI on Local and Remote Machines

PATH to installed version is:

Remote

/opt/intel/vtune_amplifier_2018.0.2.525261/bin64/

Local

/opt/intel/vtune_amplifier_2019.3.0.590814/bin64/

out files that can be executed to start are either:

  • ./amplxe-gui
  • ./amplxe-cl

Sample CLI run:

Hotspot Analysis

/opt/intel/system_studio_2019/vtune_amplifier_2019.3.0.590814/bin64/amplxe-cl -collect hotspots -knob sampling-mode=hw -finalization-mode=full -app-working-dir /home/morty/sketch_profiling/ -- /home/morty/sketch_profiling/FILENAME

Microarchitecture Exploration Analysis

/opt/intel/system_studio_2019/vtune_amplifier_2019.3.0.590814/bin64/amplxe-cl -collect uarch-exploration -knob collect-memory-bandwidth=true -target-duration-type=veryshort -finalization-mode=full -app-working-dir /home/morty/sketch_profiling/ -- /home/morty/sketch_profiling/FILENAME

File Management

Copy file/folder from local to remote

rsync -chavzP --stats /home/meggamorty/CLionProjects/thesis/optimization/cmake-build-debug/optimization [email protected]:/home/morty/FOLDER

Copy file/folder from remote to local

rsync -chavzP --stats [email protected]:/home/morty/sketch_profiling/COPYDIR /home/meggamorty/vtune-reports/PASTEDIR

Notes on warnings

Literature and Programming Research

The first milestone consists of mainly doing research on the theory and technology that I'm going to be using for this project. The plan currently consists of the following goals:

  • #10 #11 Research algorithms in detail
  • #3 Study visualization techniques for informative and simple paper diagrams
  • #4 Study programming language (C++)
  • Test the supplied remote PC (remote control via SSH, packages, etc.)
  • #6 Understand SIMD and AVX-512 instructions for C-like languages
  • #9 Understand technical aspects of CPU analysis with VTune Amplifier
  • #8 Test out VTune Amplifier’s CLI on local and remote machines
  • #30 Gather sources and information useful for the Related Work subsection

Related Work

Count-Min

Folder name Related_Work

  • Count-Min
  • count-min sketch as closely related predecessor to FastAGMS (1, 2)
  • count-min and sketches being used in ML applications (3, )
  • count-min and other proposals are also used for their efficient statistical approximations in natural language processing (NLP) solution. (4, 5)

AGMS

  • AGMS - couldn't find any useful at the moment

FastAGMS

Folder Name Papers_using_FastAGMS

  • FastAMGS - only two so far
  • Sketching streams through the net: Distributed approximate query tracking

SIMD

Folder Name Papers_using_SIMD

  • SIMD
  • papers (1, 2, 3, 4)
  • (5) Significantly less power consumption for the same throughput on streaming update ops for count-min sketch, on a low-power system

SIMD + Memory Tuning

  • High Throughput Heavy Hitter Aggregation for Modern SIMD Processors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.