The thesis from beepy0

Visualization Techniques

Bar Charts - M
Tables - LaTeX, M
GUI Widgets - M
Simple Plots - M
Scatter Plot - M, P, S
Box Plot - M, P, S
Violin Plot - M, P, S
Joint Plot - S
Facet Grid - S

Matplotlib

https://matplotlib.org/tutorials/introductory/sample_plots.html

Pandas

https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
works with DataFrames, Series

Seaborn

https://seaborn.pydata.org
most plots are given as predefined functions, customization is done via function arguments

Vega-Lite

teractive visualization described with JSON files
can be used durectly in jypiter
http://vega.github.io/vega-lite/

Visualizations in the field (SIMD, Join Estimators etc.)

Read a few papers that use SIMD and such to optimize to get inspired on structure and visualization

Python

Anaconda Platform : https://www.ceos3c.com/open-source/install-anaconda-ubuntu-18-04/

Benchmark AGMS Optimized

Note speed-up based on SIMD, memory optimization or anything else that was done to improve throughput of the algorithms.

Implement H3 Bucket Hashing Function

for each row to max_row
key_i_bit & row

tmp_row = first_row
for each row to (max_row - 1)
tmp_row ^= next_row

return tmp_row

Environment setup

Install GSL development library for tsimd

Practical research

Dummy tests for learning vectorization optimization

Implementing small tests to practice using automatically-generated SIMD operations in C++ (https://github.com/bpdentolo/thesis/tree/master/testing_simd_1)
Implement direct SIMD optimization via the tsimd library

Enforcing automatic compiler SIMD optimization:

https://www.cilkplus.org/tutorial-pragma-simd
For C++ projects: https://intellij-support.jetbrains.com/hc/en-us/community/posts/360000004970-Add-Compiler-Flags
https://youtu.be/8khWb-Bhhvs?list=WL&t=2025 at around 34:00 (slides and sample code here: https://github.com/CppCon/CppCon2018)

Initial research SIMD Libraries

https://github.com/VcDevel/Vc : AVX-512 is still in development, thus in this state it is not usable for the project
https://github.com/ospray/tsimd : doesn't support unsigned vars, will need to use the standard c++ library for those cases.
https://github.com/p12tic/libsimdpp : looks like it supports uints and also has documentation for probably everything I'll need.

AVX-512

blank

Credit

blank

Other

https://www.youtube.com/watch?v=YnnTb0AQgYM

Clear before benchmarking any throughput

1. In the update function there is this multiplication by 1 that we discussed at the very beginning, where you mentioned this could be used if you have data that arrives in packets, or something similar. Is it required to leave it in the code or should I remove it since it introduces an unnecessary multiplication in my use-case and is not going to be part of the benchmarking?
2. I have multiple files that are not shared on the github repo - the large input data files (~10GB) and the profiling reports from VTUNE (~20GB). Do I have to leave sources for these in the thesis, and if yes, how?

Writing

3. In your opinion, what could be a good outline of the problem statement for my thesis? (Do I talk about the need for more throughput as datastreams grow even more, do I talk about how vectorization works and what needs to be considered (e.g. compute bounds) before implementing it, or do I talk about something else?)
4. Do I provide pseudo-code for the join/self_join functions for both algorithms in the description section of those algorithms, or just for the update functions?
5. In which part does it make most sense to discuss the baseline benchmark results (with graphs etc)? I was initially planing to discuss them after I introduce the test conditions under Approach, e.g. after talking about implementation and tools, but now I'm thinking it might be that all results must go under Evaluation? Please let me know

Thesis Structure

Create an initial outline plan and post it here.
It will be updated and and added to this ticket as time goes on.

AGMS VTune Microarchitecture Exploration Analysis

In the Microarchitecture group.

Milestone 1

K-independent hashing : https://en.wikipedia.org/wiki/K-independent_hashing
Get inspired by other papers that use SIMD optimizations.
Roofline Model : https://en.wikipedia.org/wiki/Roofline_model Looks like a good way to represent the algorithm bottlenecks.
Self-Join : https://www.w3resource.com/sql/joins/perform-a-self-join.php
Distributed Join Operations (Distributed query processing)
C/C++ Pointer aliasing

Benchmark Fast-AGMS baseline

This ticket is to log and document the process of benchmarking, gathering of data, and interpretation of that data for the baseline case of the Fast-AGMS algorithm.

Runtime

VTune

General Sources on Memory Tuning

High Throughput Heavy Hitter Aggregation for Modern SIMD Processors

Benchmarks and optimization

The goals are as follows:

#17 #18 Benchmark both algorithms’ baseline speed
#19 #20 Analyze algorithms for hotspots
#21 #22 Determine existence of memory bounds
#23 #24 Determine existence of compute bounds
#6 Adapt to use SIMD instructions where suitable
#25 #26 Consider memory-tuning if feasable
#27 #28 Benchmark improved algorithms

AGMS VTune Memory Access Analysis

VTune - In the Microarchitecture group

Open-Source AGMS and Fast-AGMS

Research project's code
Setup initial project
Lock any random data to seeded data for use during benchmarks
Run implementation, understand parameters

Research AGMS

Put here all the related information for this algorithm. Papers, experiments, videos, etc.

Papers

(Pseudo-code) Tracking join and self-join sizes in limited storage
(EH3 definition) Pseudo-Random Number Generation for Sketch-Based Estimations
(section Randomized Stream Sketching) Sketching Streams Through the Net - Distributed Approximate Query Tracking
(original proposal) The Space Complexity of Approximating the Frequency Moments

Algorithm

Join/Self-Join estimation
Update function (EH3)

Extra theory

randomized linear projection : https://stats.stackexchange.com/questions/235632/pca-vs-random-projection
parity bit of the bits of an integer (gen_scheme.h func seq_xor)

Credit

Sketches project source code

Writing TODO

revise Problem Statement (comments in email)
3 side notes in Feedback 2 regarding Background

Vectorization and AVX Research

From https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/
"Hello,
I did a few vector AVX512 benchmarks. They are mostly arithmetic vector operations. I found that the peak of floating point multiplications is doubled from AVX2 (I configured bios for not throttling down). That’s the case when the vectors with samples are smaller than the cache pages, otherwise there is memory bottleneck. So, both AVX512 and AVX2 have same flops. I guess for intensive computations which require little memory and lots of operations AVX512 provides a better performance. Also, I’ve seen that different intel architectures have different performance in shuffling operations which are a really big bottleneck.
In any case, benchmarking instructions sets is pretty complex and highly application dependant."

Benchmark Fast-AGMS Optimized

Note speed-up based on SIMD, memory optimization or anything else that was done to improve throughput of the algorithms.

Removal of possibilities for conditional branching
Removal of heavy functions like modulo

SIMD Notes

The overview section has useful information
"Using vector types wider than available SIMD instructions increases register pressure. Users should query the most efficient vector widths from the library via vector size macros and use it to size vectors for their algorithms." - macros: http://p12tic.github.io/libsimdpp/v2.1-dev/libsimdpp/w/util.html#Vector_size_macros

Fast-AGMS VTune Hotspot Analysis

In-Depth CPU analysis with VTune Amplifier

Get guides and tutorials on how to do the different types of analysis using VTune.

Intel employee presents VTune: https://youtu.be/SqXmMeigZSo?list=PLjX5iDdaL94bEF1xBiRii0t9CFKcwNgWJ
Recommended compiler optimization setting is any normal setting with symbols enabled, meaning it must include "-g" (function symbols, function names and line numbers) . https://stackoverflow.com/questions/89603/how-does-the-debugging-option-g-change-the-binary-executable ; https://www.rapidtables.com/code/linux/gcc/gcc-o.html

Build Specifics

Create a fully optimized build
With O2 compiler optimization
WITHOUT debug mode (otherwise it gets less accurate) : -g

Analysis Aspects

Module breakdown
(parallel execution) Spin-up time : one thread waiting for another to release a shared resource.
(parallel execution) Overhead time (thread transition time) : the time it takes for one thread to retire and another to start using the CPU.
Function CPI

Analysis Types

-collect advanced-hotspots
-collect general-exploration
-collect memory-access

Microarchitecture Exploration Notes

no meaningful indication of a memory bound here, under 2% L1 cache bound and the others are even less than that.

This milestone concerns understanding how the provided open source C++ code for AGMS and Fast-AGMS works, how to use it, and how to work with the additionally provided data generators. The goals are as follows:

#13 Research and test open-source implementations of AGMS and Fast-AGMS
#14 Make use of data generators to prepare test data

1. Questions/Discussion for Martin

A list of questions I need to ask Martin the next time we meet.

Thesis Layout

Section 2.2 "Related Work", does something like that make sense to include in my thesis?
If yes, what exactly? I see other papers combine this with multi-threaded optimization, sometimes with MIMD, and maybe I can include things like OpenMP etc.?
Answer: Papers that have tested out count-min or AGMS/FAGMS in some setting are good examples.
I remember Martin saying I could stick to some more general information if I don't have enough direct examples, but should also try to keep it brief.
Where do I discuss alternative sketches? - In the Background section
Could Section 2.3 be about background of AGMS and Fast-AGMS?

Input Data

Understand data generation for this thesis (what type of data) 32 bit integers (any)
How do I simulate data streaming for the benchmarks? Feed the algorithm sequentially from a data file? -> Just use the approach like in the sample where you generate data and run through it.
Should I use data sampling for the algorithms? No

Libraries and Implementation

tsimd doesn't support unsigned ints, need to EITHER combine it with the default library to achieve vectorization OR use the additional libsimdpp
(1) "... where m v is the number of members with value v" : What is m in our case? The compare_sketches sample file assigns a static 1.0 as m
I need to vectorize the update functions as well? (EH_3 implementation) Or just whatever makes sense after analysis?

Misc

Get server access to learn using VTune remotely
LaTeX - Where did he gather/use the "List of Tables/Figures/Listings" ? Self-generated, check the main file.

Helpful papers that give insight on what I did in this paper can briefly mentioned.

Note for related work: Fast AGMS is AGMS + Count-Min -> check papers that worked on Count-Min to maybe include some of those insights in the related work.

Fast-AGMS VTune Microarchitecture Exploration Analysis

In the Microarchitecture group.

Memory Tuning AGMS

General Sources on Memory Tuning

High Throughput Heavy Hitter Aggregation for Modern SIMD Processors

Research Fast-AGMS

Put here all the related information for this algorithm. Papers, experiments, videos, etc.

Papers

Sketching streams through the net: distributed approximate query tracking

Random Variables

EH3 for update (Pseudo-Random Number Generation for Sketch-Based Estimations)
~~Source code provides CW2 for the hash buckets (rows in count-min sketch), which is also 3-wise independent~~. But I could also implement H3 (Efficient Hardware Hashing Functions for High Performance Computers)

Writing Ideas/Notes

TODO:

explain about the truncation function instead of modulo

TODO

VTune

Note down a plan on how to run initial VTune Hotspot Analysis
implement dynamic values for static variables, test if that works
comment out all prints
put a sleep 10 between the two update loops

SIMD

Rewrite function definitions to use as few different variable types as possible.
Import libsimdpp to the project (https://foonathan.net/blog/2016/07/07/cmake-dependency-handling.html); import libraries via cmakelists.txt https://stackoverflow.com/questions/44998233/add-external-c-libraries-to-a-clion-project
Run the sample code to make sure the library works.
Diagnose p_reses' first 10-100 results and analyze resulting values
Research server CPU model and highest AVX512 version support

Benchmarks

Visualization Ideas

Roofline Model for Bottleneck representation of both algorithms
A bunch of interesting statistical visualizations in the "Statistical Analysis of Sketch Estimators" paper
Show a snippet of critical code, then show the resulting VTune profile, discuss the problem+optimization, then show snippets of the result
A multi-diagram showing the CPU Time distribution per different sample distribution / buckets size
A quick diagram showing the throughput for depending on sample
A graph displaying throughput scaling depending on sample size(and distribution?) per algorithm. And then another one after optimizations
A simple graph showing only improvement in percent for AGMS-Stock vs AGMS-Optimized and F-AGMS-Stock vs F-AGMS-Optimized (averaged across all distributions and all RowXBucket sizes)

Credit

Random (linear) projection "Random projection in dimensionality reduction: Applications to image and text data" ; "Random Projection, Margins, Kernels, and Feature-Selection"
Sketches project source code
tsimd Library (need to contact somewhat soon)
libsimdpp Library
Martin

Data Generation

Understand data generation for this thesis (what type of data) ((unsigned) integers should do)
Understand how the provided open-source data generators work
Find seeded data generators (using some distribution, like zip/normal) to generate consistent data
Make use of data generators to prepare a Zipfian distribution test data
Make use of data generators to prepare a Uniform distribution test data, as a means of worst-case scenario
Generate data with a discrete Normal Distribution pattern.

Benchmark AGMS baseline

This ticket is to log and document the process of benchmarking, gathering of data, and interpretation of that data for the baseline case of the AGMS algorithm.

Runtime

VTune

AGMS VTune Hotspot Analysis

VTune Amplifier’s CLI on Local and Remote Machines

PATH to installed version is:

Remote

/opt/intel/vtune_amplifier_2018.0.2.525261/bin64/

Local

/opt/intel/vtune_amplifier_2019.3.0.590814/bin64/

out files that can be executed to start are either:

./amplxe-gui
./amplxe-cl

Sample CLI run:

Hotspot Analysis

/opt/intel/system_studio_2019/vtune_amplifier_2019.3.0.590814/bin64/amplxe-cl -collect hotspots -knob sampling-mode=hw -finalization-mode=full -app-working-dir /home/morty/sketch_profiling/ -- /home/morty/sketch_profiling/FILENAME

Microarchitecture Exploration Analysis

/opt/intel/system_studio_2019/vtune_amplifier_2019.3.0.590814/bin64/amplxe-cl -collect uarch-exploration -knob collect-memory-bandwidth=true -target-duration-type=veryshort -finalization-mode=full -app-working-dir /home/morty/sketch_profiling/ -- /home/morty/sketch_profiling/FILENAME

File Management

Copy file/folder from local to remote

rsync -chavzP --stats /home/meggamorty/CLionProjects/thesis/optimization/cmake-build-debug/optimization [email protected]:/home/morty/FOLDER

Copy file/folder from remote to local

rsync -chavzP --stats [email protected]:/home/morty/sketch_profiling/COPYDIR /home/meggamorty/vtune-reports/PASTEDIR

Notes on warnings

Missing debug information: https://software.intel.com/en-us/vtune-amplifier-help-error-cannot-locate-debug-info
Get access from Martin
Run the OpenMP dummy tests on the local Intel machine
Run VTune Analysis on the local machine
Run VTune Analysis on the remote machine via CLI
Transfer the file to the local machine https://stackoverflow.com/questions/9090817/copying-files-using-rsync-from-remote-server-to-local-machine

Fast-AGMS VTune Memory Access Analysis

VTune - In the Microarchitecture group

Learn C++ Language

Tutorials:

https://www.youtube.com/watch?v=vLnPwxZdW4Y
https://youtu.be/mUQZ1qmKlLY?t=1

Tinks:
https://isocpp.org/tour
https://isocpp.org/images/uploads/2-Tour-Basics.pdf

Literature and Programming Research

The first milestone consists of mainly doing research on the theory and technology that I'm going to be using for this project. The plan currently consists of the following goals:

#10 #11 Research algorithms in detail
#3 Study visualization techniques for informative and simple paper diagrams
#4 Study programming language (C++)
Test the supplied remote PC (remote control via SSH, packages, etc.)
#6 Understand SIMD and AVX-512 instructions for C-like languages
#9 Understand technical aspects of CPU analysis with VTune Amplifier
#8 Test out VTune Amplifier’s CLI on local and remote machines
#30 Gather sources and information useful for the Related Work subsection

Related Work

Count-Min

Folder name Related_Work

Count-Min
count-min sketch as closely related predecessor to FastAGMS (1, 2)
count-min and sketches being used in ML applications (3, )
count-min and other proposals are also used for their efficient statistical approximations in natural language processing (NLP) solution. (4, 5)

AGMS

AGMS - couldn't find any useful at the moment

FastAGMS

Folder Name Papers_using_FastAGMS

FastAMGS - only two so far
Sketching streams through the net: Distributed approximate query tracking

SIMD

Folder Name Papers_using_SIMD

SIMD
papers (1, 2, 3, 4)
(5) Significantly less power consumption for the same throughput on streaming update ops for count-min sketch, on a low-power system

SIMD + Memory Tuning

High Throughput Heavy Hitter Aggregation for Modern SIMD Processors

beepy0 / thesis Goto Github PK

thesis's People

Contributors

Watchers

Forkers

thesis's Issues

Matplotlib

Pandas

Seaborn

Vega-Lite

Visualizations in the field (SIMD, Join Estimators etc.)

Python

Environment setup

Practical research

Dummy tests for learning vectorization optimization

Enforcing automatic compiler SIMD optimization:

Initial research SIMD Libraries

AVX-512

Credit

Other

Clear before benchmarking any throughput

Writing

Milestone 1

Runtime

VTune

General Sources on Memory Tuning

Papers

Algorithm

Extra theory

Credit

Build Specifics

Analysis Aspects

Analysis Types

Thesis Layout

Input Data

Libraries and Implementation

Misc

General Sources on Memory Tuning

Papers

Random Variables

VTune

SIMD

Benchmarks

Runtime

VTune

PATH to installed version is:

Remote

Local

Sample CLI run:

Hotspot Analysis

Microarchitecture Exploration Analysis

File Management

Copy file/folder from local to remote

Copy file/folder from remote to local

Notes on warnings

Count-Min

AGMS

FastAGMS

SIMD

SIMD + Memory Tuning

Recommend Projects

Recommend Topics

Recommend Org