Giter VIP home page Giter VIP logo

Comments (11)

lemire avatar lemire commented on August 21, 2024

Thanks!

from positional-popcount.

mklarqvist avatar mklarqvist commented on August 21, 2024

Thanks @aqrit. I will add it.

from positional-popcount.

aqrit avatar aqrit commented on August 21, 2024

Ugh! If it were unrolled 2x... the second uint64_t could be processed for ~half-price.
4-bit/12-bit counters instead of 3-bit/12-bit counters.

from positional-popcount.

mklarqvist avatar mklarqvist commented on August 21, 2024

Ha! Great timing. I am literally computing the performance profile to paste here as we speak.

from positional-popcount.

mklarqvist avatar mklarqvist commented on August 21, 2024

@aqrit Unrolling seems to reduce performance. Haswell machine:

n = 1000000 
iterations = 10 
array size: 1.907 MB
avx256popcnt                            	cycles per 16-bit word:  0.316; ref cycles per 16-bit word: 0.259 
pospopcnt_u16                           	cycles per 16-bit word:  0.344; ref cycles per 16-bit word: 0.275 
pospopcnt_u16_scalar_naive              	cycles per 16-bit word:  3.936; ref cycles per 16-bit word: 3.463 
pospopcnt_u16_scalar_naive_nosimd       	cycles per 16-bit word:  18.055; ref cycles per 16-bit word: 14.282 
pospopcnt_u16_scalar_partition          	cycles per 16-bit word:  3.364; ref cycles per 16-bit word: 2.643 
pospopcnt_u16_scalar_hist1x4            	cycles per 16-bit word:  3.124; ref cycles per 16-bit word: 2.456 
pospopcnt_u16_scalar_umul128            	cycles per 16-bit word:  2.415; ref cycles per 16-bit word: 1.898 
pospopcnt_u16_scalar_umul128_unroll2    	cycles per 16-bit word:  2.969; ref cycles per 16-bit word: 2.333 
pospopcnt_u16_sse_single                	cycles per 16-bit word:  4.308; ref cycles per 16-bit word: 3.385 
pospopcnt_u16_sse_mula                  	cycles per 16-bit word:  2.135; ref cycles per 16-bit word: 1.740 
pospopcnt_u16_sse_mula_unroll4          	cycles per 16-bit word:  1.736; ref cycles per 16-bit word: 1.428 
pospopcnt_u16_sse_mula_unroll8          	cycles per 16-bit word:  1.535; ref cycles per 16-bit word: 1.251 
pospopcnt_u16_sse_mula_unroll16         	cycles per 16-bit word:  1.536; ref cycles per 16-bit word: 1.208 
pospopcnt_u16_sse2_sad                  	cycles per 16-bit word:  1.384; ref cycles per 16-bit word: 1.116 
pospopcnt_u16_sse2_csa                  	cycles per 16-bit word:  0.428; ref cycles per 16-bit word: 0.357 
pospopcnt_u16_avx2_popcnt               	cycles per 16-bit word:  3.133; ref cycles per 16-bit word: 2.469 
pospopcnt_u16_avx2                      	cycles per 16-bit word:  4.074; ref cycles per 16-bit word: 3.225 
pospopcnt_u16_avx2_naive_counter        	cycles per 16-bit word:  4.002; ref cycles per 16-bit word: 3.144 
pospopcnt_u16_avx2_single               	cycles per 16-bit word:  3.704; ref cycles per 16-bit word: 2.920 
pospopcnt_u16_avx2_lemire               	cycles per 16-bit word:  2.193; ref cycles per 16-bit word: 1.733 
pospopcnt_u16_avx2_lemire2              	cycles per 16-bit word:  1.502; ref cycles per 16-bit word: 1.191 
pospopcnt_u16_avx2_mula                 	cycles per 16-bit word:  1.426; ref cycles per 16-bit word: 1.135 
pospopcnt_u16_avx2_mula_unroll4         	cycles per 16-bit word:  0.960; ref cycles per 16-bit word: 0.772 
pospopcnt_u16_avx2_mula_unroll8         	cycles per 16-bit word:  0.816; ref cycles per 16-bit word: 0.641 
pospopcnt_u16_avx2_mula_unroll16        	cycles per 16-bit word:  0.844; ref cycles per 16-bit word: 0.663 
pospopcnt_u16_avx2_mula3                	cycles per 16-bit word:  0.517; ref cycles per 16-bit word: 0.440 
pospopcnt_u16_avx2_csa                  	cycles per 16-bit word:  0.329; ref cycles per 16-bit word: 0.275 

from positional-popcount.

aqrit avatar aqrit commented on August 21, 2024

unroll + rewrite (which I will do)

the 2nd u64 would cost:
1 load
4 and
5 add
1 mul
edit: 1 shr

from positional-popcount.

aqrit avatar aqrit commented on August 21, 2024

updated the gist w/unrolled version
https://gist.github.com/aqrit/c729815b0165c139d0bac642ab7ee104

from positional-popcount.

aqrit avatar aqrit commented on August 21, 2024

Same trick could be applied to sse2_sad ... I've got 12 extra psadbw instructions in there.
/facepalm

from positional-popcount.

aqrit avatar aqrit commented on August 21, 2024

We can add a third u64 into pospopcnt_u16_scalar_umul128 on the cheap.
(sorry for spamming)

from positional-popcount.

mklarqvist avatar mklarqvist commented on August 21, 2024

Thanks @aqrit . It is indeed slightly faster.

pospopcnt_u16_scalar_umul128            	instructions per cycle 3.53, cycles per 16-bit word:  2.414, instructions per 16-bit word 8.528 
min:  2413517 cycles,  8527767 instructions, 	     248 branch mis.,     8206 cache ref.,        0 cache mis.
avg: 2426353.7 cycles, 8527767.2 instructions, 	   248.8 branch mis.,  11234.4 cache ref.,      1.0 cache mis.

pospopcnt_u16_scalar_umul128_unroll2    	instructions per cycle 2.94, cycles per 16-bit word:  2.389, instructions per 16-bit word 7.027 
min:  2389269 cycles,  7026782 instructions, 	     248 branch mis.,    10193 cache ref.,        0 cache mis.
avg: 2400136.8 cycles, 7026782.2 instructions, 	   249.3 branch mis.,  11906.7 cache ref.,      0.8 cache mis.

from positional-popcount.

mklarqvist avatar mklarqvist commented on August 21, 2024

Now added.

from positional-popcount.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.