Giter VIP home page Giter VIP logo

Comments (92)

jerch avatar jerch commented on May 30, 2024 1

@ismail-yilmaz Minor note on what I've found worth looking into, but did not get down to it yet:

  • The sixel bit deconstruction with branching is actually much more expensive than always writing something somewhere. With more clever index "routing" like writing empty bits to dummy pixels I can gain ~30% speedup in put_single. This somewhat surprised me, would have thought that the memory access to dummy pixels would outweigh the branching + bit testing by far. But seems the branching creates a heavy load here.
  • Imho the conditional checks, if the byte at hand is a sixel (eg. in range 63 to 126), can be made much faster with clever bit shifting. Did only profile that one so far, it adds up to quite some runtime, if the sixel data contains only few compression directives.

Both ideas are somewhat branching related, and I think that is promising ground for further optimizations.

Edit: Btw my decoder function kinda invites to use computed gotos, sadly they are not yet natively supported in wasm. If you want to squeeze 10 - 20% more perf, this might be worth a look as well.

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

@ismail-yilmaz Thx a bunch, I'll check these things out.

About critical sections - I kinda have a hard time to profile it in wasm, I simply dont know any tools, that can achieve it within wasm. So I went the faulty route of deactivating code paths and comparing the runtimes, which ofc carries the risk of creating bigger dead code branches, that get eliminated by the compiler. Still my findings are these in wasm:

  • only inner state handling (deactivated sixel, cursor, params and color processing): 150 - 200 MB/s
  • all minus sixel processing (bit decomposition + pixel safe): 95 - 180 MB/s
  • all minus params and color processing: 140 - 190 MB/s
  • all: 93 - 165 MB/s

Note that the big spread in the runtimes is a result of one test image with 4096 colors and random pixels. This image has almost no compression, thus very high throughput numbers. Normal images are grouped around the lower number with only small variations (< +20 MB/s).

Interpretation of the numbers:

  • Sixel processing and pixel save accounts for ~40% of the runtime. This is actually lower than I expected.
  • Params and color processing accounts for ~7 % of the runtime. No surprise here, it is within my expectations with slightly worse runtime ratio compared to data ratio, thus slower in processing than the sixel path. This could be further optimized, but the gains are unlikely to surpass 5% overall runtime.
  • Most runtime is spent for the inner state handling with ~50% runtime. WTH, the inner state handling only switches branches based on ps.state. Have to admit, that the branches are quite scattered/nested with that weird jump helper, still my expectations were in the 15 - 30% range, not 50%.

So I gonna try to find a simpler state model, as it promises great savings. My original JS implementation uses a DFA switching on every input byte. While this is fairly equal in JS, it actually performs worse in wasm. The current wasm state model is a direct transfer of the DFA states into switch and if-else conditions with less switching per byte but deeper nested side paths (the jump helper results from preserving the DFA states, but might not be needed everywhere).
A first dig into the branching direction shows, that the switch statements might be problematic. They get compiled to br_table instructions in wasm, which dont seem to get the jump table optimization in the final machine code created by the wasm runtime (needs more investigation, did only a quick test with different amount of cases, which revealed a O(n) dependency, bummer).

Edit: Fixed the mixed numbers above.

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test sixel-bench with yout wasm, that would be very nice. :)

I can prepare a script for the sixel-bench test, sure thing.

If you pulled the whole node-sixel repo, you can run its benchmarks with:

cd node-sixel
git checkout faster_decode
# edit wasm/build.sh to point EMSCRIPTEN_PATH to your emsdk_env.sh
# then run
npm install  # will abort, just ignore it
npm run build-wasm
npm install
# benchmark
node_modules/.bin/xterm-benchmark ./lib/index.benchmark.js

Have not tested that yet on a clean checkout, so bear with me if its not working out of the box. Also my newer optimizations are not yet checked in, thus you will see slightly lower numbers.

Edit: Fixed the checkout/build commands.

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed.

Ofc with the pixels on the heap you can not get the pseudo cache locality from chunk bytes, as it could be allocated elsewhere (and the chunk bytes prolly live elsewhere as well).

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions.

I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path.
But note that I have only limited experience with other JS/wasm engines (did only surface tests with Firefox/spidermonkey and WebKit/JSScriptCore - both are a bit slower than V8 from what Ive seen so far).

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024 1

I've just identified and fixed a bottleneck in my sequence parser. It is both amazing. and embarrassing:

VTInStream(data.sixel):                  0.58s (200.00 MiB/s)
SixelRenderer(data.sixel):               2.10s (56.00 MiB/s) // Formerly ~42 MiB...

lol.

Sixel viewer now displays the animation with ~48-49 (oscillating) MiB/s througput.

I was using a string to collect the DCS, as it was convenient. All i did was to change it to a vector with a bit of (<= 64 k) reserved memory.

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

A possible explanation, why "write always" is that much faster (did some intel arch reading, but still far from understanding the details):

The sixel write actions in put_single to 6 different mem cells are non-dependent, thus can be reordered by the CPU. (Btw thats not quite true for the [0] case, maybe more room for optimization, not sure if the reordering will recognize, that always the same value gets written). With that in mind a modern x86 can process multiple micro-ops in parallel at pipeline level, thus "parallelizes" those writes to some degree. With the conditional checks in place (not "write always") that parallalism gets wastes with speculative branching with a bad retired instruction ratio in the end.

I dont know, how smart the CPU is at the index calc done, whether it will shorten k * (a + b + *c) for k=0 to zero early without bothering loading the right side. If I get it right, there is some kind of race determining k and (a + b + *c) in parallel, so I think it will at least try to load *c from mem, even if k=0 comes in early. But thats all speculative for me atm (could do some micro benchmarks to find out, but meh). Nonetheless there might be still some perf hidden here to save a few cycles, as it is the hottest of the hot paths.

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

@ismail-yilmaz

Guess I am done with optimization for now. With the latest changes I get 37-39 MB/s in xterm.js synchronous (no webworker involved). The decoder dropped to ~40% of total runtime in the browser, thus I am already in a region where only significant improvements will show further impact. And since the final solution will be web worker driven, it might get over 40 MB/s, depending on the impact of the mainthread to worker copy overhead.

I am pretty happy with these results, the wasm decoder is now 2.5 - 3 times faster than the original JS implementation.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024 1

By the way, precalculating the Y offsets works fine here. It allows a little but real gain. So I am going to borrow it from you, and use it. (with ack, ofc.)

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

@ismail-yilmaz Played around with the simd idea above and got something working. It is still has bugs in color transfers, but the performance is already quite good for a first hacky impl:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 5.83 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 103.60 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.51 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 91.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 5.71 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 115.01 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 92.40 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 167.82 MB/s

Here the code for the non repeated paints:

void simd_paint() {
  if (!reg_agg) return;

  __m128i colorMask = _mm_set_epi32(ps.color, ps.color, ps.color, ps.color);
  __m128i ones = _mm_set_epi32(1, 1, 1, 1);
  __m128i sixels = _mm_load_si128((__m128i *) reg);

  int p = l_cur + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colorMask, bitmask);

    __m128i old = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i final = _mm_or_si128(old, updated);

    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }

  _mm_store_si128((__m128i *) reg, _mm_setzero_si128());
  reg_agg = 0;
}

reg is an int[4] array, where sixels get loaded to from the main chunk loop as in cursor % 4. Whenever the write is below or above that 4 pixel area, simd_paint gets called, also for color changes and CR/LF. The single byte loading into reg and the pixel range handling is still buggy, and now the biggest runtime penalty (while the paint calls runtime dropped by almost 50%).

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

In that variant is still room for further optimization - eg. the loading of old from canvas can be made less often from far memory, if pulled into cache on the 4-pixel range change (when l_cur changes).
However, as written above the whole paint dropped from ~40% runtime of my other decoder to ~25%, while the byte chewing on the state loop increased by almost the same amount (is now at 70%). Imho the only way to further save runtime is to do the input slicing with simd as well. But thats not that easy and since sixel data is that highly fragmented, I am not sure if there is anything to gain (batches will often trigger into slow path subchunk handling).

Edit: Btw - tried the batched sixel painted above also with an advance of 8 pixels - not with avx2 (since I cannot use it), but with 2 interim reg sixel registers calling the paint twice. Runs worse, guess it triggers to many nonsense canvas updates. With avx2 it would not be that bad (since you can do it in one go), def. worth a shot for you (but will only work with newer machines like haswell+).

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

More findings about sixel parsing - my basic mixed sixel painter looks like this now:

inline void simd_paint(__m128i sixels, int cur) {
  __m128i colors = _mm_set1_epi32(ps.color);
  __m128i ones = _mm_set1_epi32(1);

  int p = cur * 4 + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colors, bitmask);

    __m128i prev = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i keep = _mm_andnot_si128(bitmask, prev);

    __m128i final = _mm_or_si128(keep, updated);
    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }
}

It is meant to be fed with 4 consecutive sixels (minus 63) and the current 4-pixel aligned cursor position (128bit progression). The color blending with previous color is fixed with a & ~bitmask.

On caller level I found 2 ways being suitable to digest the mixed sixels:

  • additional "register array" holding vector components
    This was my first naive approach. It does an additional copy step, which is subpar, but turned out to be very flexible. It can literally be hooked on top of any byte-by-byte parser impl as follows:
    // defined somewhere (initialized on decode entry)
    int reg[4] __attribute__((aligned(16)));
    int cursor4;
    int has_data;
    ...
    // on sixel write
    reg[cursor & 3] = sixel;
    has_data |= sixel;
    ...
    // on cursor position change
    if (cursor4 != cursor / 4) {
      if (has_data) {
        simd_paint(_mm_setr_epi32(reg[0], reg[1], reg[2], reg[3]), cursor4);
        // or with _mm_load_si128((__m128i *) reg) (faster on native)
        has_data = 0;
        reg[0] = 0; reg[1] = 0; reg[2] = 0; reg[3] = 0;
      }
      cursor4 = cursor / 4;
    }
    This scattered sixel collecting by only writing when cursor4 changes reduces the paint calls by far. On the other hand the array resetting is rather expensive (can be optimized with int64 or int128 nulling). has_data is a needed optimization due to sixel's way to move the cursor around with empty sixels, without it nullish dummy paint calls would be triggered over and over.
  • faster local processing in sixel state
    Next I tried to speedup the sixel processing for multiple sixels. The fastest isolated version I've found, is a direct load into SIMD from aligned memory:
    __m128i zero = _mm_set1_epi8(0);
    ...
    __m128i chunk = _mm_load_si128((__m128i *) chunk_p);
    __m128i spead1 = _mm_unpacklo_epi8(chunk, zero);
    __m128i sixels = _mm_unpacklo_epi16(spead1, zero);
    ...
    simd_paint(sixels, cursor4);
    With shifting one can access the other bytes in chunk likewise. But this fastest solution only holds its promise, if the chunk data is properly aligned and you already have ruled/filtered out other bytes. Unless taking expensive pre-steps, thats not given for "natural" chunk data, in fact we have sixel bytes scattered all over the place behind color and compression directives at various positions. I started to investigate top level SIMD parsing with fragment slicing as described above, but thats still work in progress (currently 30% slower with painting applied).
    A better interim solution for the existing parsers might be, to benefit from the sixel byte state, thus we already know that the current byte describes a sixel, and the next one has a high chance to be a sixel byte as well, without taking chunk alignment into account. Here the fastest version I've found is a shift-or one:
    // pretest: check of code fits sixel range
    sixels = _mm_srli_si128(sixels, 4);
    if (code) {
      sixels = _mm_or_si128(sixels, _mm_set_epi32(code, 0, 0, 0));
      has_data |= code;
    }
    This can be used for any sixels, but needs shift corrections at the beginning and end of the sixel bytes to move sixels in half done 4-pixel ranges into the right spot (we still have to respect the pixel aligment).

The second variant is in general ~20% faster (prolly due to omitting the extra memory roundtrip of the register array), but degrades for images with highly scattered sixel bytes. The reason is obvious - if you always do the shift variant but only one sixel byte at a time comes in, you have to do the shift correction and dummy paints over and over. Imho a combination of both will help here in local sixel byte state:

  • Do the first unaligned sixels with array register (automatically deals with the "one sixel a time" issue).
  • Once you hit the cursor mod 4 border, switch to shift variant eagerly, hoping more sixels to read.
  • When leaving, either do a shift correction + paint or save sixels back to the register. Also measured this one - for cursor % 4 == 1 the register variant is faster, for 2 and 3 the instant paint is faster.

About paint performance:
Applying the register array variant to my fastest byte-by-byte parser gives ~20% boost (110 - 120 MB/s) and the shift variant ~30% for most images (120 - 130 MB/s). The paint runtime dropped from ~40% down to <20% total runtime, thus more than halved (did not count the instructions nor cycles for simd_paint vs. put_single, guess SIMD is >2 times lower). I did not yet try an optimized combination of both, but dont expect much higher values with this, just more sustainable numbers across different image data. There still might be perf gems hidden in the sixel to pixel paint path (esp. lowering the amount of dummy paints), currently I hope to settle around stable 130 MB/s on my machine with the byte-by-byte parser loop (all with wasm, native is ~15% faster already).

Things not yet covered:

  • SIMD paint for compressed sixels - not done yet (expected: 5-10% boost from your numbers)
  • SIMD number parsing for compression and colors - is that worth to be considered?
  • SIMD top level slicing - promising interim results as described above, but way to go to cover all edge cases in a speedy manner (expect this one to be >>200 MB/s once all is done)

Cheers ๐Ÿ˜ธ

Edit:
Oh sweet, just found out that wasm directly supports _mm_insert_epi32, although it is sse4.1. This makes both variants above obsolete, we can simply carry around a __m128i variable to stack things up and paint on cursor change. Will further speedup things. - Thats actually not better, the dev guide even warns about these intrinsics showing poor performance...

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

Some callgrind profiling data (shortened):

file: test1_clean.sixel
sixels:	380283
bytes:	619290

--------------------------------------------------------------------------------
        Ir        Dr        Dw  I1mr D1mr   D1mw ILmr  DLmr   DLmw 
--------------------------------------------------------------------------------

inline put_single:
32,406,294 3,793,752 2,740,265 1,988 29,0 151,80 1,88 7,867 68,950  PROGRAM TOTALS
28,843,839 3,216,772 2,262,066   85 4,806 78,876   85     3     16  ???:decode

inline simd_paint:
31,131,809 4,417,437 2,439,052 1,980 113,0 90,21 1,87 7,868 68,950  PROGRAM TOTALS
27,622,662 3,802,861 1,922,498   79 89,072 16,28   79     4     15  ???:decode

noinline put_single:
33,145,259 4,627,920 3,670,545 1,97 29,08 151,48 1,87 7,867 68,950  PROGRAM TOTALS
15,212,388 2,151,142 1,292,548   69 4,836 15,803   69     3     16  ???:decode.part
14,370,416 1,899,798 1,899,798    4     0 62,724    3     .      .  ???:put_single(int, int)

noinline simd_paint (with prepare_paint helper):
32,095,566 4,907,242 2,565,930 1,97 113,59 90,17 1,874 7,86 68,950  PROGRAM TOTALS
16,726,671 1,835,401 1,093,683   64  6,272 15,24   64     3     15  ???:decode
11,261,790 2,377,489   875,917    6 83,365     0    6     1      .  ???:prepare_paint(int, int)
   597,958    79,776    79,776    4      0   954    4     .      .  ???:put_single(int, int)

noinline simd_paint (optimization - avoid dummy paints):
29,182,602 4,320,719 2,380,693 1,97 102,42 92,29 1,87 7,868 68,949  PROGRAM TOTALS
15,910,154 1,856,080 1,020,775   22  5,775    36   22     3     15  ???:decode
 8,271,960 1,711,440   665,560    6 72,695     0    6     .      .  ???:prepare_paint(int, int)
   798,301    58,846    98,027   39      6 15,94   39     .      .  ???:put(int, int, int)
   597,958    79,776    79,776    4      0  2,83    4     .      .  ???:put_single(int, int)
  • Instructions/sixel for single paints dropped from ~38 to ~22. Now one SIMD paint calculates to ~88 instructions, which is pretty close to my rough estimate from code (84 instructions), means we have literally no dummy paints anymore. 22 sounds still pretty high per sixel byte, well I currently see no way how to make that any faster.
  • Instructions/byte for non-sixel bytes is really high with ~90, guess we found the slowpoke ๐Ÿ˜ธ

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

@ismail-yilmaz I kinda give up with the high level SIMD tokenization. I tried 3 different approaches now (with the last being the fastest, but still 25% slower than my fastest byte-byte loop). It turns out that the high fragmentation of sixel data needs many state changes for one 16 byte slice (one register load), which creates lots of single byte extractions (which are cumbersome with SIMD) and alignment corrections on number and sixel bytes (the only ones that are consecutive to some degree). But again those consecutive bytes cannot be processed directly in SIMD, as they need special preparations:

  • sixel bytes must be padded to 32bit and aligned to the 4-pixel progression
  • number bytes must be padded to 16bit to use _mm_madd_epi16, number of digits have different code paths
    The number conversion with madd is indeed much faster for +3-digit numbers (3 times faster), but falls short for 1-digit numbers (4 times slower). The additional need for branching over the number of digits introduces another nonsense penalty, esp in wasm (that digit switch made the wasm build 30% slower). Ofc chances are high that I overlooked here something, yet sixel data often contain 1- and 2-digit numbers, thus I think there is not much to be gained here.

Nonetheless here is my last and fastest top level SIMD attempt, in case you want to play with it or have better idea, how to approach the fragment data:

typedef union Vec128 {
  __m128i  vector;
  long long i64[2];
  int i32[4];
  short i16[8];
  char byte[16];
} Vec128;

inline void handle_unclear(char code) {
  switch (code) {  // <---- thats the speed showstopper...
    case '!':
      ps.state = ST_COMPRESSION;
      break;
    case '#':
      ps.state = ST_COLOR;
      break;
    case '$':
      ps.cursor = 0;
      break;
    case '-':
      ps.y_offset += 6;
      ps.offset = ps.y_offset * ps.width + 16;
      ps.cursor = 0;
      break;
    case ';':
      break;
  }
}

static int sixels_or_numbers_counter = 0;
inline void handle_sixels_or_numbers(Vec128 reg, int start, int end, int number_bits, int sixel_bits) {
  // to not get removed by optimzation
  sixels_or_numbers_counter++;

  // follow-up code removed for less cluttering
}

// like BLSR, but not available in wasm
inline int clear_bits_below(int x) {
  return x & (x - 1);
}

inline __m128i mask_sixels(const __m128i input) {
  const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(65));
  return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-64));
}

inline __m128i mask_numbers(const __m128i input) {
  const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(80));
  return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-118));
}

void decode____(int length) {
  Vec128 reg;

  int l = length / 16 * 16;
  int rem = length - l;
  if (rem) {
    for (int k = l + rem; k < l + 16; ++k) ps.chunk[k] = 0;
    l += 16;
  }
  for (int i = 0; i < l; i += 16) {
    reg.vector = _mm_lddqu_si128((__m128i *) &ps.chunk[i]);

    // strip high bit
    reg.vector = _mm_and_si128(reg.vector, _mm_set1_epi8(0x7F));

    // identify sixel & numbers
    int sixels = _mm_movemask_epi8(mask_sixels(reg.vector));
    int numbers = _mm_movemask_epi8(mask_numbers(reg.vector));

    // identify unclear bytes
    int unclear = 0xFFFF & ~(sixels | numbers);

    int pos = 0;
    while (unclear) {
      int adv = __builtin_ctz(unclear);
      if (pos < adv) {
        handle_sixels_or_numbers(reg, pos, adv, numbers, sixels);
      }
      handle_unclear(reg.byte[adv]);
      pos = adv + 1;
      unclear = clear_bits_below(unclear);
    }
    if (pos < 16) {
      handle_sixels_or_numbers(reg, pos, 16, numbers, sixels);
    }
  }
}

This top level SIMD loop needs alot less instructions at the 16-bytes slice than my first version, and groups slices into unclear and sixels_or_numbers. The __builtin_ctz is problematic for machines without native TZCNT (< haswell I think), typically it gets triggered around 2 - 6 times for every 16-byte from my data analysis. Everything else is a piece of cake in that main loop (throughput ~3 GB/s).
The real perf showstopper is the switch statement in handle_unclear (throughput drops down to 700 MB/s). At first I wondered if the byte extraction in reg.byte[adv] is the problem, but replacing the switch with a simple add on a static variable restore most of the speed. It seems that the CPU doesnt like branching here. Maybe you have a good idea how to prevent that perf decrease, I didnt so far, the code paths are way to different to think about a branch-less version.

When applying the real data parsing (color/compression/sixels) down the code paths, things get ugly due to extra work needed for padding and alignment (not shown above). So far the only ground that shows real speed improvement with SIMD is the sixel-to-pixel path.

I gonna stop those top level SIMD attempts for now until you have a fundamental better idea, how to approach the data. In the meantime I gonna try to micro optimize my byte-by-byte loop (repeated SIMD painting still missing). The last trick I have in mind is to reduce the dummy paints* to almost 0, at least across compression-sixels-compression progression (not really useful across color changes as they almost always contain cursor movements with CR). Looking at the real asm output, simd_paint is only 72 instructions, means that my calculated 88 instructions still contains 20% dummy calls. Getting rid of those should give another 3-5% speedup. Thats not much, but would give me 130-160 MB/s for most images I have for testing here, which is still good compared to 92-95 MB/s without SIMD. After that change I am out of ideas. ๐Ÿ˜ธ


[*] You might have wondered in my last posts about "dummy paints". I call any paint a "dummy paint", if the 4-pixel range is underfull, thus sixels are missing, which create additional "nonsense" paints with simd_paint. Example:

$!5abc!3def creates the following:

  • repeated paint a 5 times --> 2 4-pixel aligned paints: [a, a, a, a], [a, 0, 0, 0]
  • paint bc --> 1 4-pixel aligned paint: [0, b, c, 0]
  • repeated paint d 3 times --> 2 4-pixel aligned paints: [0, 0, 0, d], [d, d, 0, 0]
  • paint ef --> 1 4-pixel aligned paint: [0, 0, e, f]

In total simd_paint gets called 6 times, but only writes [a, a, a, a], [a, b, c, d], [d, d, e, f] in the end (3 calls would do), thus we have 3 dummy paints.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024 1

@jerch ,

Ah, that's just a synthetic test loop to display the difference, the actual parser is based on Upp::Stream and uses Stream::Get() and Stream::Peek(), which return -1 on eof or stream error (meaning, 0 is considered a valid byte.

Besides, that loop can be used to parse the parameters in my use-cases, because our parser, firstcollects and then splits the parsed valid parameter chunks into a vector of Strings. Strings are null terminated. So at that point it is guaranteed to have no '\0' in the string. except for the null terminator.

from upp-components.

jerch avatar jerch commented on May 30, 2024 1

@ismail-yilmaz Some updates from my side. Have not yet pushed the code as the changes and integration still takes time, because of the level 1 vs. level 2 handling and the question, whether to truncate at raster dimensions.

While trying to find a working API to cover all these details I made and interesting observation, that might also help you to get better cache utilization:

My wasm implementation started as level2 truncating only, as it promised way greater optimizations. In the end we are only interested in the final pixel array, thus it seemed obvious to do that as one big blob of memory to avoid copy costs (even static to save pointer indirections).
Now while trying to solve the level1 / non-truncating issue I started playing around with a sixel-line decoder (6-pixels in height), where after finishing the line the pixels need to be copied over before the next line can be processed. This involves one additional copy action for the whole pixel data in the end. To my surprise this is in wasm slightly faster than the big blob approach (~ 1-3%, I think in native as well, but did only a few checks).

Imho this is a pretty big cache side effect overcompensating an additional copy step (prolly due to the big blob going into far memory regions for big images with bad cache locality again). Still have to understand the details, also I am not settled yet, whether to keep the level 1/2 distinction, or go with just one general purpose decoder in the end (that would be way easier to maintain).

If you want to try something similar, I did the following:

  • set line length to 32768 pixels at max (should be some time before we get to 64k)
  • spread the 6 pixel lines of a sixel at those offsets (needs 12 * 65536 memory pages in wasm, ~ 0.75 MB)
  • whenever a sixel line is finished (or at end of image), copy the written pixels elsewhere
  • flush line memory with fill_color
  • repeat with next sixel line

On the first sight it is not really obvious, why this might lead to better cache utilization, as the memory usage is quite spread across those 0.75 MB. Imho the following happens: really every memory interaction (sixel-to-pixel write, off-copy, flush) touches these 6 memory areas prolly keeping it hot in the cache. Only for very wide images thrashing will occur, when the cursor gets too far to the right. Just a theory atm, if it is true there should be a significant throughput drop at some line width depending on the cache size.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Wow sounds promising - and ofc I am really curious about these improvements on asm level and whether they also apply to wasm to some degree or could be ported back. Its pure stack machine design makes some optimizations behave worse (even pointer arithmetic tends to run slower from my findings).

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz I implemented the "write always something" in put_single by offsetting the canvas pixels and routing empty bits to canvas[0]. Boosted the raw decoding throughput from ~85 MB/s to 95-100 MB/s for my test images (~15% boost). Still not sure, why it shows that much of a speedup, could be a cache side effect - wild theory: I have the data chunk for reading right before that in memory, maybe it gets always loaded into cache due to this "accidental" locality.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch ,

implemented the "write always something"

I apologize for my late reply (Really busy now, and I'll comment on the things you've written, tonight - GMT+3). But I think I finally understand why you get so high results (which is very impressive by the way, congrats!) When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42). I now know why (at least, I know what's the main culprit. I'll write the whole thing tonight (GMT + 3) and share some findings, with a suggestion on your decoder's main loop.

Best regards,
ฤฐsmail

from upp-components.

jerch avatar jerch commented on May 30, 2024

I apologize for my late reply ...

Well wouldn't call 30 min "late" ๐Ÿ˜ธ

Yes I am really interested in your asm findings and whether I can backport some ideas to wasm as well. Wasm itself is pretty reduced in op codes and its stack mechanics, so some things might not work out there.

When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42).

Oh, thats weird. Note that those numbers are raw decoder speed, the whole turnover in xterm.js with the sixel-bench test is just at 36 MB/s (gets slowed down by the JS handling pre/posthand). Furthermore my test machine is a i7-2760QM with 2.40GHz and 1333 MHz (0.8 ns) double bank memory. The CPU is kinda old, but I remember seeing big differences in i3/i5 types vs. i7, esp. due their different cache settings. If we want comparable metrics across machines, maybe we should switch to cycles instead.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

A disclaimer: My experience with, and knowledge of, webassembly is pretty limited at the moment. (Not so with Asm/C/C++ though. )

I will be using the sixel-bench as my main reference here. And the machine I use for testing is old enough Athlon FX 6100 (six core) @3.3 GHz, 8MB, dual channel, @1866 Mhz.

My findings suggest that there are three critical sections that impact the sixel rendering performance. Let's go step by step, shall we?

1. Parameter collection

This was the first interesting part. The fact that higher color palette settings (>= 256) and sixel compression relies on a supplied parameters list makes this section a good candidate for optimization. Throughout my tests and benchmarking I came to the conclusion that for effective cpu cache utilization it would be better to collect the consecutive parameters in a single, separate loop at once (i.e via a dedicated inline function for better instruction and data cache utilization.)
To this end I've created three versions of ReadParams() function, each with a slight modification and run them all 100 times on a 10 MB file that contains random and valid sixel parameters ranging from 0 to 999999.

The code of the said three function can be inspected here. (Note that I use Upp::Stream instead of std::stringstream on the actual code, which is faster and its Peek() and Get() methods are trivially inlineable.)

Here are the results (with -O3 optimization):

// TIMING     - readparams_plain    :  2.97 s  - 29.71 ms ( 2.97 s  / 100 ), min: 29.00 ms, max: 32.00 ms, nesting: 0 - 100
// TIMING     - readparams_faster   :  1.94 s  - 19.37 ms ( 1.94 s  / 100 ), min: 19.00 ms, max: 27.00 ms, nesting: 0 - 100
// TIMING     - readparams_unrolled :  1.72 s  - 17.25 ms ( 1.73 s  / 100 ), min: 17.00 ms, max: 18.00 ms, nesting: 0 - 100

// THROUGHPUT - readparams_plain    :  3.40 MiB/s
// THROUGHPUT - readparams_faster   :  5.20 MiB/s
// THROUGHPUT - readparams_unrolled :  5.80 MiB/s

While the difference between faster and unrolled version can be considered negligible, it is still there. And it also shows that conditional branching may not always be the worst path to take. Which brings me to the question: Does this also apply to wasm? Did you already consider this approach for your decoder.cpp? Because you seem to be handling the parameters on the main loop level.

A side note: A code line from decoder.cpp written in two different ways (CLANG-webassembly, -O3)

else if (code > 62 && code != 127) // original version.
        local.get       3
        i32.const       63
        i32.lt_u
        br_if           0
# %bb.32:               
        local.get       3
        i32.const       127
        i32.eq  
        br_if           0


else if (unsigned(code - 63) < 64) // My altered version
        local.get       3
        i32.const       -63
        i32.add 
        local.tee       2
        i32.const       63
        i32.gt_u
        br_if           0

Now, I don't know which one would be faster as I don't have in-depth knowledge of wasm, so please feel free to correct me at every level, but the fact is this code used for put_pixel is on a very perf-sensitive path, and the altered version seems at least shorter and involves less branching (I don't know how many cycles each instruction takes on wasm) so it may have an impact on the performance - albeit small.

To be continued...

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch ,

I will hopefully write the second and third parts this weekend, and try to clear up some points with the numbers and profiling.
But let me say that I found the culprit in my decoder and changing a single thing skyrockets my decoders throughput to 61 Mib/s (on `sixel-bench) But it is costly, and not really ideal for my decoder's use cases.

I gonna try to find a simpler state model, as it promises great savings.

This is my main loop, by the way:

			for(;;) {
				int c = stream.Get();
				switch(c) {
				case 0x21:
					GetRepeatCount();
					break;
				case 0x22:
					GetRasterInfo();
					break;
				case 0x23:
					SetPalette();
					break;
				case 0x24:
					Return();
					break;
				case 0x2D:
					LineFeed();
					break;
				case -1: // eof or err
					goto Finalize;
				default:
					if(check_range(c, 0x3F, 0x7E))
						PaintSixel(buffer, c - 0x3F);
					break;
				}
			}

Each function is inlined and the only other two inner loops are in ReadParams() (inlined inside SetPalette, GetRepeatCount, GetRasterInfo) and PaintSixel() Also on sixel bench, ReadParams gets called 17575647 times (total), while the PaintSixel is called 69508135 times (total, and called for both compressed and single sixels), And these numbers do not even account for inner loops of these functions! That's why I think they are very sensitive.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz 61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol).

May I ask you to run the benchmarks of my lib? I have the feeling, that our machines are quite different in the throughput numbers, sadly I have no clue, how to measure wasm execution in terms of CPU cycles, which would be a better number for comparison.

If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol).

Unfortunately no. It is sequence parsing + sixel decoding.

  • If I display the full animation in the sixel viewer, I get -56 Mib/s. (gui-processing penalty)
  • If I display the full animation in the terminal via pty -> ~31 Mib/s (where mlterm gets 23 Mib/s)
  • If I dispaly the full animation in the terminal, by direclty feeding to it, -> ~40 Mib/s.

The drop has several reasons. But the most important reason is I have to stick with the gui calls and rules provided by our framework to ensure backend compatibility, as our vte is a widget, not an app. But the raw throughput (when only displaying the image is disabled) for sixel-bench in our vte is around 44 Mib/s. So it is pretty much optimized for its environment.

But this "extremely fast" rendering is not feasible for me because I need to use a static buffer with fixed length to get this. You can see where I'm getting at.. :))

It appears that 42-44 Mib/s range I've recently achieved is the limit for my decoder unless a fundamental change in my assumptions and decoder design.

If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK.

Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test sixel-bench with yout wasm, that would be very nice. :)

I will run the existing tests and compare the results with mine. (I'll share the results within next week)

I am really interested in optimizing sixel rendering now, and I really like to see xterm.js with top-notch sixel support. Besides, your decoder.cpp already gets a lot of things right. In fact I was planning to build an experimental renderer around it to see how well it fares against mine in the same environment, but I am very busy, so this will take some time...

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

I can prepare a script for the sixel-bench test, sure thing.

I'd be grateful, thanks!

I will benchmark it on this sunday.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Damn switch statement - with flat switch statement over byte values I get worse numbers (80 - 150 MB/s), but reshaped into sorted if-else cascades I get slightly higher numbers (100 - 180 MB/s). Lol, not a good sign, if the conditional cascade actually performs better, normally the switch statement should outperform that by far with proper jump table optimization for multiple cases, grrrrr. So the neckbreaker for the current complicated state model seems to be the extensive usage of switch.

Will see, if I can get a simpler branching model from the cascading thingy (still misses several edge case transitions).

from upp-components.

jerch avatar jerch commented on May 30, 2024

This is the best compromise I could find with reduced state handling:

inline void maybe_color() {
  if (ps.state == ST_COLOR) {
    if (ps.p_length == 1) {
      ps.color = ps.palette[ps.params[0] % ps.palette_length];
    } else if (ps.p_length == 5) {
      if (ps.params[1] < 3
        && ps.params[1] == 1 ? ps.params[2] <= 360 : ps.params[2] <= 100
        && ps.params[3] <= 100
        && ps.params[4] <= 100) {
        switch (ps.params[1]) {
          case 2:  // RGB
            ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_rgb(
              ps.params[2], ps.params[3], ps.params[4]);
            break;
          case 1:  // HLS
            ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_hls(
              ps.params[2], ps.params[3], ps.params[4]);
            break;
          case 0:  // illegal, only apply color switch
            ps.color = ps.palette[ps.params[0] % ps.palette_length];
        }
      }
    }
  }
}

void decode(int length) {
  if (ps.not_aborted && ps.y_offset < ps.height) {
    for (int i = 0; i < length; ++i) {
      int code = ps.chunk[i] & 0x7F;
      if (62 < code && code < 127) {
        switch (ps.state) {
          case ST_COMPRESSION:
            put(code - 63, ps.color, ps.params[0]);
            ps.cursor += ps.params[0];
            ps.state = ST_DATA;
            break;
          case ST_COLOR:
            maybe_color();
            ps.state = ST_DATA;
          default:
            put_single(code - 63, ps.color);
            ps.cursor++;
        }
      } else if (47 < code && code < 58) {
        params_add_digit(code - 48);
      } else
      switch (code) {
        case 59:
          params_add_param();
          break;
        case 33:
          maybe_color();
          params_reset();
          ps.state = ST_COMPRESSION;
          break;
        case 35:
          maybe_color();
          params_reset();
          ps.state = ST_COLOR;
          break;
        case 36:
          ps.cursor = 0;
          break;
        case 45:
          ps.y_offset += 6;
          ps.offset = ps.y_offset * ps.width + 8;
          ps.cursor = 0;
          break;
        case 34:
          maybe_color();
          params_reset();
          ps.state = ST_ATTR;
          break;
      }
    }
  }
}

While if-cascades are slightly faster in wasm, they kinda make the code alot uglier. Furthermore I dont want to optimize for shortcomings of wasm runtimes (only tested it with nodeJS so far), if they have a hard time to opt for real jump tables, they better get that fixed imho. Throughput numbers are between 92 - 180 MB/s, binary size dropped from ~6KB to ~3.8KB.

maybe_color still needs some cleanup (just copied it over), also the reduced state handling needs extensive testing.

Well it seems I cannot find a significantly faster version in wasm even with reduced state model, and I dont think that the states can be further reduced without introducing faulty handling. I did not yet try your suggested optimizations above.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

In the meantime, some preliminary benchmarks, including wasm:

(Sixel tests, on my development decoder, the one that gets`~42 Mib/s on sixel-bench)

Clean decoding, 20 rounds.
SixelRenderer_Clean(sem_clean.six):                 1.40ms   (64.00 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.80ms   (29.00 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.50ms   (21.00 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             4.50ms   (70.00 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           30.00m   (66.00 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             18.00ms  (33.00 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.70ms   (42.00 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  120.00ms (131.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              1.10ms   (9.90 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.70ms   (7.40 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms   (1.10 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            1.10ms   (13.00 MB/s)
SixelRenderer_Clean(michael_clean.six):             1.20ms   (17.00 MB/s)
SixelRenderer_Clean(trontank_clean.six):            1.10ms   (15.00 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    20.00ms  (31.00 MB/s)
SixelRenderer_Clean(time_clean.six):                3.70ms   (42.00 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.95ms   (11.00 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.50ms   (24.00 MB/s)
SixelRenderer_Clean(demo05_clean.six):              1.00ms   (34.00 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.45ms   (0.07 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.40ms   (17.00 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.70ms   (19.00 MB/s)
SixelRenderer_Clean(space_clean.six):               0.85ms   (9.70 MB/s)


"Fringe" decoder 61 Mib/ss version. ( static RGBA buffer with fixed size of 1920x1080. Yeah, kinda insane. But this can happen when we make compilers really happy. But making compilers happy can make life miserable for a developer. Hence this decoder is not really usable.

Clean decoding, 20 rounds.
SixelRenderer_Clean(sem_clean.six):                 0.90ms  (99.97 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           0.70ms  (73.60 MB/s)
SixelRenderer_Clean(screen_clean.six):              0.70ms  (72.50 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             2.20ms  (143.40 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           12.00ms (168.40 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             3.50ms  (168.70 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.55ms  (52.91 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  74.00ms (210.30 MB/s)
SixelRenderer_Clean(oriole_clean.six):              0.50ms  (21.73 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.45ms  (11.53 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms  (1.15 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            0.50ms  (28.75 MB/s)
SixelRenderer_Clean(michael_clean.six):             0.55ms  (35.07 MB/s)
SixelRenderer_Clean(trontank_clean.six):            0.50ms  (32.82 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    3.40ms  (189.30 MB/s)
SixelRenderer_Clean(time_clean.six):                1.30ms  (118.20 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.45ms  (22.98 MB/s)
SixelRenderer_Clean(biplane_clean.six):             0.60ms  (59.01 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.65ms  (51.93 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.45ms  (0.07 MB/s)
SixelRenderer_Clean(chess_clean.six):               0.55ms  (41.01 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             0.65ms  (48.84 MB/s)
SixelRenderer_Clean(space_clean.six):               0.45ms  (18.35 MB/s)

And wasm on my base testing machine which I mentioned in my earlier messages:

   Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.11 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.20 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.91 ms
            Case "decodeString" : 20 runs - average runtime: 5.32 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.86 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.84 ms
            Case "decodeString" : 20 runs - average runtime: 2.26 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.89 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.23 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.78 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.76 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.61 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.66 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 234.74 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 66.26 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.17 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 58.24 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.36 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 72.63 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.45 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 56.30 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 124.64 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 124.38 MB/s

and perf on sixel-bench

       Output:
       SixelRenderer(data.sixel):          2.90s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)

       Performance counter stats for './SixelBench' (4 runs):
     

      3.214,11 msec task-clock:u              #    0,998 CPUs utilized            ( +-  0,23% )
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
         4.940      page-faults:u             #    0,002 M/sec                    ( +-  0,03% )
11.763.759.200      cycles:u                  #    3,660 GHz                      ( +-  0,10% )  (83,12%)
   837.240.854      stalled-cycles-frontend:u #    7,12% frontend cycles idle     ( +-  0,26% )  (83,36%)
 2.512.438.090      stalled-cycles-backend:u  #   21,36% backend cycles idle      ( +-  1,25% )  (33,52%)
14.418.053.773      instructions:u            #    1,23  insn per cycle         
                                              #    0,17  stalled cycles per insn  ( +-  0,53% )  (50,13%)
 3.225.163.249      branches:u                # 1003,440 M/sec                    ( +-  0,17% )  (66,73%)
   112.190.910      branch-misses:u           #    3,48% of all branches          ( +-  0,35% )  (83,26%)

       # Table of individual measurements:
       3,24364 (+0,02208) #
       3,22007 (-0,00148) #
       3,20740 (-0,01416) #
       3,21511 (-0,00644) #

       # Final result:
       3,22156 +- 0,00781 seconds time elapsed  ( +-  0,24% )

`

I'll continue to explore the optimization strategies, part 2, with some more findings, tomorrow. Have a nice weekend.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch ,

Another treat: A modified version of your wasm decoder (I adapted the logic of mine to yours. Seems slightly faster, but contains less checking.) You can compare it with the above benchmarks. (Warning: the patch code is ugly - in C++ standards, at least..)

Results:

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.20 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.29 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 5.01 ms
            Case "decodeString" : 20 runs - average runtime: 5.43 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.77 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.66 ms
            Case "decodeString" : 20 runs - average runtime: 2.17 ms
          Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.77 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.07 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.76 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.69 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.51 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 236.52 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 65.68 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.34 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 57.31 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.33 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 73.11 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.73 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 54.91 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 115.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 134.54 MB/s

Please find attached the modified code..
decoder.zip

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

I've made some corrections to my patch and as a result it seems a bit faster than previous:

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.16 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.22 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.77 ms
            Case "decodeString" : 20 runs - average runtime: 5.59 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.84 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.59 ms
            Case "decodeString" : 20 runs - average runtime: 2.08 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.83 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.10 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.12 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.65 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 41.38 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.68 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.54 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 250.48 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 62.10 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.38 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 57.28 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.30 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 73.60 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.83 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 54.51 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 111.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 139.39 MB/s

The code can be easily modified and unrolled, which might let you gain even faster speeds, and test the assembly behavior for your final decoder.

decoder_corrected.zip

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Hmm weird, with your last code I get slightly worse runtimes than with my lastest optimizations (still have to cleanup and push that). Btw in line 187 you do this:

		for(int n = 0; n < 6; ++n) {
			if(c & (1 << n))
				ps.canvas[p + ps.jump_offsets[n]] = ps.color;
		}

Replaced with:

		for(int n = 0; n < 6; ++n) {
				ps.canvas[((c >> n) & 1) * (p + ps.jump_offsets[n])] = ps.color;
		}

I get much higher throughput. It is again the "write always something" trick, can you try if this gives you higher numbers as well? (Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings).

Btw my numbers are constantly higher than yours, so your 42-ish number is more like 65 for my machine. Maybe CPU and bus speed differences can explain most of that, not sure if there are bigger differences in cache loading times between AMD and Intel, I also think that the different pipeline lengths will equal out in the end (well I have not really an informed opinion in that field anymore, lost interest in CPU differences around Pentium 4, which was known to have insane long pipelines).

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch ,

Sure (not much difference here, interesting...):

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.07 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.19 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.62 ms
            Case "decodeString" : 20 runs - average runtime: 5.38 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.74 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.58 ms
            Case "decodeString" : 20 runs - average runtime: 2.04 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.98 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.20 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.96 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.88 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.08 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.51 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.83 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 243.80 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 63.77 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.55 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 56.14 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.35 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 72.80 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 12.24 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 52.60 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 111.89 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 138.63 MB/s

(Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings).

Well, for one, it is because you are using a static integer array with fixed size. Compilers usually vectorize the hell out of them (on x86 arch). I allocate the memory on heap (resizeable) and use aligned RGBA structure. integer operations can use registers. Setting a rgba buffer (even their length be the same) is usually done using memcpy. That's why the performance of my renderer takes hit if I "always write" .

Here's a synthetic benchmark stressing the difference between RGBA/integer and static vs dynamic allocation:

OP         - 500 rounds of random color fill.
BUFFERSIZE - 1920 * 1080,
THROUGHPUT - Buffer<RGBA>  : 56.00 MiB/s
THROUGHPUT - Buffer<int>   : 58.00 MiB/s
TIMING     - Buffer<int>   : 17.06 s  - 34.12 ms (17.06 s  / 500 ), min: 33.00 ms, max: 42.00 ms, nesting: 0 - 500
TIMING     - Buffer<RGBA>  : 17.69 s  - 35.38 ms (17.69 s  / 500 ), min: 34.00 ms, max: 49.00 ms, nesting: 0 - 500


OP         - 100000 rounds of random fill.
BUFFERSIZE - 1920 * 1080,
THROUGHPUT - RGBA   buffer[]  : 5700.00 MiB/s
THROUGHPUT - uint32 buffer[]  : 6200.00 MiB/s
TIMING     - RGBA   buffer[]  :  7.72 ms - 77.20 ns (99.00 ms / 100000 ),  min:  0.00 ns, max:  1.00 ms, nesting: 0 - 100000
TIMING     - uint32 buffer[]  :  9.72 ms - 97.20 ns (101.00 ms / 100000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 100000

if there are bigger differences in cache loading times between AMD and Intel

Possibly, because it appears that Athlon FX 6100 is somewhere in between i5 and i7, closer to i5.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch
Oops, I've made a mistake. Here are the real results with always write:

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.13 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.23 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.96 ms
            Case "decodeString" : 20 runs - average runtime: 5.56 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.87 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.51 ms
            Case "decodeString" : 20 runs - average runtime: 2.08 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 28.15 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.95 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.44 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.76 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.72 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.73 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.40 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 235.48 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 65.95 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.88 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 66.80 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.80 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 83.18 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 9.37 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 68.81 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 107.65 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 144.03 MB/s

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Pushed my latest optimizations, which is actually the fastest with all proper state transitions. Thats the one where I get >90 MB/s for all tests of the benchmark, and ~170MB/s for the 12bit noise image (which should not be taken too serious, as it is "very opinionated" in its data). The reduced state model is also contained as decode_ (so you flip and test as well).

Edit: Ah I see, your last numbers are more like mine only showing a small gap now.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed.

Yeah, that's, unfortunately, not really affordable for me That's why I'd better explore other options...

from upp-components.

jerch avatar jerch commented on May 30, 2024

I did not like the static memory in the first place, because it creates tons of frictions for the higher level JS integration. But emscripten currently does not allow "memory.grow" for direct wasm builds (its simply not yet implemented lol), thus removing the allocators and going fully static was the easiest way for me to overcome the shortcomings. And since it shows better performance I am not really mad about it and will reshape the JS integration to suit that model.

Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with alloca resulting in a real big performance jump during the negamax descent. Ofc the stack is not suitable here, but maybe some own byte sink on the heap...

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

I'll update the node-sixel shortly and report the results later today.

Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with alloca resulting in a real big performance jump during the negamax descent. Ofc the stack is not suitable here, but maybe some own byte sink on the heap...

Of course. I already did some "cool/clumsy" tricks that "just work" and failed miserably at others, but they were not feasible in general, because the real problem is that our vte is a widget and it can, and should continue to, work on a variety of devices and backends, ranging from liimited hobby hardware and rasberry Pi to hi-perf desktops, on SDL/linuxfb and even in web browsers. That's another reason why I find your work on wasm and opinions important and very productive for me. You see, I am also the co-author and maintainer of our HTML5 backend which allows apps that use our framework to run inside web-browsers (canvas + websockets). It is called Turtle. As its name suggests, it uses javascript :)) Thus I am also exploring the possibilities that webassembly can offer us. After all our HTML backend boils down to image and input handling which I believe can be satisfyingly optimized using webassembly.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions.

I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path.
But note that I have only limited experience with other JS/wasm engines (did only surface tests with Firefox/spidermonkey and WebKit/JSScriptCore - both are a bit slower than V8 from what Ive seen so far).

Well, turtle is pretty lightweight as it does not try to recreate a gui with JS. What it does is simply redirect the app window's output (and an associated background) to a web browser via a simple base64 encoded binary protocol. It gets decoded by javascript and displayed and key + mouse input are tracked. This all happens in a simple loop so it can work on all major browsers. So all it does is some image processing and blitting partial or complete images from a server application to a client web browser capable of canvas and web sockets. That's why I think it can be easily optimized using webassembly (I may be dead wrong of course.) To give you a clear idea of what I am talking about, here is an example. (Note that this is two years ago and neither TerminalCtrl nor turtle are as slow as this now.)

But then again, I don't really want to pollute further our sixel and wasm optimization discussion wih other stuff. Later when I implement this I'd love to discuss this and learn more about webassembly from your experience with it.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Ok, I downloaded the latest changes and here is the comparison:

decode() - always write

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.09 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.19 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.79 ms
            Case "decodeString" : 20 runs - average runtime: 5.44 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.84 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.49 ms
            Case "decodeString" : 20 runs - average runtime: 1.98 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.80 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.83 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.63 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.64 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 41.38 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.61 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.64 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 243.20 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 63.87 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.69 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 68.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.83 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 82.61 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 9.43 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 68.33 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 121.10 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 128.01 MB/s

decode_() - always write:

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.24 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.40 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 5.14 ms
            Case "decodeString" : 20 runs - average runtime: 5.55 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 4.23 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.35 ms
            Case "decodeString" : 20 runs - average runtime: 1.97 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 28.42 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.35 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.53 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.80 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.54 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 19.33 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 33.34 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 229.72 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 67.67 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.70 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 68.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.85 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 82.20 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 8.85 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 72.84 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 136.91 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 113.34 MB/s

my patch (always write)

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.14 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.17 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.90 ms
            Case "decodeString" : 20 runs - average runtime: 5.50 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.86 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.17 ms
            Case "decodeString" : 20 runs - average runtime: 1.69 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.93 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.97 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.33 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.48 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 42.28 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.59 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.67 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 220.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 70.46 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.41 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 70.54 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.64 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 87.14 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 8.95 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 71.97 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 101.49 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 152.77 MB/s

from upp-components.

jerch avatar jerch commented on May 30, 2024

kk you got me, gonna try your patches with the "always write" patch merged than. Seems to be the fastest for you, if I am not mistaken? Still wondering why I see much more of an advantage for "always write" than you do though...

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

kk you got me, gonna try your patches with the "always write" patch merged than. Seems to be the fastest for you, if I am not mistaken? Still wondering why I see much more of an advantage for "always write" than you do though...

Don't bother, I'd just synced the repo, compiled the decoder and ran it (while you changed the canvas start address, and I missed to update my code.) So it issomewhat false alarm: I checked & fixed the canvas address, Now my patch's performance is somewhere in between your decoders', and is slightly faster than the decode (original).

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

The major and possibly the only problem with the so-called "write-always" method -which works exceptionally well on static-fixed-sized OR small canvases-is that it does not seem to scale well to larger images thar require dynamic-variable-sized buffers. Writing to buffer[0] becomes much more expensive than branching if the said address is not in the cache line or any level of cpu caches. I think this can be worked-around by introducing alternative PaintSixel methods for larger and smaller buffers + calculating/checking the absolute cursor position by powers of cacheline length. This makes things a little more complicated though.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Hmm sounds reasonable. Would a local dummy pixel help here? Something like sliceA, sliceB, ... - where every slice has its own dummy area at the beginning?

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

Thinking again, my initial reasoning might be faulty and I don't want to mislead youรง. Now, I haven't tested this yet, but currenty we're moving along the y axis when drawing a strip of single sixel. So when the buffer(0,[0) is hit, it will probably be in the hot cache for the next round, as we are already in a loop (Probably the reason why always-write is faster in the first place.). OTOH, writing on the beginning of the current line (buffer(0, y +n)), for example, would probably be useful had we been moving along the x axis. Needs some testing and more thinking....

from upp-components.

jerch avatar jerch commented on May 30, 2024

Idk if we should optimize for cache specialties in the end, but it sounds like it is worth to be tried.

Idk at which occasions you have to resize the canvas by what extend (and where the cut points are), from sixel writing perspective imho the smallest contiguous mem area would be a sixel line, thus 6 * width pixels. Such a sixel line could hold at [0] (plus some alignment) dummy pixels, thus we would get better cache locality at least for the first real pixels (for very long lines the cache offset might be too big again, but the static mem has the same issue). Last but not least lines with offsets like that need an additional final compose step to get the full canvas, thus the whole things needs to be profiled, whether the cache benefit outweighs the compose workload. Tbh I doubt that, still it sounds worth trying, as contiguous copying is quite cheap. (Maybe you dont even have to compose it prehand, but just push line pointers forward to compose it during drawing.)

Well, just a wild idea.

Edit: In the original JS decoder I dont write to the final image canvas positions directly, but to sixel bands into x direction like that [a1 ... a6, b1 ... b6]. While this is very fast during streaming, the final compose step is really ugly and expensive (and one reason for my reduced level 2 wasm decoder).

from upp-components.

jerch avatar jerch commented on May 30, 2024

You prolly already know that - here are some interesting tests about cache effects on runtime: http://igoro.com/archive/gallery-of-processor-cache-effects/

from upp-components.

jerch avatar jerch commented on May 30, 2024

This brings me to another idea - an attempt to create a cache friendly encoder + decoder with "local chunks":

Encoder:

  • slice a sixel line into 16 * 6 pixel chunks (--> 6 cache lines), or maybe bigger cache line multiples
  • do all pixel color addressing within one chunk (thus all color registers for that chunk with LFs and sixel offset data)
  • repeat until sixel line is done

Decoder:

  • alloc local canvas chunk (maybe even on stack)
  • apply all color directives and LFs
  • memcpy 6 * chunklength to canvas position
  • increase canvas position
  • repeat with all chunks

This model introduces additional LF roundtrips with sixel offset data, but they are rather cheap. The memcpy action will be expensive, since it is the only action into non-cached mem, but thats always a contiguous copy of chunklength, thus not that bad.

Big downside - the decoder would only benefit, if the images are encoded that way. Data size will slightly increase due additional LF and sixel offset data. Whether thats actually faster remains uncertain until actually tried. Big plus - such a model can easily be run concurrently (could even be done on the GPU).

(If I am bored I gonna try such a version, but more for academic purpose...)

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

You prolly already know that - here are some interesting tests about cache effects on runtime: http://igoro.com/archive/gallery-of-processor-cache-effects/

Thanks, a nice summary indeed. The example #7 is an interesting situation. IMO, however, a generic, cache-friendly decoder would be enough in practice.

This brings me to another idea - an attempt to create a cache friendly encoder + decoder with "local chunks":

This. I really like the idea. In fact I had attempted to write a similar MT decoder -not really as cache-friendly as your suggested enc/dec pair though- some months ago. How it operated was as follows:

  1. Split the sixel lines beforehand and calculate the total number of canvas lines in pixels.
  2. Create a vector of RGBA structures/strips (initial lengrh: 1024) from the given number of sixel lines.
  3. Feed the vector of line buffers into the sixel painter.
  4. Merge the strips into a final image buffer.

It was slighty faster (~2-3 Mib/s) in single threaded mode and a bit more faster in multithreaded mode. But I just discarded the idea, for, at the time, it did not seem to me as a real improvement. This was due to the initial version of my sixel renderers design. It was meant to "just work" and not optimized at all. (was giving ~20 Mib/s at most.) But now, with all the speed gain I achieved, thanks to our conversations and some of your ideas, I may try this again as a side project to see the results.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Woops - I mixed LF and CR above in the model description. Ofc I meant CR to get the next color into the chunk.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Ok, that makes better sense now.

from upp-components.

jerch avatar jerch commented on May 30, 2024

And btw - the memcpy action could be marked as non-cacheable, as the cache friendly sixel format would guarantee to be final on a pixel (not needed anymore for decoding). Would avoid nonsense cache refreshs accidentally dropping other still needed parts.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Did a quick impl of the cache friendly sixel format encoder as described above. Well the result is not that promising. With a chunklength of 256 * 6 pixels it changes as follows:

  • data size increased by 30% (!)
  • wasm decoder takes 7% longer now (cache optimized decoder is not yet done)

The 30% size increase is surprisingly high (might still be buggy) and results from the CR resets with offset handling at the chunk borders, which is needed to still form valid sixel data. While the decoder runs faster in terms of throughput (100 MB/s vs. 122 MB/s), we have no gain yet. Not sure, if a cache optimized decoder can make up for more than the additional 7% runtime introduced by the bigger data size. Atm the penalty on normal decoders alone makes this approach subpar.

Btw I made the chunklength configurable, so its easier to test at different cache utilizations (width of 256 pixels is at 96 cache lines).

from upp-components.

jerch avatar jerch commented on May 30, 2024

Some interim results of a first cache optimized decoder impl:

  • initial code changes with local canvas chunk: 10 times slower ๐Ÿ˜…
  • +simd: 21% slower
  • +_int128 for memcpy: 17% slower

Currently a big performance dent comes from a simple modulo calc I have to do for every pixel addressing, but that will be gone with a proper full rewrite. Furthermore there is still a logic twist hidden somewhere, that forces me to do a load from canvas back to the local chunk, which should not be needed and creates a rather big penalty. With fixing both I expect that decoder to be much faster than the current fastest one.

Remarkable about the findings above is the fact, that the memcpy action normally is really expensive, but gets hidden in the other CPU load with simd + __int128. Not sure yet, why __int128 gave another 4% advantage, thought memcpy would get vectorized anyway with simd enabled. Maybe its a shortcoming of the wasm runtime, needs some investigation.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Some more minor optimizations at the pixel index calc give me now these results with my default decoder:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 6.15 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 97.74 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.20 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 99.42 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 6.23 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 103.96 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 87.73 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 176.77 MB/s

Changes:

  • unrolled bits (saves one shift per bit)
  • reintroduced branching in index calc for put_single to skip sum for k=0, this is slightly faster than multiply by zero
  • ps.width * offset is faster in put_single than loading precalc'ed value from jump_offsets, for put it is the opposite
  • for --> while in put
  • overall gain ~5%

Code

// in put_single
    int p = ps.cursor + ps.offset;
    ps.canvas[code & 1  ? p                : 0] = color;
    ps.canvas[code & 2  ? p + ps.width     : 0] = color;
    ps.canvas[code & 4  ? p + ps.width * 2 : 0] = color;
    ps.canvas[code & 8  ? p + ps.width * 3 : 0] = color;
    ps.canvas[code & 16 ? p + ps.width * 4 : 0] = color;
    ps.canvas[code & 32 ? p + ps.width * 5 : 0] = color;

// in put
    int p = ps.cursor + ps.offset;
    int pp;
    int r;
    if (code & 1)  { pp = p + ps.jump_offsets[0]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 2)  { pp = p + ps.jump_offsets[1]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 4)  { pp = p + ps.jump_offsets[2]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 8)  { pp = p + ps.jump_offsets[3]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 16) { pp = p + ps.jump_offsets[4]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 32) { pp = p + ps.jump_offsets[5]; r = n; while (r--) ps.canvas[pp++] = color; }

The put code looks like there is some gain possible with pointer arithemtics, but thats actually not true for wasm. Imho a native build would benefit from doing it this way (have not checked it):

    int *p = ps.canvas + ps.cursor + ps.offset;
    int *pp;
    int r;
    if (code & 1)  { pp = p + ps.jump_offsets[0]; r = n; while (r--) *pp++ = color; }
    ...

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

Sorry I couldn't reply earlier - very busy week. A bunch to digest here. I'll read them through tonight.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

First try: (single paint)

		*(buffer + ((c & 1)  ? coords[0] + cursor.x : 0)) = ink;
		*(buffer + ((c & 2)  ? coords[1] + cursor.x : 0)) = ink;
		*(buffer + ((c & 4)  ? coords[2] + cursor.x : 0)) = ink;
		*(buffer + ((c & 8)  ? coords[3] + cursor.x : 0)) = ink;
		*(buffer + ((c & 16) ? coords[4] + cursor.x : 0)) = ink;
		*(buffer + ((c & 32) ? coords[5] + cursor.x : 0)) = ink;


result:

SixelRenderer(data.sixel):			2.40s (49.00 MB/s) // drops down from 56 Mib/s  (on sixel-bench) :/

Webassembly seems to defy some, if not all, of my expectations.

Edit: code fixed.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Yeah wasm produces some surprising results, esp. for index vs. pointer arithmetics. Normally I would expect no differences anymore for those for a native build, while 20ys ago the pointer stuff would almost always win. In wasm the index addressing is almost always faster. This might be the compiler not yet optimizing for the fastest version under wasm yet, or wasm itself with its stack limitations (cannot pull values from registers, has instead to push/pop arguments before using any instruction).

from upp-components.

jerch avatar jerch commented on May 30, 2024

Btw I think the heavy state handling / chewing on bytes penalty (> 50% since last changes) could be somewhat lifted with explicit SIMD usage. Have not investigated into that direction and would not expect too much of a benefit. It really depends on how good the data can be batched. At least for sequential sixel bytes in a line and color register definitions (param parsing) I see a chance here (beside that is sixel very highly fragmented, thus simd with too broad register loads might even degrade performance). Might be worth a shot for you. While I have currently simd in wasm enabled, I have to diable it again due to missing support in safari and older nodejs versions (works only in nodejs 16+).

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Btw I think the heavy state handling / chewing on bytes penalty (> 50% since last changes) could be somewhat lifted with explicit SIMD usage. Have not investigated into that direction and would not expect too much of a benefit. It really depends on how good the data can be batched. At least for sequential sixel bytes in a line and color register definitions (param parsing) I see a chance here (beside that is sixel very highly fragmented, thus simd with too broad register loads might even degrade performance). Might be worth a shot for you. While I have currently simd in wasm enabled, I have to diable it again due to missing support in safari and older nodejs versions (works only in nodejs 16+).

Yeah that can help. In fact I have tried SIMD for at least compressed paint. We have a low level function called memcpy_t which uses very fast SIMD-based 8/16/32/64/128 mem-copy functions depending on the count of elements to be copied. Whole code was as follows:

for(int n = 0; n < 6; ++n) {
     if(c & (1 << n)) {
       RGBA *p = buffer + coords[n] + cursor.x;
        memcpy_t(p, &ink, repeat); 
     }
}

And it did give me some meaningful performance (~9 gain on average across the tests that use heavy compression). But for some reason, it leaves some artifacts on the images (the usual suspect is mem alignment/padding, which I did not bother checking at the time, as it a was test out of curiosity).. It needs further investigation. so I deferrred it as a TODO for later. Currently I am reconsidering a complete refactor of the lexical component of my sequence parser, which turned out to be a major culprit for the performance degradation (as I mentioned earlier).

from upp-components.

jerch avatar jerch commented on May 30, 2024

Yeah thats another ground for simd, certainly generated graphics/plots could benefit from that, since they usually have a high amount of compressed sixels with high repeats.

Do you mean with ~9 gain like 9 times faster? And yes, the artefacts are prolly from mis-aligments. You can fix that with some clever switch jumping (similar to duff's device). ๐Ÿ˜ธ

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Do you mean with ~9 gain like 9 times faster?

Lol. I just wish! I forgot to add % sign. :D

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Benchmarks with and without SIMD-based memset32 in compressed sixel paint (no artefacts):

20 rounds, clean, with SIMD
SixelRenderer_Clean(sem_clean.six):                 1.50ms (60.40 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.60ms (32.56 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.30ms (22.42 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             3.80ms (83.91 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           21.00ms (92.19 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             10.00ms (57.41 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.71ms (41.17 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  100.00ms (150.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              0.99ms (11.00 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.68ms (7.63 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.44ms (1.16 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            0.99ms (14.59 MB/s)
SixelRenderer_Clean(michael_clean.six):             0.92ms (21.00 MB/s)
SixelRenderer_Clean(trontank_clean.six):            0.98ms (16.75 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    11.00ms (57.34 MB/s)
SixelRenderer_Clean(time_clean.six):                3.30ms (47.17 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.95ms (10.85 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.30ms (27.99 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.87ms (39.00 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.43ms (0.08 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.20ms (18.85 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.60ms (20.08 MB/s)
SixelRenderer_Clean(space_clean.six):               0.84ms (9.89 MB/s)

SixelRenderer(data.sixel):                          2.10s (58.00 MB/s)

20 rounds, clean, without SIMD
SixelRenderer_Clean(sem_clean.six):                 1.70ms (53.21 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.60ms (32.14 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.60ms (19.57 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             4.00ms (78.57 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           24.00ms (82.70 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             11.00ms (55.11 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.77ms (37.90 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  110.00ms (137.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              1.30ms (8.38 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.71ms (7.32 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms (1.15 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            1.30ms (11.30 MB/s)
SixelRenderer_Clean(michael_clean.six):             1.10ms (17.41 MB/s)
SixelRenderer_Clean(trontank_clean.six):            1.00ms (15.64 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    11.00ms (56.55 MB/s)
SixelRenderer_Clean(time_clean.six):                3.40ms (45.69 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.96ms (10.76 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.30ms (26.52 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.91ms (37.27 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.43ms (0.08 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.30ms (16.94 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.90ms (16.87 MB/s)
SixelRenderer_Clean(space_clean.six):               0.87ms (9.47 MB/s)

SixelRenderer(data.sixel):                          2.10s (57.00 MB/s)

from upp-components.

jerch avatar jerch commented on May 30, 2024

I would keep that, it is not overwhelming but a nice boost for contiguous color handling.

But I think that can be driven a step further - by slicing the data at CR ($) and register select (#nnn) basically only compression directives and sixel bytes are left. This can be used to approach the pixel writing always in contiguous order by loading the sixels into simd, and do the bit shift there, schematically:

  • on register select: create colorMask as [color, color, color, color] (4x32)
  • repeat for local sixels:
    • load 4 sixels into 4x32 simd
    • repeat 6 times:
      • create 4x32 bitMask from from lsb components --> [0xFFFFFF, 0, ...]
      • write: *p = *p | (colorMask & bitMask); p += width/4;
      • rshift sixel vector components

Well something like that. It would reduce the instructions per pixel by quite some amount (well the needed back loading from *p is nasty and might actually perform worse, idk). Also note that the compression could be ballooned here, as it is just a special of the general handling (still might be faster on a separate code path, as the masking per bit could be done as prepare step in colorMask directly).

Edit:
Some runtime estimates of that - I'd expect that to be ~2 times faster, with a big range of uncertainty introduced by alignment needs (you somehow have to get prolly misalgined sixels in there fast) and the fact, that this will do more dummy writes, where the linear handling might have skipped it. With __m256 you could even do 8 pixels at once, not sure if you want to test "bigger is better" route, wasm-simd only supports 128 as far as I am aware.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@ismail-yilmaz Played around with the simd idea above and got something working. It is still has bugs in color transfers, but the performance is already quite good for a first hacky impl:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 5.83 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 103.60 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.51 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 91.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 5.71 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 115.01 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 92.40 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 167.82 MB/s

Here the code for the non repeated paints:

void simd_paint() {
  if (!reg_agg) return;

  __m128i colorMask = _mm_set_epi32(ps.color, ps.color, ps.color, ps.color);
  __m128i ones = _mm_set_epi32(1, 1, 1, 1);
  __m128i sixels = _mm_load_si128((__m128i *) reg);

  int p = l_cur + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colorMask, bitmask);

    __m128i old = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i final = _mm_or_si128(old, updated);

    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }

  _mm_store_si128((__m128i *) reg, _mm_setzero_si128());
  reg_agg = 0;
}

reg is an int[4] array, where sixels get loaded to from the main chunk loop as in cursor % 4. Whenever the write is below or above that 4 pixel area, simd_paint gets called, also for color changes and CR/LF. The single byte loading into reg and the pixel range handling is still buggy, and now the biggest runtime penalty (while the paint calls runtime dropped by almost 50%).

@jerch
Hmm, a clever trick indeed. I get the picture now.
I'll try this tomorrow. (or later today, whatever).

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

With avx2 it would not be that bad (since you can do it in one go), def. worth a shot for you (but will only work with newer machines like haswell+).

The thing is, we officially have a nice API for up to 128 bit SIMD in U++, which simplifies things drastically (operator overloading, bit manipulation, loading storing, broadcasting, etc.). Not to mention it encapsulates ARM NEON too. This will allow me to keep the compatibility with different backends, which I've mentioned earlier. So It think it will be sufficient for me If I can get a meaningful gain with the approach you've suggested last night. The only problem I have to solve with that approach now is to efficiently read 4 sixel bytes in a row without changing the existing loop much. Besides I am almost satisfied with the results I already have (last night, another round of refactoring -wiping out my mortal laziness from the codebase- just boosted the sequence parser to 320 Mib/s on sixel-bench and that gave me a steady 62 Mib/s, which is a Mib higher than the test code in which I used static, fixed size buffer. (I wonder what'll I get with that static bufffer. Well, I'll test that too...) Seems this time I am pretty close to the limit of hardware...

from upp-components.

jerch avatar jerch commented on May 30, 2024

Hmm thats interesting, found a big difference in native vs wasm build:

// faster in native (gcc)
__m128i sixels = _mm_load_si128((__m128i *) reg);
// faster in wasm (clang)
__m128i sixels = _mm_set_epi32(reg[3], reg[2], reg[1], reg[0]);

Seems wasm does not like pointers at all? Sadly idk how to peek into the final asm created by the wasm runtime...

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Hmm thats interesting, found a big difference in native vs wasm build:

// faster in native
__m128i sixels = _mm_load_si128((__m128i *) reg);
// faster in wasm
__m128i sixels = _mm_set_epi32(reg[3], reg[2], reg[1], reg[0]);

Seems wasm does not like pointers at all? Sadly idk how to peek into the final asm created by the wasm runtime...

The official page writes:

_mm_load_si128 ๐ŸŸก wasm_v128_load. VM must guess type. Unaligned load on x86 CPUs.

๐ŸŸก there is some information missing (e.g. type or alignment information) for a Wasm VM to be guaranteed to be able to reconstruct the intended x86 SSE opcode. This might cause a penalty depending on the target CPU hardware family, especially on older CPU generations.
`

It kinda reads "Good luck!" to me. :)

from upp-components.

jerch avatar jerch commented on May 30, 2024

Oh thx for finding this one, yeah the unaligned load is alot slower (had to get rid of it with __attribute__((aligned(16))) all over the place, which boosted native, but not wasm). No I know why...

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Making baby steps towards SIMD parsing. Got the easy part working - slicing data at single state changing bytes (CR, LF, color/attribs/compression introducer). This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of _mm_load_si128, no clue how to avoid that). It is ~8 times faster in native than the byte-by-byte loop, and still ~4 times faster under wasm.

The real challenge now is not to waste too much cycles on actual parsing of the subchunk data. ๐Ÿ˜ธ
Basically I see there two options:

  • go with minimal copying - will run alot into unaligned SIMD instructions
  • move SIMD digestable bytes into aligned memory before processing them

Normally I am a big fan of less data moving, but the penalty of unaligned SIMD access seems rather high, so an additional copy step might help, if many SIMD instructions will follow. Also not sure if my SIMD-fu is good enough to get number/color parsing done with it, might just drop to the old code here (which is no biggy, as those are low on the runtime meter). Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches.

Regarding parsing numbers, this might help: http://0x80.pl/articles/simd-parsing-int-sequences.html

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of _mm_load_si128, no clue how to avoid that). It is ~8 times faster in native than the byte-by-byte loop, and still ~4 times faster under wasm.

Those numbers are impressive and sounds very promising!

Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches.

Yeah this is what I have in my mind too. For the next round, my strategy will be to focus on batching single sixels with color information (map them), without modifying the original loop much, then flush them at once using SIMD where count % 4 == 0 . I'd like to see the difference for better or worse.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Did some thinking/modelling of parsing with SIMD. My idea is currently the following:

input          ['#n!nnsss#nnss#nns#nss...']
load as epi8   ['#n!nnsss#nnss#nn']
slice singles  color:['n'] compression:['nnsss'] color:['nnss'] color:['nn']

These are "typed" fragments containing:

  • color /color definition (+ sixel bytes)
  • compression (+ sixel bytes)
  • pure sixel bytes (as continuation of previous vector)

The first fragment might be continuing from the previous vector, thus needs some sort of state carry. In the same sense the last fragment might not be finished yet. Not sure yet how efficiently deal with that.

All other fragments are "final", thus have proper outer state boundaries and can be processed atomic. The main problem here I currently have is to efficiently identify, whether a fragment has follow-up sixel bytes or not. In SIMD that could be done as follows:

__m128i data = _mm_loadu_si128(fragment);
__m128i lesser_63 = _mm_cmplt_epi8(data, 63);
__m128i lesser_127 = _mm_cmplt_epi8(data, 127);
__m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);

While this works, it creates a rather big penalty, and is not yet helpful in the later sixel processing.

And there is another problem with that fixed fragment idea - the biggest atomic directive the sixel format can have, is a non-disturbed color definition like:

#nnnn;2;xxx;xxx;xxx

This can only be processed in 128 SIMD in one go, if we limit the register numbering to one digit, bummer. Means at least for this one it is not possible without looping again.
Furthermore sixel is a very relaxed format, thus any data can be disturbed by NOOP chars. Means we either need a pre-cleanup step, or have to do some sanitizing on the fly. Both is rather expensive. Also not sure yet how to deal with that.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Modified my top-level parser abit, which makes it slower (at 605 MB/s native), but now also precalculates the sixel byte offsets:

void decode(int length) {
  __m128i LF = _mm_set1_epi8(45);
  __m128i CR = _mm_set1_epi8(36);
  __m128i COLOR = _mm_set1_epi8(35);
  __m128i COMP = _mm_set1_epi8(33);

  int l = length / 16 * 16;
  int rem = length - l;
  for (int i = 0; i < l; i += 16) {
    __m128i part = _mm_load_si128((__m128i *) &ps.chunk[i]); // this is actually very slow in wasm!
    int start = i;
    int end = i;

    // test for singles: CR, LF, COMPRESSION and COLOR introducer
    __m128i testLF = _mm_cmpeq_epi8(part, LF);
    __m128i testCR = _mm_cmpeq_epi8(part, CR);
    __m128i testCRLF = _mm_or_si128(testLF, testCR);

    __m128i testCOLOR = _mm_cmpeq_epi8(part, COLOR);
    __m128i testCOMP = _mm_cmpeq_epi8(part, COMP);
    __m128i testCOLORCOMP = _mm_or_si128(testCOLOR, testCOMP);

    __m128i testSINGLE = _mm_or_si128(testCRLF, testCOLORCOMP);
    int hasSINGLE = _mm_movemask_epi8(testSINGLE);

    // identify sixels
    __m128i lesser_63 = _mm_cmplt_epi8(part, _mm_set1_epi8(63));
    __m128i lesser_127 = _mm_cmplt_epi8(part, _mm_set1_epi8(127));
    __m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);
    int sixelPos = _mm_movemask_epi8(sixels_marked);
    
    while (hasSINGLE) {
      // get LSB: __builtin_ctz (better with _tzcnt_u32, but not supported in wasm)
      int adv = __builtin_ctz(hasSINGLE);
      end = i + adv;
      if (end - start) parse_fragment(start, end, i + (sixelPos ? __builtin_ctz(sixelPos) : 16));
      handle_single(i + adv, ps.chunk[i + adv]);
      start = end + 1;
      hasSINGLE &= ~(1 << adv);
      sixelPos &= ~((1 << adv) - 1);
    }
    end = i + 16;
    if (end - start) parse_fragment(start, end, (sixelPos ? __builtin_ctz(sixelPos) : 16) + i);
  }

  // TODO: rem handling...
}

handle_single does the top level state switching for LF, CR, compression and color. This is pretty expensive due to compression handled here as well (throughput dropped from 2 GB/s to 800 MB/s by that), might need to handle that separately, not sure yet. (There is one thing that would support this idea - simd_paint does not need to flush output at compression borders, if the 4-pixel area is not yet full. Currently I cannot spot that from the fragment borders alone leading to superfluous paint calls.)

Now parse_fragment gets its own start and end offsets, and the start index of the next sixel data (which might be in fragment range or above). With this precalc'ed sixel byte offsets I hope I can do faster aligned sixel-pixel writes. No clue yet if the first part of a fragment (color/compression numbers) should be done in SIMD as well, prolly yes.

Edit: Fixed lousy do-while loop in code.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz The simd_paint code above is incomplete, it needs an additional ~bitmask operation on old before mixing, otherwise the OR'ing will blend with a previously set color (eg. from fillColor).

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch Just to be clear: I can't comment or give feedback much these days because of my job right now.

I am reading your findings and really appreciate your hard work. Thus I think we should later gather these findings for a recommendation guide. Also maybe, just maybe, we can later use these findings to create a reference SIMD optimized encoder and decoder.

(Btw. Last night I have started to implement the parser in SIMD but now it has to wait for the next week until I can be free to finish & test it and give you some feedback. )

from upp-components.

jerch avatar jerch commented on May 30, 2024

Ah no worries, no need to feel pushed or anything, to me this is mostly optimization for fun. If we get somewhere down to a ref decoder/encoder - awesome, but if not - I dont care much either.
Furthermore SIMD introduces machine dependencies, not even sure how that wasm thingy will do on an ARM (whether they abstracted/shimmed things away).

from upp-components.

jerch avatar jerch commented on May 30, 2024

One more idea to further lower half filled paint calls - by stacking the colors in a ringbuffer[4], we could lower paint calls to real 4-pixel progression (including CRLF jumps). Not sure, if this will show a real benefit (or even runs worse because of the ringbuffer overhead), as far as I know most encoder libs do a CR reset on a color change not mixing colors from current line cursor position onwards. Well. would need some investigation on typical sixel encoding schemes...

from upp-components.

jerch avatar jerch commented on May 30, 2024

Started to wonder, how about a simple thing like the RGB color conversion, thus went on and tried SIMD versions of it:

int normalize_rgb(int r, int g, int b) {
  return 0xFF000000 | ((b * 255 + 99) / 100) << 16 | ((g * 255 + 99) / 100) << 8 | ((r * 255 + 99) / 100);
}

int normalize_rgb_simd_int(int r, int g, int b) {
  // algo: ((x * 255 + 99) * 0xA3D7 + 0x8000) >> 22
  __m128i reg = _mm_set_epi32(r, g, b, 100);
  reg = _mm_mullo_epi32(reg, _mm_set1_epi32(255));
  reg = _mm_add_epi32(reg, _mm_set1_epi32(99));
  reg = _mm_mullo_epi32(reg, _mm_set1_epi32(0xA3D7));
  reg = _mm_add_epi32(reg, _mm_set1_epi32(0x8000));
  reg = _mm_srli_epi32(reg, 22);
  __m128i result = _mm_shuffle_epi8(reg, _mm_set_epi8(
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x00, 0x04, 0x08, 0x0C
  ));
  return _mm_cvtsi128_si32(result);
}

int normalize_rgb_simd_float(int r, int g, int b) {
  __m128 reg = _mm_set_ps(r, g, b, 100);
  reg = _mm_mul_ps(reg, _mm_set1_ps(2.55f));
  reg = _mm_round_ps(reg, _MM_FROUND_TO_NEAREST_INT);
  __m128i result = _mm_cvtps_epi32(reg);
  result = _mm_shuffle_epi8(result, _mm_set_epi8(
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x00, 0x04, 0x08, 0x0C
  ));
  return _mm_cvtsi128_si32(result);
}

and here is the asm for those (gcc 11) with cycle count in front (taken from intel optimization guide):

  normalize_rgb(int, int, int):
1         mov     r8d, edx
1         mov     edx, esi
1         sal     edx, 8
1         sub     edx, esi
1         add     edx, 99
1         movsx   rax, edx
1         sar     edx, 31
3         imul    rax, rax, 1374389535
1         sar     rax, 37
1         sub     eax, edx
1         mov     edx, edi
1         sal     edx, 8
1         sal     eax, 8
1         sub     edx, edi
1         add     edx, 99
1         movsx   rcx, edx
1         sar     edx, 31
3         imul    rcx, rcx, 1374389535
1         sar     rcx, 37
1         sub     ecx, edx
1         or      eax, ecx
1         mov     ecx, r8d
1         sal     ecx, 8
1         sub     ecx, r8d
1         add     ecx, 99
1         movsx   rdx, ecx
1         sar     ecx, 31
3         imul    rdx, rdx, 1374389535
1         sar     rdx, 37
1         sub     edx, ecx
1         sal     edx, 16
1         or      eax, edx
1         or      eax, -16777216
39        ret

 normalize_rgb_simd_int(int, int, int):
1         mov     eax, 100
1         movd    xmm0, esi
1         movd    xmm1, eax
2         pinsrd  xmm0, edi, 1
2         pinsrd  xmm1, edx, 1
1         punpcklqdq      xmm1, xmm0
1         movdqa  xmm0, xmm1
1         pslld   xmm0, 8
1         psubd   xmm0, xmm1
1         paddd   xmm0, XMMWORD PTR .LC0[rip]
10        pmulld  xmm0, XMMWORD PTR .LC1[rip]
1         paddd   xmm0, XMMWORD PTR .LC2[rip]
1         psrld   xmm0, 22
1         pshufb  xmm0, XMMWORD PTR .LC3[rip]
1         movd    eax, xmm0
26        ret

 normalize_rgb_simd_float(int, int, int):
1         pxor    xmm2, xmm2
1         pxor    xmm1, xmm1
1         pxor    xmm3, xmm3
1         movss   xmm0, DWORD PTR .LC4[rip]
5         cvtsi2ss        xmm2, edx
5         cvtsi2ss        xmm1, esi
5         cvtsi2ss        xmm3, edi
1         unpcklps        xmm0, xmm2
1         unpcklps        xmm1, xmm3
1         movlhps xmm0, xmm1
5         mulps   xmm0, XMMWORD PTR .LC5[rip]
6         roundps xmm0, xmm0, 0
3         cvtps2dq        xmm0, xmm0
1         pshufb  xmm0, XMMWORD PTR .LC3[rip]
1         movd    eax, xmm0
38        ret

Performance: the normal version wins, simd_int is 2% slower, simd_float is 5% slower. This is somewhat weird when only looking at the cycle counts, but makes sense when taking parallel ยต-ops into account. This will change, if the channel values would not be provided from normal registers but are already in SIMD (the alignment and preparation steps are quite expensive currently). Comparing simd_int to simd_float the int variant for the "inner algo" is one cycle faster (15 vs, 16 cycles), even with the nasty pmulld.

Edit: the float version drops down to 20 cycles, if already fed with floats (and currently wins the race in the decoder). Still think the int variant will be the winner in the end, as the values are not meant to be floats. I really wonder, if I could get rid of that pmulld somehow...

from upp-components.

jerch avatar jerch commented on May 30, 2024

Did some more local changes to my byte-to-byte loop (mostly introducing inner loops with less conditions):

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.85 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 127.80 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 116.20 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.41 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 147.54 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 84.26 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 184.07 MB/s

This gets me to 39-40 MB/s synchronous in xterm.js. I think I gonna stop with optimizations here, as the numbers in xterm.js start to look funny (one run of sixel-bench):
image

(The utf8 conversion does not more than a uint8 --> uint32 padding expansion (xterm.js works with utf32 internally) and the sequence parser already has a fast inner loop with only 3 conditions to check for DCS data. I hate JS for being so slow on such simple tasks. Would not have happened with wasm.)

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch ,

This gets me to 39-40 MB/s synchronous in xterm.js.

That sounds like a very good number for a JS sixel decoder (o so I assume, because the only one I know in the wild is yours.)

I'll take your findings and try moving it forward on a test setup , starting this monday as I will have a window (we'll have a national holiday season this whole week.) to work on something fun. Will post my findings on x86_64. So. stay tuned. :)

from upp-components.

jerch avatar jerch commented on May 30, 2024

Latest numbers:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.68 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 134.27 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.58 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 124.60 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.79 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 142.69 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 77.71 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 199.62 MB/s

While I missed my >130 MB/s goal with "test2_clean.sixel" (not even sure why), I got a nice boost on the fullHD noise image, which is now at 12fps. Thats still very low, on the other hand the image is really hard to chew on with its 4096 colors and the single pixel addressing (cursor kinda moving from 0 to target position for tons of 1bit-sixel). I have no test data for normal fullHD stuff, but I think it is capable to stay >30 fps for less degenerated data.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Not sure how good you know callgrind, I have an interesting case, which I dont quite understand, maybe you have an idea or can explain, whats going on.

Playing around with the byte-by-byte loop I found this loop schematics to be the fastest:

int decode(int length) {
  for (int i = 0; i < length; ++i) {
    int code = ps.chunk[i] & 0x7F;
    switch (ps.state) {
      case ST_DATA:
        while (62 < code && code < 127) {
          ...  // process sixel bytes
          if (++i >= length) break;  // <--------
          code = ps.chunk[i] & 0x7F;
        }
        break;
      case ST_COMPRESSION:
        while (47 < code && code < 58) {
          ...  // process number bytes
          if (++i >= length) break;  // <--------
          code = ps.chunk[i] & 0x7F;
        }
        break;
    }
  }
}

It basically sub-loops at all positions, where multiple occurrences are possible skipping the outer loop and switch statement. Ofc now I need those inner break conditions on length (marked with <-----).

Now looking into callgrind I see many instructions being burnt on those length-break checks. Thus I went further and restructured the break conditions as outer fall-through:

int decode(int length) {
  ps.chunk[length] = 0xFF;  // new fall-through break condition in data
  for (int i = 0; i < length; ++i) {
    int code = ps.chunk[i] & 0x7F;
    switch (ps.state) {
      case ST_DATA:
        while (62 < code && code < 127) {
          ...  // process sixel bytes
          code = ps.chunk[++i] & 0x7F;
        }
        break;
      case ST_COMPRESSION:
        while (47 < code && code < 58) {
          ...  // process number bytes
          code = ps.chunk[++i] & 0x7F;
        }
        break;
    }
  }  // 0x7F in code will reach here as NOOP breaking the for loop eventually on next i < length check
}

This works great and reduces counted instructions for test1_clean.sixel by almost 4%! But runtime did not change at all for native, and in wasm it is even ~2% slower. Any clue whats going on? Is the instruction counter flawed in callgrind? Or is this some sort of a pipeline effect, where those excess instructions are hidden as uop not altering whole pipeline?

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

@jerch

Or is this some sort of a pipeline effect, where those excess instructions are hidden as uop not altering whole pipeline?

I haven't checked it yet ( I will), but it is likely a pipeline effect :(on x86, at least):

Remember this? That is exactly how I process parameters (and also optimized my sequence parser) on x86_64. In almost all scenarios the while loop works faster than the for loop for such operations that it is likely to loop more than once.

E.g two style of reading parameter chunks (Size: 95.37 MB):

For loop:

int GetNum1(const String& s)
{
	int n = 0;
	for(const int& c : s) {
		if(dword(c - '0') > 9)
			break;
		 n = n * 10 + (c - '0');
	}
	
	return n;
}

While loop

int GetNum2(const String& s)
{
	const char *p = ~s;
	int c = 0, n = 0;
	while(*p && dword((c = *p++) - '0') < 10)
		n = n * 10 + (c - '0');

	return n;
}

Timing (20 rounds, O3):

TIMING GetNum2 (while):  2.75 s  - 137.50 ms ( 2.75 s  / 20 ), min: 135.00 ms, max: 141.00 ms, nesting: 0 - 20
TIMING GetNum1 (for)    :  2.94 s  - 147.20 ms ( 2.94 s  / 20 ), min: 144.00 ms, max: 151.00 ms, nesting: 0 - 20

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that "hot" point has a greater cost than its potential gains.)

from upp-components.

jerch avatar jerch commented on May 30, 2024

The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that point has a greater cost than its gain.)

Yes thats true, but the image triggers that case only 157 times. Overall the sub loops are much faster in wasm (not so much in native). Ofc the stack unwinding back from the inner switch case might be much more expensive here (br_table moves quick some stuff on the stack in wasm). Well I dont have something like callgrind for wasm, thus can only profile things indirectly.

Nonetheless will try to reshape the outer thing into a while loop as well, it is abit annoying that I cannot use pointer stuff that easy in wasm, not clue why I get such a bad penalty for using those.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz Note there is an issue with your while loop above - your loop condition tests for *p not being zero. Imho you cannot assume null termination here. Although NULL in terminal data is a NOOP for all states I know, it does not mark end of data by any means. And the parser from vt100.net would pass NULL along, thus it has to be handled by the subparser (sixel decoder here).

(Some background on this: NULL was used by some very early terminals as "time padding bytes" (give the device some time to keep up), thus it is theoretically possible to see cascades of NULLs in old data.)

from upp-components.

jerch avatar jerch commented on May 30, 2024

Another optimization idea I had while looking at callgrind metrics - currently simd_paint puts colors for bits into final canvas target cells, which creates a rather high cache miss rate for the load/store ops (~7%, reason is the massive cell offset created by the the 6 pixel progression in y-direction and loading/blending with previous colors). This could be avoided by doing a local construction with contiguous writes:

input sixel: [a, b, c, d]
local canvas line writes: [[a1, b1, c1, d1], ..., [a6, b6, c6, d6]]

and a final copy step on LF:

memcpy: *[a1-d1] --> canvas_offset_1
...
memcpy: *[a6-d6] --> canvas_offset_6

This has several advantages in terms of cache and SIMD utilisation:

  • load/blend/write dance during construction has much better cache locality due to contiguous mem access
  • writes are guaranteed to be 4 * 32bit aligned (good for simd load/store)
  • final copy step is still cache-unfriendly, but can be done in steps with high cache utilization (e.g. in 16 pixel progression)
  • final copy step is a "write-through", no loading/blending needed anymore

Ofc this has a major drawback - the additional copy step itself, which might just be more expensive than the savings from better cache utilization achieved. Remains uncertain until actually tried.

Another similar idea would be to spread sixel bits into SIMD registers (horizontal spreading), instead of the current shift-or looping (vertical spreading):

input sixel: [a, b, c, d]
local canvas line writes: [[a1, a2, a3, a4], [a5, a6, b1, b2], ..., [d3, d4, d5, d6]]

Well I dont expect that to be any faster for 3 reasons:

  • bit spreading is quite cumbersome with SIMD (needs several shuffle ops)
  • 6 bits of sixel dont align well with 4 * 32bit in registers - needs branching/looping over [a1-a4], [a5-a6, b1-b2], [b3-b6] writes with expensive construction step for the interleaved 2. case
  • final copy step gets really awkward and prolly really expensive (has to extract/shuffle bit positions into contiguous order)

Not sure if I will try any of these ideas, they need quite some canvas construction refactoring, before they would work at all...

from upp-components.

jerch avatar jerch commented on May 30, 2024

Some more microoptimizations here and there:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.16 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 149.45 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.42 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 132.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.05 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 161.45 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 67.41 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 230.24 MB/s

Also tested against 5 fullHD screenshots, they all run at 40-50 FPS, and the degraded noise example is now at 15 FPS.

Furthermore I partially tested the first idea of my last post - well, it is not worth the trouble. A cache optimized load/store in paint_simd gives only 3% boost, but gets more than eaten by the final canvas construction/copy, which suffers the same cache problem (well less often, but still taking >5%).

Will push the code once I cleaned it up abit.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Extended the benchmark script with more image relevant metrics to get some more explanation of bottlenecks:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 50 runs - average runtime: 3.68 ms
         Case "test1_clean.sixel" : 50 runs - average throughput: 164.39 MB/s
          --> image throughput { FPS: '271.50', PPS: '250.21 M', pixelWrite: '1.00 GB/s' }
         Case "test2_clean.sixel" : 50 runs - average runtime: 2.15 ms
         Case "test2_clean.sixel" : 50 runs - average throughput: 148.45 MB/s
          --> image throughput { FPS: '465.29', PPS: '104.46 M', pixelWrite: '417.86 MB/s' }
         Case "sampsa_reencoded_clean.six" : 50 runs - average runtime: 3.60 ms
         Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 181.07 MB/s
          --> image throughput { FPS: '277.93', PPS: '305.00 M', pixelWrite: '1.22 GB/s' }
         Case "FullHD 12bit noise" : 50 runs - average runtime: 64.36 ms
         Case "FullHD 12bit noise" : 50 runs - average throughput: 240.98 MB/s
          --> image throughput { FPS: '15.54', PPS: '32.22 M', pixelWrite: '128.87 MB/s' }
         Case "640x480 9bit tiles" : 50 runs - average runtime: 0.67 ms
         Case "640x480 9bit tiles" : 50 runs - average throughput: 151.24 MB/s
          --> image throughput { FPS: '1494.92', PPS: '459.24 M', pixelWrite: '1.84 GB/s' }
         Case "FullHD 1" : 50 runs - average runtime: 20.36 ms
         Case "FullHD 1" : 50 runs - average throughput: 166.72 MB/s
          --> image throughput { FPS: '49.11', PPS: '101.84 M', pixelWrite: '407.36 MB/s' }
         Case "FullHD 2" : 50 runs - average runtime: 22.69 ms
         Case "FullHD 2" : 50 runs - average throughput: 167.93 MB/s
          --> image throughput { FPS: '44.07', PPS: '91.38 M', pixelWrite: '365.50 MB/s' }

FPS is frame per second (here full images decoded per second), PPS is pixels per second and pixelWrite the effective pixel write memory throughput.

To make sense of these numbers, we need to know some facts about the images itself:

  • real world images:
    • test1: palette 256, RGB encoded, no transparency, dimensions in 4^n, encoder node-sixel
    • test2: palette 256, RGB encoded, no transparency, dimensions not in 4^n, encoder node-sixel
    • sampsa: palette 256, RGB encoded (monochrome), no transparency, dimensions in 4^n, encoder node-sixel
    • FullHD 1 & 2: palette 256, RGB encoded, no transparency, dimensions in 4^n, encoder img2sixel
  • constructed images:
    • worst case - FullHD 12bit noise: palette 4096, RGB encoded, no transparency, dimensions in 4^n, degraded with tons of CR repositioning and single sixel bits
    • best case - 640x480 9bit tiles: palette 512, RGB encoded, no transparency, dimensions in 4^n, compressed sixels with high bit density (10x10 color tiles)

Interpretation:

  • Real world images, that did not see special care, seem to settle around ~100M PPS or 400 MB/s effective pixel write throughput (that is "test2", "FullHD 1", "FullHD 2"). Still sounds rather lowish to me.
  • "test1" runs much better than the other real images, maybe some sort of over-optimization (it is my standard profiling image).
  • No clue yet, why "sampsa" runs that much better (only difference is it being monochrome, but the RGB converter does not special case that).
  • "test2" dimensions are not 4^n aligned, which leads to unaligned SIMD access. While I suspected this to be a major bottleneck, it only accounts for -5% speed on "test2" (tested with an aligned version of that image).
  • The images show all the same cache behavior (4-5% missrate), the results are not affected by different cache behavior for different image sizes (tested up to 1920x1080). This was further confirmed by tests with pixel writes against a fixed zero offset - the throughput only changes in +- 3%. I expect a big perf dent for very "wide" images, where 6 * (RGBA8888) image_width will overflow L3 cache size / invalidate often (not tested). But since most CPUs have at least 2MB cache these days, this will not happen for normally shaped images. (Ofc there is another cache drain step at L1/L2 border, that will be triggered much earlier somewhere around 600 - 1300px width, did not yet test for that). To make it short - the cache unfriendly sixel format is not a problem on L2/L3 vs. DRAM level.
  • pixelWrite highly varies across the images, while the input throughputs are pretty close. This indicates, that the input loop and the preparation is the main limiting factor, not the pixel writes (we already knew that). It also reveals, that, while "worst case" has the highest absolute input throughput, it really sucks in pixel writes. Not sure yet, if the degraded CR + empty sixel jumps can be further optimized. This image also has the worst sixel bit density with mainly 1 bit per sixel byte. Also the 4-pixel simd write cannot help here much, as a color change is very likely for every sixel, which leads to more "dummy paints" with excess 4-pixel load/store cycles - theoretically up to 4 times more often for single color single pixel addressing (have not analysed the data, at least the PPS numbers suggest degradation around 3 times).

Summary:
These new numbers do not help as much as I had hoped at the decoder alone, they mostly confirm what we already knew - the input consumption being slow, especially for low sixel bit density. But they might come handy for an encoder to try different strategies and maybe speed up the decoder indirectly by outputting more decoder-friendly data.

from upp-components.

ismail-yilmaz avatar ismail-yilmaz commented on May 30, 2024

Hello @jerch ,

Thanks for the update! I am reading your posts, passively. As you prolly noted, I've taken a short break, after a long and very exhausting season. I'll be back, and try to implement + continue testing + reporting back as usual starting from Aug 29 and on.

from upp-components.

jerch avatar jerch commented on May 30, 2024

Trying to get a hold of the cache effects - with a very simple autogenerated image across various widths I see the following behavior in my benchmark:

      Context "Cache"
         Case "16" : 100 runs - average throughput: 49.26 MB/s
         Case "64" : 100 runs - average throughput: 95.27 MB/s
         Case "128" : 100 runs - average throughput: 116.08 MB/s
         Case "256" : 100 runs - average throughput: 111.16 MB/s
         Case "512" : 100 runs - average throughput: 106.59 MB/s
         Case "1024" : 100 runs - average throughput: 106.55 MB/s
         Case "4096" : 100 runs - average throughput: 93.51 MB/s
         Case "16384" : 100 runs - average throughput: 90.70 MB/s

It seems the throughput peaks somewhere around 128px width and has a bigger drop between 1024 and 4096. Doing these areas more in detail reveals, that the throughput peaks between 120-180px with 125 MB/s max, and around 1200-1400px the throughput drops from ~107 to ~95 MB/s out of a sudden.

The drop around 1300px is pretty close to my L1 cache size of 32kB, which strongly suggests that cache behavior changed here. I have no good explanation for the lower peak yet, it simply might be a summation effect of overall buffer/cache utilization. Furthermore my tests are partially screwed by GC actions, which might show up non-deterministically (might be the reason for the rather big ranges across the runs). Trying to test cache behavior with JS is def. not a good idea ๐Ÿ˜….

(Btw very small images <64px width are much slower in throughput, I guess that the function call overhead for off-copying pixels gets significant here.)

from upp-components.

jerch avatar jerch commented on May 30, 2024

The new line-based decoder opened the field for further optimizations, which kinda makes the SIMD attempts obsolete under wasm.

  • line-based w'o SIMD:
        Context "decode - testfiles (WasmDecoder)"
           Case "test1_clean.sixel" : 50 runs - average throughput: 158.07 MB/s
           Case "test2_clean.sixel" : 50 runs - average throughput: 157.47 MB/s
           Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 156.20 MB/s
           Case "FullHD 12bit noise" : 50 runs - average throughput: 325.19 MB/s
           Case "640x480 9bit tiles" : 50 runs - average throughput: 146.03 MB/s
           Case "FullHD 1" : 50 runs - average throughput: 172.81 MB/s
           Case "FullHD 2" : 50 runs - average throughput: 180.82 MB/s
    
  • big blob with SIMD:
        Context "decode - testfiles (WasmDecoder-simd)"
           Case "test1_clean.sixel" : 50 runs - average throughput: 164.39 MB/s
           Case "test2_clean.sixel" : 50 runs - average throughput: 148.45 MB/s
           Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 181.07 MB/s
           Case "FullHD 12bit noise" : 50 runs - average throughput: 240.98 MB/s
           Case "640x480 9bit tiles" : 50 runs - average throughput: 151.24 MB/s
           Case "FullHD 1" : 50 runs - average throughput: 166.72 MB/s
           Case "FullHD 2" : 50 runs - average throughput: 167.93 MB/s
    

It seems SIMD is not yet that usable and well optimized for wasm, thus I will stick with the generic version for now, which also covers more browsers. Note that for x86-native the sixel SIMD decoder runs "test1_clean.sixel" with 220 MB/s (~40% faster).

offtopic:
Kinda made an even worse observation with my base64 lib, where under wasm SIMD actually runs at half the speed of the generic version, while under x86-native it is 5x faster. Well, the base64 decoder in SIMD is very stack-var demanding with cumbersome access needs, which hurts alot more on a stack machine like wasm, while a register machine can carry along those values in registers.

from upp-components.

jerch avatar jerch commented on May 30, 2024

@ismail-yilmaz FYI - pushed the my latest wasm code to the PR: jerch/node-sixel#20.

from upp-components.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.