Comments (4)
Is there an inherent reason for that?
The reason is that I don't use x86 (32-bit) or MSVC day-to-day, so I was conservative. We could possibly change it to
#if defined(_M_X64) || defined(_M_IX86)
although some care might be needed, if other code (PNG-related or otherwise) is depending on the 64-ness that WUFFS_BASE__CPU_ARCH__X86_64
currently guarantees.
Out of curiousity, and if profiling is not too much work for you, do you know which Wuffs functions get greatly improved performance once you define WUFFS_BASE__CPU_ARCH__X86_64 for your 32-bit MSVC-compiled program?
from wuffs.
Sure, the main differences for me are basically
wuffs_png__decoder__filter_4_distance_4_x86_sse42
being called instead ofwuffs_png__decoder__filter_4_distance_4_fallback
wuffs_png__decoder__filter_4_distance_3_x86_sse42
being called instead ofwuffs_png__decoder__filter_4_distance_3_fallback
wuffs_adler32__hasher__up_x86_sse42
being called instead ofwuffs_adler32__hasher__up__choosy_default
wuffs_crc32__ieee_hasher__up_x86_avx2
being called instead ofwuffs_crc32__ieee_hasher__up__choosy_default
The filter_4_distance_4
function takes the majority of the time for me, so that being faster is the main win. Perhaps I should have mentioned that I need RGBA output, I suppose that determines which filter function is necessary.
Here is an example profile (with only wuffs functions listed), before adding || defined(_M_IX86)
:
Function Name Total CPU [unit, %] Self CPU [unit, %]
wuffs_png__decoder__decode_frame 31681 (75,78 %) 1 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle 19626 (46,95 %) 0 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle__choosy_default 19619 (46,93 %) 30 ( 0,07 %)
wuffs_png__decoder__filter_4 18324 (43,83 %) 9 ( 0,02 %)
wuffs_png__decoder__filter_4_distance_4_fallback 16389 (39,20 %) 16387 (39,20 %)
wuffs_png__decoder__decode_pass 12053 (28,83 %) 1 ( 0,00 %)
wuffs_zlib__decoder__transform_io 11598 (27,74 %) 3 ( 0,01 %)
wuffs_deflate__decoder__transform_io 10147 (24,27 %) 0 ( 0,00 %)
wuffs_deflate__decoder__decode_blocks 10109 (24,18 %) 2 ( 0,00 %)
wuffs_deflate__decoder__decode_huffman_fast32 9793 (23,43 %) 9515 (22,76 %)
wuffs_png__decoder__filter_4_distance_3_fallback 1928 ( 4,61 %) 1928 ( 4,61 %)
wuffs_adler32__hasher__update_u32 1447 ( 3,46 %) 1 ( 0,00 %)
wuffs_adler32__hasher__up 1446 ( 3,46 %) 0 ( 0,00 %)
wuffs_adler32__hasher__up__choosy_default 1446 ( 3,46 %) 1446 ( 3,46 %)
wuffs_base__pixel_swizzler__swizzle_interleaved_from_slice 1263 ( 3,02 %) 5 ( 0,01 %)
wuffs_crc32__ieee_hasher__update_u32 566 ( 1,35 %) 4 ( 0,01 %)
wuffs_crc32__ieee_hasher__up 562 ( 1,34 %) 0 ( 0,00 %)
wuffs_crc32__ieee_hasher__up__choosy_default 562 ( 1,34 %) 562 ( 1,34 %)
wuffs_base__pixel_swizzler__bgrw__bgr 510 ( 1,22 %) 451 ( 1,08 %)
wuffs_deflate__decoder__init_dynamic_huffman 306 ( 0,73 %) 86 ( 0,21 %)
wuffs_deflate__decoder__init_huff 257 ( 0,61 %) 257 ( 0,61 %)
wuffs_deflate__decoder__init_fixed_huffman 40 ( 0,10 %) 2 ( 0,00 %)
And after adding || defined(_M_IX86)
:
Function Name Total CPU [unit, %] Self CPU [unit, %]
wuffs_png__decoder__decode_frame 16978 (63,48 %) 0 ( 0,00 %)
wuffs_png__decoder__decode_pass 10508 (39,29 %) 1 ( 0,00 %)
wuffs_zlib__decoder__transform_io 10437 (39,02 %) 1 ( 0,00 %)
wuffs_deflate__decoder__transform_io 10087 (37,71 %) 1 ( 0,00 %)
wuffs_deflate__decoder__decode_blocks 10036 (37,52 %) 2 ( 0,01 %)
wuffs_deflate__decoder__decode_huffman_fast32 9712 (36,31 %) 9389 (35,10 %)
wuffs_png__decoder__filter_and_swizzle 6468 (24,18 %) 0 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle__choosy_default 6466 (24,18 %) 23 ( 0,09 %)
wuffs_png__decoder__filter_4 5203 (19,45 %) 7 ( 0,03 %)
wuffs_png__decoder__filter_4_distance_4_x86_sse42 4232 (15,82 %) 4231 (15,82 %)
wuffs_base__pixel_swizzler__swizzle_interleaved_from_slice 1237 ( 4,62 %) 8 ( 0,03 %)
wuffs_png__decoder__filter_4_distance_3_x86_sse42 964 ( 3,60 %) 964 ( 3,60 %)
wuffs_base__pixel_swizzler__bgrw__bgr 515 ( 1,93 %) 442 ( 1,65 %)
wuffs_adler32__hasher__up 349 ( 1,30 %) 0 ( 0,00 %)
wuffs_adler32__hasher__update_u32 349 ( 1,30 %) 0 ( 0,00 %)
wuffs_adler32__hasher__up_x86_sse42 349 ( 1,30 %) 349 ( 1,30 %)
wuffs_deflate__decoder__init_dynamic_huffman 316 ( 1,18 %) 78 ( 0,29 %)
wuffs_deflate__decoder__init_huff 283 ( 1,06 %) 283 ( 1,06 %)
wuffs_crc32__ieee_hasher__update_u32 92 ( 0,34 %) 7 ( 0,03 %)
wuffs_crc32__ieee_hasher__up 85 ( 0,32 %) 0 ( 0,00 %)
wuffs_crc32__ieee_hasher__up_x86_avx2 85 ( 0,32 %) 85 ( 0,32 %)
wuffs_deflate__decoder__init_fixed_huffman 50 ( 0,19 %) 1 ( 0,00 %)
from wuffs.
It's fixed for release/c/wuffs-unsupported-snapshot.c
. It'll be fixed in release/c/wuffs-v0.3.c
whenever the next beta (version 0.3.0-beta.10) gets minted.
In case anyone else from the future is trying something similar, I presume you are configuring MSVC with /arch:AVX
.
from wuffs.
Works great, thank you very much.
In case anyone else from the future is trying something similar, I presume you are configuring MSVC with
/arch:AVX
.
Yes indeed.
from wuffs.
Related Issues (20)
- print-image-metadata script can go into an infinite loop
- Slow f64 parsing HOT 13
- RGB/BGR 16 bit treated like RGBA/BGRA? HOT 1
- OSS-Fuzz issue 59018 HOT 1
- [JPEG] unsupported DQT after SOF markers HOT 1
- OSS-Fuzz issue 59182 HOT 1
- OSS-Fuzz issue 59540 HOT 1
- OSS-Fuzz issue 59966 HOT 1
- A question regarding auxiliary C++ API HOT 4
- What is the status of version 0.3? HOT 3
- Empty slice manipulation triggers UBSAN by offsetting from a null pointer. HOT 2
- error: conversion to ‘uint32_t’ {aka ‘unsigned int’} from ‘int’ may change the sign of the result HOT 3
- OSS-Fuzz issue 66816 HOT 1
- PNG's are stored in RGB order but Wuffs returns BGR/BGRA? HOT 1
- Decode PNG with gray+alpha as 2 channels (i.e. YA not BGRA) HOT 5
- Warning about always true comparison of integers HOT 1
- std/crc64 doesn't build for 32-bit x86 HOT 1
- Allowing LA and RGBA PNGs with a tRNS chunk HOT 2
- How to get the correct 'transparency' value in the DecodeImage API? HOT 2
- wuffs 0.4 significantly slower than 0.3 decoding PNGs HOT 27
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wuffs.