richgel999 / bc7enc_rdo Goto Github PK
View Code? Open in Web Editor NEWState of the art RDO BC1-7 GPU texture encoders
License: Other
State of the art RDO BC1-7 GPU texture encoders
License: Other
Is it possible for this to make dds files containing mipmaps? The quality of the resulting dds from this encoder is much better than other encoders I've tried, but the lack of mipmaps prevents me from actually using the results.
The issue seems to be that the RDO code doesn't account for rdo_bc_params::m_bc45_channel0 or rdo_bc_params::m_bc45_channel1. The workaround is easy: I just permute the channels in my source image, but I wanted to report here in case other people run into this unawares.
Thank you very much for making an open source RDO compressor by the way!
The current compilation in CMakeLists.txt passes avx
and avx2
as targets to ISPC. These default to avx1-i32x8
and avx2-i32x8
respectively. However, when timing conversion of a very large input image performance was significantly better when using avx1-i32x4
and avx2-i32x4
. On a laptop with an i7-8750H, avx1-i32x8 was almost 2x slower than avx1-i32x4, while avx2-i32x8 was ~20% slower than avx2-i32x4. On a desktop with a Ryzen 2700X, i32x8 was almost 2x slower than i32x4 for both avx1 and avx2.
A couple other interesting things of note:
In other words, at least with the hardware I have available to me switching to i32x4 for any AVX targets universally gives a significant performance boost. In my own project (which includes bc7enc_rdo as well as ISPCTextureCompressor for BC6H support) I switched to avx2-i32x4
for the AVX2 target and dropped the AVX1 target due to not being different enough from SSE4 to justify the extra compile time and binary size.
Also worth noting that I saw a similar performance degradation when testing on an M1 Mac with using neon-i32x8
vs. neon-i32x4
, which is what prompted me to check the difference in x86.
Will give more details.
If a BC4 block has only two values, and the search radius is greater than zero, then the second value is always interpolated, and since the BC4/5 interpolation is done with more than 8-bits in hardware (see findings here) the resulting value is slightly off.
If you start with the code here:
Line 2801 in e6990bc
An example being: a block with values 126
and 127
and the default search radius of 3
will achieve a best_err
of zero with endpoints 127
and 123
(with six interpolated values, MODE8
in the code). When interpolated as 8-bit, and with selectors of 0
and 2
, this does correctly result in values of 126
and 127
. But... since the hardware is using 14- or 16-bit interpolation then the resulting values are 126.43
and 127.0
.
This is small, agreed, but when mixed with solid blocks of 126
we get block artefacts as the encoder flips between solid and multi-value blocks, breaking down to blocks of 126.0
and 126.43
.
This came about because we have a normal map exported from 3D Coat with essentially a dimple noise texture on it. AFAIK it was worked on at 4k then exported at 2k, by which time the dimple texture end up being a few dots. Here's the result isolated with obvious block errors:
Ignoring there's clearly something amiss with the normal map, it does nicely highlight this issue. If this were BC3 alpha it wouldn't be affected, it's only BC4/5 with the extended interpolation bits.
The easy fix (which I've already tried) is to clamp the search radius to zero when there are only two values. I think, though, a better fix may be to interpolate with more range in bc4_block::get_block_values()
so that it works for more than two values (which I'll try next).
Hi,
When compiling and running bc7e.ispc on a computer with an apple M1 cpu, I consistently hit this assert.
bc7e.ispc:2890:2: Assertion failed: *pCur_ofs <= 128
Compiling with assertions disabled produces wrong/corrupted compressed images.
I tried compiling on an intel machine with sse or/and avx and it works as expected, so my guess is the problem is only the neon target.
This is how I compiled bc7e.ispc for the M1:
./ispc -g bc7e.ispc -o bc7e.obj -h bc7e_ispc.h --target=neon --arch=aarch64
This is a callstack of when the assert hits:
set_block_bits___vy_3C_unT_3E_vyuvyuun_3C_vyu_3E_ (bc7enc_rdo-master/bc7e.ispc:2890)
encode_bc7_block___vy_3C_unv_3E_un_3C_s_5B__c_vybc7_optimization_results_5D__3E_ (bc7enc_rdo-master/bc7e.ispc:3086)
handle_alpha_block___vy_3C_unv_3E_un_3C_s_5B__c_vycolor_quad_i_5D__3E_un_3C_s_5B__c_unbc7e_compress_block_params_5D__3E_un_3C_s_5B_uncolor_cell_compressor_params_5D__3E_vyuvyu (bc7enc_rdo-master/bc7e.ispc:3906)
bc7e_compress_blocks (bc7enc_rdo-master/bc7e.ispc:4741)
main (bc7enc_rdo-master/test.c:79)
And in this case the value of pCur_ofs was (82, 82, 130, 82)
Iโm not familiar with either ISPC or texture compression, so I would appreciate any help or insight on how to debug this.
Do any of the algorithms here require me to license some patent?
I was wondering if it was possible to keep the same API as ispc for the cpu edition.
It's mixed currently.
This is releated to this issue that still remains on BC7enc. I updated to BC7enc_rdo. This affects BC7 encoded textures.
richgel999/bc7enc#3
When pbit = 0 is chosen in the following call, the alpha reconstruct can only go to 254. This doesn't match the fully opaque textures that we are processing. We don't have this issue with ETC/ASTC encoder. For now, we'll workaround at the shader level and assume 254/255.0 is opaque when BC7enc is used.
I tried forcing use of pbit = 1 for opaque textures. This results in correct fully opaque textures, but on a red-black checkerboard, the pbit of 1 results in 255,1,1,255 as the final color, when it should be 255,0,0,255. This transparent mode that is chosen seems to be incapable of reproducing the original texture accurately. The pbit limits the reproduced rgba components - even pbit results in even, and odd pbit results in odd values. So bc7enc probably needs to support one of the opaque mode, where the color bits don't affect the alpha.
static uint64_t find_optimal_solution(uint32_t mode, vec4F xl, vec4F xh, const color_cell_compressor_params *pParams, color_cell_compressor_results *pResults)
{
....
for (int p = pParams->m_has_alpha ? 0 : 1; p < 2; p++)
I'll update this more when I've tested on other platforms. Running on Debian I'm seeing this:
Total encoding time: 32.043268 secs
Total processing time: 32.212762 secs
But the actual run was much, much faster. If I time the run (with time
) I see:
real 0m2.503s
user 0m57.651s
sys 0m4.223s
The 2.5 seconds of wallclock time feels correct (this is a 144-core machine).
Update: if I change the clock()
calls to the required jumble of std::chrono
incantations I get:
Total encoding time: 0.400000 secs
Total processing time: 0.417000 secs
This was as millis so on a many core or fast machine misses the nuances. I could do a PR for this and switch to micros if this is of interest? Something like this:
https://github.com/richgel999/bc7enc_rdo/compare/master...cwoffenden:bc7enc_rdo:mt-timer?expand=1
(It's what I'm using to time the BC4/5 changes I made)
On Arch Linux, using bc7enc, it works fine on BC3 but I'm having trouble with BC1. With a PNG input, with a rectangular part on top and bottom being transparent, only the bottom part becomes transparent, while the top becomes black. Attempting to use another image as alpha mask doesn't help either.
When compressing a texture in debug mode, I encountered an assert. The following color block was being compressed:
23, 25, 46, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
23, 25, 46, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
22, 24, 45, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
22, 24, 45, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
It was processing mode 6, and the assert was in set_block_bits() (val set to 67, asserting since it doesn't fit in 3 bits)
The parameters I used were:
m_max_partitions
: BC7ENC_MAX_PARTITIONS
m_uber_level
: 1m_try_least_squares
: true
m_mode17_partition_estimation_filterbank
: false
Our workflow only support .ktx and I can't found any tool can do DDS -> KTX conversion, so it'd be great if bc7enc can output to KTX directly.
Hi,
I am trying to use this to load and save from a filename
Loading:
BC7 dds input file from hard drive (with a filename) - output to RGBA raw uncompressed mem stream.
Example: Loadbc7("c:\myfile.dds");
Saving:
RGBA raw uncompressed mem stream to BC7 dds output file (a filename - height and width need specifying to the output file)
Savebc7("c:\myfile.dds", 512, 1024);
Do you have any working examples of this?
Thanks for open sourcing this great tool! We at Respawn plan to start using it soon for an internal tool making some of the textures in Apex Legends.
I was compressing using bc7e, decompressing using bc7decomp.cpp, and dumping the results as a PNG. It looked great, except some pixels that should have been orange were green instead. I disabled optimizations on bc7decomp.cpp so I could step through the code and see how it was encoded, and to my surprise, the colors it showed in the debugger were right! I let it finish, and the green pixels were in fact the proper orange.
I usually suspect uninitialized data when optimized code behaves differently, but when I looked into it in the debugger and read the generated disassembly, the compiler actually had a bug. This shocked me; compiler bugs are very rare! I've only encountered 2 other compiler bugs in 20 years of professional software development.
In unpack_bc7_mode4_5, there's a nested loop to initialize "endpoints[e][c]" on lines 393-397. With optimizations enabled, it looks like VC++ 2017 swapped the order of the inner and outer loops, then unrolled the loop over "c" (which was the outer loop but is now the inner loop). This changed where the bits read from "color_read_bits" ended up. Once the end points were decoded wrongly, everything else was wrong after that.
Here's the relevant disassembly, with comments added to help see what's going on:
00007FFE8DC34360 40 0F B6 C7 movzx eax,dil // rdi has "color_read_bits"
00007FFE8DC34364 48 8D 52 04 lea rdx,[rdx+4] // rdx is "end_points", with an offset
{
for (uint32_t e = 0; e < ENDPOINTS; e++)
{
endpoints[e][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
00007FFE8DC34368 41 22 C6 and al,r14b // r14b is 0x7F for mode 5, 0x1F for mode 4
color_read_bits >>= ENDPOINT_BITS;
00007FFE8DC3436B 48 D3 EF shr rdi,cl // cl is 7 for mode 5, and 5 for mode 4
00007FFE8DC3436E 88 42 FA mov byte ptr [rdx-6],al // this stores endpoints[e][0]
00007FFE8DC34371 40 0F B6 C7 movzx eax,dil
00007FFE8DC34375 41 22 C6 and al,r14b
00007FFE8DC34378 48 D3 EF shr rdi,cl // this stores endpoints[e][1]
00007FFE8DC3437B 88 42 FB mov byte ptr [rdx-5],al
00007FFE8DC3437E 40 0F B6 C7 movzx eax,dil
00007FFE8DC34382 41 22 C6 and al,r14b
00007FFE8DC34385 48 D3 EF shr rdi,cl
00007FFE8DC34388 88 42 FC mov byte ptr [rdx-4],al // this stores endpoints[e][2]
00007FFE8DC3438B 49 83 E8 01 sub r8,1 // r8 counts down from 2 to 0, so it is "2 - e"
00007FFE8DC3438F 75 CF jne bc7decomp::unpack_bc7_mode4_5+100h (07FFE8DC34360h)
Each of 2 iterations picks off N bits at a time (N = 5 or 7) and stores them in end_points[e][0..2]. In shorthand with everything unrolled, it writes the picked-off bits in order 00 01 02 10 11 12. However, the C++ code says the order should be 00 10 01 11 02 12. Reordering which bits went to which array entry caused the improper decompression.
My workaround was to manually unroll the inner loop, so that the optimizer wouldn't switch the inner/outer loops. Once I did that, the green pixels were orange in optimized builds as well. There may be other, better workarounds.
// VC++ 2017 with full optimizations reversed the orders of the loops, not realizing it changed which
// bits got extracted from 'color_read_bits' for all but the first and last iteration.
static_assert(ENDPOINTS == 2); // This is a manual unrolling of a loop over ENDPOINTS
endpoints[0][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
color_read_bits >>= ENDPOINT_BITS;
endpoints[1][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
color_read_bits >>= ENDPOINT_BITS;
I was curious why I didn't notice this bug with the other decompressor modes. It turns out that the functions unpack_bc7_mode0_2 and unpack_bc7_mode1_3_7 have the extra line "uint64_t channel_read_chunk = channel_read_chunks[c];" at the top of the outer loop before the start of the inner loop, so the compiler can't swap the order of those loops. The last function is unpack_bc7_mode6, which doesn't loop at all, since there are only 2 endpoints to decode.
Sorry for not making this a pull request, but it seemed like GitHub wanted me to clone the whole repository and let it diff the changes to find out what I changed. I just wanted to suggest a new version for a single file, but I didn't see any way to do that in their interface.
I try to compile executable with running .bat file and then executing in console: cmake --build .
But compiled executable works only on my machine. On others it asks for msvcp140d.dll. Looks like it is a debug library or smth. May somebody help me with compiling process?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.