richgel999 / bc7enc_rdo Goto Github PK

View Code? Open in Web Editor NEW

175.0 175.0 29.0 14.67 MB

State of the art RDO BC1-7 GPU texture encoders

License: Other

CMake 0.26% C++ 52.21% C 47.53% Batchfile 0.01%

bc7enc_rdo's People

Contributors

Stargazers

Watchers

bc7enc_rdo's Issues

No mipmap support

Is it possible for this to make dds files containing mipmaps? The quality of the resulting dds from this encoder is much better than other encoders I've tried, but the lack of mipmaps prevents me from actually using the results.

BC5: Using channels other than 0 and 1 (via rdo_bc_params::m_bc45_channelX) with RDO generates corrupted image

The issue seems to be that the RDO code doesn't account for rdo_bc_params::m_bc45_channel0 or rdo_bc_params::m_bc45_channel1. The workaround is easy: I just permute the channels in my source image, but I wanted to report here in case other people run into this unawares.

Thank you very much for making an open source RDO compressor by the way!

AVX and AVX2 targets perform better with i32x4 width

The current compilation in CMakeLists.txt passes avx and avx2 as targets to ISPC. These default to avx1-i32x8 and avx2-i32x8 respectively. However, when timing conversion of a very large input image performance was significantly better when using avx1-i32x4 and avx2-i32x4. On a laptop with an i7-8750H, avx1-i32x8 was almost 2x slower than avx1-i32x4, while avx2-i32x8 was ~20% slower than avx2-i32x4. On a desktop with a Ryzen 2700X, i32x8 was almost 2x slower than i32x4 for both avx1 and avx2.

A couple other interesting things of note:

On the i7, SSE4 and AVX1 performed identically.
On the Ryzen system, AVX2 was actually ~10% slower than AVX1, and AVX1 was very slightly (~2-3%) faster than SSE4. Some research seems to suggest that newer Ryzen models likely fare better with AVX2.

In other words, at least with the hardware I have available to me switching to i32x4 for any AVX targets universally gives a significant performance boost. In my own project (which includes bc7enc_rdo as well as ISPCTextureCompressor for BC6H support) I switched to avx2-i32x4 for the AVX2 target and dropped the AVX1 target due to not being different enough from SSE4 to justify the extra compile time and binary size.

Also worth noting that I saw a similar performance degradation when testing on an M1 Mac with using neon-i32x8 vs. neon-i32x4, which is what prompted me to check the difference in x86.

Use define statements to disable zlib and png

Will give more details.

BC4/5 blocks with two values per channel are slightly off

If a BC4 block has only two values, and the search radius is greater than zero, then the second value is always interpolated, and since the BC4/5 interpolation is done with more than 8-bits in hardware (see findings here) the resulting value is slightly off.

If you start with the code here:

bc7enc_rdo/rgbcx.cpp

Line 2801 in e6990bc

for (int lo_delta = -(int)search_rad; lo_delta <= (int)search_rad; lo_delta++)

An example being: a block with values 126 and 127 and the default search radius of 3 will achieve a best_err of zero with endpoints 127 and 123 (with six interpolated values, MODE8 in the code). When interpolated as 8-bit, and with selectors of 0 and 2, this does correctly result in values of 126 and 127. But... since the hardware is using 14- or 16-bit interpolation then the resulting values are 126.43 and 127.0.

This is small, agreed, but when mixed with solid blocks of 126 we get block artefacts as the encoder flips between solid and multi-value blocks, breaking down to blocks of 126.0 and 126.43.

This came about because we have a normal map exported from 3D Coat with essentially a dimple noise texture on it. AFAIK it was worked on at 4k then exported at 2k, by which time the dimple texture end up being a few dots. Here's the result isolated with obvious block errors:

Ignoring there's clearly something amiss with the normal map, it does nicely highlight this issue. If this were BC3 alpha it wouldn't be affected, it's only BC4/5 with the extended interpolation bits.

The easy fix (which I've already tried) is to clamp the search radius to zero when there are only two values. I think, though, a better fix may be to interpolate with more range in bc4_block::get_block_values() so that it works for more than two values (which I'll try next).

Hitting an assert on Apple M1

Hi,

When compiling and running bc7e.ispc on a computer with an apple M1 cpu, I consistently hit this assert.
bc7e.ispc:2890:2: Assertion failed: *pCur_ofs <= 128

Compiling with assertions disabled produces wrong/corrupted compressed images.

I tried compiling on an intel machine with sse or/and avx and it works as expected, so my guess is the problem is only the neon target.

This is how I compiled bc7e.ispc for the M1:
./ispc -g bc7e.ispc -o bc7e.obj -h bc7e_ispc.h --target=neon --arch=aarch64

This is a callstack of when the assert hits:

set_block_bits___vy_3C_unT_3E_vyuvyuun_3C_vyu_3E_ (bc7enc_rdo-master/bc7e.ispc:2890)
encode_bc7_block___vy_3C_unv_3E_un_3C_s_5B__c_vybc7_optimization_results_5D__3E_ (bc7enc_rdo-master/bc7e.ispc:3086)
handle_alpha_block___vy_3C_unv_3E_un_3C_s_5B__c_vycolor_quad_i_5D__3E_un_3C_s_5B__c_unbc7e_compress_block_params_5D__3E_un_3C_s_5B_uncolor_cell_compressor_params_5D__3E_vyuvyu (bc7enc_rdo-master/bc7e.ispc:3906)
bc7e_compress_blocks (bc7enc_rdo-master/bc7e.ispc:4741)
main (bc7enc_rdo-master/test.c:79)

And in this case the value of pCur_ofs was (82, 82, 130, 82)

I’m not familiar with either ISPC or texture compression, so I would appreciate any help or insight on how to debug this.

Patents?

Do any of the algorithms here require me to license some patent?

Possible to have an unified api for both ISPC and CPU

I was wondering if it was possible to keep the same API as ispc for the cpu edition.

It's mixed currently.

Constant alpha = 255 and/or color is not reproduced using BC7 transparency modes

This is releated to this issue that still remains on BC7enc. I updated to BC7enc_rdo. This affects BC7 encoded textures.
richgel999/bc7enc#3

When pbit = 0 is chosen in the following call, the alpha reconstruct can only go to 254. This doesn't match the fully opaque textures that we are processing. We don't have this issue with ETC/ASTC encoder. For now, we'll workaround at the shader level and assume 254/255.0 is opaque when BC7enc is used.

I tried forcing use of pbit = 1 for opaque textures. This results in correct fully opaque textures, but on a red-black checkerboard, the pbit of 1 results in 255,1,1,255 as the final color, when it should be 255,0,0,255. This transparent mode that is chosen seems to be incapable of reproducing the original texture accurately. The pbit limits the reproduced rgba components - even pbit results in even, and odd pbit results in odd values. So bc7enc probably needs to support one of the opaque mode, where the color bits don't affect the alpha.

static uint64_t find_optimal_solution(uint32_t mode, vec4F xl, vec4F xh, const color_cell_compressor_params *pParams, color_cell_compressor_results *pResults)
{
....
	  for (int p = pParams->m_has_alpha ? 0 : 1; p < 2; p++)

Encoding time incorrect on multicore machine

I'll update this more when I've tested on other platforms. Running on Debian I'm seeing this:

Total encoding time: 32.043268 secs
Total processing time: 32.212762 secs

But the actual run was much, much faster. If I time the run (with time) I see:

real	0m2.503s
user	0m57.651s
sys	0m4.223s

The 2.5 seconds of wallclock time feels correct (this is a 144-core machine).

Update: if I change the clock() calls to the required jumble of std::chrono incantations I get:

Total encoding time: 0.400000 secs
Total processing time: 0.417000 secs

This was as millis so on a many core or fast machine misses the nuances. I could do a PR for this and switch to micros if this is of interest? Something like this:

https://github.com/richgel999/bc7enc_rdo/compare/master...cwoffenden:bc7enc_rdo:mt-timer?expand=1

(It's what I'm using to time the BC4/5 changes I made)

PNG transparency issues with BC1

On Arch Linux, using bc7enc, it works fine on BC3 but I'm having trouble with BC1. With a PNG input, with a rectangular part on top and bottom being transparent, only the bottom part becomes transparent, while the top becomes black. Attempting to use another image as alpha mask doesn't help either.

Assert in bc7enc mode 6

When compressing a texture in debug mode, I encountered an assert. The following color block was being compressed:

23, 25, 46, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
23, 25, 46, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
22, 24, 45, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255
22, 24, 45, 255
22, 24, 45, 255
22, 25, 44, 255
22, 25, 44, 255

It was processing mode 6, and the assert was in set_block_bits() (val set to 67, asserting since it doesn't fit in 3 bits)

The parameters I used were:

linear weights
m_max_partitions: BC7ENC_MAX_PARTITIONS
m_uber_level: 1
m_try_least_squares: true
m_mode17_partition_estimation_filterbank: false

Save to .KTX

Our workflow only support .ktx and I can't found any tool can do DDS -> KTX conversion, so it'd be great if bc7enc can output to KTX directly.

No documentation - no idea where to start

Hi,

I am trying to use this to load and save from a filename

Loading:
BC7 dds input file from hard drive (with a filename) - output to RGBA raw uncompressed mem stream.

Example: Loadbc7("c:\myfile.dds");

Saving:
RGBA raw uncompressed mem stream to BC7 dds output file (a filename - height and width need specifying to the output file)
Savebc7("c:\myfile.dds", 512, 1024);

Do you have any working examples of this?

VC 2017 compiler bug and workaround with mode 4/5 blocks in bc7decomp.cpp

Thanks for open sourcing this great tool! We at Respawn plan to start using it soon for an internal tool making some of the textures in Apex Legends.

I was compressing using bc7e, decompressing using bc7decomp.cpp, and dumping the results as a PNG. It looked great, except some pixels that should have been orange were green instead. I disabled optimizations on bc7decomp.cpp so I could step through the code and see how it was encoded, and to my surprise, the colors it showed in the debugger were right! I let it finish, and the green pixels were in fact the proper orange.

I usually suspect uninitialized data when optimized code behaves differently, but when I looked into it in the debugger and read the generated disassembly, the compiler actually had a bug. This shocked me; compiler bugs are very rare! I've only encountered 2 other compiler bugs in 20 years of professional software development.

In unpack_bc7_mode4_5, there's a nested loop to initialize "endpoints[e][c]" on lines 393-397. With optimizations enabled, it looks like VC++ 2017 swapped the order of the inner and outer loops, then unrolled the loop over "c" (which was the outer loop but is now the inner loop). This changed where the bits read from "color_read_bits" ended up. Once the end points were decoded wrongly, everything else was wrong after that.

Here's the relevant disassembly, with comments added to help see what's going on:

00007FFE8DC34360 40 0F B6 C7          movzx       eax,dil         // rdi has "color_read_bits"
00007FFE8DC34364 48 8D 52 04          lea         rdx,[rdx+4]     // rdx is "end_points", with an offset
	{
		for (uint32_t e = 0; e < ENDPOINTS; e++)
		{
			endpoints[e][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
00007FFE8DC34368 41 22 C6             and         al,r14b              // r14b is 0x7F for mode 5, 0x1F for mode 4
			color_read_bits >>= ENDPOINT_BITS;
00007FFE8DC3436B 48 D3 EF             shr         rdi,cl               // cl is 7 for mode 5, and 5 for mode 4
00007FFE8DC3436E 88 42 FA             mov         byte ptr [rdx-6],al  // this stores endpoints[e][0]
00007FFE8DC34371 40 0F B6 C7          movzx       eax,dil
00007FFE8DC34375 41 22 C6             and         al,r14b  
00007FFE8DC34378 48 D3 EF             shr         rdi,cl               // this stores endpoints[e][1]
00007FFE8DC3437B 88 42 FB             mov         byte ptr [rdx-5],al  
00007FFE8DC3437E 40 0F B6 C7          movzx       eax,dil  
00007FFE8DC34382 41 22 C6             and         al,r14b  
00007FFE8DC34385 48 D3 EF             shr         rdi,cl  
00007FFE8DC34388 88 42 FC             mov         byte ptr [rdx-4],al  // this stores endpoints[e][2]
00007FFE8DC3438B 49 83 E8 01          sub         r8,1                 // r8 counts down from 2 to 0, so it is "2 - e"
00007FFE8DC3438F 75 CF                jne         bc7decomp::unpack_bc7_mode4_5+100h (07FFE8DC34360h)

Each of 2 iterations picks off N bits at a time (N = 5 or 7) and stores them in end_points[e][0..2]. In shorthand with everything unrolled, it writes the picked-off bits in order 00 01 02 10 11 12. However, the C++ code says the order should be 00 10 01 11 02 12. Reordering which bits went to which array entry caused the improper decompression.

My workaround was to manually unroll the inner loop, so that the optimizer wouldn't switch the inner/outer loops. Once I did that, the green pixels were orange in optimized builds as well. There may be other, better workarounds.

		// VC++ 2017 with full optimizations reversed the orders of the loops, not realizing it changed which
		// bits got extracted from 'color_read_bits' for all but the first and last iteration.
		static_assert(ENDPOINTS == 2);		// This is a manual unrolling of a loop over ENDPOINTS
		endpoints[0][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
		color_read_bits >>= ENDPOINT_BITS;
		endpoints[1][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
		color_read_bits >>= ENDPOINT_BITS;

I was curious why I didn't notice this bug with the other decompressor modes. It turns out that the functions unpack_bc7_mode0_2 and unpack_bc7_mode1_3_7 have the extra line "uint64_t channel_read_chunk = channel_read_chunks[c];" at the top of the outer loop before the start of the inner loop, so the compiler can't swap the order of those loops. The last function is unpack_bc7_mode6, which doesn't loop at all, since there are only 2 endpoints to decode.

Sorry for not making this a pull request, but it seemed like GitHub wanted me to clone the whole repository and let it diff the changes to find out what I changed. I just wanted to suggest a new version for a single file, but I didn't see any way to do that in their interface.

MSVCP140D.dll missing

I try to compile executable with running .bat file and then executing in console: cmake --build .
But compiled executable works only on my machine. On others it asks for msvcp140d.dll. Looks like it is a debug library or smth. May somebody help me with compiling process?

richgel999 / bc7enc_rdo Goto Github PK

bc7enc_rdo's People

Contributors

Stargazers

Watchers

Forkers

bc7enc_rdo's Issues

Recommend Projects

Recommend Topics

Recommend Org