schnaader / precomp-cpp Goto Github PK

View Code? Open in Web Editor NEW

392.0 24.0 52.0 4.94 MB

Precomp, C++ version - further compress already compressed files

Home Page: http://schnaader.info/precomp.php

License: Apache License 2.0

C 72.69% C++ 27.07% Makefile 0.11% CMake 0.13%

compression command-line-tool lossless pdf gif zlib jpg mp3

precomp-cpp's People

Contributors

Stargazers

Watchers

precomp-cpp's Issues

Build using other compilers (MSVC etc.)

Instruction and/or makefiles for other common compilers needed, e.g. Visual Studio.

Speed up recommendation should also include `-t-`

The speed up recommendation (You can speed up Precomp for THIS FILE...) should output a compression type switch (-t-...) for stream types where 0/X were recompressed. For example, consider the following output:

JPG streams: 0/3460
PNG streams: 0/367

In this case, -t-nj should be recommended

update packJPG to 2.5k and packMP3 to 1.0g

newest packJPG 2.5k can be found here
https://github.com/packjpg/packJPG

newest packMP3 can be found here:
http://packjpg.encode.ru/?page_id=19

DNG recompression by unpacking Lossless JPEG 1992

DNG compresses either lossy (JPEG DCT) or lossless; the lossless compression is done by
LJ92 (Lossless JPEG 1992, DPCM+Huffman) which was developed 24 years ago.
Recompressing this could give the opportunity to use modern compression algorithms.

stdin\stdout support for recompressing pcf file

stdin\stdout for "> pcf" probably not very needed, but for repacking "< pcf" it could be very useful, since after precomp there will be compressor that suport it and stdin\stdout speed up process a lot.

Recursive Base64 streams aren't restored correctly

The Base64 code uses a line length array, but this array was not added to the recursion_push()/_pop() routines, so it is overwritten in recursion, resulting in wrong line lengths when restoring (if the line lengths differed).

Update XZ Utils to 5.2.3

There's been an update on XZ Utils at the end of 2016. Changes do not look like they'd fix or improve something in the code Precomp uses, but it might be good updating anyway.

Speed up JPG detection

JPG detection is done 1 byte at a time (fread(in, 1, 1, fin)). If there's a lot of FF D8 FF (SOI + FF) in the file, this slows down Precomp. More important, it also slows down JPG processing for corrupt or invalid JPGs because the search potentially continues until the end of file.

A better way would be to use in_buf here until we cross its borders and allocate new search buffers afterwards.

Support more ZLIB compressed data types in SWF

There are several types of data inside SWF that are compressed using ZLIB according to the specification:

JPGs with alpha channel (page 139, "DefineBitsJPEG3")
Lossless compressed bitmap data (pages 139 + 140, "DefineBitsLossless")
Lossless compressed bitmap data with alpha channel (page 142, "DefineBitsLossless2")
Alpha data for JPG/PNG/GIF images (page 143, "DefineBitsJPEG4")
Video data (page 208 ff., "Screen Video bitstream format", page 212 ff. "Screen Video V2 bitstream format")

These are detected in intense mode, but it would be useful to extend the parser to detect them in normal mode, too.

precomp crashes on this file

latest precomp 0.4.6 crashes on this file every time:

http://www.squeezechart.com/KTv1.apk

Class Switches

In the file 'precomp.cpp' line 296 (use_mp3 = switches.use_mp3;) contains a parameter that is not defined in the Switches class.

MP3 streams: Slowdown on files with "synching failure"

An example file provived by Gonzalo shows a slowdown for files where packMP3 gives a "synching failure" error message. In the example file, this happens every 522 bytes until position 4552728, where packMP3 finally is successful on a small part of the file (852636 bytes):

(0.04%) Possible MP3 found at position 2197, length 5403167
packMP3 error: synching failure (frame #762 at 0x6131A)
No matches
(0.05%) Possible MP3 found at position 2719, length 5402645
packMP3 error: synching failure (frame #761 at 0x61110)
No matches
(0.06%) Possible MP3 found at position 3242, length 5402122
packMP3 error: synching failure (frame #760 at 0x60F05)
No matches

[...]

(84.23%) Possible MP3 found at position 4552728, length 852636
Best match: 852636 bytes, recompressed to 744546 bytes
New size: 5297324 instead of 5405364

Done.
Time: 50 minute(s), 31 second(s)

Recompressed streams: 1/8712
MP3 streams: 1/8712

Passing big parts of the file thousands of times to packMP3 slows down the process to 50 minutes for a 5 MB file which should be avoided.

Give more information about unsupported MP3 formats in verbose mode

Some MP3 formats are not supported by packMP3, see its readme:

Please note that MP3 may stand for three different audio file types:
MPEG-1 Audio Layer III, MPEG-2 Audio Layer III and MPEG-2.5 Audio Layer
III. Only the first type is supported by packMP3. The file types may not
be distinguished by their extension (which would be '.mp3' for each of
them) but by their sample rates when playing in audio player software.
Only MPEG-1 Audio Layer III supports sample rates of 32kHz and above.

If packMP3 is called with such a file, it gives an error message, e.g. "file is MPEG-2 LAYER III, not supported".

Precomp detects these unsupported types itself, but doesn't give any information in verbose mode. This would be better to avoid confusion.

Check if Linux Makefile is still working

The Makefile for Linux has been changed with commit 0c3d924
I couldn't check if it still works so far, this has to be done.

LZSS/MScompress recompression

It's possible to recompress files compressed by MScompress.
This would be a cool feature in Precomp as MScompress is still used in some installers.

PAQ8pxd_v18 is capable of LZSS recompression. Program and sources are available here:
https://encode.ru/attachment.php?attachmentid=4536&d=1468876571

There's also an open-source implementation of MScompress available here:
https://github.com/coderforlife/ms-compress

Support bijection for uncompressed archive formats to facilitate deduplication

In addition to improving the ratio for a final recompression pass, precomp by nature also improves deduplication ratios across a dataset since the content may coexist in both uncompressed and compressed forms. So there's a benefit to deduplication ratio (across a long range data stream/set) as well as compression.

That said, I'd like to propose that precomp support a bijective transform on an input (stream) which is applied even to an uncompresed archive - whether that's raw format, a tarfile, cpio/ditto, a ZFS send stream, zip -0, or an excerpt from an uncompressed vmdk using a few popular filesystem formats.

This is proposed because it would further precomp's ability to normalise data and prepare it for efficient deduplication and compression. Variable-block deduplication processing can be done by tools like ddar (great since negligible memory needed), srep (more efficient than lrzip), and pcompress, as an example. In principle, I could set the (average) block size for deduplication small enough so that more blocks are considered equivalent and thus deduplicated. But that's more than what's needed. To do that completely might involve multiplying the resource requirements by a factor of ten. Analogously, if each of those tools chose a large enough blocksize, then the periodic presence of excess bytes in the stream (constituting the internal format) wouldn't cause as big of a problem. But a large blocksize is not always chosen, so this wouldn't be sufficient either.

What if the tar/cpio format, ZFS stream, or VMDK/filesystem stream was reversibly altered in such a way that a deduplication pass on the resultant hybrid data stream would consider these multiple representations as nearly equivalent? Suppose for example that I have a tarfile and I've also unpacked the raw files. If I create a cpio file of the whole tree, including all of that, I now have multiple representations of the same data. I'd like to not have to store this twice. It's easier to notice when everything is in the same subdirectory, but suppose there are a few times data is replicated. Also, ideally this idea ought to apply recursively. A cpio file containing similar tarfiles should be just considered as a nested structure referencing some identical files.

The more that data is expressed in its most native format and the extent to which a canonical representation can be derived, the better the deduplication and compression. I think that most formats mentioned could be translated between each other as part of a stream processing with bounded memory. An imperfect transformation that encodes the side information would also suffice as long as the transform is reversible. In theory something like xdelta3 with just the right settings would help; maybe there's a way to generalise the idea without considering the particulars of a multitude of archive formats.

Show new size when using -n

When using -n to convert a PCF file to its bZip2 compressed version or vice versa, e.g. precomp -n test.pcf, the new size of the file is not shown (in contrast to other operations like precomp test.pdf), which should be done.

Slowdown on files with various packMP3 inconsistencies

In packMP3, compress_mp3, there are various checks for frame header consistency (checksums, stereo int, emphasis, mixed blocks, private bit..). All of them throw errors and stop processing because packMP3 wouldn't be able to restore such a file identical.

Similar to issue #29, this slows down Precomp because it will pass all frames to packMP3 and after failure, will continue with the next frame.

Precomp has to check for these inconsistencies itself to speed up processing here.

Memory management / avoiding temporary files

At the moment, everything is done using temporary files, so even very small streams are processed on disk instead of memory which slows down Precomp a lot. On the other hand, it allows handling of very large streams that are several GB in size.

Proposal: There should be a parameter to set a max. memory size, everything up to that size will be processed in memory and only if more memory is needed, streams are processed on disk using temporary files.

"No matches" - undetected zLib streams

./precomp.exe -v -cn 2box4.zip

Precomp v0.4.3 - ALPHA version - USE FOR TESTING ONLY
Free for non-commercial use - Copyright 2006-2012 by Christian Schneider

Input file: 2box4.zip
Output file: 2box4.pcf

Using packjpg25.dll for JPG recompression.
--> packJPG library v2.5a (12/12/2011) by Matthias Stirner / Se <--
More about PackJPG here: http://www.elektronik.htw-aalen.de/packjpg

ZIP header detected
(0.00%) ZIP header detected at position 0
compressed size: 50222
uncompressed size: 108368
file name length: 9
extra field length: 0
(0.08%) Possible zLib-Stream in ZIP found at position 0, windowbits = 15
Can be decompressed to 108368 bytes
No matches
New size: 50364 instead of 50338

Done.
Time: 362 millisecond(s)

Recompressed streams: 0/1
ZIP streams: 0/1

None of the given compression and memory levels could be used.
There will be no gain compressing the output file.

Can be decompressed to 108368 bytes
No matches

That's obviously wrong. Can be and will not be. :-)

If this issue has already been resolved then I'm sorry.

2box4.zip

Add basic multithreading (zLib bruteforce)

Multithreading would speed up the whole process a lot, the most time consuming routine (bruteforcing the 81 different zLib modes) should be fully parallelizable, but refactoring is needed to make it threadsafe.

Support MP3 format

Detect and recompress MP3 streams using PackMP3.

Detection will be similar to PackMP3, 5 consecutive frames must be found, continue parsing frames to find end of stream.
ID3 tags will not be included as they often contain JPG files that can also be processed.

Improve JPG detection on some files with multiple JPG streams

For some images from cameras that include thumbnails, JPG detection fails. In the attached example file, there are two streams: an 4608x3456 image and a 1440x1080 thumbnail.

PackJPG only detects the first stream and handles the second one as "garbage following EOI", Precomp detects both streams, but the length of the first one is incorrect (only 11 KB).

Support MP3 format in containers

There are several types of video/audio containers like avi, mpeg, mp4, ogm, webm, flv, mov, mkv, ts, 3gp that can contain MP3 streams. These will often be interleaved so that the detection (5 consecutive frames) won't succeed. Also, they don't have to be in the right order.

"Demuxing" the MP3 streams would fix this, for some containers like AVI this seems to be very easy, for others, it could be more complex.

wavetlan.com seems to be a good place for testfiles (section "Video Formats")

DDS (DXT) recompression

DDS files that use DXT compression can be also recompressed.
This would be a nice feature of precomp.

There's a page describing decompression:
http://matejtomcik.com/Public/KnowHow/DXTDecompression/

Sort similar content together

Especially for the built-in bZip2 compression that uses blocks of 100..900 KB, it would be good to sort similar content together.

A simple way would be sorting by stream type (PNG, PDF, Base64...), more sophisticated ways would analyze the data, e.g. using a histogram.

Adjust *nix makefiles for 32-/64-bit compiles

The Windows makefiles use "-march=pentiumpro -m32" for the 32-bit version and "-march=x86-64 -m64" for the 64-bit version.

At the moment, the *nix makefiles don't use any of these options, so they will compile a 32-bit version on 32-bit systems, a 64-bit version on 64-bit systems.

Support more PDF filters

Good explanation of filters, cascades and a list of available filters: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/

FlateDecode is already implemented

Most important ones would be ASCII85Decode and LZWDecode
Implementing cascades would be nice, but I'm not sure how common they are
CCITTFaxDecode is very common, but probably hard to implement

Slowdown on packMP3 error message "big value pairs out of bound"

There's a slowdown on the message "big value pairs out of bound". Sounds similar to issues #29, #34 and #35, but this time, it's deep in the packMP3 enconding code instead of parsing code, so we can't do this on the Precomp side.

A possible approach would be to skip future streams that look like this specific stream (sum of position and length is the same).

Test file at encode.ru

Changes for macOS

Hi,

I wanted to push the precomp application to the macOS homebrew manager (https://brew.sh) and opened a pull request there. There are some changes though that would have to be made and that can only be fixed upstream in this repository. You can look at the pull request here: Homebrew/homebrew-core#15562

First of all, the Unix Makefile would need an install target (that should be easy to do, it just needs to install the built precomp binary to a suitable location). It would be needed to have it done via a variable in the Makefile, so the brew formula can override the variable and install the application to the correct folder automatically.

Secondly, the main Makefile uses the -s flag which is not available on OS X Yosemite and thus, precomp would not be supported on all platforms that are still served by brew (El Capitan and Sierra ignore this flag, but Yosemite does not compile with it at all). Is this flag really needed? According to GCC, it is supposed to "Remove all symbol table and relocation information from the executable."

The other two requested changes from the pull request have to be changed by me in the formula, though. So there is nothing, you would need to change.

Support deflate compressed PNG chunks (zTXt, iCCP)

Description of chunk format:

support lossless .ARW recompression

Sony uses the .ARW format for their raw photographs.
For many years those files were hardly compressible but now they can be decoded losslessly.
The sources can be found here

https://github.com/FLIF-hub/FLIF/tree/master/raw-tools/sony_arw

Could this format please be supported by Precomp?

precomp segfaults for this file.

pre compressing this file --
precomp_segfault.zip
with command line --

-cn -oprecomp_tar

segfaults precomp on LInux 64

I tried to get the BT, but the debugging symbols where not there in the binary version and the one that I built from git. I tried changing the CFLAGS and added -ggdb, but the debugging symbols still got stripped. On changing the optimization levels to -O0 for all libraries, solved the problem but of course it's a bad solution.

False positives: MP3 streams

There are still false MP3 positives, e.g in this TeamViewer Debian package:

(0.00%) Possible bZip2-Stream found at position 2084, compression level = 9
Can be decompressed to 192122880 bytes
Identical recompressed bytes: 53084090
Identical decompressed bytes: 192122880 of 192122880
Real identical bytes: 53084090
Best match: 53084090 bytes, decompressed to 192122880 bytes
Recursion start - new recursion depth 1
(52.25%) Possible MP3 found at position 100378755, length 2400
packMP3 error: region size out of bounds (frame #0 at 0x0)
No matches
[...]
Recompressed streams: 1/140
MP3 streams: 0/139
bZip2 streams: 1/1

Doing the region size check before passing the data to PackJPG would solve this.

Ignore version byte for compressed SWFs

Compressed SWF files contain a version byte. At the moment, it is checked if this version byte is between 0 and 10:

if ((swf_version > 0) && (swf_version < 10)) {

Unfortunately, the highest possible SWF version is 32 as of Mar 10, 2016 (list at Adobe, StackOverflow question) and it's very likely to proceed further.

As the 3 preceding bytes (CWS) don't leave too much room for false positives, the best solution is to just ignore the version flag.

Add an option to preserve file extension

At the moment, the file extension is replaced with ".pcf".

It would be nice to have an option to preserve the file extension, e.g. processing a file called "file.tar" would result in a new file called "file.tar.pcf" instead of "file.pcf".

The default should still be replacing the extension.

Support "xz" (liblzma)

Currently precomp only supports compressing using bzip2 which is not too bad. However most people probably turn compression off and use another tool to manually apply stronger compression. XZ (http://tukaani.org/xz/) is already widely used in the *nix world and would be an excellent choice for a strong, yet not exotic compression method.

Even more interesting: Importing liblzma as contrib might also make it possible to allow for recompression of lzma streams. FreeBSD as well as a lot of Linux distributions use xz for their packages. Those often contain gzipped files like manpages and stuff like that. So if precomp could decompress lzma streams, streams with weak compression could be found and dealt with using recursion. A lot of open source projects offer source code in xz compressed tarballs, as well, and precompressing e.g. a project's source iso would greatly benefit from xz support.

Adding Base64 streams in XML

I've run precomp4.6 on the attached file which is compressed with GZ but it seems that precomp does not recognize the embedded base64 streams which easy to support. Also refer to my post at http://encode.ru/threads/2036-Leanify?p=49016&viewfull=1#post49016 for examples on xml files generated from swf files. it such tool (swf converter to xml and back is made) then a better swf support is gained
Acacia_High.zip

PreComp crashed on big file (ERROR: 3)

PreComp 0.46 (64 bit), Win 10 x64.

Files in dir after crash (file starting with D5803 - source):
~temp00000000.dat (0 B)
~temp000000000.dat (16382 B)
~temp000000001.dat (1735 MB)
~temp000000001.dat_ (2241 MB)
D5803 23.5.A.1.291 Customized RU 1289-0445 R3D (by skapunkcsd90@4pda).ftf (1616 MB)
precomp.exe (1515 KB)

Command line:
precomp.exe -cn -intense -brute "D5803 23.5.A.1.291 Customized RU 1289-0445 R3D (by skapunkcsd90@4pda).ftf"
(and with compression, -cl, - same thing)

94.04% -
ERROR 3: There is not enough space on disk

Even other PC and OS - same position (94.04%) and same error 3. Other similar FTF files >1GB size - same issue (other pos). But for 700 MB file works fine.
Path to folder - C:\PreComp. Disk C is NTFS, 652 GB free.

Memory corruption in PackJPG

For certain invalid JPG fragments (see attached file), PackJPG corrupts the heap with a double free.

Support 3DM format without intense mode

3D modelling file format used by [https://en.wikipedia.org/wiki/Rhinoceros_3D](Rhinoceros 3D).

There is a C++/.NET SDK to read and write 3DM files called https://www.rhino3d.com/opennurbs.

Incorrectly restored PNG multi file

There's a bug in the PNG multi routines that leads to an integer underflow and incorrect restoration of the attached file (original file 9 KB, incorrect restored file 4 GB). The problem here seems to be the combination of two IDAT chunks and the second one being very short (only 3 image data bytes).

Ignoring positions doesn't work in recursion

At the moment, you can set a list of ignore positions using the ' -i' parameter. However, when in recursion, positions are reset to 0 so a given position like 1234 can occur several times at different recursion levels which is unwanted behaviour - the user wants to ignore a certain stream and nothing more.

Proposal: To prevent this, the stream position output in verbose mode should output positions in the format xxxx_yyyy_zzzz... - every recursion adds an underscore to the position, so e.g. a stream starting at position 1234 that contains a stream starting at position 567 in it would lead to the position 1234_567. The ignore list parameter parsing and usage in the detection would have to be changed accordingly.

Modularity for SFX archives

When using Precomp for self extracting archives, the following would be useful to reduce their size:

Remove code not used for SFX (e.g. on-the-fly compressions, verbose mode, ...)
Remove or adjust code like command-line parsing, syntax help
Remove all decompression code, only use recompression code
Removing all unused recompression methods (e.g. if only PNG and JPG were used, remove all others)

This could be done using modular concepts, e.g. using #ifdefs or treating recompression methods as plug-ins.

Statistics are wrong when using -pdfbmp+

Recompressed streams: 90/90
PDF streams: 88/88
PDF image streams (8-bit): 0/1
JPG streams: 1/1

In the output above, 88 + 0 + 1 = 89, 89 != 90, count of successful decompressed streams is wrong.

Update external libraries

There are newer versions of PackJPG, bZip2 and zLib. All of them have bugfixes and minor improvements, bZip2 even fixes a potential security vulnerability.

Slowdown on packMP3 error message "region size out of bounds"

Similar to issues #29 and #34, there's a slowdown on the packMP3 error "region size out of bounds", the corresponding check should be done in Precomp instead of passing the stream to packMP3.

Wrap BMP header around other image content, especially GIF

When using PAQ compressors on files processed with the "pdfbmp" switch, images inside PDFs are wrapped with a dummy BMP header so that the compressor can switch to the image model and compress the image both faster and better.

This should also be implemented for GIF image data.

For PNG image data, research is needed. It might also help, but the image data is filtered here (see here), so it might be necessary to unfilter it.

The switch should be renamed to "wrapbmp" after this.

Using FLIF for image compression

Image data is present in many of the streams Precomp supports, e.g. PDF, PNG and GIF. On this data, specialized image compression algorithms can be used to get better compression ratio than with the general purpose compression methods the -c option offers (bzip/lzma).

At the moment, the -pdfbmp option does something similar by converting image data detected in PDF documents to BMP format so higher compression ratios can be achieved when processing the resulting files with compressors from the PAQ family later.

Compressing the image data using the FLIF library would offer a way to get higher compression ratios on image data using only Precomp. This would also be useful for people that want to convert existing image files to FLIF without losing the original file.

Parameter parsing ignores additional characters after switches

At the moment, additional characters after valid switches are ignored, which can lead to unwanted behaviour. For example, an user could call precomp -cn-v file (no space before -v) and be surprised that verbose mode is off.

This should lead to an error ("superfluous characters in switch -cn-v") instead of ignoring the characters.

schnaader / precomp-cpp Goto Github PK

precomp-cpp's People

Contributors

Stargazers

Watchers

Forkers

precomp-cpp's Issues

Recommend Projects

Recommend Topics

Recommend Org