schnaader / precomp-cpp Goto Github PK
View Code? Open in Web Editor NEWPrecomp, C++ version - further compress already compressed files
Home Page: http://schnaader.info/precomp.php
License: Apache License 2.0
Precomp, C++ version - further compress already compressed files
Home Page: http://schnaader.info/precomp.php
License: Apache License 2.0
Instruction and/or makefiles for other common compilers needed, e.g. Visual Studio.
The speed up recommendation (You can speed up Precomp for THIS FILE...
) should output a compression type switch (-t-...
) for stream types where 0/X were recompressed. For example, consider the following output:
JPG streams: 0/3460
PNG streams: 0/367
In this case, -t-nj
should be recommended
newest packJPG 2.5k can be found here
https://github.com/packjpg/packJPG
newest packMP3 can be found here:
http://packjpg.encode.ru/?page_id=19
DNG compresses either lossy (JPEG DCT) or lossless; the lossless compression is done by
LJ92 (Lossless JPEG 1992, DPCM+Huffman) which was developed 24 years ago.
Recompressing this could give the opportunity to use modern compression algorithms.
stdin\stdout for "> pcf" probably not very needed, but for repacking "< pcf" it could be very useful, since after precomp there will be compressor that suport it and stdin\stdout speed up process a lot.
The Base64 code uses a line length array, but this array was not added to the recursion_push()/_pop() routines, so it is overwritten in recursion, resulting in wrong line lengths when restoring (if the line lengths differed).
There's been an update on XZ Utils at the end of 2016. Changes do not look like they'd fix or improve something in the code Precomp uses, but it might be good updating anyway.
JPG detection is done 1 byte at a time (fread(in, 1, 1, fin)
). If there's a lot of FF D8 FF
(SOI + FF
) in the file, this slows down Precomp. More important, it also slows down JPG processing for corrupt or invalid JPGs because the search potentially continues until the end of file.
A better way would be to use in_buf
here until we cross its borders and allocate new search buffers afterwards.
There are several types of data inside SWF that are compressed using ZLIB according to the specification:
These are detected in intense mode, but it would be useful to extend the parser to detect them in normal mode, too.
latest precomp 0.4.6 crashes on this file every time:
In the file 'precomp.cpp' line 296 (use_mp3 = switches.use_mp3;) contains a parameter that is not defined in the Switches class.
An example file provived by Gonzalo shows a slowdown for files where packMP3 gives a "synching failure" error message. In the example file, this happens every 522 bytes until position 4552728, where packMP3 finally is successful on a small part of the file (852636 bytes):
(0.04%) Possible MP3 found at position 2197, length 5403167
packMP3 error: synching failure (frame #762 at 0x6131A)
No matches
(0.05%) Possible MP3 found at position 2719, length 5402645
packMP3 error: synching failure (frame #761 at 0x61110)
No matches
(0.06%) Possible MP3 found at position 3242, length 5402122
packMP3 error: synching failure (frame #760 at 0x60F05)
No matches
[...]
(84.23%) Possible MP3 found at position 4552728, length 852636
Best match: 852636 bytes, recompressed to 744546 bytes
New size: 5297324 instead of 5405364
Done.
Time: 50 minute(s), 31 second(s)
Recompressed streams: 1/8712
MP3 streams: 1/8712
Passing big parts of the file thousands of times to packMP3 slows down the process to 50 minutes for a 5 MB file which should be avoided.
Some MP3 formats are not supported by packMP3, see its readme:
Please note that MP3 may stand for three different audio file types:
MPEG-1 Audio Layer III, MPEG-2 Audio Layer III and MPEG-2.5 Audio Layer
III. Only the first type is supported by packMP3. The file types may not
be distinguished by their extension (which would be '.mp3' for each of
them) but by their sample rates when playing in audio player software.
Only MPEG-1 Audio Layer III supports sample rates of 32kHz and above.
If packMP3 is called with such a file, it gives an error message, e.g. "file is MPEG-2 LAYER III, not supported".
Precomp detects these unsupported types itself, but doesn't give any information in verbose mode. This would be better to avoid confusion.
The Makefile for Linux has been changed with commit 0c3d924
I couldn't check if it still works so far, this has to be done.
It's possible to recompress files compressed by MScompress.
This would be a cool feature in Precomp as MScompress is still used in some installers.
PAQ8pxd_v18 is capable of LZSS recompression. Program and sources are available here:
https://encode.ru/attachment.php?attachmentid=4536&d=1468876571
There's also an open-source implementation of MScompress available here:
https://github.com/coderforlife/ms-compress
In addition to improving the ratio for a final recompression pass, precomp by nature also improves deduplication ratios across a dataset since the content may coexist in both uncompressed and compressed forms. So there's a benefit to deduplication ratio (across a long range data stream/set) as well as compression.
That said, I'd like to propose that precomp support a bijective transform on an input (stream) which is applied even to an uncompresed archive - whether that's raw format, a tarfile, cpio/ditto, a ZFS send stream, zip -0, or an excerpt from an uncompressed vmdk using a few popular filesystem formats.
This is proposed because it would further precomp's ability to normalise data and prepare it for efficient deduplication and compression. Variable-block deduplication processing can be done by tools like ddar (great since negligible memory needed), srep (more efficient than lrzip), and pcompress, as an example. In principle, I could set the (average) block size for deduplication small enough so that more blocks are considered equivalent and thus deduplicated. But that's more than what's needed. To do that completely might involve multiplying the resource requirements by a factor of ten. Analogously, if each of those tools chose a large enough blocksize, then the periodic presence of excess bytes in the stream (constituting the internal format) wouldn't cause as big of a problem. But a large blocksize is not always chosen, so this wouldn't be sufficient either.
What if the tar/cpio format, ZFS stream, or VMDK/filesystem stream was reversibly altered in such a way that a deduplication pass on the resultant hybrid data stream would consider these multiple representations as nearly equivalent? Suppose for example that I have a tarfile and I've also unpacked the raw files. If I create a cpio file of the whole tree, including all of that, I now have multiple representations of the same data. I'd like to not have to store this twice. It's easier to notice when everything is in the same subdirectory, but suppose there are a few times data is replicated. Also, ideally this idea ought to apply recursively. A cpio file containing similar tarfiles should be just considered as a nested structure referencing some identical files.
The more that data is expressed in its most native format and the extent to which a canonical representation can be derived, the better the deduplication and compression. I think that most formats mentioned could be translated between each other as part of a stream processing with bounded memory. An imperfect transformation that encodes the side information would also suffice as long as the transform is reversible. In theory something like xdelta3 with just the right settings would help; maybe there's a way to generalise the idea without considering the particulars of a multitude of archive formats.
When using -n to convert a PCF file to its bZip2 compressed version or vice versa, e.g. precomp -n test.pcf
, the new size of the file is not shown (in contrast to other operations like precomp test.pdf
), which should be done.
In packMP3, compress_mp3, there are various checks for frame header consistency (checksums, stereo int, emphasis, mixed blocks, private bit..). All of them throw errors and stop processing because packMP3 wouldn't be able to restore such a file identical.
Similar to issue #29, this slows down Precomp because it will pass all frames to packMP3 and after failure, will continue with the next frame.
Precomp has to check for these inconsistencies itself to speed up processing here.
At the moment, everything is done using temporary files, so even very small streams are processed on disk instead of memory which slows down Precomp a lot. On the other hand, it allows handling of very large streams that are several GB in size.
Proposal: There should be a parameter to set a max. memory size, everything up to that size will be processed in memory and only if more memory is needed, streams are processed on disk using temporary files.
./precomp.exe -v -cn 2box4.zip
Precomp v0.4.3 - ALPHA version - USE FOR TESTING ONLY
Free for non-commercial use - Copyright 2006-2012 by Christian Schneider
Input file: 2box4.zip
Output file: 2box4.pcf
Using packjpg25.dll for JPG recompression.
--> packJPG library v2.5a (12/12/2011) by Matthias Stirner / Se <--
More about PackJPG here: http://www.elektronik.htw-aalen.de/packjpg
ZIP header detected
(0.00%) ZIP header detected at position 0
compressed size: 50222
uncompressed size: 108368
file name length: 9
extra field length: 0
(0.08%) Possible zLib-Stream in ZIP found at position 0, windowbits = 15
Can be decompressed to 108368 bytes
No matches
New size: 50364 instead of 50338
Done.
Time: 362 millisecond(s)
Recompressed streams: 0/1
ZIP streams: 0/1
None of the given compression and memory levels could be used.
There will be no gain compressing the output file.
Can be decompressed to 108368 bytes
No matches
That's obviously wrong. Can be and will not be. :-)
If this issue has already been resolved then I'm sorry.
Multithreading would speed up the whole process a lot, the most time consuming routine (bruteforcing the 81 different zLib modes) should be fully parallelizable, but refactoring is needed to make it threadsafe.
Detect and recompress MP3 streams using PackMP3.
For some images from cameras that include thumbnails, JPG detection fails. In the attached example file, there are two streams: an 4608x3456 image and a 1440x1080 thumbnail.
PackJPG only detects the first stream and handles the second one as "garbage following EOI", Precomp detects both streams, but the length of the first one is incorrect (only 11 KB).
There are several types of video/audio containers like avi, mpeg, mp4, ogm, webm, flv, mov, mkv, ts, 3gp that can contain MP3 streams. These will often be interleaved so that the detection (5 consecutive frames) won't succeed. Also, they don't have to be in the right order.
"Demuxing" the MP3 streams would fix this, for some containers like AVI this seems to be very easy, for others, it could be more complex.
wavetlan.com seems to be a good place for testfiles (section "Video Formats")
DDS files that use DXT compression can be also recompressed.
This would be a nice feature of precomp.
There's a page describing decompression:
http://matejtomcik.com/Public/KnowHow/DXTDecompression/
Especially for the built-in bZip2 compression that uses blocks of 100..900 KB, it would be good to sort similar content together.
A simple way would be sorting by stream type (PNG, PDF, Base64...), more sophisticated ways would analyze the data, e.g. using a histogram.
The Windows makefiles use "-march=pentiumpro -m32" for the 32-bit version and "-march=x86-64 -m64" for the 64-bit version.
At the moment, the *nix makefiles don't use any of these options, so they will compile a 32-bit version on 32-bit systems, a 64-bit version on 64-bit systems.
Good explanation of filters, cascades and a list of available filters: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/
FlateDecode is already implemented
There's a slowdown on the message "big value pairs out of bound". Sounds similar to issues #29, #34 and #35, but this time, it's deep in the packMP3 enconding code instead of parsing code, so we can't do this on the Precomp side.
A possible approach would be to skip future streams that look like this specific stream (sum of position and length is the same).
Hi,
I wanted to push the precomp application to the macOS homebrew manager (https://brew.sh) and opened a pull request there. There are some changes though that would have to be made and that can only be fixed upstream in this repository. You can look at the pull request here: Homebrew/homebrew-core#15562
First of all, the Unix Makefile would need an install target (that should be easy to do, it just needs to install the built precomp binary to a suitable location). It would be needed to have it done via a variable in the Makefile, so the brew formula can override the variable and install the application to the correct folder automatically.
Secondly, the main Makefile uses the -s
flag which is not available on OS X Yosemite and thus, precomp would not be supported on all platforms that are still served by brew (El Capitan and Sierra ignore this flag, but Yosemite does not compile with it at all). Is this flag really needed? According to GCC, it is supposed to "Remove all symbol table and relocation information from the executable."
The other two requested changes from the pull request have to be changed by me in the formula, though. So there is nothing, you would need to change.
Description of chunk format:
Sony uses the .ARW format for their raw photographs.
For many years those files were hardly compressible but now they can be decoded losslessly.
The sources can be found here
https://github.com/FLIF-hub/FLIF/tree/master/raw-tools/sony_arw
Could this format please be supported by Precomp?
pre compressing this file --
precomp_segfault.zip
with command line --
-cn -oprecomp_tar
segfaults precomp on LInux 64
I tried to get the BT, but the debugging symbols where not there in the binary version and the one that I built from git. I tried changing the CFLAGS and added -ggdb, but the debugging symbols still got stripped. On changing the optimization levels to -O0 for all libraries, solved the problem but of course it's a bad solution.
There are still false MP3 positives, e.g in this TeamViewer Debian package:
(0.00%) Possible bZip2-Stream found at position 2084, compression level = 9
Can be decompressed to 192122880 bytes
Identical recompressed bytes: 53084090
Identical decompressed bytes: 192122880 of 192122880
Real identical bytes: 53084090
Best match: 53084090 bytes, decompressed to 192122880 bytes
Recursion start - new recursion depth 1
(52.25%) Possible MP3 found at position 100378755, length 2400
packMP3 error: region size out of bounds (frame #0 at 0x0)
No matches
[...]
Recompressed streams: 1/140
MP3 streams: 0/139
bZip2 streams: 1/1
Doing the region size check before passing the data to PackJPG would solve this.
Compressed SWF files contain a version byte. At the moment, it is checked if this version byte is between 0 and 10:
if ((swf_version > 0) && (swf_version < 10)) {
Unfortunately, the highest possible SWF version is 32 as of Mar 10, 2016 (list at Adobe, StackOverflow question) and it's very likely to proceed further.
As the 3 preceding bytes (CWS) don't leave too much room for false positives, the best solution is to just ignore the version flag.
At the moment, the file extension is replaced with ".pcf".
It would be nice to have an option to preserve the file extension, e.g. processing a file called "file.tar" would result in a new file called "file.tar.pcf" instead of "file.pcf".
The default should still be replacing the extension.
Currently precomp only supports compressing using bzip2 which is not too bad. However most people probably turn compression off and use another tool to manually apply stronger compression. XZ (http://tukaani.org/xz/) is already widely used in the *nix world and would be an excellent choice for a strong, yet not exotic compression method.
Even more interesting: Importing liblzma as contrib might also make it possible to allow for recompression of lzma streams. FreeBSD as well as a lot of Linux distributions use xz for their packages. Those often contain gzipped files like manpages and stuff like that. So if precomp could decompress lzma streams, streams with weak compression could be found and dealt with using recursion. A lot of open source projects offer source code in xz compressed tarballs, as well, and precompressing e.g. a project's source iso would greatly benefit from xz support.
I've run precomp4.6 on the attached file which is compressed with GZ but it seems that precomp does not recognize the embedded base64 streams which easy to support. Also refer to my post at http://encode.ru/threads/2036-Leanify?p=49016&viewfull=1#post49016 for examples on xml files generated from swf files. it such tool (swf converter to xml and back is made) then a better swf support is gained
Acacia_High.zip
PreComp 0.46 (64 bit), Win 10 x64.
Files in dir after crash (file starting with D5803 - source):
~temp00000000.dat (0 B)
~temp000000000.dat (16382 B)
~temp000000001.dat (1735 MB)
~temp000000001.dat_ (2241 MB)
D5803 23.5.A.1.291 Customized RU 1289-0445 R3D (by skapunkcsd90@4pda).ftf (1616 MB)
precomp.exe (1515 KB)
Command line:
precomp.exe -cn -intense -brute "D5803 23.5.A.1.291 Customized RU 1289-0445 R3D (by skapunkcsd90@4pda).ftf"
(and with compression, -cl, - same thing)
94.04% -
ERROR 3: There is not enough space on disk
Even other PC and OS - same position (94.04%) and same error 3. Other similar FTF files >1GB size - same issue (other pos). But for 700 MB file works fine.
Path to folder - C:\PreComp. Disk C is NTFS, 652 GB free.
3D modelling file format used by [https://en.wikipedia.org/wiki/Rhinoceros_3D](Rhinoceros 3D).
There is a C++/.NET SDK to read and write 3DM files called https://www.rhino3d.com/opennurbs.
There's a bug in the PNG multi routines that leads to an integer underflow and incorrect restoration of the attached file (original file 9 KB, incorrect restored file 4 GB). The problem here seems to be the combination of two IDAT chunks and the second one being very short (only 3 image data bytes).
At the moment, you can set a list of ignore positions using the ' -i' parameter. However, when in recursion, positions are reset to 0 so a given position like 1234
can occur several times at different recursion levels which is unwanted behaviour - the user wants to ignore a certain stream and nothing more.
Proposal: To prevent this, the stream position output in verbose mode should output positions in the format xxxx_yyyy_zzzz...
- every recursion adds an underscore to the position, so e.g. a stream starting at position 1234 that contains a stream starting at position 567 in it would lead to the position 1234_567
. The ignore list parameter parsing and usage in the detection would have to be changed accordingly.
When using Precomp for self extracting archives, the following would be useful to reduce their size:
This could be done using modular concepts, e.g. using #ifdefs
or treating recompression methods as plug-ins.
Recompressed streams: 90/90
PDF streams: 88/88
PDF image streams (8-bit): 0/1
JPG streams: 1/1
In the output above, 88 + 0 + 1 = 89, 89 != 90, count of successful decompressed streams is wrong.
There are newer versions of PackJPG, bZip2 and zLib. All of them have bugfixes and minor improvements, bZip2 even fixes a potential security vulnerability.
When using PAQ compressors on files processed with the "pdfbmp" switch, images inside PDFs are wrapped with a dummy BMP header so that the compressor can switch to the image model and compress the image both faster and better.
This should also be implemented for GIF image data.
For PNG image data, research is needed. It might also help, but the image data is filtered here (see here), so it might be necessary to unfilter it.
The switch should be renamed to "wrapbmp" after this.
Image data is present in many of the streams Precomp supports, e.g. PDF, PNG and GIF. On this data, specialized image compression algorithms can be used to get better compression ratio than with the general purpose compression methods the -c option offers (bzip/lzma).
At the moment, the -pdfbmp option does something similar by converting image data detected in PDF documents to BMP format so higher compression ratios can be achieved when processing the resulting files with compressors from the PAQ family later.
Compressing the image data using the FLIF library would offer a way to get higher compression ratios on image data using only Precomp. This would also be useful for people that want to convert existing image files to FLIF without losing the original file.
At the moment, additional characters after valid switches are ignored, which can lead to unwanted behaviour. For example, an user could call precomp -cn-v file
(no space before -v
) and be surprised that verbose mode is off.
This should lead to an error ("superfluous characters in switch -cn-v") instead of ignoring the characters.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.