Comments (6)
Hi!
Calling BitArchiveReader::items takes good 6 seconds while official 7zFM.exe opens it in no time. Ideas?
Part of the reason is that items()
is a vector of whole item objects (a.k.a. BitArchiveItemInfo
).
This means that each BitArchiveItemInfo
object stores all the properties about that item that bit7z could get via 7-Zip; these properties are stored in a map BitProperty
(the property) -> BitPropVariant
(the value of the property).
Unfortunately, 7-Zip doesn't provide a method to know which item properties are present or not (as far as I know). So what bit7z does is to try to get all of them (~95 iirc); if it fails, the property is not present, otherwise it is stored in the map.
Now, if you do all this for ~350000 items, it can have a significant overhead, obviously.
Bit7z does this because it doesn't know how the user wants to use the vector returned by items()
.
Specifically, the user might want to use the content of the vector after the BitArchiveReader
went out of scope, and the archive was closed. So bit7z has to store every information about the items.
However, there's an alternative to items()
: you can directly iterate over a BitArchiveReader
. This way, you pretty much avoid the overhead of reading and storing all the properties of all the items.
BitArchiveReader
has the begin()
/cbegin()
/end()
/cend()
methods that allow, for example, to iterate over it with a for each loop.
An iterator of BitArchiveReader
points to a BitArchiveItemOffset
object, which is basically just an index of an item within the archive.
BitArchiveItemOffset
has the same exact interface of BitArchiveItemInfo
, but doesn't store any property, it only stores the index of the item. When you ask for something like the path()
of the item, bit7z will directly ask 7-Zip for its value.
For example:
BitArchiveReader reader{ lib, "qt-everywhere-src-6.7.0.zip", BitFormat::Zip };
const auto items = reader.items();
for ( const auto& item : items ) {
std::cout << item.index() << ") " << item.path() << '\n';
}
becomes
BitArchiveReader reader{ lib, "qt-everywhere-src-6.7.0.zip", BitFormat::Zip };
for ( const auto& item : reader ) {
std::cout << item.index() << ") " << item.path() << '\n';
}
Sorry for the wall-of-text, I just wanted to give some insight on why items()
works like that and might have a significant overhead in some cases like yours.
from bit7z.
PS.
for ( uint32_t i = 0; i < itemsCount(); ++i )I don't think it's a good idea to call itemsCount over 350.000 times (for this particular archive).
const uint32_t numberOfItems = itemsCount();would be better, though it doesn't help anyways in the matter.
Nice catch!
I think the compiler might be able to optimise it out of the loop condition, but I'm not sure if it can in this particular case.
I'll fix it in the next minor release!
Thanks for pointing it out!
from bit7z.
I've done some quick benchmarks (without any rigorous statistical analysis):
// Optimizing std::cout on MSVC
setvbuf(stdout, nullptr, _IOLBF, 4096);
std::ios::sync_with_stdio(false);
// Printing the content of the archive
BitArchiveReader reader{ lib, "qt-everywhere-src-6.7.0.zip", BitFormat::Zip };
const auto start_items = std::chrono::high_resolution_clock::now();
const auto items = reader.items();
for ( const auto& item : items ) {
std::cout << item.path() << '\n';
}
const auto end_items = std::chrono::high_resolution_clock::now();
const auto start_it = std::chrono::high_resolution_clock::now();
for ( const auto& item : reader ) {
std::cout << item.path() << '\n';
}
const auto end_it = std::chrono::high_resolution_clock::now();
std::cout << std::endl;
std::cout << "Time taken\n";
std::cout << " using items(): " << std::chrono::duration_cast<std::chrono::seconds>(end_items - start_items) << '\n';
std::cout << " using iterators: " << std::chrono::duration_cast<std::chrono::seconds>(end_it - start_it) << '\n';
Results (release build, MSVC 2022):
Time taken
using items(): 7s
using iterators: 3s
Obviously, the more information you print, the slower the loop, in any case.
For example, if you want to imitate the l
subcommand of 7z.exe
void print_item(const BitArchiveItem& item) {
const auto date = item.itemProperty(BitProperty::MTime);
const auto attr = item.isDir() ? "D...." : ".....";
std::cout << date.getTimePoint() << ' ' <<
attr << ' ' <<
std::setw(8) << item.size() << ' ' <<
std::setw(8) << item.packSize() << ' ' <<
item.path() << '\n';
}
// use print_item(item); in the for loops
The results are:
Time taken
using items(): 11s
using iterators: 6s
In general, the version using iterators is always faster than the one using the items()
function.
In my tests, iterators were consistently ~4s faster.
The 7z.exe
is significantly slower. However, the comparison is a bit unfair, as it does more things, like parsing the CLI arguments, open the archive (in my program I didn't include this), etc.
>cmake -E time 7z l qt-everywhere-src-6.7.0.zip
<7z output>
Elapsed time (seconds): 67.6411
My guess is that 7zFMT.exe
is even faster simply because it doesn't have to immediately display all the paths.
For example, if we non-recursively print only the items inside the root folder:
Time taken
using items(): 4s
using iterators: 0s
from bit7z.
thanks a lot for such a detailed explanation, it works comparable now!
from bit7z.
You're welcome!
from bit7z.
Just for reference, besides calling itemsCount()
only once, I found out that the items()
method could be further optimized.
The biggest improvement was achieved by calling std::vector::reserve
, which is possible since we already know the number of items that the vector will contain.
Another (smaller) improvement was achieved by moving the BitArchiveItemInfo
objects that are pushed into the vector instead of copying them.
From my tests, after all the changes, itemsCount()
is about 25% faster on average than before.
For example, with the same test program I used for this thread to emulate the 7z.exe
program, items()
is now only ~2s slower than the iterator version:
Time taken
using items(): 8s
using iterators: 6s
The changes are available in the latest v4.0.7.
from bit7z.
Related Issues (20)
- [Bug]: UpdateMode::Update not working as expected, throwing exception HOT 3
- [Feature Request]: Creation or extraction archive file in one line HOT 1
- [Feature Request]: Support for compressed packages in four formats: ace, img, uue, and war HOT 2
- [Feature Request]: Extracting files whose names contain forbidden characters HOT 1
- Do I need to build 7zip myself? HOT 6
- [Bug]:
- [Bug]: Multiple Definition Linker Error for IID_IUnknown HOT 10
- [Bug]: BIT7Z_DISABLE_USE_STD_FILESYSTEM not working HOT 4
- [Bug]: Compile error when bit7z.lib is introduced into the project. HOT 11
- [Feature Request]: Is "tstring_to_path" forgotten here? HOT 1
- [Bug]: HOT 3
- [Feature Request]: Compresses files open for writing by another applications HOT 1
- [Bug]: CreateObject 函数调用异常
- [Bug]: core dump with BitArchiveWriter HOT 9
- [Bug]: Some zip file names are garbled HOT 5
- [Bug]: HOT 3
- [Feature Request]: Link static 7z lib HOT 1
- [Bug]: Inconsistent encoding of string parameter HOT 1
- [Bug]: "Failed to open the output file: The system cannot find the path specified" error when extracting 7z file HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bit7z.