Giter VIP home page Giter VIP logo

Comments (14)

bodgit avatar bodgit commented on May 26, 2024 1

That does present challenges, and I understand the implementation differences. You'd probably have to create a map of file name to "byte ranges" to know which file it's reading from at any given time. That would allow me to identify which archive file is corrupted, perhaps. Probably a lot of work for you.

I would probably have to add an optional interface that an io.ReaderAt could implement that allows you to query what the underyling file(s) were for a given byte offset and then surface that information in a read error. It would likely tie in with #38 to provide a hint if there was any encryption involved that could point to an incorrect password.

Having a list of archive names is probably the easier task. I only rely on that when the extraction finishes without errors. I'm currently more interested in "a full file list after successful extraction" than I am being able to identify the bad file.

Given I have to open every volume up front, a Volumes() method that just returns the full list would likely be correct in any case so I'll add that.

I see @ulikunitz asked you a question on the PR you opened. Let me know if you need anything from me there to help answer that.

Yes, if you still have your forked copy of the xz library, could you add a fmt.Println or similar that dumps out the r.h.dictCap value from the archive header (where it was previously erroring if it was less than MinDictCap) when reading those archives. I'm curious if it's entirely unset, or just set to a value smaller than 4096.

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

I also have two sets of 7z files that this library will not extract. They're 10-15 years old. One set (4 files) is about 750MB. The other (3 files) is about 450MB. Are you interested in having these files to test with? both produce this error:

os.Open: 1 error occurred:
	* lzma: dictionary capacity too small

from sevenzip.

bodgit avatar bodgit commented on May 26, 2024

So it just returns either the sole archive filename, or the list of .001, .002, etc. filenames when the archive is split? That doesn't seem unreasonable to add.

I also have two sets of 7z files that this library will not extract. They're 10-15 years old. One set (4 files) is about 750MB. The other (3 files) is about 450MB. Are you interested in having these files to test with? both produce this error:

os.Open: 1 error occurred:
	* lzma: dictionary capacity too small

This error comes from the underlying LZMA library I'm using. The LZMA stream in the .7z archive should specify the size of the dictionary but maybe these old archives don't for some reason. This is the code doing the check:

https://github.com/ulikunitz/xz/blob/0b7c695d23f84aa7e968bbcaa1980847683d909a/lzma/reader.go#L72-L78

This code looks a bit suspect to me as I should be able to override the dictionary capacity, but the code checks the size in the stream before allowing the config to override it. I've forked the library and tweaked the order of the statements and pointed a branch to that fork. Can you try the dictcap branch of the sevenzip library and see if that works any better? If so, I'll file a PR with the xz library. Hopefully that will save passing 450-750 MB archives about 😉

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

Trying it shortly!
EDIT:
first go..

$ go get -u github.com/bodgit/sevenzip@dictcap
go: downloading github.com/bodgit/sevenzip v1.3.1-0.20221211005010-415a639b09a0
go: github.com/bodgit/sevenzip@dictcap: github.com/bodgit/[email protected]: verifying module: github.com/bodgit/[email protected]: reading https://sum.golang.org/lookup/github.com/bodgit/[email protected]: 404 Not Found
	server response: not found: fetch timed out

^^ there's an issue (closed yesterday) on the go GitHub repo for this. apparently it's not fixed. This worked:

GOPRIVATE=github.com/bodgit/sevenzip GOPROXY=direct go get -u github.com/bodgit/sevenzip@415a639b09a0342b3e014614f973273f3ebcd539

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

Both archives still produce the same error with that branch. EDIT: I suspect adding that replace in go.mod isn't working its way through my working tree.

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

I cloned your xz repo, pulled the dictcap branch, and replace'd it in my local go.mod. I even made sure your change was there, and added a word to the error message to make sure I had that change when the error reoccurred. Glad I did too because I typo'd the replace the first time around, and would not have caught it had I not changed the error.

Anywho.. it works now. That changed allowed me to extract those archives. 🤞 you get upstream to accept it

from sevenzip.

bodgit avatar bodgit commented on May 26, 2024

ulikunitz/xz#52 raised for fixing the unreadable archives.

from sevenzip.

bodgit avatar bodgit commented on May 26, 2024

Now that I think about it, it would be best if the Volumes() method provided point-in-time data, so when we get an error, we could look at the last slice item returned to identify the corrupted archive. In other words, the method should only return files that have been processed, and the one currently processing (if it's not done). I'm taking this ideal from here: https://github.com/nwaples/rardecode/blob/e2fa07408d4b19ae0500efbcc6983863c95f821e/reader.go#L359-L364

The 7-zip code has no idea that its dealing with an archive split into multiple volumes, the OpenReaderWithPassword() function just opens all of the volumes and creates an io.ReaderAt implementation that spans the volumes and treats them as one big file and then passes it off to the rest of the code. That's the only interface the code needs.

I'm also under the impression that that particular RAR library treats archives in a linear manner; it starts at the beginning of the archive and processes each volume in order. Whereas my package allows you to access any file randomly in the archive so you can bounce around reading from any volume, plus all of the archive metadata is usually found at the end of the archive, so you need to have opened all of the volumes for the initial seek to hit the right location.

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

That does present challenges, and I understand the implementation differences. You'd probably have to create a map of file name to "byte ranges" to know which file it's reading from at any given time. That would allow me to identify which archive file is corrupted, perhaps. Probably a lot of work for you.

Having a list of archive names is probably the easier task. I only rely on that when the extraction finishes without errors. I'm currently more interested in "a full file list after successful extraction" than I am being able to identify the bad file.

I see @ulikunitz asked you a question on the PR you opened. Let me know if you need anything from me there to help answer that.

Thanks for all your help!

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

Sorry just saw this. Will do that shortly!
EDIT:
First archive (4 * 200MB) printed several numbers :

dictCap 2048
dictCap 67108864
dictCap 1048576
dictCap 1048576
dictCap 67108864

Next archive (3 * 200MB):

dictCap 1024
dictCap 67108864
dictCap 1048576
dictCap 1048576
dictCap 67108864

And a third archive (4 * 200MB):

dictCap 2048
dictCap 67108864
dictCap 1048576
dictCap 1048576
dictCap 67108864

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

Release v0.5.11 of the xz library has solved the problem. All 3 archives extract after updating that module.

from sevenzip.

bodgit avatar bodgit commented on May 26, 2024

First archive (4 * 200MB) printed several numbers :

dictCap 2048
dictCap 67108864
dictCap 1048576
dictCap 1048576
dictCap 67108864

The first stream that will be opened in each archive is the compressed archive metadata which will tend to be quite small. So it seems it's set, just to a value smaller than 4096, which apparently is invalid.

Release v0.5.11 of the xz library has solved the problem. All 3 archives extract after updating that module.

That would have been my next request. I'll merge the dependabot PR that bumps the xz package.

from sevenzip.

bodgit avatar bodgit commented on May 26, 2024

I've pushed v1.4.0 that has a Volumes() method and also uses v0.5.11 of the xz library.

Thanks for the feature request and bug report!

from sevenzip.

davidnewhall avatar davidnewhall commented on May 26, 2024

You rock man, appreciate the quick turnaround and A+ support!

from sevenzip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.