Giter VIP home page Giter VIP logo

Comments (10)

galtgendo avatar galtgendo commented on July 17, 2024

While it would likely take a tailored archive, it looks like unix_path_seperators in DBCS locale can potentially return an incorrect result too.

from libmspack.

galtgendo avatar galtgendo commented on July 17, 2024

a sort-of-working hack
(I'm yet again reminded how horribly stupid this interface is)

Anyway, this is sort of works for me...really bad code though, also, in case of a partially completed conversion likely broken wrt. path separators.

from libmspack.

galtgendo avatar galtgendo commented on July 17, 2024

updated hack
...a slightly less broken hack - separators should be now OK in case of partial conv

from libmspack.

kyz avatar kyz commented on July 17, 2024

Thanks for reminding me of this. I actually have a well developed patch from Kevin Tardif for setting the (non-UTF8) filename encoding, but so far I haven't applied it for want of testing.

I'll look into applying it and testing it.

from libmspack.

galtgendo avatar galtgendo commented on July 17, 2024

Looking forward to that patch.
Just wonder how will you go about unix_path_seperators, cause heuristics: either they work or they get things even more broken.

from libmspack.

kyz avatar kyz commented on July 17, 2024

If an encoding is specified, the CAB filenames that are not UTF8 will be converted from their given encoding to UTF-8 using iconv(), and the UTF8 flag will be set on them.

At that point, unix_path_separators() and create_output_name() will proceed as they would with UTF8 characters. Slash and backslash are single-byte characters in UTF8.

from libmspack.

kyz avatar kyz commented on July 17, 2024

I've committed several changes to cabextract:

  • cabextract will try and pick a UTF-8 locale for LC_TYPE at startup
  • cabextract supports a --encoding option to specify the encoding of non-UTF8 filenames

Could you try out these changes?

Iif you have any sample cabinet files with non-UTF8, non-ASCII, non-ISO-8859-1 filenames, I'd appreciate a copy. I haven't found any CAB creating software that generates these type of filenames.

If you know of any software that does create these, I'd also like to know about it too.

from libmspack.

galtgendo avatar galtgendo commented on July 17, 2024
+   for (i = 0; i < (sizeof(locales)/sizeof(*locales)); i++) {
+      if (setlocale(LC_CTYPE, locales[i])) break;
+   }

This makes me a bit twitchy: as it's for purpose of tolower/towlower, I'm nearly sure there's at least one locale where that's incorrect...then again, same could be said about the current code. I've been shown once just why that monster called ICU is sometimes necessary ... I still groan about it.

Also, while you were already doing it in the code with utf8 names, I consider broken filenames a better solution than ones filled with replacement char, as the former at least gives you a minimal chance of recovering the original name (also, a lower chance of a name clash - pretty much only if the archive is corrupted).

Other than that, the code looks like something that should work, given that you're the author of libmspack too.

from libmspack.

kyz avatar kyz commented on July 17, 2024

I think these are matters of opinion; the reason the replacement character is used is because 1) it's not useful to promise to convert to UTF8 and then generate invalid UTF8 (if you want the original filenames, just don't use --encoding) and 2) it helps pinpoint what went wrong, Filenames like br�k, br��k, b�r�k are much easier to debug than brk, brk, brk. It also removes the possibility of a bug where you end up with a blank filename.

All of the locales attempted with setlocale() are chosen for their UTF8 character type, only LC_CTYPE is set, cabextract doesn't use any other locale-dependent functions. It might not always be successful, because the user's system might not have any of those locales, but in that situation a POSIX compliant setlocale() will still leave the program in the "C" locale, so the situation didn't get any worse.

With glibc, all UTF8 locales have the same character types, except tr_TR which lowercases I to ı instead of i. I think trying to support that would introduce more problems than not supporting it. People usually work with cab files created by someone else, somewhere else, so adapting to their local settings is more likely to break things.

Talking of cabinet files, I still have still seen NO cabinet files in the wild that are non-ASCII and non-UTF8. If you have any, I would greatly appreciate them.

So far this feature is entirely theoretical. The Microsoft cab creation tools always encode non-ASCII filenames as UTF8, so I'm not sure if there are any cab files that would need this new --encoding option. I would love to be proven wrong.

from libmspack.

kyz avatar kyz commented on July 17, 2024

Thanks for sending some cabinet files over email. cabextract 1.7 was just released with the --encoding option to override local and UTF8 encodings, and I added a test suite with some synthetic cab files using non-UTF8 non-ASCII filename encodings, to ensure the feature works. Thanks for your input!

from libmspack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.