nexb / extractcode Goto Github PK

A mostly universal file extraction library and CLI tool to extract almost any archive in a reasonably safe way on Linux, macOS and Windows.

Home Page: https://www.aboutcode.org/

Python 84.80% Shell 2.48% Roff 2.01% HTML 6.12% Perl 1.90% C 1.07% Makefile 0.41% Batchfile 1.21%

7zip archive bzip2 cab cpio decompression extract extractor gzip iso9660 libarchive lzma tar xz zip zstd

extractcode's People

Contributors

Stargazers

Watchers

Forkers

tardyp pombredanne aydinnyunus rajamedindrao jayvdb 5l1v3r1 josemakifriend lizlefevre0 armintaenzertng jflynnxyz priv-kweihmann havocesp srehm eclipseo heliocastro keshav-space

extractcode's Issues

Add support for JMOD and JIMAGE archives

JMODs or Java Modules are zip files with a modified magic:
Instead of start with the zip 50 4B 03 04 they start with 4A 4D (e.g. JM) then 01 00 and then the zip header 50 4B 03 04
See also

Check uncompressed size before extract entries of archive

Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.

It is possible to add size limit for those kind of files? like an ignore option.
or maybe to set the limit of the max. uncompressed size of the whole archive.

extractcode's behaviour and error message output on damaged archives

when running extractcode on gcc-4.9 (download here https://packages.debian.org/jessie/all/gcc-4.9-source/download) extractcode fails with these error messages:

nakami@debian:~/Downloads/scancode-toolkit-developNEW$ ./extractcode samples/gcc-4.9-source_4.9.2-10_all.deb 
Extracting archives...
  [####################################]                                             
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
Extracting done.

with the --verbose flag the error messages look like this:

nakami@debian:~/Downloads/scancode-toolkit-developNEW$ ./extractcode --verbose samples/gcc-4.9-source_4.9.2-10_all.deb 
Extracting archives...
Extracting: gcc-4.9-source_4.9.2-10_all.deb
[...]
Extracting: changelog.Debian.gz

ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
Extracting done.

are you aware of this problem?
it seems that either the depth of further archives within the initial archive, the depth that those have or both depths sumed up is the problem here.

Extract a specific paths from an archive

It would be useful (and a prep for nexB/scancode-toolkit#14) to be able to extract a single path or a list of paths from a given archive rather than everything all the times.
In particular this would allow smarter extracts of only specific files (such as metadata from package archives) when needed and would speed up some scans (and use less disk)

Failed to extract windows AR lib

In https://files.pythonhosted.org/packages/55/e0/ccbf260a6545460d1da255810320d4f0cc8aee7fe4127eec07cfa6297228/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl

ERROR extracting: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6qml.abi3.lib: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6qml.abi3.lib
/tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6qml.abi3.lib
Open ERROR: Can not open the file as [Ar] archive


ERRORS:
Is not archive
ERROR extracting: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6.abi3.lib: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6.abi3.lib
/tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6.abi3.lib
Open ERROR: Can not open the file as [Ar] archive


ERRORS:
Is not archive
Extracting done.

Cannot extract some files

See nexB/scancode.io#827 (comment)

Ensure we are hadling common tar bombs and zip bombs correctly

See:

Add support for smarter MSFT cab extraction

At the moment the extractcode lib extracts CAB files al-right, but does not understand the underlying structure of the files. You end up with a pile of files names after some hash or UUID and not real file names as they would be installed.

The format is more or less documented here: https://msdn.microsoft.com/en-us/library/bb417343.aspx
https://github.com/n3k/PyCAB seems to implement some code to handle this in pure Python
Wine has an implementation of cabarc which runs likely only under wine.
cabextract is a portable, standalone extractor: http://cabextract.org.uk/

Support Python 3.9

We should enable Python 3.9 testing on CI.

Extract phar archives

These are used with PHP, for instance with composer.

Add support for zstandard archives

These are not super common but they are supported by the latest libarchive

Spaces in paths are replaced with underscore

I have extracted a Windows Docker image using extractcode. I noticed that the Program Files directory in one of the layers had the space replaced with an underscore (Program_Files vs Program Files). I expect that the path of the files/directory would not be modified when extracted.

extractcode should extract lzip archives

See http://ftp.gnu.org/gnu/gmp/gmp-5.1.3.tar.lz
These are not very frequent and in most cases there are gz or bz2 alternatives.
To consider at some point

Failure to extract a zip with trailing garbage with libarchive

This bug is tracked upstream there libarchive/libarchive#545
The failing test is:
https://github.com/nexB/scancode-toolkit/blob/develop/tests/extractcode/test_archive.py#L579

Cannot extract Lz4 file

Cannot extract lz4 file... Found in nexB/scancode.io#827 (comment)

$ cat foobar
dadsasd

$ lz4 foobar

$ extractcode --all-formats foobar.lz4 
Extracting archives...
[####################] 4                       
WARNING extracting: foobar.lz4: 'dadsasd': 
Missing type keyword in mtree specification
Extracting done.

$ ll foobar.lz4-extract/
total 8
drwxrwxr-x  2 pombreda pombreda 4096 Aug  1 15:52 ./
drwxrwxr-x 16 pombreda pombreda 4096 Aug  1 15:52 ../
-rw-rw-r--  1 pombreda pombreda    0 Jan  1  1970 dadsasd

The file is empty!

Consider patool test suite for extractcode

See https://github.com/wummel/patool/tree/master/tests/data for more tests cases.

Deal properly with large UPX-compressed binaries

Scanning UPX-compressed executables does not make sense unless they could be unpacked first.
See https://en.wikipedia.org/wiki/UPX

For instance these PostgreSQL installers take a large amount of resources and time to scan.
And there is little to squeeze out of the raw binaries.

One is a large statically-linked ELf http://get.enterprisedb.com/postgresql/postgresql-9.4.1-1-linux-x64.run for Linux.
The other a Windows exe from http://get.enterprisedb.com/postgresql/postgresql-9.4.1-1-windows-x64.exe

They are not really archives but exe hence the reason why they are still scanned for now.
We will need to figure out a way to avoid issues when dealing with these large binaries that cannot yield much when scanned.
Both are compressed with UPX which makes their binary completely opaque short of decompressing them assuming they are using a standard UPX compressor.

Extractcode replaces `:` in file names with `_`

This replacement is causing an issue with how debian system package resources are found and associated in the scancode.io docker pipeline. Some debian .list files have : in their names to separate the architecture from the package name, e.g. libc6:amd64.list. However, extractcode extracts this file as libc6_amd64.list. The code run in the docker pipeline is trying to find the original, unmodified name (libc6:amd64.list), and such the pipeline does not find the declared resources of a package from the .list as it was extracted as libc6_amd64.list

ExtractCode (from ScanCode TK) fails to extract .pkg and .exe files

There are some .pkg files that 7zip is able to do the extraction while extractcode fails to do so.

Note that I have already ran with the --all-formats

Improve doc for extractcode --ignore option

The extractcode doc at https://scancode-toolkit.readthedocs.io/en/stable/tutorials/how_to_extract_archives.html doc doesn't mention the "--ignore" option at all. it's quite an important option to avoid wasting time on unnecessary files and also for preventing extractcode falling over when it encountered an invalid/corrupt archive file that isn't required.

When documenting this flag, it'd be helpful to explain the interaction between the extractcode --ignore and the scancode parameter of the same name. Specifically, having just spent several hours adding debug statements to the source code to understand why my extractcode --ignore globs weren't working, the piece of info that would really help is to know that the extractcode ignores do NOT apply to paths within the archives (e.g. my-archive.tar/tests/foo is extracted even if I use extractcode --ignore=*/tests/*) but only to the decision about which archives to unpack.

(aside: I was wondering about create an additional FR for applying extractcode ignores to individual files - could make it a LOT faster if extractcode didn't waste time writing to-be-ignored files such as /tests/ to disk only to be later ignored by scancode... if you think that's a good idea we could create an issue for that too)

Add support for lpkg file extraction

A .lpkg file is compressed file format that contains .jar files to be deployed to Liferay DXP. See also https://help.liferay.com/hc/en-us/articles/360018159991-Overriding-lpkg-files .

Sparse file in archive triggers "OSError: [Errno 28] No space left on device: '...' " during scan

Description

moby-20.10.5.zip

After scanning the folder 'moby-20.10.5' I ran into following error:

(from docker logs)

There should be enough space left on the device.

How To Reproduce

Scan folder 'moby-20.10.5' with following function.

Note, that max_depth=0, such that there are no limitations.

System configuration

For bug reports, it really helps us to know:

What OS are you running on? (Windows/MacOS/Linux)
Windows/docker, as well as Linux/docker
What version of scancode-toolkit was used to generate the scan file?
21.8.4
What installation method was used to install/run scancode? (pip/source download/other)
pip with Python 3.6

Ungraceful handling of libarchive missing symbol

When extractcode-libarchive isnt present, on openSUSE we see the following. A more graceful error indicating what to do would be helpful.

[   35s] src/extractcode/archive.py:29: in <module>
[   35s]     from extractcode import libarchive2
[   35s] <frozen importlib._bootstrap>:991: in _find_and_load
[   35s]     ???
[   35s] <frozen importlib._bootstrap>:975: in _find_and_load_unlocked
[   35s]     ???
[   35s] <frozen importlib._bootstrap>:671: in _load_unlocked
[   35s]     ???
[   35s] /usr/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
[   35s]     exec(co, module.__dict__)
[   35s] src/extractcode/libarchive2.py:635: in <module>
[   35s]     archive_reader = libarchive.archive_read_new
[   35s] /usr/lib/python3.8/ctypes/__init__.py:386: in __getattr__
[   35s]     func = self.__getitem__(name)
[   35s] /usr/lib/python3.8/ctypes/__init__.py:391: in __getitem__
[   35s]     func = self._FuncPtr((name_or_ordinal, self))
[   35s] E   AttributeError: /usr/bin/python3.8: undefined symbol: archive_read_new

Problem with recursive extraction

We have this https://github.com/apache/tika/raw/1.28.5/tika-parsers/src/test/resources/test-documents/droste.zip , this zip contains itself as a zip inside and it's a never-ending recursive extraction. See:

Extract VM images

We should be able to extract VMDK, VDI and similar qcow images, as well as ext2, ext3 and ext4 (and ideally some squashfs too?)

extractcode fails on Fedora 24

On Fedora 24, I get this:

$ extractcode test/requests-2.11.1.tar.gz                        
Extracting archives...
  [------------------------------------]
Traceback (most recent call last):
  File "scancode-toolkit/bin/extractcode", line 9, in <module>
    load_entry_point('scancode-toolkit', 'console_scripts', 'extractcode')()
  File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 664, in __call__
    return self.main(*args, **kwargs)
  File "scancode-toolkit/src/scancode/utils.py", line 64, in main
    standalone_mode=standalone_mode, **extra)
  File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 644, in main
    rv = self.invoke(ctx)
  File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 837, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 464, in invoke
    return callback(*args, **kwargs)
  File "scancode-toolkit/src/scancode/extract_cli.py", line 156, in extractcode
    for xev in extraction_events:
  File "scancode-toolkit/lib/python2.7/site-packages/click/_termui_impl.py", line 240, in next
    rv = next(self.iter)
  File "scancode-toolkit/src/scancode/api.py", line 43, in extract_archives
    from extractcode.extract import extract
  File "scancode-toolkit/src/extractcode/extract.py", line 37, in <module>
    from extractcode import archive
  File "scancode-toolkit/src/extractcode/archive.py", line 47, in <module>
    from extractcode import libarchive2
  File "scancode-toolkit/src/extractcode/libarchive2.py", line 91, in <module>
    libarchive = load_lib()
  File "scancode-toolkit/src/extractcode/libarchive2.py", line 79, in load_lib
    lib = ctypes.CDLL(libarchive)
  File "/usr/lib64/python2.7/ctypes/__init__.py", line 357, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libbz2.so.1.0: cannot open shared object file: No such file or directory

It starts working correctly if I run this:

sudo ln -s /usr/lib64/libbz2.so.1 /usr/lib64/libbz2.so.1.0

Some files are wrongly recognized as archives.

A text file with this content followed by an LF is wrongly reported as zip file by the latest libmagic 5.22
80de10a8b9f13365de8cc4bbf8efec5e /etc/rsyslog.d/50-default.conf
This triggers some extraction error.

`./configure --dev` fails with: Could not find a version that satisfies the requirement typecode[full]>=30.0.0

Description

./configure --dev fails with

ERROR: Could not find a version that satisfies the requirement typecode[full]>=30.0.0; extra == "full" (from extractcode[full,patch,testing]) (from versions: 20.9, 20.9.28, 20.9.29, 20.10, 20.10.7, 20.10.12, 20.10.20, 21.1.8.1, 21.1.9.1, 21.1.21, 21.2.24, 21.5.31, 21.6.1, 30.0.0)
ERROR: No matching distribution found for typecode[full]>=30.0.0; extra == "full"

on fresh checkout of 1c64a75.

Extractcode FileNotFoundError if using replace-originals option

If extractcodes tries to extract and get some errors and afterwards tries to clean, up you will get error with filenotfound exception.
e.g. again gcc magic issue6550.gz archive
here traces without replace-originals

λ extractcode --verbose D:\test-extractcode\test.zip
Extracting archives...
Extracting: test.zip
Extracting: test.zip
Extracting: zweite_ebene.zip
Extracting: issue6550.gz
ERROR extracting: D:/test-extractcode/test.zip-extract/zweite_ebene.zip-extract/issue6550.gz: Error -3 while decompressing data: too many length or distance symbols
Extracting done.

with replace originals

λ extractcode --verbose --replace-originals D:\test-extractcode\test.zip
Extracting archives...
Extracting: test.zip
Extracting: test.zip
Extracting: zweite_ebene.zip
Extracting: issue6550.gz
Extracting: issue6550.gz
Traceback (most recent call last):
  File "C:\WS\tools\Python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\WS\tools\Python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\WS\tools\scancode-toolkit-21.3.31\Scripts\extractcode.exe\__main__.py", line 7, in <module>
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\commoncode\cliutils.py", line 87, in main
    **extra,
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 697, in main
    rv = self.invoke(ctx)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\extractcode\cli.py", line 184, in extractcode
    for xev in extraction_events:
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\_termui_impl.py", line 259, in next
    rv = next(self.iter)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\extractcode\api.py", line 42, in extract_archives
    ignore_pattern=ignore_pattern
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\extractcode\extract.py", line 142, in extract
    fileutils.copytree(target, source)
  File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\commoncode\fileutils.py", line 403, in copytree
    names = os.listdir(src)
FileNotFoundError: [WinError 3] Das System kann den angegebenen Pfad nicht finden: 'D:\\test-extractcode\\test.zip-extract\\zweite_ebene.zip-extract\\issue6550.gz-extract'

can be fixed in the extract.py in method extract to skip moving file when event has errors.
I mean here for example:

    for event in extract_events:
        yield event
        if replace_originals:
            processed_events_append(event)

Not usable with commoncode 21.1.14

Description

When using new commoncode release 21.1.14 calling extractcode fails.
This was not the case with commoncode release 20.10.20.

Error message from extractcode

Traceback (most recent call last):
File "/app/venv-extractcode/bin/extractcode", line 5, in
from extractcode.cli import extractcode
File "/app/venv-extractcode/lib/python3.6/site-packages/extractcode/init.py", line 39, in
from commoncode.fileutils import fsencode
ImportError: cannot import name 'fsencode'

How To Reproduce

# Workdir is /app
pip install virtualenv \
    && virtualenv -p /usr/local/bin/python3.6 venv-extractcode \
    && . venv-extractcode/bin/activate \
    && pip install extractcode

virtualenv -p /usr/local/bin/python3.6 venv-extractcode
. venv-extractcode/bin/activate
    
/app/venv-extractcode/bin/extractcode

Error also occurs with python 3.9.1.

System configuration

Running on Linux in python:3.6 docker container.
Release 20.10 of extractcode.
Installed with pip.

Various tests failures on Python 3.12.0rc1

Environment:

Python 3.12.0rc1
Fedora Rawhide
libmagic 5.45
libarchive 3.7.1
p7zip 16.02

Summary

FAILED tests/test_archive.py::TestGetExtractorTest::test_get_extractor_qcow2
FAILED tests/test_archive.py::TestRar::test_extract_rar_with_trailing_data - ...
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_take_an_empty_directory
FAILED tests/test_extractcode_cli.py::test_extractcode_command_does_extract_verbose
FAILED tests/test_extractcode_cli.py::test_extractcode_command_always_shows_something_if_not_using_a_tty_verbose_or_not
FAILED tests/test_extractcode_cli.py::test_extractcode_command_works_with_relative_paths
FAILED tests/test_extractcode_cli.py::test_extractcode_command_works_with_relative_paths_verbose
FAILED tests/test_extractcode_cli.py::test_usage_and_help_return_a_correct_script_name_on_all_platforms
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_archive_with_unicode_names_verbose
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_archive_with_unicode_names
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_shallow
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_ignore - A...
FAILED tests/test_extractcode_cli.py::test_extractcode_command_does_not_crash_with_replace_originals_and_corrupted_archives
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_nuget

Details:

=================================== FAILURES ===================================
________________ TestGetExtractorTest.test_get_extractor_qcow2 _________________

self = <test_archive.TestGetExtractorTest testMethod=test_get_extractor_qcow2>

    def test_get_extractor_qcow2(self):
        test_file = self.extract_test_tar('vmimage/foobar.qcow2.tar.gz')
        test_file = str(Path(test_file) / 'foobar.qcow2')
    
        expected = []
        self.check_get_extractors(test_file, expected, kinds=extractcode.default_kinds)
    
        expected = [archive.extract_vm_image]
>       self.check_get_extractors(test_file, expected, kinds=())

tests/test_archive.py:217: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test_archive.TestGetExtractorTest testMethod=test_get_extractor_qcow2>
test_file = '/tmp/scancode-tk-tests -uejk8_7m/u6e9wzs8/foobar.qcow2.tar.gz/foobar.qcow2'
expected = [<function extract at 0x7f702e883c40>], kinds = ()

    def check_get_extractors(self, test_file, expected, kinds=()):
        from extractcode import archive
    
        test_loc = self.get_test_loc(test_file)
        if kinds:
            extractors = archive.get_extractors(test_loc, kinds)
        else:
            extractors = archive.get_extractors(test_loc)
    
        fe = fileutils.file_extension(test_loc).lower()
        em = ', '.join(e.__module__ + '.' + e.__name__ for e in extractors)
    
        msg = ('%(expected)r == %(extractors)r for %(test_file)s\n'
               'with fe:%(fe)r, em:%(em)s' % locals())
>       assert expected == extractors, msg
E       AssertionError: [<function extract at 0x7f702e883c40>] == [] for /tmp/scancode-tk-tests -uejk8_7m/u6e9wzs8/foobar.qcow2.tar.gz/foobar.qcow2
E         with fe:'.qcow2', em:
E       assert [<function ex...7f702e883c40>] == []
E         Left contains one more item: <function extract at 0x7f702e883c40>
E         Use -v to get more diff

tests/extractcode_assert_utils.py:166: AssertionError
_________________ TestRar.test_extract_rar_with_trailing_data __________________

self = <test_archive.TestRar testMethod=test_extract_rar_with_trailing_data>

    def test_extract_rar_with_trailing_data(self):
        test_file = self.get_test_loc('archive/rar/rar_trailing.rar')
        test_dir = self.get_temp_dir()
        expected = Exception('Unknown error')
>       self.assertRaisesInstance(expected, archive.extract_rar, test_file, test_dir)

tests/test_archive.py:1693: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test_archive.TestRar testMethod=test_extract_rar_with_trailing_data>
excInstance = Exception('Unknown error')
callableObj = <function extract at 0x7f702e881620>
args = ('/builddir/build/BUILD/extractcode-31.0.0/tests/data/archive/rar/rar_trailing.rar', '/tmp/scancode-tk-tests -uejk8_7m/h4a951m9')
kwargs = {}, excClass = <class 'Exception'>, excName = 'Exception'

    def assertRaisesInstance(self, excInstance, callableObj, *args, **kwargs):
        """
        This assertion accepts an instance instead of a class for refined
        exception testing.
        """
        kwargs = kwargs or {}
        excClass = excInstance.__class__
        try:
            callableObj(*args, **kwargs)
        except excClass as e:
            assert str(e).startswith(str(excInstance))
        else:
            if hasattr(excClass, '__name__'):
                excName = excClass.__name__
            else:
                excName = str(excClass)
>           raise self.failureException('%s not raised' % excName)
E           AssertionError: Exception not raised

tests/extractcode_assert_utils.py:184: AssertionError
_____________ test_extractcode_command_can_take_an_empty_directory _____________

    def test_extractcode_command_can_take_an_empty_directory():
        test_dir = test_env.get_temp_dir()
>       result = run_extract([test_dir], expected_rc=0)

tests/test_extractcode_cli.py:64: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['/tmp/scancode-tk-tests -uejk8_7m/fna7ftnv'], expected_rc = 0
cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
________________ test_extractcode_command_does_extract_verbose _________________

    def test_extractcode_command_does_extract_verbose():
        test_dir = test_env.get_test_loc('cli/extract', copy=True)
>       result = run_extract(['--verbose', test_dir], expected_rc=1)

tests/test_extractcode_cli.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/bbuw3oy4/extract']
expected_rc = 1, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
_ test_extractcode_command_always_shows_something_if_not_using_a_tty_verbose_or_not _

    def test_extractcode_command_always_shows_something_if_not_using_a_tty_verbose_or_not():
        test_dir = test_env.get_test_loc('cli/extract/some.tar.gz', copy=True)
    
>       result = run_extract(options=['--verbose', test_dir], expected_rc=0)

tests/test_extractcode_cli.py:91: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/36o7wnm7/some.tar.gz']
expected_rc = 0, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
______________ test_extractcode_command_works_with_relative_paths ______________

    def test_extractcode_command_works_with_relative_paths():
        # The setup is complex because we want to have a relative dir to the base
        # dir where we run tests from, i.e. the git checkout  dir To use relative
        # paths, we use our tmp dir at the root of the code tree
        from os.path import join
        from  commoncode import fileutils
        import extractcode
        import tempfile
        import shutil
    
        try:
            test_file = test_env.get_test_loc('cli/extract_relative_path/basic.zip')
    
            project_tmp = join(project_root, 'tmp')
            fileutils.create_dir(project_tmp)
            temp_rel = tempfile.mkdtemp(dir=project_tmp)
            assert os.path.exists(temp_rel)
    
            relative_dir = temp_rel.replace(project_root, '').strip('\\/')
            shutil.copy(test_file, temp_rel)
    
            test_src_file = join(relative_dir, 'basic.zip')
            test_tgt_dir = join(project_root, test_src_file) + extractcode.EXTRACT_SUFFIX
>           result = run_extract([test_src_file], expected_rc=0, cwd=project_root)

tests/test_extractcode_cli.py:124: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['tmp/tmpiu9t_zt7/basic.zip'], expected_rc = 0
cwd = '/builddir/build/BUILD/extractcode-31.0.0'

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
__________ test_extractcode_command_works_with_relative_paths_verbose __________

    def test_extractcode_command_works_with_relative_paths_verbose():
        # The setup is a tad complex because we want to have a relative dir
        # to the base dir where we run tests from, i.e. the git checkout dir
        # To use relative paths, we use our tmp dir at the root of the code tree
        from os.path import join
        from  commoncode import fileutils
        import tempfile
        import shutil
    
        try:
            project_tmp = join(project_root, 'tmp')
            fileutils.create_dir(project_tmp)
            test_src_dir = tempfile.mkdtemp(dir=project_tmp).replace(project_root, '').strip('\\/')
            test_file = test_env.get_test_loc('cli/extract_relative_path/basic.zip')
            shutil.copy(test_file, test_src_dir)
            test_src_file = join(test_src_dir, 'basic.zip')
    
>           result = run_extract(['--verbose', test_src_file] , expected_rc=0)

tests/test_extractcode_cli.py:158: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--verbose', 'tmp/tmpc_z4ga7z/basic.zip'], expected_rc = 0
cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
______ test_usage_and_help_return_a_correct_script_name_on_all_platforms _______

    def test_usage_and_help_return_a_correct_script_name_on_all_platforms():
        options = ['--help']
    
>       result = run_extract(options , expected_rc=0)

tests/test_extractcode_cli.py:177: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--help'], expected_rc = 0, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
___ test_extractcode_command_can_extract_archive_with_unicode_names_verbose ____

    def test_extractcode_command_can_extract_archive_with_unicode_names_verbose():
        test_dir = test_env.get_test_loc('cli/unicodearch', copy=True)
>       result = run_extract(['--verbose', test_dir] , expected_rc=0)

tests/test_extractcode_cli.py:195: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/2rcvcl3l/unicodearch']
expected_rc = 0, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
_______ test_extractcode_command_can_extract_archive_with_unicode_names ________

    def test_extractcode_command_can_extract_archive_with_unicode_names():
        test_dir = test_env.get_test_loc('cli/unicodearch', copy=True)
>       run_extract([test_dir] , expected_rc=0)

tests/test_extractcode_cli.py:213: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['/tmp/scancode-tk-tests -uejk8_7m/3a9hxexv/unicodearch']
expected_rc = 0, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
_________________ test_extractcode_command_can_extract_shallow _________________

    def test_extractcode_command_can_extract_shallow():
        test_dir = test_env.get_test_loc('cli/extract_shallow', copy=True)
>       run_extract(['--shallow', test_dir] , expected_rc=0)

tests/test_extractcode_cli.py:230: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--shallow', '/tmp/scancode-tk-tests -uejk8_7m/n1bkdpux/extract_shallow']
expected_rc = 0, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
_____________________ test_extractcode_command_can_ignore ______________________

    def test_extractcode_command_can_ignore():
        test_dir = test_env.get_test_loc('cli/extract_ignore', copy=True)
>       run_extract(['--ignore', '*.tar', test_dir] , expected_rc=0)

tests/test_extractcode_cli.py:248: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--ignore', '*.tar', '/tmp/scancode-tk-tests -uejk8_7m/bs_94w2l/extract_ignore']
expected_rc = 0, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
_ test_extractcode_command_does_not_crash_with_replace_originals_and_corrupted_archives _

    def test_extractcode_command_does_not_crash_with_replace_originals_and_corrupted_archives():
        test_dir = test_env.get_test_loc('cli/replace-originals', copy=True)
>       result = run_extract(['--replace-originals', '--verbose', test_dir] , expected_rc=1)

tests/test_extractcode_cli.py:266: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--replace-originals', '--verbose', '/tmp/scancode-tk-tests -uejk8_7m/ldztqpv3/replace-originals']
expected_rc = 1, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError
__________________ test_extractcode_command_can_extract_nuget __________________

    @pytest.mark.skipif(on_windows, reason='FIXME: this test fails on Windows until we have support for long file names.')
    def test_extractcode_command_can_extract_nuget():
        test_dir = test_env.get_test_loc('cli/extract_nuget', copy=True)
>       result = run_extract(['--verbose', test_dir])

tests/test_extractcode_cli.py:283: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/zcx7geku/extract_nuget']
expected_rc = None, cwd = None

    def run_extract(options, expected_rc=None, cwd=None):
        """
        Run extractcode as a plain subprocess. Return rc, stdout, stderr.
        """
        bin_dir = 'Scripts' if on_windows else 'bin'
        # note: this assumes that we are using a standard directory layout as set
        # with the configure script
        cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
>       assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E       AssertionError: assert False
E        +  where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E        +    where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E        +      where <module 'posixpath' (frozen)> = os.path

tests/test_extractcode_cli.py:38: AssertionError

extractcode error:File name address is longer than 255 bytes

D:/apptest/nqi/nia-npa-workflow-sl_master_20200218180521_5860907_B72.n0063.tar.gz-extract/nia-npa-workflow-sl/nia-npa-workflow.tar.gz-extract/nia-npa-workflow-1.0-SNAPSHOT.jar: [(u'd
\scancode-toolkit-3.0.0\tmp\scancode-tk-3.0.0-aerdzt\scancode-extract-c__kic\com\codahale\metrics\InstrumentedScheduledExecutorService$InstrumentedCallable.class', u'D:\apptest\nqi\
workflow-sl_master_20200218180521_5860907_B72.n0063.tar.gz-extract\nia-npa-workflow-sl\nia-npa-workflow.tar.gz-extract\nia-npa-workflow-1.0-SNAPSHOT.jar-extract\com\codahale\metrics\dScheduledExecutorService$InstrumentedCallable.class', u"[Errno 2] No such file or directory:

Trouble getting tests running

Hello,

I can't seem to manage to get extractcode and typecode to run the tests.

The whole issue is that part of the README.rst:

To install this package with its full capability (where the binaries for
7zip and libarchive are installed), use the full extra option::
pip install extractcode[full]
If you want to use the version of binaries (possibly) provided by your operating
system, use the minimal option::
pip install extractcode
In this case, you will need to provide a working and compatible libarchive and
7zip installed and configured in one of these ways such that ExtractCode can
find them:

a typecode-libarchive and typecode-7z plugin: See the standard ones at
https://github.com/nexB/scancode-plugins/tree/main/builtins
These can either bundle a libarchive library, a 7z executable or expose a
system-installed libraries.
It does so by providing plugin entry points as scancode_location_provider
for extractcode_libarchive that should point to a LocationProviderPlugin
subclass with a get_locations() method that must return a mapping with
this key:

'extractcode.libarchive.dll': the absolute path to a libarchive shared object/DLL

See for example:

https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40

https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17

And in the same way, the scancode_location_provider for extractcode_7zip
should point to a LocationProviderPlugin subclass with a get_locations()
method that must return a mapping with this key:

'extractcode.sevenzip.exe': the absolute path to a 7zip executable

See for example:

https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40

https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18

use environment variables to point to installed binaries:

EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL

EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable

a system-installed libarchive and 7zip executable available in the system PATH.

So I am on a distro with libarchive-3.7.1 and p7zip-16.02 installed. Obviously I don't want to bundle with the full option.

I set up:

export EXTRACTCODE_7Z_PATH=%{_bindir}
export EXTRACTCODE_LIBARCHIVE_PATH_ENVVAR=%{_libdir}
%pytest

It seems that libarchive is detected:

=============================== warnings summary ===============================
../../../../usr/lib/python3.12/site-packages/typecode/magic2.py:195
  /usr/lib/python3.12/site-packages/typecode/magic2.py:195: UserWarning: System libmagic found in typical location is used. Install instead a typecode-libmagic plugin for best support.
    warnings.warn(

src/extractcode/libarchive2.py:107: 12 warnings
  /builddir/build/BUILD/extractcode-31.0.0/src/extractcode/libarchive2.py:107: UserWarning: Using "libarchive" library found in a system location. Install instead a extractcode-libarchive plugin for best support.
    warnings.warn(

(same with libmagic)

However nothing works:

==================================== ERRORS ====================================
_________________ ERROR collecting src/extractcode/archive.py __________________
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_________________ ERROR collecting src/extractcode/archive.py __________________
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_________________ ERROR collecting src/extractcode/extract.py __________________
src/extractcode/extract.py:23: in <module>
    import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_________________ ERROR collecting src/extractcode/extract.py __________________
src/extractcode/extract.py:23: in <module>
    import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_______________ ERROR collecting src/extractcode/libarchive2.py ________________
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_______________ ERROR collecting src/extractcode/libarchive2.py ________________
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_archive.py ____________________
tests/test_archive.py:29: in <module>
    from extractcode import archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_archive.py ____________________
tests/test_archive.py:29: in <module>
    from extractcode import archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_extract.py ____________________
tests/test_extract.py:23: in <module>
    from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
    import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_extract.py ____________________
tests/test_extract.py:23: in <module>
    from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
    import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
________________ ERROR collecting tests/test_extractcode_api.py ________________
tests/test_extractcode_api.py:16: in <module>
    from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
    import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
________________ ERROR collecting tests/test_extractcode_api.py ________________
tests/test_extractcode_api.py:16: in <module>
    from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
    import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
    from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
    ???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:841: in _load_unlocked
    ???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
    archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
    func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
E   AttributeError: /usr/bin/python3: undefined symbol: archive_read_new

archive_read_new is not found so it doesn't find libarchive I believe.

So I may not have understood the README.rst correctly. Are all the part s following "In this case, you will need to provide a working and compatible libarchive and 7zip installed and configured in one of these ways such that ExtractCode can find them:" mandatory or do I have to choose among the options?

Are the plugins mandatory?

Then If I distribute extractool, does the user have to set up EXTRACTCODE_7Z_PATH and EXTRACTCODE_7Z_PATH each time? Is there a way to avoir that without bundling?

Also extract links from archives

extractcode does not extract links by design.... but this is proving to be limiting in some cases, in particular when extracting some package archive that do contain links and where the absence of such link would lead to eventually partial conclusion on the origin or license of such package.

Error while extracting patch file

❯ extractcode --all-formats libmediainfo-0.7.43.diff
Extracting archives...
[####################] 4
ERROR extracting: ./libmediainfo-0.7.43.diff: sequence item 0: expected str instance, bytes found
Extracting done.

`Handler` can probably benefit from making it an abstract class conxtructed by a metaclass, not a `namedtuple`

So you will be able to specify the interface and maybe reuse some code.

A metaclass can do the work on autoregistration in into the registry.

It may make sense to implement the extractors that can be implemented using python bindings only or purely in python that way

It would avoid a possible security issue with calling a subprocess and also may allow progress reporting (i.e. https://github.com/prebuilder/fetchers.py/blob/master/fetchers/unpackers/archives/tar.py#L24).

Drop support for extracting patches

Extracting a patch as if it were an archive is seldom used and rarely useful.
We should drop this

extractcode errors out on

Hi,
I'm running into a problem with certain .lz4 and also .jar files. Example (lz4):

$:~/SCAN_IMAGES/release-1.13.zip-extract$ ~/scancode-toolkit/extractcode ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
Extracting archives...
[####################] 4
ERROR extracting: /home/joe/SCAN_IMAGES/release-1.13.zip-extract/release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4: Unrecognized archive format
Extracting done.

But the file has substance and can be decompressed using the lz4 utility:

$:~/SCAN_IMAGES/release-1.13.zip-extract$ ls -al ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb
.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
-rw-r--r-- 1 joe users 17315708 Apr 29  2023 ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4

$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 -t ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
./release/deploy_art : decoded 45545571 bytes
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 --list ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists
/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
    Frames           Type Block  Compressed  Uncompressed     Ratio   Filename
         1       LZ4Frame   B4D      16.51M             -         -   deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 -dv ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/de
b.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
*** LZ4 command line interface 64-bits v1.9.3, by Yann Collet ***
Decoding file ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages
./release/deploy_art : decoded 45545571 bytes

Following is what the file header looks like:

$:~/SCAN_IMAGES/release-1.13.zip-extract$ hexdump ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4 | head
0000000 2204 184d 4040 cdc0 0078 f200 5003 6361
0000010 616b 6567 203a 6130 0a64 6f53 7275 0c63
0000020 f600 2008 3028 302e 322e 2e33 2d31 2935
0000030 560a 7265 6973 6e6f 203a 0015 7cf5 622b
0000040 0a31 6e49 7473 6c61 656c 2d64 6953 657a
0000050 203a 3032 3632 0a38 614d 6e69 6174 6e69
0000060 7265 203a 6544 6962 6e61 4720 6d61 7365
0000070 5420 6165 206d 703c 676b 672d 6d61 7365
0000080 642d 7665 6c65 6c40 7369 7374 612e 696c
0000090 746f 2e68 6564 6962 6e61 6f2e 6772 0a3e

The magic bytes are correct, pls refer to https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md

Why can lz4 decode it properly but extractcode cannot?

Regards,
Matthias

Wrong link in README.rst

The link contains a typo and points to a not existing page:

- homepage_url: https://github.com/nexB/extractode

It misses a c, should be:

- homepage_url: https://github.com/nexB/extractcode

nexb / extractcode Goto Github PK

extractcode's People

Contributors

Stargazers

Watchers

Forkers

extractcode's Issues

Description

How To Reproduce

System configuration

Description

Description

How To Reproduce

System configuration

Recommend Projects

Recommend Topics

Recommend Org