stefan6419846 / license_tools Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 780 KB

Collection of tools for working with Open Source licenses

Home Page: https://license-tools.readthedocs.io/

License: Apache License 2.0

Python 100.00%

foss license-scan licenses open-source

license_tools's People

Contributors

Stargazers

Watchers

Forkers

pombredanne

license_tools's Issues

Improve RPM file modes

At the moment, the RPM file modes are wrong due to an upstream bug: https://github.com/srossross/rpmfile/issues/48. Afterwards, we should be able to provide a meaningful mapping for the file modes based upon the stat module as well.

Extract nested archives

Nested archives are not handled at the moment, which is required for .src.rpm files for example which usually ship an additional archive with the actual source code.

Replace print in cargo_tools.get_package_versions by logging

get_package_versions currently prints all cases where a package is skipped due to the repository URL. We should use logging instead to allow further customization with log levels and redirection.

Add metadata handling for pip packages

We already analyze the metadata of font and RPM files and could extend this to Python packages as well. Some basic implementations are available from https://github.com/stefan6419846/pip-licenses-lib

Some things which could be considered:

Retrieve name, version, author/maintainer, homepage, licenses, license file, dependencies
Basic verification tasks
- At least author or maintainer is set
- Homepage is set and an URL
- At least one license is declared and no full license text
- At least one license file is available

Reconsider/improve run_on_directory and TemporaryDirectoryWithFixedName

At the moment, run_on_directory uses TemporaryDirectoryWithFixedName to handle nested archives by unpacking them and deleting them after the analysis. This does not allow to use this method to permanently store the extracted files for further analysis.

There are two possible approaches to fix this:

Allow run_on_directory to receive a dedicated parameter which avoids deleting the unpacked archive directory after the analysis.
Replace the current handler by a dedicated method which extracts all archives in a recursive manner and starts one big analysis afterwards instead of planning separate analysis jobs for each extracted sub-archive.

Enable support for OTF font files

OTF files are currently being skipped, although already supported.

Enable support in

license_tools/license_tools/tools/font_tools.py

Line 191 in 95b0a10

KNOWN_FONT_EXTENSIONS = {".ttf", ".woff", ".woff2"}
Add test.

Move network functionality from retrieval to download utils and pip tools

At the moment, run_on_downloaded_archive_file and run_on_downloaded_package_file download the files directly. As this functionality could be useful for other cases as well, the corresponding download code should be moved to the download utils (for generic archive downloads) and the pip tools (for pip-specific package downloads without URL).

LDD analysis for binary files

At the moment, LDD analysis is restricted to shared objects. It should be extended to all types of Linux binaries.

Move get_files_from_directory to path utils

The get_files_from_directory functionality is generic enough to be moved to the path utils instead to allow further usage if required.

Test on RPM-based distribution

At the moment, our tests are only running on Ubuntu 22.04. This does not allow for the full suite of RPM-based tooling/testing, as required for #25. Therefore, a corresponding RPM-based distribution, probably OpenSUSE Leap due to already using it for development, should be added to GitHub Actions.

Docs: https://docs.github.com/en/actions/using-jobs/running-jobs-in-a-container, containers are at https://hub.docker.com/r/opensuse/leap/tags

It seems like we need at least OpenSUSE Leap 15.2 for https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/), but as the currently supported minimum version is 15.5 anyway, this should be no big deal.

Prepare for pip-licenses-lib 0.3.0

Prepare for pip-licenses-lib>=0.3.0 (not yet released) which uses dataclasses as containers instead of dictionaries.

This might require some custom support for license_files as dataclasses.as_dict only covers fields, but not custom properties.

After the initial preparations, pip-licenses-lib==0.3.0 can be released safely.

Expose API for downloading Crates for a Cargo.lock file

Expose a simple API to download Rust/Cargo crates for a Cargo.lock file for further analysis. Maybe add a way for getting the repository URL and the version license as well (see package section of Cargo.toml file or corresponding API).

Basic draft (not yet respecting the API limits/requirements from https://crates.io/data-access):

import hashlib
import sys
from pathlib import Path

import requests
import tomli


def get_package_versions(lock_file):
    with open(lock_file, mode='rb') as fd:
        data = tomli.load(fd)
    for package in data['package']:
        if package.get('source') != 'registry+https://github.com/rust-lang/crates.io-index':
            print('Skipping', package)
            continue
        yield package['name'], package['version'], package['checksum']


def download_from_lock_file(lock_file, target_directory):
    session = requests.Session()
    target_directory = Path(target_directory)

    for name, version, checksum in get_package_versions(lock_file):
        url = f'https://crates.io/api/v1/crates/{name}/{version}/download'
        response = session.get(url)
        if response.status_code != 200:
            print(response)
            continue
        target_directory.joinpath(f'{name}_{version}.crate').write_bytes(response.content)
        digest = hashlib.sha256(response.content).hexdigest()
        assert checksum == digest, url


def main():
    lock_file = sys.argv[1]
    target_directory = sys.argv[2]
    download_from_lock_file(lock_file, target_directory)


if __name__ == '__main__':
    main()

Resolve shared objects to their package

When the linking analysis reports the shared objects, there should be an option to resolve these to the OS package it they are not package-local. The output should be the package and the installed version, maybe even a source download link if it can be determined.

Possible implementation: https://unix.stackexchange.com/questions/158041/

Replace sys.stderr in check_shared_objects by logging

check_shared_objects currently uses sys.stderr to report shared objects using symlinks. We should use logging instead to allow further customization with log levels and redirection.

Cleanup unpacked archives when not running in temporary directory

At the moment, cleaning up unpacked archives being part of a directory does not work correctly when not being downloaded on the fly. The reason is that the corresponding logic mostly has temporary downloads in mind (unpacking the archive into a subdirectory of the directory the archive is part of), which tends to be my primary use case:

license_tools/license_tools/retrieval.py

Line 259 in 95b0a10

 subdirectory = path.parent / f'{name}_{"_".join(path.suffixes).replace(".", "")}' 

Example:

from license_tools import retrieval

file_results = retrieval.run(
    directory="/home/user/directory_with_archive",
    retrieve_copyrights=True,
    retrieve_emails=True,
    retrieve_file_info=False,
    retrieve_urls=True,
    retrieve_ldd_data=True,
    retrieve_font_data=True,
)

This will unpack file.jar and generate a corresponding /home/user/directory/file_jar directory, which should be removed on exit to not modify the file system state from before the run.

This has some side effects as well when running analysis on the directory multiple times:

The first time, the archive file will be unpacked and not deleted, only considering the archive content once.
For further runs, the archive file will be unpacked into a randomly named directory and deleted afterwards, considering the archive content twice in total (once for the existing directory name causing the naming conflict, once for the new directory name).

CLI: Configure log level

The application might emit info messages. For this reason, allow the user to configure the desired log level with a CLI parameter.

Analyze image metadata

Images might have additional metadata which should be retrieved if a specific flag is active.

Possible metadata formats to evaluate/consider:

Make Cargo.toml metadata printing optional

At the moment, run_on_file will print the Cargo.toml metadata in all cases. This complicates library usage, thus there should be a corresponding flag for it instead.

Further process `ldd` output

Currently, we just forward the output of ldd for ELF binaries from a subprocess and print it to stdout directly. This probably can be improved.

Due to already using the ScanCode toolkit, I evaluated the elf-inspector tool, but this currently lacks important features as it does not provide the full dependency paths (to decide whether the originate from the OS or whether they are shipped with the package itself) and somehow omits the libc dependency in some cases: nexB/elf-inspector#4.

Add support for scancode_toolkit==32.1.0

The latest release of scancode_toolkit==32.1.0 introduced some incompatible changes which require some changes in our code as well.

The direct failures are all related to the same source code line:

======================================================================
ERROR: test_rpm (test_retrieval.RunOnPackageArchiveFileTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/license_tools/license_tools/tests/test_retrieval.py", line 441, in test_rpm
    self._check_call(
  File "/home/runner/work/license_tools/license_tools/tests/test_retrieval.py", line 414, in _check_call
    result = list(
  File "/home/runner/work/license_tools/license_tools/license_tools/retrieval.py", line 292, in run_on_package_archive_file
    archive_results = _run_on_archive_file(path=archive_path, short_path=archive_path.name, default_to_none=True)
  File "/home/runner/work/license_tools/license_tools/license_tools/retrieval.py", line 131, in _run_on_archive_file
    rpm_results = PackageResults.from_rpm(path)
  File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 335, in from_rpm
    return cls(**data)
  File "<string>", line 40, in __init__
  File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 318, in __post_init__
    self.license_detections = [
  File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 319, in <listcomp>
    LicenseDetection(**x) if not isinstance(x, LicenseDetection) else x for x in self.license_detections  # type: ignore[arg-type]
TypeError: __init__() got an unexpected keyword argument 'license_expression_spdx'

Example reproducing this issue:

from license_tools import retrieval

file_results = retrieval.run(
    directory="directory_containing_archive_file",
    retrieve_copyrights=True,
    retrieve_emails=True,
    retrieve_file_info=False,
    retrieve_urls=True,
    retrieve_ldd_data=True,
    retrieve_font_data=True,
)

for file_result in file_results:
    print(file_result.short_path)
    if file_result.copyrights != retrieval.NOT_REQUESTED and (file_result.copyrights.copyrights or file_result.copyrights.holders or file_result.copyrights.authors):
        print(file_result.copyrights)
    if file_result.emails != retrieval.NOT_REQUESTED and file_result.emails.emails:
        print(file_result.emails.emails)
    if file_result.urls != retrieval.NOT_REQUESTED and file_result.urls.urls:
        print(file_result.urls.urls)

The conditions will fail for the archive file:

AttributeError: 'object' object has no attribute 'copyrights'