Giter VIP home page Giter VIP logo

license_tools's People

Contributors

dependabot[bot] avatar stefan6419846 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

pombredanne

license_tools's Issues

Improve RPM file modes

At the moment, the RPM file modes are wrong due to an upstream bug: https://github.com/srossross/rpmfile/issues/48. Afterwards, we should be able to provide a meaningful mapping for the file modes based upon the stat module as well.

Extract nested archives

Nested archives are not handled at the moment, which is required for .src.rpm files for example which usually ship an additional archive with the actual source code.

Add metadata handling for pip packages

We already analyze the metadata of font and RPM files and could extend this to Python packages as well. Some basic implementations are available from https://github.com/stefan6419846/pip-licenses-lib

Some things which could be considered:

  • Retrieve name, version, author/maintainer, homepage, licenses, license file, dependencies
  • Basic verification tasks
    • At least author or maintainer is set
    • Homepage is set and an URL
    • At least one license is declared and no full license text
    • At least one license file is available

Reconsider/improve run_on_directory and TemporaryDirectoryWithFixedName

At the moment, run_on_directory uses TemporaryDirectoryWithFixedName to handle nested archives by unpacking them and deleting them after the analysis. This does not allow to use this method to permanently store the extracted files for further analysis.

There are two possible approaches to fix this:

  • Allow run_on_directory to receive a dedicated parameter which avoids deleting the unpacked archive directory after the analysis.
  • Replace the current handler by a dedicated method which extracts all archives in a recursive manner and starts one big analysis afterwards instead of planning separate analysis jobs for each extracted sub-archive.

Move network functionality from retrieval to download utils and pip tools

At the moment, run_on_downloaded_archive_file and run_on_downloaded_package_file download the files directly. As this functionality could be useful for other cases as well, the corresponding download code should be moved to the download utils (for generic archive downloads) and the pip tools (for pip-specific package downloads without URL).

LDD analysis for binary files

At the moment, LDD analysis is restricted to shared objects. It should be extended to all types of Linux binaries.

Test on RPM-based distribution

At the moment, our tests are only running on Ubuntu 22.04. This does not allow for the full suite of RPM-based tooling/testing, as required for #25. Therefore, a corresponding RPM-based distribution, probably OpenSUSE Leap due to already using it for development, should be added to GitHub Actions.

Docs: https://docs.github.com/en/actions/using-jobs/running-jobs-in-a-container, containers are at https://hub.docker.com/r/opensuse/leap/tags

It seems like we need at least OpenSUSE Leap 15.2 for https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/), but as the currently supported minimum version is 15.5 anyway, this should be no big deal.

Prepare for pip-licenses-lib 0.3.0

Prepare for pip-licenses-lib>=0.3.0 (not yet released) which uses dataclasses as containers instead of dictionaries.

This might require some custom support for license_files as dataclasses.as_dict only covers fields, but not custom properties.

After the initial preparations, pip-licenses-lib==0.3.0 can be released safely.

Expose API for downloading Crates for a Cargo.lock file

Expose a simple API to download Rust/Cargo crates for a Cargo.lock file for further analysis. Maybe add a way for getting the repository URL and the version license as well (see package section of Cargo.toml file or corresponding API).

Basic draft (not yet respecting the API limits/requirements from https://crates.io/data-access):

import hashlib
import sys
from pathlib import Path

import requests
import tomli


def get_package_versions(lock_file):
    with open(lock_file, mode='rb') as fd:
        data = tomli.load(fd)
    for package in data['package']:
        if package.get('source') != 'registry+https://github.com/rust-lang/crates.io-index':
            print('Skipping', package)
            continue
        yield package['name'], package['version'], package['checksum']


def download_from_lock_file(lock_file, target_directory):
    session = requests.Session()
    target_directory = Path(target_directory)

    for name, version, checksum in get_package_versions(lock_file):
        url = f'https://crates.io/api/v1/crates/{name}/{version}/download'
        response = session.get(url)
        if response.status_code != 200:
            print(response)
            continue
        target_directory.joinpath(f'{name}_{version}.crate').write_bytes(response.content)
        digest = hashlib.sha256(response.content).hexdigest()
        assert checksum == digest, url


def main():
    lock_file = sys.argv[1]
    target_directory = sys.argv[2]
    download_from_lock_file(lock_file, target_directory)


if __name__ == '__main__':
    main()

Cleanup unpacked archives when not running in temporary directory

At the moment, cleaning up unpacked archives being part of a directory does not work correctly when not being downloaded on the fly. The reason is that the corresponding logic mostly has temporary downloads in mind (unpacking the archive into a subdirectory of the directory the archive is part of), which tends to be my primary use case:

subdirectory = path.parent / f'{name}_{"_".join(path.suffixes).replace(".", "")}'

Example:

from license_tools import retrieval

file_results = retrieval.run(
    directory="/home/user/directory_with_archive",
    retrieve_copyrights=True,
    retrieve_emails=True,
    retrieve_file_info=False,
    retrieve_urls=True,
    retrieve_ldd_data=True,
    retrieve_font_data=True,
)

This will unpack file.jar and generate a corresponding /home/user/directory/file_jar directory, which should be removed on exit to not modify the file system state from before the run.

This has some side effects as well when running analysis on the directory multiple times:

  • The first time, the archive file will be unpacked and not deleted, only considering the archive content once.
  • For further runs, the archive file will be unpacked into a randomly named directory and deleted afterwards, considering the archive content twice in total (once for the existing directory name causing the naming conflict, once for the new directory name).

CLI: Configure log level

The application might emit info messages. For this reason, allow the user to configure the desired log level with a CLI parameter.

Make Cargo.toml metadata printing optional

At the moment, run_on_file will print the Cargo.toml metadata in all cases. This complicates library usage, thus there should be a corresponding flag for it instead.

Further process `ldd` output

Currently, we just forward the output of ldd for ELF binaries from a subprocess and print it to stdout directly. This probably can be improved.

Due to already using the ScanCode toolkit, I evaluated the elf-inspector tool, but this currently lacks important features as it does not provide the full dependency paths (to decide whether the originate from the OS or whether they are shipped with the package itself) and somehow omits the libc dependency in some cases: nexB/elf-inspector#4.

Add support for scancode_toolkit==32.1.0

The latest release of scancode_toolkit==32.1.0 introduced some incompatible changes which require some changes in our code as well.

The direct failures are all related to the same source code line:

======================================================================
ERROR: test_rpm (test_retrieval.RunOnPackageArchiveFileTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/license_tools/license_tools/tests/test_retrieval.py", line 441, in test_rpm
    self._check_call(
  File "/home/runner/work/license_tools/license_tools/tests/test_retrieval.py", line 414, in _check_call
    result = list(
  File "/home/runner/work/license_tools/license_tools/license_tools/retrieval.py", line 292, in run_on_package_archive_file
    archive_results = _run_on_archive_file(path=archive_path, short_path=archive_path.name, default_to_none=True)
  File "/home/runner/work/license_tools/license_tools/license_tools/retrieval.py", line 131, in _run_on_archive_file
    rpm_results = PackageResults.from_rpm(path)
  File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 335, in from_rpm
    return cls(**data)
  File "<string>", line 40, in __init__
  File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 318, in __post_init__
    self.license_detections = [
  File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 319, in <listcomp>
    LicenseDetection(**x) if not isinstance(x, LicenseDetection) else x for x in self.license_detections  # type: ignore[arg-type]
TypeError: __init__() got an unexpected keyword argument 'license_expression_spdx'

Extract RPM files

At the moment, extracting (source) RPM files requires manual attention. This should probably be automated.

For the best experience, this should probably be implemented after #6.

Replace NOT_REQUESTED by None

At the moment, the FileResults class uses the NOT_REQUESTED object to indicate not requested information. This seems to cause issues when combined with _get_dummy_file_results, as the NOT_REQUESTED object seems to have different values.

To make this more consistent with the remaining optional fields as well, rewrite this to use | None = None instead of | object = NOT_REQUESTED.

Example reproducing this issue:

from license_tools import retrieval

file_results = retrieval.run(
    directory="directory_containing_archive_file",
    retrieve_copyrights=True,
    retrieve_emails=True,
    retrieve_file_info=False,
    retrieve_urls=True,
    retrieve_ldd_data=True,
    retrieve_font_data=True,
)

for file_result in file_results:
    print(file_result.short_path)
    if file_result.copyrights != retrieval.NOT_REQUESTED and (file_result.copyrights.copyrights or file_result.copyrights.holders or file_result.copyrights.authors):
        print(file_result.copyrights)
    if file_result.emails != retrieval.NOT_REQUESTED and file_result.emails.emails:
        print(file_result.emails.emails)
    if file_result.urls != retrieval.NOT_REQUESTED and file_result.urls.urls:
        print(file_result.urls.urls)

The conditions will fail for the archive file:

AttributeError: 'object' object has no attribute 'copyrights'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.