stefan6419846 / license_tools Goto Github PK
View Code? Open in Web Editor NEWCollection of tools for working with Open Source licenses
Home Page: https://license-tools.readthedocs.io/
License: Apache License 2.0
Collection of tools for working with Open Source licenses
Home Page: https://license-tools.readthedocs.io/
License: Apache License 2.0
At the moment, the RPM file modes are wrong due to an upstream bug: https://github.com/srossross/rpmfile/issues/48
. Afterwards, we should be able to provide a meaningful mapping for the file modes based upon the stat
module as well.
Nested archives are not handled at the moment, which is required for .src.rpm
files for example which usually ship an additional archive with the actual source code.
get_package_versions
currently prints all cases where a package is skipped due to the repository URL. We should use logging instead to allow further customization with log levels and redirection.
We already analyze the metadata of font and RPM files and could extend this to Python packages as well. Some basic implementations are available from https://github.com/stefan6419846/pip-licenses-lib
Some things which could be considered:
At the moment, run_on_directory
uses TemporaryDirectoryWithFixedName
to handle nested archives by unpacking them and deleting them after the analysis. This does not allow to use this method to permanently store the extracted files for further analysis.
There are two possible approaches to fix this:
run_on_directory
to receive a dedicated parameter which avoids deleting the unpacked archive directory after the analysis.OTF files are currently being skipped, although already supported.
At the moment, run_on_downloaded_archive_file
and run_on_downloaded_package_file
download the files directly. As this functionality could be useful for other cases as well, the corresponding download code should be moved to the download utils (for generic archive downloads) and the pip tools (for pip-specific package downloads without URL).
At the moment, LDD analysis is restricted to shared objects. It should be extended to all types of Linux binaries.
The get_files_from_directory
functionality is generic enough to be moved to the path utils instead to allow further usage if required.
At the moment, our tests are only running on Ubuntu 22.04. This does not allow for the full suite of RPM-based tooling/testing, as required for #25. Therefore, a corresponding RPM-based distribution, probably OpenSUSE Leap due to already using it for development, should be added to GitHub Actions.
Docs: https://docs.github.com/en/actions/using-jobs/running-jobs-in-a-container, containers are at https://hub.docker.com/r/opensuse/leap/tags
It seems like we need at least OpenSUSE Leap 15.2 for https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/), but as the currently supported minimum version is 15.5 anyway, this should be no big deal.
Prepare for pip-licenses-lib>=0.3.0
(not yet released) which uses dataclasses
as containers instead of dictionaries.
This might require some custom support for license_files
as dataclasses.as_dict
only covers fields, but not custom properties.
After the initial preparations, pip-licenses-lib==0.3.0
can be released safely.
Expose a simple API to download Rust/Cargo crates for a Cargo.lock
file for further analysis. Maybe add a way for getting the repository URL and the version license as well (see package
section of Cargo.toml
file or corresponding API).
Basic draft (not yet respecting the API limits/requirements from https://crates.io/data-access):
import hashlib
import sys
from pathlib import Path
import requests
import tomli
def get_package_versions(lock_file):
with open(lock_file, mode='rb') as fd:
data = tomli.load(fd)
for package in data['package']:
if package.get('source') != 'registry+https://github.com/rust-lang/crates.io-index':
print('Skipping', package)
continue
yield package['name'], package['version'], package['checksum']
def download_from_lock_file(lock_file, target_directory):
session = requests.Session()
target_directory = Path(target_directory)
for name, version, checksum in get_package_versions(lock_file):
url = f'https://crates.io/api/v1/crates/{name}/{version}/download'
response = session.get(url)
if response.status_code != 200:
print(response)
continue
target_directory.joinpath(f'{name}_{version}.crate').write_bytes(response.content)
digest = hashlib.sha256(response.content).hexdigest()
assert checksum == digest, url
def main():
lock_file = sys.argv[1]
target_directory = sys.argv[2]
download_from_lock_file(lock_file, target_directory)
if __name__ == '__main__':
main()
When the linking analysis reports the shared objects, there should be an option to resolve these to the OS package it they are not package-local. The output should be the package and the installed version, maybe even a source download link if it can be determined.
Possible implementation: https://unix.stackexchange.com/questions/158041/
check_shared_objects
currently uses sys.stderr
to report shared objects using symlinks. We should use logging instead to allow further customization with log levels and redirection.
At the moment, cleaning up unpacked archives being part of a directory does not work correctly when not being downloaded on the fly. The reason is that the corresponding logic mostly has temporary downloads in mind (unpacking the archive into a subdirectory of the directory the archive is part of), which tends to be my primary use case:
license_tools/license_tools/retrieval.py
Line 259 in 95b0a10
Example:
from license_tools import retrieval
file_results = retrieval.run(
directory="/home/user/directory_with_archive",
retrieve_copyrights=True,
retrieve_emails=True,
retrieve_file_info=False,
retrieve_urls=True,
retrieve_ldd_data=True,
retrieve_font_data=True,
)
This will unpack file.jar
and generate a corresponding /home/user/directory/file_jar
directory, which should be removed on exit to not modify the file system state from before the run.
This has some side effects as well when running analysis on the directory multiple times:
The application might emit info messages. For this reason, allow the user to configure the desired log level with a CLI parameter.
Images might have additional metadata which should be retrieved if a specific flag is active.
Possible metadata formats to evaluate/consider:
At the moment, run_on_file
will print the Cargo.toml
metadata in all cases. This complicates library usage, thus there should be a corresponding flag for it instead.
Currently, we just forward the output of ldd
for ELF binaries from a subprocess and print it to stdout directly. This probably can be improved.
Due to already using the ScanCode toolkit, I evaluated the elf-inspector
tool, but this currently lacks important features as it does not provide the full dependency paths (to decide whether the originate from the OS or whether they are shipped with the package itself) and somehow omits the libc dependency in some cases: nexB/elf-inspector#4.
The latest release of scancode_toolkit==32.1.0
introduced some incompatible changes which require some changes in our code as well.
The direct failures are all related to the same source code line:
======================================================================
ERROR: test_rpm (test_retrieval.RunOnPackageArchiveFileTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/runner/work/license_tools/license_tools/tests/test_retrieval.py", line 441, in test_rpm
self._check_call(
File "/home/runner/work/license_tools/license_tools/tests/test_retrieval.py", line 414, in _check_call
result = list(
File "/home/runner/work/license_tools/license_tools/license_tools/retrieval.py", line 292, in run_on_package_archive_file
archive_results = _run_on_archive_file(path=archive_path, short_path=archive_path.name, default_to_none=True)
File "/home/runner/work/license_tools/license_tools/license_tools/retrieval.py", line 131, in _run_on_archive_file
rpm_results = PackageResults.from_rpm(path)
File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 335, in from_rpm
return cls(**data)
File "<string>", line 40, in __init__
File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 318, in __post_init__
self.license_detections = [
File "/home/runner/work/license_tools/license_tools/license_tools/tools/scancode_tools.py", line 319, in <listcomp>
LicenseDetection(**x) if not isinstance(x, LicenseDetection) else x for x in self.license_detections # type: ignore[arg-type]
TypeError: __init__() got an unexpected keyword argument 'license_expression_spdx'
At the moment, extracting (source) RPM files requires manual attention. This should probably be automated.
For the best experience, this should probably be implemented after #6.
Add a switch for the --package
option to download either a binary wheel or the source distribution. This is especially useful for native code to have a look at the source code copyrights as well.
At the moment, the FileResults
class uses the NOT_REQUESTED
object to indicate not requested information. This seems to cause issues when combined with _get_dummy_file_results
, as the NOT_REQUESTED
object seems to have different values.
To make this more consistent with the remaining optional fields as well, rewrite this to use | None = None
instead of | object = NOT_REQUESTED
.
Example reproducing this issue:
from license_tools import retrieval
file_results = retrieval.run(
directory="directory_containing_archive_file",
retrieve_copyrights=True,
retrieve_emails=True,
retrieve_file_info=False,
retrieve_urls=True,
retrieve_ldd_data=True,
retrieve_font_data=True,
)
for file_result in file_results:
print(file_result.short_path)
if file_result.copyrights != retrieval.NOT_REQUESTED and (file_result.copyrights.copyrights or file_result.copyrights.holders or file_result.copyrights.authors):
print(file_result.copyrights)
if file_result.emails != retrieval.NOT_REQUESTED and file_result.emails.emails:
print(file_result.emails.emails)
if file_result.urls != retrieval.NOT_REQUESTED and file_result.urls.urls:
print(file_result.urls.urls)
The conditions will fail for the archive file:
AttributeError: 'object' object has no attribute 'copyrights'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.