sdsc-ordes / gimie Goto Github PK

Extract linked metadata from repositories

Home Page: https://sdsc-ordes.github.io/gimie/

License: Apache License 2.0

Python 96.82% Makefile 1.29% Dockerfile 1.66% Shell 0.23%

fair-data git linked-open-data metadata-extraction scientific-software cli library

gimie's Introduction

Gimie (GIt Meta Information Extractor) is a python library and command line tool to extract structured metadata from git repositories.

Context

Scientific code repositories contain valuable metadata which can be used to enrich existing catalogues, platforms or databases. This tool aims to easily extract structured metadata from a generic git repositories. It can extract extract metadata from the Git provider (GitHub or GitLab) or from the git index itself.

Using Gimie: easy peasy, it's a 3 step process.

1: Installation

To install the stable version on PyPI:

pip install gimie

To install the dev version from github:

pip install git+https://github.com/sdsc-ordes/gimie.git@main#egg=gimie

Gimie is also available as a docker container hosted on the Github container registry:

docker pull ghcr.io/sdsc-ordes/gimie:latest

# The access token can be provided as an environment variable
docker run -e GITHUB_TOKEN=$GITHUB_TOKEN ghcr.io/sdsc-ordes/gimie:latest gimie data <repo>

2 : Set your credentials

In order to access the github api, you need to provide a github token with the read:org scope.

A. Create access tokens

New to access tokens? Or don't know how to get your Github / Gitlab token ?

Have no fear, see here for Github tokens and here for Gitlab tokens. (Note: tokens are as precious as passwords! Treat them as such.)

B. Set your access tokens via the Terminal

Gimie will use your access tokens to gather information for you. If you want info about a Github repo, Gimie needs your Github token; if you want info about a Gitlab Project then Gimie needs your Gitlab token.

Add your tokens one by one in your terminal: your Github token:

export GITHUB_TOKEN=

and/or your Gitlab token:

export GITLAB_TOKEN=

3: GIMIE info ! Run Gimie

As a command line tool

gimie data https://github.com/numpy/numpy

(want a Gitlab project instead? Just replace the URL in the command line)

As a python library

from gimie.project import Project
proj = Project("https://github.com/numpy/numpy)

# To retrieve the rdflib.Graph object
g = proj.extract()

# To retrieve the serialized graph
g_in_ttl = g.serialize(format='ttl')
print(g_in_ttl)

For more advanced use see the documentation.

Outputs

The default output is Turtle, a textual syntax for RDF data model. We follow the schema recommended by codemeta. Supported formats are turtle, json-ld and n-triples (by specifying the --format argument in your call i.e. gimie data https://github.com/numpy/numpy --format 'ttl').

With no specifications, Gimie will print results in the terminal. Want to save Gimie output to a file? Add your file path to the end : gimie data https://github.com/numpy/numpy > path_to_output/gimie_output.ttl

Contributing

All contributions are welcome. New functions and classes should have associated tests and docstrings following the numpy style guide.

The code formatting standard we use is black, with --line-length=79 to follow PEP8 recommendations. We use pytest as our testing framework. This project uses pyproject.toml to define package information, requirements and tooling configuration.

For development:

activate a conda or virtual environment with Python 3.8 or higher

git clone https://github.com/sdsc-ordes/gimie && cd gimie
make install

run tests:

make test

run checks:

make check

for an easier use Github/Gitlab APIs, place your access tokens in the .env file: (and don't worry, the .gitignore will ignore them when you push to GitHub)

cp .env.dist .env

build documentation:

make doc

Releases and Publishing on Pypi

Releases are done via github release

a release will trigger a github workflow to publish the package on Pypi
Make sure to update to a new version in pyproject.toml before making the release
It is possible to test the publishing on Pypi.test by running a manual workflow: go to github actions and run the Workflow: 'Publish on Pypi Test'

gimie's People

Contributors

Stargazers

Watchers

Forkers

mallorywittwer

gimie's Issues

graphql error on some large repos

When calling gimie on numpy/numpy, the license is missing from the output.

It occasionally crashes with:

gimie data --exclude-parser license https://github.com/numpy/numpy

gimie/gimie/extractors/github.py:239 in _repo_data    │
│                                                                                                  │
│   236 │   │   response = send_graphql_query(GH_API, repo_query, data, self._headers)             │
│   237 │   │                                                                                      │
│   238 │   │   if "errors" in response:                                                           │
│ ❱ 239 │   │   │   raise ValueError(response["errors"])                                           │
│   240 │   │                                                                                      │
│   241 │   │   return response["data"]["repository"]   

ValueError: [{'message': 'Something went wrong while executing your query. Please include [...] when reporting this issue.'}]

Incorrect value for codeRepository

When running gimie on a GitHub repository, schema:codeRepository is a local path instead of the URL. It is also incorrectly capitalized (CodeRepository instead of codeRepository).

This happens with gimie 0.5.0

[Gimie] Programming Language detection

Objective:

Write a function that helps identify the programming language based on the file extension of a git repository.

Requirements

Describe the programming language extraction and mapping
~~[ ] Identify schema.org mapping~~
~~[ ] Write PoC / Example~~

[gimie] Use Github GraphQL API

Extracting metadata from Github's REST API requires multiple requests (at least one per contributor). This results in unacceptably long wait times for large repositories. Github provides a GraphQL endpoint, which exposes largely the same data as the REST endpoint.

Using GraphQL has 2 main advantages:

We can retrieve only the desired fields
A single nested query can be used by specifying the complete desired reponse model

Objective: Fix speed issues by replacing Github REST API calls with a single GraphQL query.

Requirements:

Prepare query using the explorer
- draft query here (query time for renku: 1.04s instead of 38.7s with REST)
rewrite the GithubExtractor._request() method to use the query
Adjust _get...() helper methods to extract nested fields from the response

[Gimie] Docker container for gimie executable

To facilitate the installation of gimie, we thought it might be a good idea to have it containerized. Here is an example of the desired execution:

$ docker run gimie --version
gimie 0.2.0

For this, we need to add a Dockerfile to the repo and publish the gimie image to a Docker registry.

Acceptance criteria:

Create gimie Dockerfile
Publish on a Docker registry
Add Docker registry publication to the CI/CD process

enforce conventional PR titles

https://github.com/amannn/action-semantic-pull-request

[gimie] Handle GitLab subgroups

GitLab allows to nest groups into subgroups. Currently, we assume a single level of depth. This needs to be fixed.

Objective: Support nested gitlab groups
Requirements:

Modify the Gitlab URL parsing logic to support nested groups
Add test cases with subgroups

[gimie] Extract from Gitlab API

Extract metadata from gitlab api:

Objective: Extract metadata from gitlab api similar to already implemented github api extractor
see: https://github.com/SDSC-ORD/ORDES/issues/163

[ ] gitlab api extractor is integrated into gimie

Fix docker push CI

the docker build and push CI workflow fails with:

#24 2.746 OSError: libgomp.so.1: cannot open shared object file: No such file or directory

Objective: Fix CI

Requirements:

Investigate issue in logs
Update Dockerfile as required
Trigger image push

[gimie] Handle repository versioning

Currently, gimie sets the latest release of a repository as the version. This is not the correct way to handle versioning, as breaking changes can happen between the last release and HEAD. We need to allow users to refer to specific releases (tags).

The desired behaviour is as follows:

gimie data <repo-url> -> empty version field (refers to HEAD)
gime data <tag-url> -> set version field to tag (fixed version)

Objective: Record repository release only when specified by user.

[gimie] github properties beyond codemetapy

Git providers (Gitlab, Github, Codeberg, ...) provide additional useful information with no corresponding codemeta property. Namely:

Where the repository is forked from
How many forks the repository has
How many stars the repository has

Would it make sense to add such properties, for example using schema.org terms. One example that come to mind would be to use schema:isBasedOn for forks:

<downstream-repo> schema:isBasedOn <upstream-repo>

For stars, maybe schema:InteractionStatistic ? But that seems a bit convoluted.

Would love to hear your suggestions @rmfranken

[Gimie] Generic git metadata

Extract relevant information from information contained in the .git folder. This information can be retrieved using packages such as pydriller

Objective: Given a URL leverage an existing library to extract relevant metadat embedded in the git metadata.

Requirements

Retrieve list of authors / contributors
- Should Repo creator be identified separately ?
Retrieve creation date
Retrieve release dates (e.g. dates of git tags)
~~[ ] Identify schema.org/ mappings for the fields~~
~~[ ] PoC to extract fields~~

Use @cached_property decorator for the attributes of GitMetadata (see functools documentation)

Extract license when unavailable

Currently, gimie only retrieves the license from the GitHub API.
The GitLab API does not provide license at all, and the GitHub API license detection is lacking and will return NOASSERTION when:

the license is in a different file (e.g. LICENSE.md)
multiple licenses are found in the repo
the license is slightly different (e.g. added copyright notice)
the license is not one of the most popular licenses

For cases when gimie fails to retrieve a license from the Git provider, it would be preferable to extract it locally. The process could look as follows:

graph TD;
    query[Send API request to git provider] -->check[license in response?];
    check -->|yes| return[Success];
    check -->|no| clone[Clone repository];
    clone --> extract[Locally extract license];
    extract --> add[Add to graph];
    add --> return

This would add considerable overhead only in cases when license cannot be determined from the provider. This approach could also be applied to other attributes that may be missing from the API but present in the repo.

Implement license detection for GitExtractor

#68 added support for explicit license detection (via scancode) from GitHub and GitLab repositories. We should implement that feature in the (local) GitExtractor to benefit from this feature with other git providers.

Objective: Support for license detection in GitExtractor

Requirements:

Implement list_files() in GitExtractor
Implement _get_license() in GitExtractor
Update calamus schema to support multiple license in GitExtractor
Add relevant test cases

provide generic file object

Certain metadata fields require inspecting the contents of files within the repository.
To write generic functions that can analyze the repository contents, we need a standard file-like interface.

The interface must:

Support remote and local resources
Behave like a file
Keep the filename information
Forward headers when getting files from remote resources

[Gimie] HTML metadata

Important metadata can be embedded in the repository webpage's HTML code. Existing standards (opengraph, meta tags, schema.microdata, RDFa) can help us access this information in a standardized way using libraries such as extruct.

Objective: Extract relevant repository metadata from HTML page

Requirements:

Investigate relevant fields available on major git providers pages
Implement the WebMetadata class accordingly
~~[ ] Identify schema.org mappings~~
~~[ ] Write PoC / example~~

Rework Extractor interface

The Extractor interface currently takes a path: str as input. This is vague and not very flexible. In particular, this will not work easily with custom gitlab instances (whose URL may extend beyond the TLD).
We should use more specific inputs, namely:

instance_url: str: The base URL to the git provider instance (e.g., gitlab.com, renkulab.io/gitlab)
project_path: str: The path to the project in the git instance group/subgroup/project
local_path: Optional[str]: The local path where the project was cloned (if it was cloned)

prevent docs rebuild on non-default branches

[Gimie] Discuss program structure

The architecture of Gimie may need to be simplified or redesigned to improve maintainability. The codebase being relatively small, now would be a good time to refactor. Below is the current architecture. The ProjectGraph placeholder class is not shown in the diagram, but serves as an example of how the "user facing" class should be serialized.

Objective: Discuss issues with the current structure and propose an improved model.

classDiagram
class Repo {
  << Could implement file management (clone, locate license, ...) >>
    path: str
    files_meta: FilesMetadata
    git_meta: GitMetadata
    license_meta: LicenseMetadata
    get_files_meta(path) -> FilesMetadata
    get_git_meta(path) -> GitMetadata
    get_license_meta(path) -> LicenseMetadata
}
class GitMetadata {
    path: str
    authors: Tuple[str]
    creation_date: datetime
    creator: str
    releases: Tuple[Release]

}
class Release {
    date: datetime
    tag: str
    commit_hash: str
}
class LicenseMetadata {
    paths: Tuple[str]
    get_licenses(min_score: int) -> List[str]

}
class FilesMetadata {
   << Could be dropped >>
    project_path: str
    locate_licenses(project_path) -> List[str]

}
GitMetadata --* Release
Repo --* GitMetadata
Repo --* LicenseMetadata
Repo --* FilesMetadata

Add Parser concept

Currently, we have an Extractor class, whose job is to extract all metadata about a repository.
With #70 and #68, Extractor has the ability to list_files() present in a repository and access their contents.

Ideally, the responsibility of an extractor should stop there. It should not be responsible for extracting metadata from file contents.

The proposal here is to have a separate object responsible for it: Parser. A Parser would take a file as input and extract specific RDF triples from it. The Repo's RDF graph could then be enriched using the Parser graphs.

graph TD;
    repo[Repository URL]-->ext{Extractor};
    ext --> meta[Metadata];
    ext --> files[Files];
    meta --> repograph{Repository};
    repograph --> repo_rdf[Repo RDF];
    files --> parser{Parser};
    parser --> spec_rdf[Specific RDF];
    repo_rdf --> union((Union));
    spec_rdf --> union;
    union --> enhanced[Enhanced RDF];

Parsers could be added for pyproject.toml, setup.py, licenses, Cargo.toml, R's DESCRIPTION, package.json, etc...

Python package parser

Python package definitions provide detailed metadata such as supported python versions, operating systems, intended audience and more. This metadata can be extracted locally from the package file (setup.py, setup.cfg or pyproject.toml).

Note: Depends on #97

Objective: Add parser for local python package metadata.

Requirements:

New gimie.parsers.PythonParser follows the gimie.parsers.Parser interface
is added to gimie.parsers.PARSERS
Tests verify that parser works as expected.

Resources:

tomllib: a library to parse toml files
the wheel archive of a built package contains a metadata.json file https://stackoverflow.com/questions/30188158/how-to-read-python-package-metadata-without-installation
pkginfo: api to query package metadata from a source or binart distribution

[gimie] Add command-line option --instance

In some cases, it is not obvious where the instance name ends in the URL (e.g. renkulab.io/gitlab/group/project).
We should give a command line option to specify it manually:

Objective: Let users disambiguate the separation between project namespace and git instance.

Add cli option with a clear name to specify instance url
split url and pass to extractor

Add schema:isBasedOn as property for Forks

In order to capture information about which repository a repository has been forked off, we can use schema:isBasedOn to indicate the relationship between 2 repositories.

GitLab extractor broken for user-owned projects

When a project is owned by a user, the GitLab GraphQL API returns an empty array for projectMembers when used with a PAT (Personal Access Token). It works normally when running the same query in the GraphiQL explorer.

This causes gimie to crash when running on a user-owned project (e.g. https://gitlab.com/edouardklein/falsisign).

Requirements:

Add test case with user-owned repo
Prevent crash when projectMembers is empty
Fallback on extracting author from project path in such cases.

Run on all user repositories

In the context of SDSC Internal Knowledge graph, it would be useful to let gimie run automatically on all repositories associated with a user. We could add that feature in gimie itself.

Objective: Allow to easily retrieve graph of all repositories for a given user.

Requirements:

gimie user <username> works
Feature available from Python API
Can run on either repositories "owned" by user, or where user has "contributed" (i.e. has commits in the codebase).

Version naming scheme

The current release of gimie is named v0.2.0, but the docker image tags in #31 are named based on the version in pyproject.toml, which is now 0.2.0.
Would it make sense to drop the v from releases or should we prepend the v to docker tags?

Note: On PyPI the package should probably be named 0.2.0 as v0.2.0 is not semver compliant

Optimize gimie container size

Currently the gimie container for x86 stands at ~815MB divided as follows (docker history command):

Layer                                           size
RUN /bin/sh -c useradd -ms /bin/bash gimie_u…   332kB     
COPY .docker/entrypoint.sh /entrypoint.sh # …   215B      
COPY /app /app # buildkit                       582MB     
RUN /bin/sh -c set -eux;   savedAptMark="$(a…   12.2MB    
RUN /bin/sh -c set -eux;  for src in idle3 p…   32B       
RUN /bin/sh -c set -eux;   savedAptMark="$(a…   29.6MB    
ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8…   0B
RUN /bin/sh -c set -eux;  apt-get update;  a…   3.12MB
/bin/sh -c #(nop) ADD file:cb13581b8e7a9de43…   80.6MB

Ideas on how to reduce size:

Copy only gimie folder and accompanying python venv instead of the whole /app folder (check if other dependencies are needed), see example
Re-order layers to have Poetry install dependencies first, and then copy gimie. This improves Docker layer caching (see this post, section 4)
Remove dev dependencies from installation since they're not necessary to execute gimie (poetry install --without dev arg)

Other ideas are welcome :)

[Gimie] License detection

Software licenses are a crucial part of code repositories, as they define whether and how the code can be reused.
Licenses are generally provided in the form of a text file, and sometimes as a header of source files. Automatically detecting the presence and type of an OSI approced license in repositories would be very useful.

Objective: Automatically identify and classify the license in a given repository.

Requirements

Define where to look for a license (LICENSE file, COPYING ?, file headers, ...)
Assess the usability and usefulness of different license matchers
~~[ ] Identify schema.org mapping~~
~~[ ] Write PoC / Example~~

make list_files recursive

This is rare, but when license(s) are hosted in a folder that is not the root directory of the repository, Gimie currently does not pick them up.
fix should include changing the list-files function to look inside "license-like" folders, instead of only the root dir.
Rest of the script should remain pretty untouched.

Use CFF file as a source for DOI and authors+ORCID

As CFF (citation file format) is a best practice recommended by fair-software.eu, we should aim to use this meta-data captured in a CFF file to further enhance Gimie output. For instance, extracting DOI (into schema:identifier).

Note: Depends on #97

Objective: Add parser for CFF files to extract DOI

Requirements:

New gimie.parsers.CffParser follows the gimie.parsers.Parser interface
is added to gimie.parsers.PARSERS
Tests verify that parser works as expected.

Next steps:

Discuss if other properties should be extracted 's (e.g. authors in some properties such as m4di:orcidId , schema:author)

write contribution guide

We should have clear guidelines for potential contributors. According to standard practice, we need to add:

a CONTRIBUTING.md file describing what contributions are welcome
- It should make it easier for users to interact with the repo (e.g. submitting issues)
A development guide page in the docs with technical information for developers
- It should make it easier for contributors (internal and external) to setup the development environment and contribute to the project.

Objective: Add clear development and contribution guides to docs

Requirements:

Write Development guide
Write `CONTRIBUTING.md
Both pages are indexed in the docs
CONTRIBUTING.md links to dev guide

Resources:

Reduce extractor complexity

The extractor interface is becoming too complex, partly because it bears two unrelated responsibilities:

Extracting data from the git provider
Serializing to RDF and mapping data to ontologies

We could delegate RDF-related matters to a different object and use composition to connect it to the extractor.

For example:

class Extractor:
  path: string

  def list_files() -> list[Resource]:
    ...
  def extract() -> Repository:
    ...

class Repository:
  def to_graph() -> rdflib.Graph:
    ...
  def serialize(format: str) -> str:
    ...

class RepositorySchema(Repository):
  # Mapping of attributes to RDF
  ...

This would:

simplify the definition of extractors
Make testing easier
Improve reusability
Remove the need for defining a schema for each extractor, only defining it once in RepoGraph (name TBD)

Note: extract() should probably return the repo graph instead of saving it into the instance

[Gimie] Setup CI/CD infrastructure

To support development of gimie, existing tests should be executed automatically using CI/CD to prevent merging broken code.

Objective: Setup automatic testing and builds of gimie triggered by github actions.

Requirements:

Github action to run tests on commits and PR: #16
Github action to build / publish the package to PyPI on tagged commits: #21

[gimie] retrieve contributors from GitHub GraphQL API

Unlike Github's REST API, the GraphQL API does not have a contributors field. Instead is has mentionableUsers. In the current implementation on #33, we use this field.

For repositories owned by organizations, mentionableUsers includes both organization members and contributors, we need to use an alternative solution.

One solution would be to use the commit list.

Objective: Get the correct list of contributors using Github's GraphQL API. Maybe via the commit list.

Requirements:

Retrieve contributors, whether the repo is owned by a user or organization
Handle pagination
Include organization- and user-owned repos in tests

Resources:

Example query:

{
  viewer {
    login
  }
  repository(name: "gimie", owner: "SDSC-ORD") {
    defaultBranchRef {
      target {
        ... on Commit {
          id
          author {
            date
            user {
              id
            }
          }
          history(first: 100) {
            edges {
              node {
                id
                author {
                  name
                  email
                  user {
                    login
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

sample output

"node": {
  "id": "C_kwDOIksyxdoAKDNiMWU3ZWNiNDg1ZjMxOTA1Y2M2NTNjNWRhOTY0MTMxOGUxNTliNmU",
    "author": {
      "name": "Martin Fontanet",
      "email": "[email protected]",
      "user": {
        "login": "martinfontanet"
      }
    }
}

[Gimie] Add CLI boilerplate

Write the code and config to create a CLI, test it and make it a console entrypoint.

Resource: https://www.pluralsight.com/tech-blog/python-cli-utilities-with-poetry-and-typer/

Objective: Basic CLI available
Requirements:

add cli function in cli.py and make it a typer command
Add relevant console flags
Verify or update console entrypoint in pyproject.toml
Call typer command in CLI tests with CLIRunner

[ IP] publishing date - not pulled out of gimie

Gimie should be able to get the publishing date. This is a gimie feature that needs to be developed.

Optimize Github Actions

Based on recommendations from @cmdoret and @rmfranken. Feel free to add anything I missed.

`docker_publish.yml`

Time optimization:

Parallelize docker build to cut runtime by ~half following official Docker docs and SHACL API implementation.

Code redundancy:

Add env variable to compute whether to push the image or not and add it as parameter in the action

Example:

env:
  REGISTRY: ghcr.io
  MAIN: ${{ github.ref == 'refs/heads/main' }}
    [...]
      - name: Build Docker image
        uses: docker/[email protected]
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          file: .docker/Dockerfile
          push: ${{ env.MAIN }}

`sphynx_docs.yml`

Code redundancy:

After building docs, upload them as artifacts and download them at the push/deploy action with a condition.

Example:

name: docs
on:
  push:
    branches: [main]
  pull_request:
    paths:
      - 'docs/**'
  
permissions:
    contents: write
jobs:
  docs-build:
    runs-on: ubuntu-latest
    steps:
      # https://github.com/actions/checkout
      - uses: actions/checkout@v4
      
      # https://github.com/actions/setup-python
      - uses: actions/setup-python@v4
      
      # https://github.com/snok/install-poetry
      - name: Install Poetry
        uses: snok/install-poetry@v1

      - name: Install dependencies
        run: |
          poetry install --with doc

      - name: Sphinx build
        run: |
          make doc

      - name: Archive docs artifacts
        uses: actions/upload-artifact@v3
        with:
          name: sphynx_docs
          path: docs/**

  docs-push:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      # https://github.com/actions/checkout
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v3
         with:
             name: sphynx_docs

      # https://github.com/peaceiris/actions-gh-pages
      - name: Deploy
        uses: peaceiris/actions-gh-pages@v3
        # if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/docs-website' }}
        with:
          publish_branch: gh-pages
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: docs/_build/
          force_orphan: true

[gimie] Disallow local paths

Gimie being focused on making metadata FAIR, we decided to require URLs as inputs and disallow local folders.

Objective: Terminate with clear error message if only a local path is provided.

Requirements:

Project() with local path should fail
Local extractors (e.g. GitExtractor) should always require a URL (in addition to local path).

Add code coverage report

Add a README badge with codecov report via coveralls:

coveralls intergration
update test dependencies to include coverage tools
Update CI config (add coverage upload step)
Add badge in readme

[gimie] Implement missing github API fields

A GithubExtractor is already implemented in gimie, but it does not extract all relevant fields provided by the API.
Namely, the following fields remain to be implemented:

Requirements:

Below are proposed mappings. the Notation is "github_variable_name" → namespace:property

On schema:SoftwareSourceCode:

"language" → schema:programmingLanguage
"html_url" → schema:codeRepository
"releases_url">[0]["name"] → schema:softwareVersion
"releases_url">[0]["body"] → schema:releaseNotes
"topics" → schema:keywords
Bonus: "stargazer_count" →❓

On schema:Person:

"login" → schema:identifier or ❓:githubUsername
"name" → schema:name and drop

On schema:Organization:

"name" → schema:legalName
"login" → schema:name
"avatar_url" → schema:logo
"description" → schema:description

License identifier not matched correctly

Scancode's license matcher does not return a spdx identifier by default. We currently convert the "license expression" found by normalizing the string, and then looking up that match in the big static scancode license dictionary file. Problem is that the found match does not always have a key in that dictionary.

Example:
The license text belonging to the spdx identifier https://spdx.org/licenses/BSD-3-Clause.html is found by scancode api as "bsd-new". See scancode output below

[{'license_expression': 'bsd-new', 'matches': [{'score': 99.53, 'start_line': 3, 'end_line': 8, 'matched_length': 210, 'match_coverage': 100.0, 'matcher': '2-aho', 'license_expression': 'bsd-new', 'rule_identifier': 'bsd-new_31.RULE', 'rule_relevance': 100, 'rule_url': 'https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_31.RULE', ... , 'identifier': 'bsd_new-b02c5829-769b-deef-d8c8-55549a5900e9'}]

"bsd-new", even when normalized, does not appear in the scancode dictionary of licenses. This is not a super common thing, but it's not very rare either, see the large number of inconsistencies here .

Found by running gimie data https://github.com/MouseLand/cellpose/

Licenses not being picked up correctly

gimie data 'https://github.com/facebookresearch/co-tracker' --format 'json-ld'
returns, among other triples:
"http://schema.org/license": [
{
"@id": "https://spdx.org/licenses/NOASSERTION"
}
I'm not aware of such a license, nor is spdx. In any case, if it believes there is no license, I would not expect a triple at all... Maybe we can put in an exception? Right now there is a if data["licenseInfo"] is not None: But I guess that doesn't help if gitHub returns some sort of "NOASSERTION".
Not sure why our license grabber is having a hard time with this one, the license.md file clearly states:
Attribution-NonCommercial 4.0 International at the top of page 😕

Writing gimie output to file instead of shell not consistent

Using

PS C:\Users\franken\PycharmProjects\rdf_tools> gimie data https://github.com/numpy/numpy > file.ttl
Traceback (most recent call last):

  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "C:\Users\franken\AppData\Roaming\Python\Python311\Scripts\gimie.exe\__main__.py", line 7, in <module>
    sys.exit(app())
             ^^^^^

  File "C:\Users\franken\AppData\Roaming\Python\Python311\site-packages\gimie\cli.py", line 63, in data
    print(proj.serialize(format=format))

  File "C:\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

UnicodeEncodeError: 'charmap' codec can't encode characters in position 10433-10434: character maps to <undefined>

returns a unicode decode error (when I probe numpy github) However, if I put gimie repo as the repository to scrape, it works no problem. I don't understand why the difference between the two repos could cause a character encoding error.

Move code out of init files

Currently, __init__ files are used to avoid depth in module loading, hence contain entire class definitions and other components. Having them under init files reduced the readability of the code and could be improved.

A PR needs to work on balancing depth and readability of code, with the end goal of avoiding chained imports and removing code from init files.

License (SPDX) maintenance strategy

We use two data files for license matching, generated by the script generate_tfidf.py

Depending on the update rate of SPDX , we can see to have a GitHub Action to automate generation or, if release rate is once per year, then we can stick to a manual update. However, we should see to have a reminder to stay up-to-date.

[gimie] Docs website

As the gimie API is getting more complex, a documentation website could become useful.
We should use a framework such as Sphinx or MkDocs.
We already use numpy-formatted docstrings and doctests, which can be used by either framework to auto-generate html content.

Objective: Setup documentation website

Requirements:

Configure docs framework with required extensions
Setup hosting and deployment (most liikely GitHub pages or readthedocs)
Configure index, apidoc and welcome page
Changelog

[gimie] Fix local Git extraction

The input to gimie is a URL to a Git repository.

In cases where gimie supports the API of the provider where it is hosted, it should call the right extractor.
In other cases (e.g. Bitbucket, gitea, codeberg, ...) it should still extract information by cloning the repository

Objective: Allow local git metadata extraction when provider is not compatible.

Requirements:

Fix local GitExtractor
Autodetect provider in Project
Fallback to GitExtractor

Implement license matcher

scancode-toolkit imposes speed and platform limitations. As we only use the library for license matching, it is hard to justify imposing these limitations on gimie.

We could probably implement a license matcher using a rule-based, distance or ML method.

Suggested approach: (truncated) TF-IDF based classification

Requirements:

Write license corpus downloader script
Implement vectorizer script
Find optimal parameters for vectorizer (truncation et cie)
Serialize vectorizer + matrix in repo
Update gimie.sources.common.license.get_license_url() to use vectorizer instead of scancode
Drop scancode dependency

Next (optional):

~~if needed, look for additional ways to reduce memory / storage footprint (dimension red. / compression)~~
- not needed, size is tolerable
ci-job to periodically refresh vectorizer with new / updated licenses
more sophisticated method like BM25 is accuracy is an issue

credits: Thanks @Panaetius for the suggestion :)

sdsc-ordes / gimie Goto Github PK

gimie's Introduction

Context

1: Installation

2 : Set your credentials

A. Create access tokens

B. Set your access tokens via the Terminal

3: GIMIE info ! Run Gimie

As a command line tool

As a python library

Outputs

Contributing

For development:

Releases and Publishing on Pypi

gimie's People

Contributors

Stargazers

Watchers

Forkers

gimie's Issues

docker_publish.yml

sphynx_docs.yml

Recommend Projects

Recommend Topics

Recommend Org

`docker_publish.yml`

`sphynx_docs.yml`