sdsc-ordes / gimie Goto Github PK
View Code? Open in Web Editor NEWExtract linked metadata from repositories
Home Page: https://sdsc-ordes.github.io/gimie/
License: Apache License 2.0
Extract linked metadata from repositories
Home Page: https://sdsc-ordes.github.io/gimie/
License: Apache License 2.0
In the context of SDSC Internal Knowledge graph, it would be useful to let gimie run automatically on all repositories associated with a user. We could add that feature in gimie itself.
Objective: Allow to easily retrieve graph of all repositories for a given user.
Requirements:
gimie user <username>
worksGit providers (Gitlab, Github, Codeberg, ...) provide additional useful information with no corresponding codemeta property. Namely:
Would it make sense to add such properties, for example using schema.org terms. One example that come to mind would be to use schema:isBasedOn for forks:
<downstream-repo> schema:isBasedOn <upstream-repo>
For stars, maybe schema:InteractionStatistic ? But that seems a bit convoluted.
Would love to hear your suggestions @rmfranken
In order to capture information about which repository a repository has been forked off, we can use schema:isBasedOn to indicate the relationship between 2 repositories.
Certain metadata fields require inspecting the contents of files within the repository.
To write generic functions that can analyze the repository contents, we need a standard file-like interface.
The interface must:
The extractor interface is becoming too complex, partly because it bears two unrelated responsibilities:
We could delegate RDF-related matters to a different object and use composition to connect it to the extractor.
For example:
class Extractor:
path: string
def list_files() -> list[Resource]:
...
def extract() -> Repository:
...
class Repository:
def to_graph() -> rdflib.Graph:
...
def serialize(format: str) -> str:
...
class RepositorySchema(Repository):
# Mapping of attributes to RDF
...
This would:
Note: extract() should probably return the repo graph instead of saving it into the instance
Using
PS C:\Users\franken\PycharmProjects\rdf_tools> gimie data https://github.com/numpy/numpy > file.ttl
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\franken\AppData\Roaming\Python\Python311\Scripts\gimie.exe\__main__.py", line 7, in <module>
sys.exit(app())
^^^^^
File "C:\Users\franken\AppData\Roaming\Python\Python311\site-packages\gimie\cli.py", line 63, in data
print(proj.serialize(format=format))
File "C:\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10433-10434: character maps to <undefined>
returns a unicode decode error (when I probe numpy github) However, if I put gimie repo as the repository to scrape, it works no problem. I don't understand why the difference between the two repos could cause a character encoding error.
Scancode's license matcher does not return a spdx identifier by default. We currently convert the "license expression" found by normalizing the string, and then looking up that match in the big static scancode license dictionary file. Problem is that the found match does not always have a key in that dictionary.
Example:
The license text belonging to the spdx identifier https://spdx.org/licenses/BSD-3-Clause.html is found by scancode api as "bsd-new". See scancode output below
[{'license_expression': 'bsd-new', 'matches': [{'score': 99.53, 'start_line': 3, 'end_line': 8, 'matched_length': 210, 'match_coverage': 100.0, 'matcher': '2-aho', 'license_expression': 'bsd-new', 'rule_identifier': 'bsd-new_31.RULE', 'rule_relevance': 100, 'rule_url': 'https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_31.RULE', ... , 'identifier': 'bsd_new-b02c5829-769b-deef-d8c8-55549a5900e9'}]
"bsd-new", even when normalized, does not appear in the scancode dictionary of licenses. This is not a super common thing, but it's not very rare either, see the large number of inconsistencies here .
Found by running gimie data https://github.com/MouseLand/cellpose/
scancode-toolkit imposes speed and platform limitations. As we only use the library for license matching, it is hard to justify imposing these limitations on gimie.
We could probably implement a license matcher using a rule-based, distance or ML method.
Suggested approach: (truncated) TF-IDF based classification
Requirements:
gimie.sources.common.license.get_license_url()
to use vectorizer instead of scancodeNext (optional):
credits: Thanks @Panaetius for the suggestion :)
Gimie should be able to get the publishing date. This is a gimie feature that needs to be developed.
Currently, __init__
files are used to avoid depth in module loading, hence contain entire class definitions and other components. Having them under init files reduced the readability of the code and could be improved.
A PR needs to work on balancing depth and readability of code, with the end goal of avoiding chained imports and removing code from init files.
Currently the gimie container for x86 stands at ~815MB divided as follows (docker history
command):
Layer size
RUN /bin/sh -c useradd -ms /bin/bash gimie_u… 332kB
COPY .docker/entrypoint.sh /entrypoint.sh # … 215B
COPY /app /app # buildkit 582MB
RUN /bin/sh -c set -eux; savedAptMark="$(a… 12.2MB
RUN /bin/sh -c set -eux; for src in idle3 p… 32B
RUN /bin/sh -c set -eux; savedAptMark="$(a… 29.6MB
ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8… 0B
RUN /bin/sh -c set -eux; apt-get update; a… 3.12MB
/bin/sh -c #(nop) ADD file:cb13581b8e7a9de43… 80.6MB
Ideas on how to reduce size:
gimie
folder and accompanying python venv
instead of the whole /app
folder (check if other dependencies are needed), see examplegimie
. This improves Docker layer caching (see this post, section 4)dev
dependencies from installation since they're not necessary to execute gimie (poetry install --without dev
arg)Other ideas are welcome :)
Software licenses are a crucial part of code repositories, as they define whether and how the code can be reused.
Licenses are generally provided in the form of a text file, and sometimes as a header of source files. Automatically detecting the presence and type of an OSI approced license in repositories would be very useful.
Objective: Automatically identify and classify the license in a given repository.
Requirements
Currently, we have an Extractor
class, whose job is to extract all metadata about a repository.
With #70 and #68, Extractor
has the ability to list_files()
present in a repository and access their contents.
Ideally, the responsibility of an extractor should stop there. It should not be responsible for extracting metadata from file contents.
The proposal here is to have a separate object responsible for it: Parser
. A Parser
would take a file as input and extract specific RDF triples from it. The Repo's RDF graph could then be enriched using the Parser graphs.
graph TD;
repo[Repository URL]-->ext{Extractor};
ext --> meta[Metadata];
ext --> files[Files];
meta --> repograph{Repository};
repograph --> repo_rdf[Repo RDF];
files --> parser{Parser};
parser --> spec_rdf[Specific RDF];
repo_rdf --> union((Union));
spec_rdf --> union;
union --> enhanced[Enhanced RDF];
Parsers could be added for pyproject.toml, setup.py, licenses, Cargo.toml, R's DESCRIPTION, package.json, etc...
Gimie being focused on making metadata FAIR, we decided to require URLs as inputs and disallow local folders.
Objective: Terminate with clear error message if only a local path is provided.
Requirements:
Project()
with local path should failGitExtractor
) should always require a URL (in addition to local path).Currently, gimie only retrieves the license from the GitHub API.
The GitLab API does not provide license at all, and the GitHub API license detection is lacking and will return NOASSERTION when:
For cases when gimie fails to retrieve a license from the Git provider, it would be preferable to extract it locally. The process could look as follows:
graph TD;
query[Send API request to git provider] -->check[license in response?];
check -->|yes| return[Success];
check -->|no| clone[Clone repository];
clone --> extract[Locally extract license];
extract --> add[Add to graph];
add --> return
This would add considerable overhead only in cases when license cannot be determined from the provider. This approach could also be applied to other attributes that may be missing from the API but present in the repo.
Objective:
Write a function that helps identify the programming language based on the file extension of a git repository.
Requirements
Important metadata can be embedded in the repository webpage's HTML code. Existing standards (opengraph, meta tags, schema.microdata, RDFa) can help us access this information in a standardized way using libraries such as extruct.
Objective: Extract relevant repository metadata from HTML page
Requirements:
Based on recommendations from @cmdoret and @rmfranken. Feel free to add anything I missed.
docker_publish.yml
Time optimization:
Code redundancy:
env
variable to compute whether to push the image or not and add it as parameter in the actionExample:
env:
REGISTRY: ghcr.io
MAIN: ${{ github.ref == 'refs/heads/main' }}
[...]
- name: Build Docker image
uses: docker/[email protected]
with:
context: .
platforms: linux/amd64,linux/arm64
file: .docker/Dockerfile
push: ${{ env.MAIN }}
sphynx_docs.yml
Code redundancy:
Example:
name: docs
on:
push:
branches: [main]
pull_request:
paths:
- 'docs/**'
permissions:
contents: write
jobs:
docs-build:
runs-on: ubuntu-latest
steps:
# https://github.com/actions/checkout
- uses: actions/checkout@v4
# https://github.com/actions/setup-python
- uses: actions/setup-python@v4
# https://github.com/snok/install-poetry
- name: Install Poetry
uses: snok/install-poetry@v1
- name: Install dependencies
run: |
poetry install --with doc
- name: Sphinx build
run: |
make doc
- name: Archive docs artifacts
uses: actions/upload-artifact@v3
with:
name: sphynx_docs
path: docs/**
docs-push:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
# https://github.com/actions/checkout
- uses: actions/checkout@v4
- uses: actions/download-artifact@v3
with:
name: sphynx_docs
# https://github.com/peaceiris/actions-gh-pages
- name: Deploy
uses: peaceiris/actions-gh-pages@v3
# if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/docs-website' }}
with:
publish_branch: gh-pages
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: docs/_build/
force_orphan: true
When calling gimie on numpy/numpy, the license is missing from the output.
It occasionally crashes with:
gimie data --exclude-parser license https://github.com/numpy/numpy
gimie/gimie/extractors/github.py:239 in _repo_data │
│ │
│ 236 │ │ response = send_graphql_query(GH_API, repo_query, data, self._headers) │
│ 237 │ │ │
│ 238 │ │ if "errors" in response: │
│ ❱ 239 │ │ │ raise ValueError(response["errors"]) │
│ 240 │ │ │
│ 241 │ │ return response["data"]["repository"]
ValueError: [{'message': 'Something went wrong while executing your query. Please include [...] when reporting this issue.'}]
The architecture of Gimie may need to be simplified or redesigned to improve maintainability. The codebase being relatively small, now would be a good time to refactor. Below is the current architecture. The ProjectGraph placeholder class is not shown in the diagram, but serves as an example of how the "user facing" class should be serialized.
Objective: Discuss issues with the current structure and propose an improved model.
classDiagram
class Repo {
<< Could implement file management (clone, locate license, ...) >>
path: str
files_meta: FilesMetadata
git_meta: GitMetadata
license_meta: LicenseMetadata
get_files_meta(path) -> FilesMetadata
get_git_meta(path) -> GitMetadata
get_license_meta(path) -> LicenseMetadata
}
class GitMetadata {
path: str
authors: Tuple[str]
creation_date: datetime
creator: str
releases: Tuple[Release]
}
class Release {
date: datetime
tag: str
commit_hash: str
}
class LicenseMetadata {
paths: Tuple[str]
get_licenses(min_score: int) -> List[str]
}
class FilesMetadata {
<< Could be dropped >>
project_path: str
locate_licenses(project_path) -> List[str]
}
GitMetadata --* Release
Repo --* GitMetadata
Repo --* LicenseMetadata
Repo --* FilesMetadata
gimie data 'https://github.com/facebookresearch/co-tracker' --format 'json-ld'
returns, among other triples:
"http://schema.org/license": [
{
"@id": "https://spdx.org/licenses/NOASSERTION"
}
I'm not aware of such a license, nor is spdx. In any case, if it believes there is no license, I would not expect a triple at all... Maybe we can put in an exception? Right now there is a if data["licenseInfo"] is not None: But I guess that doesn't help if gitHub returns some sort of "NOASSERTION".
Not sure why our license grabber is having a hard time with this one, the license.md file clearly states:
Attribution-NonCommercial 4.0 International at the top of page 😕
To facilitate the installation of gimie
, we thought it might be a good idea to have it containerized. Here is an example of the desired execution:
$ docker run gimie --version
gimie 0.2.0
For this, we need to add a Dockerfile to the repo and publish the gimie image to a Docker registry.
Acceptance criteria:
Extract relevant information from information contained in the .git
folder. This information can be retrieved using packages such as pydriller
Objective: Given a URL leverage an existing library to extract relevant metadat embedded in the git metadata.
Requirements
@cached_property
decorator for the attributes of GitMetadata
(see functools documentation)The input to gimie is a URL to a Git repository.
Objective: Allow local git metadata extraction when provider is not compatible.
Requirements:
To support development of gimie, existing tests should be executed automatically using CI/CD to prevent merging broken code.
Objective: Setup automatic testing and builds of gimie triggered by github actions.
Requirements:
The current release of gimie is named v0.2.0
, but the docker image tags in #31 are named based on the version in pyproject.toml
, which is now 0.2.0
.
Would it make sense to drop the v
from releases or should we prepend the v to docker tags?
Note: On PyPI the package should probably be named
0.2.0
asv0.2.0
is not semver compliant
When running gimie on a GitHub repository, schema:codeRepository
is a local path instead of the URL. It is also incorrectly capitalized (CodeRepository
instead of codeRepository
).
This happens with gimie 0.5.0
This is rare, but when license(s) are hosted in a folder that is not the root directory of the repository, Gimie currently does not pick them up.
fix should include changing the list-files function to look inside "license-like" folders, instead of only the root dir.
Rest of the script should remain pretty untouched.
Add a README badge with codecov report via coveralls:
Write the code and config to create a CLI, test it and make it a console entrypoint.
Resource: https://www.pluralsight.com/tech-blog/python-cli-utilities-with-poetry-and-typer/
Objective: Basic CLI available
Requirements:
cli.py
and make it a typer commandpyproject.toml
CLIRunner
When a project is owned by a user, the GitLab GraphQL API returns an empty array for projectMembers
when used with a PAT (Personal Access Token). It works normally when running the same query in the GraphiQL explorer.
This causes gimie to crash when running on a user-owned project (e.g. https://gitlab.com/edouardklein/falsisign).
Requirements:
Related GitLab issue: https://gitlab.com/gitlab-org/gitlab/-/issues/255177
#68 added support for explicit license detection (via scancode) from GitHub and GitLab repositories. We should implement that feature in the (local) GitExtractor
to benefit from this feature with other git providers.
Objective: Support for license detection in GitExtractor
Requirements:
list_files()
in GitExtractor
_get_license()
in GitExtractor
GitExtractor
We should have clear guidelines for potential contributors. According to standard practice, we need to add:
CONTRIBUTING.md
file describing what contributions are welcome
Objective: Add clear development and contribution guides to docs
Requirements:
CONTRIBUTING.md
links to dev guideResources:
In some cases, it is not obvious where the instance name ends in the URL (e.g. renkulab.io/gitlab/group/project).
We should give a command line option to specify it manually:
Objective: Let users disambiguate the separation between project namespace and git instance.
A GithubExtractor is already implemented in gimie, but it does not extract all relevant fields provided by the API.
Namely, the following fields remain to be implemented:
Requirements:
Below are proposed mappings. the Notation is
"github_variable_name"
→namespace:property
On schema:SoftwareSourceCode
:
"language"
→ schema:programmingLanguage
"html_url"
→ schema:codeRepository
releases_url">[0]["name"]
→ schema:softwareVersion
"releases_url">[0]["body"]
→ schema:releaseNotes
"topics"
→ schema:keywords
"stargazer_count"
→❓On schema:Person
:
"login"
→ schema:identifier
or ❓:githubUsername
"name"
→ schema:name
and dropOn schema:Organization
:
"name"
→ schema:legalName
"login"
→ schema:name
"avatar_url"
→ schema:logo
"description"
→ schema:description
Currently, gimie sets the latest release of a repository as the version. This is not the correct way to handle versioning, as breaking changes can happen between the last release and HEAD. We need to allow users to refer to specific releases (tags).
The desired behaviour is as follows:
gimie data <repo-url>
-> empty version field (refers to HEAD)gime data <tag-url>
-> set version field to tag (fixed version)Objective: Record repository release only when specified by user.
the docker build and push CI workflow fails with:
#24 2.746 OSError: libgomp.so.1: cannot open shared object file: No such file or directory
Objective: Fix CI
Requirements:
As CFF (citation file format) is a best practice recommended by fair-software.eu, we should aim to use this meta-data captured in a CFF file to further enhance Gimie output. For instance, extracting DOI (into schema:identifier).
Note: Depends on #97
Objective: Add parser for CFF files to extract DOI
Requirements:
gimie.parsers.CffParser
follows the gimie.parsers.Parser
interfacegimie.parsers.PARSERS
Next steps:
The Extractor interface currently takes a path: str
as input. This is vague and not very flexible. In particular, this will not work easily with custom gitlab instances (whose URL may extend beyond the TLD).
We should use more specific inputs, namely:
instance_url: str
: The base URL to the git provider instance (e.g., gitlab.com, renkulab.io/gitlab)project_path: str
: The path to the project in the git instance group/subgroup/project
local_path: Optional[str]
: The local path where the project was cloned (if it was cloned)We use two data files for license matching, generated by the script generate_tfidf.py
Depending on the update rate of SPDX , we can see to have a GitHub Action to automate generation or, if release rate is once per year, then we can stick to a manual update. However, we should see to have a reminder to stay up-to-date.
Python package definitions provide detailed metadata such as supported python versions, operating systems, intended audience and more. This metadata can be extracted locally from the package file (setup.py, setup.cfg or pyproject.toml).
Note: Depends on #97
Objective: Add parser for local python package metadata.
Requirements:
gimie.parsers.PythonParser
follows the gimie.parsers.Parser
interfacegimie.parsers.PARSERS
Resources:
Extracting metadata from Github's REST API requires multiple requests (at least one per contributor). This results in unacceptably long wait times for large repositories. Github provides a GraphQL endpoint, which exposes largely the same data as the REST endpoint.
Using GraphQL has 2 main advantages:
Objective: Fix speed issues by replacing Github REST API calls with a single GraphQL query.
Requirements:
As the gimie API is getting more complex, a documentation website could become useful.
We should use a framework such as Sphinx or MkDocs.
We already use numpy-formatted docstrings and doctests, which can be used by either framework to auto-generate html content.
Objective: Setup documentation website
Requirements:
Extract metadata from gitlab api:
Objective: Extract metadata from gitlab api similar to already implemented github api extractor
see: https://github.com/SDSC-ORD/ORDES/issues/163
[ ] gitlab api extractor is integrated into gimie
Unlike Github's REST API, the GraphQL API does not have a contributors field. Instead is has mentionableUsers
. In the current implementation on #33, we use this field.
For repositories owned by organizations, mentionableUsers
includes both organization members and contributors, we need to use an alternative solution.
One solution would be to use the commit list.
Objective: Get the correct list of contributors using Github's GraphQL API. Maybe via the commit list.
Requirements:
Resources:
Example query:
{
viewer {
login
}
repository(name: "gimie", owner: "SDSC-ORD") {
defaultBranchRef {
target {
... on Commit {
id
author {
date
user {
id
}
}
history(first: 100) {
edges {
node {
id
author {
name
email
user {
login
}
}
}
}
}
}
}
}
}
}
sample output
"node": {
"id": "C_kwDOIksyxdoAKDNiMWU3ZWNiNDg1ZjMxOTA1Y2M2NTNjNWRhOTY0MTMxOGUxNTliNmU",
"author": {
"name": "Martin Fontanet",
"email": "[email protected]",
"user": {
"login": "martinfontanet"
}
}
}
GitLab allows to nest groups into subgroups. Currently, we assume a single level of depth. This needs to be fixed.
Objective: Support nested gitlab groups
Requirements:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.