italia / publiccode-crawler Goto Github PK

View Code? Open in Web Editor NEW

28.0 13.0 52.0 16.05 MB

publiccode.yml crawler for the Open Source software catalog of Developers Italia

License: GNU Affero General Public License v3.0

Go 97.62% Dockerfile 0.69% Mustache 1.69%

developers-italia crawler publiccode publiccodeyml hacktoberfest

publiccode-crawler's Introduction

publiccode.yml crawler for the software catalog of Developers Italia

Description

Developers Italia provides a catalog of Free and Open Source software aimed to Public Administrations.

publiccode-crawler retrieves the publiccode.yml files from the repositories of publishers found in the Developers Italia API.

Setup and deployment processes

publiccode-crawler can either run manually on the target machine or it can be deployed from a Docker container.

Manually configure and build

Rename config.toml.example to config.toml and set the variables

NOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"
Build the binary with go build

Docker

You can build the Docker image using

docker build .

or use the image published to DockerHub:

docker run -it italia/publiccode-crawler

Commands

`publiccode-crawler crawl`

Gets the list of publishers from https://api.developers.italia.it/v1/publishers and starts to crawl their repositories.

`publiccode-crawler crawl publishers*.yml`

Gets the list of publishers in publishers*.yml and starts to crawl their repositories.

`publiccode-crawler crawl-software <software> <publisher>`

Crawl just the software specified as parameter. It takes the software URL and its publisher id as parameters.

Ex. publiccode-crawler crawl-software https://api.developers.italia.it/v1/software/a2ea59b0-87cd-4419-b93f-00bed8a7b859 edb66b3d-3e36-4b69-aba9-b7c4661b3fdd

Other commands

crawler download-publishers downloads organizations and repositories from the onboarding portal repository and saves them to a publishers YAML file.

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

publiccode-crawler's People

Contributors

Stargazers

Watchers

publiccode-crawler's Issues

Define onboarding whitelist merging strategy

The crawler's whitelist is a combination of at least 2 YAML files: the one present here https://github.com/italia/developers-italia-backend/blob/master/crawler/whitelist/publishers.yml.example and the one filled by the onboarding procedure.
An option might be to parse everything contained in the whitelist folder, so a symlink should be enough. However, this approach would require to change the structure of the YAML file produced by the onboarding.
In fact, backend's YAML:

- id: "Comune di Bagnacavallo"
  codice-iPA: "c_a547"
  orgs:
    - "https://github.com/gith002"
  repos:
    - "https://github.com/gith002/foobar"

whilst onboarding YAML:

registrati: 
    - 
      timestamp: "2019-05-27T09:45:00.770Z"
      ipa: "c_a123"
      url: "https://github.com/undefined"
      pec: "[email protected]"

Probably the best approach requires to parse the onboarding whitelist, extract the information needed and then populate the crawler's whitelist. What do you think? @sebbalex @alranel

Improve CLI

We need CLI commands for:

removing contents from the catalog
dry-run crawling (i.e. without saving to Elasticsearch)

Use a separate folder to store crawler output files both in Dockerfile and crawler/config.toml

Problemi di permessi con Elasticsearch

Nell'ambiente di produzione ero riuscito a far funzionare Elasticsearch con le modifiche riportate in #56. Poi però dopo aver fatto girare make crawl non funziona più.

L'errore presente nel log di Elasticsearch è:

No index-level perm match for User [name=frontend, roles=[search], requestedTenant=null] [IndexType [index=1533915490, type=*]] [Action [[indices:data/read/search]]] [RolesChecked [search]]

Credo possa essere legato a qusta riga presente nel log del crawler:

time=“2018-08-10T15:38:14Z” level=error msg=“Error updating Elastic Alias: elastic: Error 404 (Not Found): no such index [type=index_not_found_exception]”

Add PA website on ElasticSearch

Add PA website on ElasticSearch fetched from IndicePA - sito_istituzionale

Refactor CircleCI config file and move deployment to gh-pages

Pipeline for handling on prem gitlab

When dealing with an "on premises" Gitlab repository it is necessary to slightly modify the pipeline.
As such, it is necessary to handle the case as following:

query the Gitlab API (/projects) to return a list of all the projects contained in the instance. For example: https://riuso.comune.salerno.it/api/v4/projects/
parse the returned JSON and populate a list with the same structure of the usual list we have in the other cases.

In this way, we don't have to loop over the orgs/users of the GL instance or to look for repos containing publiccode.yml files but we can directly interpret all the repos singularly.

developers.italia.it seems to be updating some of the software published

Hello! I dont know if this is the right place where to notify this but i noticed that our project seems to not be correctly updated by developers.italia.it.

I noticed in fact that the data of GlobaLeaks it stopped being updated in last February: https://developers.italia.it/it/software/globaleaks-globaleaks-f22648.html

Our publiccode.yml descriptor is loaded here: https://github.com/globaleaks/GlobaLeaks/blob/main/publiccode.yml

I notify this in order to recheck for the possibility of a software bug or of change in the API.

Thank you for your work and support!

\cc @alranel

Include LibreOffice in the catalogue

The publiccode.yml file for LibreOffice is available here:
https://github.com/tdf/libreoffice-metadata/blob/master/publiccode.yml

Refactor (and rename) the `one` command

When running the crawler on a single repo (using one), and the target repo is bitbucket, the remote repo is not cloned correctly.
Error prompted:

ERRO[0002] [art-uniroma2/vocbench3] error while cloning: cannot clone a repository without git URL

Crawler download-whitelist do not aggregate orgs under same IPA

In case of multiple entries in onboarding's repo-list with same codiceIPA but different orgs url, crawler should aggregate all orgs under same record named with codiceIPA.
Launching crawler in download-whitelist mode it does not aggregate them and we get only the first one.

Include Comune Open Web powered by Klan.IT in the catalogue

Segnaliamo il nostro progetto per un sito web comunale aderente alle linee guida agid. Si chiama Comune Open Web

https://github.com/klanit/comune-open-web

Gitlab supports subgroups but crawler cannot get their projects

Gitlab has support for subgroups and has a specified API, detailed here. As of now crawler doesn't support subgroups and when a project is inside of them it won't be crawled.
Actually the only workaround is running crawler in one mode giving project (repository) url as argument.
Question: do we really need to support this feature?
cc @libremente

Include GovPay to the catalogue

Please include project GovPay, an open source platform providing integration with pagoPA, the Italian public payment infrastructure.

pull request: #126

Periodically check that indexed repositories comply with the public publishers whitelist

Public bodies may decide to abandon an org on a code hosting platform. In that case, we should unlist software contained in that org because it would not match with the claimed codiceIPA anymore.

Publiccode's codiceIPA case

As discussed in #91 IPA code are case insensitive and we need to enforce this rule by having all codes in lowercase format.
To avoid duplicates and mismatch also publiccode.it.riuso.codiceIPA in publiccodes created from crawler must have same case, see here.

Improve logging of bad publiccodes

As of now, the bad_publiccodes.lst file is populated with the raw URLs to the failed publiccodes:

https://raw.githubusercontent.com/italia/18app/master/publiccode.yml

We should also log the errors and the repository URL (so that it can be easily re-tested with bin/crawl one <URL>).

Update crawler documentation

The docs folder contains very outdated infos. We should update most of them

Once in a while the crawler doesn't expand the URLs in logos

When the path in logo is relative, the crawler is supposed to expand it to the full URL and export the normalized file, but sometimes it doesn't do it:

today:

_site/en/software/p_tn-provinciaautonomatrento-pitre.html
  *  internal image screenshots/pitre_logo.png does not exist (line 632)

August 16th:

- ./_site/en/software/p_ve-cittametropolitanavenezia-desk-kitriuso-sicla.html
  *  internal image logo_header2.fw_.png does not exist (line 632)
- ./_site/it/software/p_ve-cittametropolitanavenezia-desk-kitriuso-sicla.html
  *  internal image logo_header2.fw_.png does not exist (line 632)

ElasticSearch breaking changes

ES7 bring several updates with a lot of enhancements and new features along with performance. It deprecates some of old feats and params, here a list.

Running crawler with one command we get:

ERRO[0006] json: cannot unmarshal object into Go struct field SearchHits.total of type int64
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x161bce1]

goroutine 1 [running]:
github.com/italia/developers-italia-backend/crawler/jekyll.AmministrazioniYML(0xc000b7c450, 0x27, 0xc000ba0000, 0x0, 0x0)
	/Users/sebbalex/workspace/developers-italia-backend/crawler/jekyll/amministrazioni.go:67 +0x741
github.com/italia/developers-italia-backend/crawler/jekyll.GenerateJekyllYML(0xc000ba0000, 0x7ffeefbff790, 0x32)
	/Users/sebbalex/workspace/developers-italia-backend/crawler/jekyll/main.go:22 +0x12e

this is due to one of breaking changes of major release.
We should to test accurately against this version and find out whatever there are other problems.

Add more info in amministrazioni.yml

According to Generate pages for individual public entities on Developers Italia we need more info on amministrazioni.yml file which is currently produced by this.

I suggest to add new fields in different steps:

- ipa: PCM
  title: Presidenza del consiglio dei ministri
  publicAuthority:
    name: Presidenza del Consiglio di ministri
    styledName: <span class="font-weight-light">Presidenza del </span> Consiglio

later, with more info provided by onboarding:

- ipa: unical
  title: Università della Calabria
  publicAuthority:
    name: UNIVERSITA' DELLA CALABRIA
    emblem: http://www.comune.bagnacavallo.ra.it/var/comune_bagnacavallo/storage/images/banner/banner-header-footer/stemma-header/1880-7-ita-IT/Stemma-Header.gif
    website: http://www.comune.bagnacavallo.ra.it

Note:
styledName is optional of course

Add changelog and draft new release

Add changelog and draft new releases

Include subtitle in software listing page

I believe that the software listing page should also contain a subtitle.
Right now the title is not self explicative enough and it does not help the reader.

Activity calculation parameters should not be hardcoded

Right now, the number of days to use in order to calculate the activity (vitality index) are hardcoded in the function call. This is not optimal and hard to customize. I'd suggest to move it to the conf file (or maybe a simple package const?).

Support GitHub user accounts in addition to orgs

Some public entities created a GitHub user account instead of an organization (eg. https://github.com/IstitutoCentraleCatalogoUnicoBiblio) and the crawler does not work with them:

ERRO[0000] error reading https://api.github.com/orgs/IstitutoCentraleCatalogoUnicoBiblio/repos repository list: not found. NextUrl: https://api.github.com/orgs/IstitutoCentraleCatalogoUnicoBiblio/repos

We should try with /users/:user/repos when /orgs/:org/repos returns 404 in order to be tolerant.

Remove unused deploy job in circleci config

Right now we are not using circleci to deploy this. We should remove it from there.

Blacklist of orgs/repos

We need to support a blacklist in order to remove contents from the catalog.

Add Azure Pipelines badge to README.md

The deploy is now done with Azure Pipelines. We should add a badge to the README.

Design workflow to remove entry from catalog

Entries may be removed from the catalog for several reasons.
However, right now there is not an easy way to completely purge an entry from the catalog.
I believe we should design a workflow in order to obtain such a result flawlessly.

Use GoReleaser to manage releases

As done for publiccode-parser I suggest to use also here goreleaser to handle all software releases.

Include GovWay to the catalogue

Please include GovWay, an API Gateway compliant with the interoperability rules of Italian Public Administration

Periodically check that indexed repositories still exist

We should periodically go through the indexed publiccodes and check whether they are still online. The ones which do not exist anymore should be hidden (or removed?) from the catalog, and logged.

developers.italia.it seems to be parsing/presenting languages incorrectly for Greek

Hello i don't know if this is the right place where to notify this but i noticed that the "Greek" language is not parsed/presented correctly on: https://developers.italia.it/it/software/globaleaks-globaleaks-f22648.html

The publiccode.yml of Globaleaks is published here: https://github.com/globaleaks/GlobaLeaks/blob/main/publiccode.yml

Thank you for your work!

The blacklist should be handled like the whitelists

We should handle the blacklist like we handle the whitelists: with a command line argument instead of a configuration entry in _config.toml.

eg:

crawler crawl allow.yml --blacklist reject.yml
crawler one $REPOURL allow.yml --blacklist reject.yml

Crawler excludes user type repos

Right now, the crawler is designed to support 2 kinds of feeds:

org URL (such as github.com/org-name)
Single repository URL

As such, it is not possible to provide a list of user names containing repos to crawl since the API call used is exclusively retrieving the list of repositories belonging to the org and not to the user (GET /orgs/:org/repos and not GET /users/:username/repos).

I believe this choice is rather limiting the scope of the crawler since it may not be always feasible to obtain a URL pointing to an org and the cost of migrating every repository already prepared from a user to an org may be != 0 for some stakeholder.

As such, I believe it could be easier to check the type of the user and modify that API call correspondingly.

This check can be as easy as parsing the JSON response of GET /users/:username which contains the type:

GET /users/libremente
{ 
"type": "User"
}

GET /users/italia
{
"type": "Organization"
}

What do you think? @sebbalex @alranel

Failing GETs produce empty repo URL in final catalog page

When there's an error in the pcvalidate's GETs this results in an empty href in the frontend.
Question: is there a way to handle such a situation (is this an exception? probably not...) in a more graceful way?
I believe this opens up a broader question.
The point is that we have to decide how to handle the internal processes fails.
Possibilities:

every fail means a complete fail. The final page will not be rendered. The info inserted in ES and in the softwares.yml up until this point have to be rolled back leaving the instance clean.
a fail does not mean a complete fail. The final page will be rendered anyway, most probably with some elements missing. We have an inconsistent state.

I'd try to avoid as much as possible the inconsistency here.
The problem is of course not in the new insertions but with the updates in the publiccode.yml files, since info are already present in ES and we don't want to override them if something fails during the process. Ideas @sebbalex? Thanks

Authenticate pcvalidate requests

Since lately many GETs fired from the pcvalidate tool are returning 429s, I think it's about time to authenticate those requests.

Add crawler Docker support

Remove entry from catalog

Sometimes it is necessary to remove an entry from the catalog, see this policy for more info.
To accomplish such a goal in the current architecture it is necessary to undergo different paths:

remove an entry from the whitelist - accomplished via blacklist mentioned in #78
remove an entry from elasticsearch
remove an entry from the _site directory - accomplished automatically since everynight the crawler runs from scratch so this is solved via 2.

It would be nice to have an option to be passed to the crawler via command line to automagically perform the operation described in 2.
@sebbalex

Consider to split this repository removing legacy code

After a heavy refactor and some technologies changes decision we should consider to split this repository removing the legacy part which is now obsolete. I see two way in front of us:

create a dedicated branch
create a new repository and put it in archive mode

Thoughts are welcome
cc @libremente @LucaPrete

Make logs more verbose

I believe that we need to make the logs printed in the bad_publiccodes.lst more verbose. Right now the logger is just appending the URL to the end of the file, however I believe that inserting a timestamp in front of the URL may help the troubleshooting process (this could potentially also help the issueopener job?).
These are the lines:
https://github.com/italia/developers-italia-backend/blob/f40477d59ad3d181ff59e852083db102cef29aae/crawler/crawler/saveToFile.go#L54-L71

I'd suggest something like this:

 _, err = f.WriteString(time.now().Date()+"-"+fileRawURL + "\r\n")

@sebbalex

Related softwares don't have codice-IPA attribute

As of now related softwares don't have codice-IPA attribute which is used to get what type of software is, if it is from reuse channel or third-party.
This is needed for italia/developers.italia.it#561 (comment)

Sometimes we got an exit status 128 from git pull

This only happens when repository is already cloned and crawler try to update id

time="2020-09-24T08:32:25Z" level=error msg="[UniversitaDellaCalabria/django-delivery] error while cloning: cannot git pull the repository: exit status 128"

Include LibreOffice powered by CIB in the catalogue

LibreOffice powered by CIB is a complete open source desktop productivity solution for Companies, Organizations and Public Administrations of all sizes willing to deploy LibreOffice with a long term certified support.

The publiccode.yml file for LibreOffice powered by CIB is available here:
https://github.com/cibsoftware/LibreOfficePoweredByCIB-metadata

Include Jaffree in the catalogue

The publiccode.yml is available here: https://github.com/kokorin/Jaffree/blob/master/publiccode.yml

httpclient should be extracted into a new library

httpclient is used by https://github.com/italia/publiccode-parser-go and should be extracted into a new library, to avoid the circular dependency.

Bitbucket raw URL is outside of repository

When dealing with a raw URL from bitbucket the pcvalidate throws a validation ko.
See this repo as an example

description/it/screenshots: Absolute URL (https://bitbucket.org/_darti/ssd-arti/raw/ee661d623ebc73edfdc459e280f9fdd0e4c8a5f4/docs/img/assettoIstituzioniScolastiche.jpg) is outside the repository (https://bitbucket.org/_darti/ssd-arti/raw/master/)

This does not happen with Github repos.

In particular, those changes were not applied (i.e. last 3 commits):
KnowageLabs/Knowage-Server@90fe6d0#diff-d808908edaf9b9a0101c6225f9d55907
KnowageLabs/Knowage-Server@da72d0e#diff-d808908edaf9b9a0101c6225f9d55907
KnowageLabs/Knowage-Server@61e02f4#diff-d808908edaf9b9a0101c6225f9d55907

The only thing that we noticed is that a Knowage screenshot is displayed on search result page when searching for "Knowage", that is fine.

Was Knowage being added into a kind of "bad-publiccode" list?

italia / publiccode-crawler Goto Github PK

publiccode-crawler's Introduction

publiccode.yml crawler for the software catalog of Developers Italia

Description

Setup and deployment processes

Manually configure and build

Docker

Commands

publiccode-crawler crawl

publiccode-crawler crawl publishers*.yml

publiccode-crawler crawl-software <software> <publisher>

Other commands

See also

Authors

publiccode-crawler's People

Contributors

Stargazers

Watchers

Forkers

publiccode-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org

`publiccode-crawler crawl`

`publiccode-crawler crawl publishers*.yml`

`publiccode-crawler crawl-software <software> <publisher>`