Giter VIP home page Giter VIP logo

publiccode-crawler's Introduction

publiccode.yml crawler for the software catalog of Developers Italia

Go Report Card Join the #publiccode channel Get invited

Description

Developers Italia provides a catalog of Free and Open Source software aimed to Public Administrations.

publiccode-crawler retrieves the publiccode.yml files from the repositories of publishers found in the Developers Italia API.

Setup and deployment processes

publiccode-crawler can either run manually on the target machine or it can be deployed from a Docker container.

Manually configure and build

  1. Rename config.toml.example to config.toml and set the variables

    NOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"

  2. Build the binary with go build

Docker

You can build the Docker image using

docker build .

or use the image published to DockerHub:

docker run -it italia/publiccode-crawler

Commands

publiccode-crawler crawl

Gets the list of publishers from https://api.developers.italia.it/v1/publishers and starts to crawl their repositories.

publiccode-crawler crawl publishers*.yml

Gets the list of publishers in publishers*.yml and starts to crawl their repositories.

publiccode-crawler crawl-software <software> <publisher>

Crawl just the software specified as parameter. It takes the software URL and its publisher id as parameters.

Ex. publiccode-crawler crawl-software https://api.developers.italia.it/v1/software/a2ea59b0-87cd-4419-b93f-00bed8a7b859 edb66b3d-3e36-4b69-aba9-b7c4661b3fdd

Other commands

  • crawler download-publishers downloads organizations and repositories from the onboarding portal repository and saves them to a publishers YAML file.

See also

Authors

Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.

publiccode-crawler's People

Contributors

alessio-fioravanti avatar alranel avatar bfabio avatar cgagliardipaa avatar davide-zerbetto avatar davidegiarolo avatar dependabot[bot] avatar ghtmtt avatar giuseppecozza avatar gmereu avatar ilghera avatar libremente avatar lorello avatar lucaprete avatar lussoluca avatar madbob avatar mandreoli86 avatar marco-fioriti avatar martinomaggio avatar mattmattv avatar mfortini avatar micheletiledesk avatar netbender avatar nicolazivago avatar r3vit avatar ruphy avatar scolettads avatar sebbalex avatar silviorelli avatar tensor5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

publiccode-crawler's Issues

Define onboarding whitelist merging strategy

The crawler's whitelist is a combination of at least 2 YAML files: the one present here https://github.com/italia/developers-italia-backend/blob/master/crawler/whitelist/publishers.yml.example and the one filled by the onboarding procedure.
An option might be to parse everything contained in the whitelist folder, so a symlink should be enough. However, this approach would require to change the structure of the YAML file produced by the onboarding.
In fact, backend's YAML:

- id: "Comune di Bagnacavallo"
  codice-iPA: "c_a547"
  orgs:
    - "https://github.com/gith002"
  repos:
    - "https://github.com/gith002/foobar"

whilst onboarding YAML:

registrati: 
    - 
      timestamp: "2019-05-27T09:45:00.770Z"
      ipa: "c_a123"
      url: "https://github.com/undefined"
      pec: "[email protected]"

Probably the best approach requires to parse the onboarding whitelist, extract the information needed and then populate the crawler's whitelist. What do you think? @sebbalex @alranel

Improve CLI

We need CLI commands for:

  • removing contents from the catalog
  • dry-run crawling (i.e. without saving to Elasticsearch)

Problemi di permessi con Elasticsearch

Nell'ambiente di produzione ero riuscito a far funzionare Elasticsearch con le modifiche riportate in #56. Poi però dopo aver fatto girare make crawl non funziona più.

L'errore presente nel log di Elasticsearch è:

No index-level perm match for User [name=frontend, roles=[search], requestedTenant=null] [IndexType [index=1533915490, type=*]] [Action [[indices:data/read/search]]] [RolesChecked [search]]

Credo possa essere legato a qusta riga presente nel log del crawler:

time=“2018-08-10T15:38:14Z” level=error msg=“Error updating Elastic Alias: elastic: Error 404 (Not Found): no such index [type=index_not_found_exception]”

Pipeline for handling on prem gitlab

When dealing with an "on premises" Gitlab repository it is necessary to slightly modify the pipeline.
As such, it is necessary to handle the case as following:

  • query the Gitlab API (/projects) to return a list of all the projects contained in the instance. For example: https://riuso.comune.salerno.it/api/v4/projects/
  • parse the returned JSON and populate a list with the same structure of the usual list we have in the other cases.

In this way, we don't have to loop over the orgs/users of the GL instance or to look for repos containing publiccode.yml files but we can directly interpret all the repos singularly.

developers.italia.it seems to be updating some of the software published

Hello! I dont know if this is the right place where to notify this but i noticed that our project seems to not be correctly updated by developers.italia.it.

I noticed in fact that the data of GlobaLeaks it stopped being updated in last February: https://developers.italia.it/it/software/globaleaks-globaleaks-f22648.html

Our publiccode.yml descriptor is loaded here: https://github.com/globaleaks/GlobaLeaks/blob/main/publiccode.yml

I notify this in order to recheck for the possibility of a software bug or of change in the API.

Thank you for your work and support!

\cc @alranel

Refactor (and rename) the `one` command

When running the crawler on a single repo (using one), and the target repo is bitbucket, the remote repo is not cloned correctly.
Error prompted:

ERRO[0002] [art-uniroma2/vocbench3] error while cloning: cannot clone a repository without git URL

Crawler download-whitelist do not aggregate orgs under same IPA

In case of multiple entries in onboarding's repo-list with same codiceIPA but different orgs url, crawler should aggregate all orgs under same record named with codiceIPA.
Launching crawler in download-whitelist mode it does not aggregate them and we get only the first one.

Gitlab supports subgroups but crawler cannot get their projects

Gitlab has support for subgroups and has a specified API, detailed here. As of now crawler doesn't support subgroups and when a project is inside of them it won't be crawled.
Actually the only workaround is running crawler in one mode giving project (repository) url as argument.
Question: do we really need to support this feature?
cc @libremente

Publiccode's codiceIPA case

As discussed in #91 IPA code are case insensitive and we need to enforce this rule by having all codes in lowercase format.
To avoid duplicates and mismatch also publiccode.it.riuso.codiceIPA in publiccodes created from crawler must have same case, see here.

Improve logging of bad publiccodes

As of now, the bad_publiccodes.lst file is populated with the raw URLs to the failed publiccodes:

https://raw.githubusercontent.com/italia/18app/master/publiccode.yml

We should also log the errors and the repository URL (so that it can be easily re-tested with bin/crawl one <URL>).

Once in a while the crawler doesn't expand the URLs in logos

When the path in logo is relative, the crawler is supposed to expand it to the full URL and export the normalized file, but sometimes it doesn't do it:

today:

_site/en/software/p_tn-provinciaautonomatrento-pitre.html
  *  internal image screenshots/pitre_logo.png does not exist (line 632)

August 16th:

- ./_site/en/software/p_ve-cittametropolitanavenezia-desk-kitriuso-sicla.html
  *  internal image logo_header2.fw_.png does not exist (line 632)
- ./_site/it/software/p_ve-cittametropolitanavenezia-desk-kitriuso-sicla.html
  *  internal image logo_header2.fw_.png does not exist (line 632)

ElasticSearch breaking changes

ES7 bring several updates with a lot of enhancements and new features along with performance. It deprecates some of old feats and params, here a list.

Running crawler with one command we get:

ERRO[0006] json: cannot unmarshal object into Go struct field SearchHits.total of type int64
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x161bce1]

goroutine 1 [running]:
github.com/italia/developers-italia-backend/crawler/jekyll.AmministrazioniYML(0xc000b7c450, 0x27, 0xc000ba0000, 0x0, 0x0)
	/Users/sebbalex/workspace/developers-italia-backend/crawler/jekyll/amministrazioni.go:67 +0x741
github.com/italia/developers-italia-backend/crawler/jekyll.GenerateJekyllYML(0xc000ba0000, 0x7ffeefbff790, 0x32)
	/Users/sebbalex/workspace/developers-italia-backend/crawler/jekyll/main.go:22 +0x12e

this is due to one of breaking changes of major release.
We should to test accurately against this version and find out whatever there are other problems.

Add more info in amministrazioni.yml

According to Generate pages for individual public entities on Developers Italia we need more info on amministrazioni.yml file which is currently produced by this.

I suggest to add new fields in different steps:

- ipa: PCM
  title: Presidenza del consiglio dei ministri
  publicAuthority:
    name: Presidenza del Consiglio di ministri
    styledName: <span class="font-weight-light">Presidenza del </span> Consiglio

later, with more info provided by onboarding:

- ipa: unical
  title: Università della Calabria
  publicAuthority:
    name: UNIVERSITA' DELLA CALABRIA
    emblem: http://www.comune.bagnacavallo.ra.it/var/comune_bagnacavallo/storage/images/banner/banner-header-footer/stemma-header/1880-7-ita-IT/Stemma-Header.gif
    website: http://www.comune.bagnacavallo.ra.it

Note:
styledName is optional of course

Activity calculation parameters should not be hardcoded

Right now, the number of days to use in order to calculate the activity (vitality index) are hardcoded in the function call. This is not optimal and hard to customize. I'd suggest to move it to the conf file (or maybe a simple package const?).

Support GitHub user accounts in addition to orgs

Some public entities created a GitHub user account instead of an organization (eg. https://github.com/IstitutoCentraleCatalogoUnicoBiblio) and the crawler does not work with them:

ERRO[0000] error reading https://api.github.com/orgs/IstitutoCentraleCatalogoUnicoBiblio/repos repository list: not found. NextUrl: https://api.github.com/orgs/IstitutoCentraleCatalogoUnicoBiblio/repos

We should try with /users/:user/repos when /orgs/:org/repos returns 404 in order to be tolerant.

Design workflow to remove entry from catalog

Entries may be removed from the catalog for several reasons.
However, right now there is not an easy way to completely purge an entry from the catalog.
I believe we should design a workflow in order to obtain such a result flawlessly.

The blacklist should be handled like the whitelists

We should handle the blacklist like we handle the whitelists: with a command line argument instead of a configuration entry in _config.toml.

eg:

crawler crawl allow.yml --blacklist reject.yml
crawler one $REPOURL allow.yml --blacklist reject.yml

Crawler excludes user type repos

Right now, the crawler is designed to support 2 kinds of feeds:

  1. org URL (such as github.com/org-name)
  2. Single repository URL

As such, it is not possible to provide a list of user names containing repos to crawl since the API call used is exclusively retrieving the list of repositories belonging to the org and not to the user (GET /orgs/:org/repos and not GET /users/:username/repos).

I believe this choice is rather limiting the scope of the crawler since it may not be always feasible to obtain a URL pointing to an org and the cost of migrating every repository already prepared from a user to an org may be != 0 for some stakeholder.

As such, I believe it could be easier to check the type of the user and modify that API call correspondingly.

This check can be as easy as parsing the JSON response of GET /users/:username which contains the type:

GET /users/libremente
{ 
"type": "User"
}

GET /users/italia
{
"type": "Organization"
}

What do you think? @sebbalex @alranel

Failing GETs produce empty repo URL in final catalog page

When there's an error in the pcvalidate's GETs this results in an empty href in the frontend.
Question: is there a way to handle such a situation (is this an exception? probably not...) in a more graceful way?
I believe this opens up a broader question.
The point is that we have to decide how to handle the internal processes fails.
Possibilities:

  • every fail means a complete fail. The final page will not be rendered. The info inserted in ES and in the softwares.yml up until this point have to be rolled back leaving the instance clean.
  • a fail does not mean a complete fail. The final page will be rendered anyway, most probably with some elements missing. We have an inconsistent state.

I'd try to avoid as much as possible the inconsistency here.
The problem is of course not in the new insertions but with the updates in the publiccode.yml files, since info are already present in ES and we don't want to override them if something fails during the process. Ideas @sebbalex? Thanks

Remove entry from catalog

Sometimes it is necessary to remove an entry from the catalog, see this policy for more info.
To accomplish such a goal in the current architecture it is necessary to undergo different paths:

  1. remove an entry from the whitelist - accomplished via blacklist mentioned in #78
  2. remove an entry from elasticsearch
  3. remove an entry from the _site directory - accomplished automatically since everynight the crawler runs from scratch so this is solved via 2.

It would be nice to have an option to be passed to the crawler via command line to automagically perform the operation described in 2.
@sebbalex

Make logs more verbose

I believe that we need to make the logs printed in the bad_publiccodes.lst more verbose. Right now the logger is just appending the URL to the end of the file, however I believe that inserting a timestamp in front of the URL may help the troubleshooting process (this could potentially also help the issueopener job?).
These are the lines:
https://github.com/italia/developers-italia-backend/blob/f40477d59ad3d181ff59e852083db102cef29aae/crawler/crawler/saveToFile.go#L54-L71

I'd suggest something like this:

 _, err = f.WriteString(time.now().Date()+"-"+fileRawURL + "\r\n")

@sebbalex

Sometimes we got an exit status 128 from git pull

This only happens when repository is already cloned and crawler try to update id

time="2020-09-24T08:32:25Z" level=error msg="[UniversitaDellaCalabria/django-delivery] error while cloning: cannot git pull the repository: exit status 128"

Bitbucket raw URL is outside of repository

When dealing with a raw URL from bitbucket the pcvalidate throws a validation ko.
See this repo as an example

description/it/screenshots: Absolute URL (https://bitbucket.org/_darti/ssd-arti/raw/ee661d623ebc73edfdc459e280f9fdd0e4c8a5f4/docs/img/assettoIstituzioniScolastiche.jpg) is outside the repository (https://bitbucket.org/_darti/ssd-arti/raw/master/)

This does not happen with Github repos.

Knowage entry in Developers Italia software catalogue is not updated

We modified the publiccode.yml on Knowage github repository, but the entry on Developers Italia software catalogue is not updated, even if validation on public editor was always being successful.

In particular, those changes were not applied (i.e. last 3 commits):
KnowageLabs/Knowage-Server@90fe6d0#diff-d808908edaf9b9a0101c6225f9d55907
KnowageLabs/Knowage-Server@da72d0e#diff-d808908edaf9b9a0101c6225f9d55907
KnowageLabs/Knowage-Server@61e02f4#diff-d808908edaf9b9a0101c6225f9d55907

The only thing that we noticed is that a Knowage screenshot is displayed on search result page when searching for "Knowage", that is fine.

Was Knowage being added into a kind of "bad-publiccode" list?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.