Giter VIP home page Giter VIP logo

django-geo-spaas-harvesting's People

Contributors

akorosov avatar aperrin66 avatar azamifard avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

aperrin66

django-geo-spaas-harvesting's Issues

Get all dataset variables in the DDXIngester

Harvesting dataset variable from DDX metadata works only for a specific metadata structure and for standard variable names accessible through pythesint. It would be nice to have a more generic mechanism.

To do:

  • extract all variable names from the DDX metadata (in DDXIngester._extract_attributes())
  • process variable names which are unknown from pythesint (add them as wkv? create them on the fly?)

Three new source of data must be harvested

Cyclic Referring

In the case of circular references, the crawler should handle it.
If one page refers to itself,we have this state of crawler:
image
which is solved by using EXCLUDE.

But how about the cases that page 1 refers to page 2 and page 2 refers to page 1?

Clean up NansatIngester

Currently, the NansatIngester class code is pretty much copied from the nansat_ingestor app in django-geo-spaas.

  • The _ingest_dataset() method should be split into more easily testable methods.
  • The unit tests should also be completed and at least cover the whole code of the class.

Harvest sea ice concentration form OSISAF

OSIAF provides sea ice concentration data over THREDDS here:
https://thredds.met.no/thredds/osisaf/osisaf_seaiceconc.html

There are aggregated datasets which are of lesser interest (e.g. Aggregated Sea Ice Concentration for the Northern Hemisphere Lambert Projection) and individual daily files in two subfolders:
https://thredds.met.no/thredds/catalog/osisaf/met.no/ice/conc/catalog.html
https://thredds.met.no/thredds/catalog/osisaf/met.no/ice/amsr2_conc/catalog.html

In the subfolders the files are given for two different hemispheres (nh, sh) and in two different projections (polstere, ease).
In principle we need only data from nh and in polstere projection, so it would be nice if we could select which data to ingest by filename (e.g ice_conc_nh_polstere*).

The task is to extend harvesting to manage the sources mentioned above.

It is important to think about extending that in future to also harvest metadata from:
https://thredds.met.no/thredds/osisaf/osisaf_seaicetype.html
and
https://thredds.met.no/thredds/osisaf/osisaf_seaicedrift.html

Harvest VIIRS Chlorophyll-a from GSFC

To investigate.

See https://oceancolor.gsfc.nasa.gov/data/download_methods/

Here are examples of search queries:

curl 'https://oceandata.sci.gsfc.nasa.gov/api/file_search/' --data-raw 'search=*L2_SNPP_OC.nc+&sdate=2020-09-01&edate=2020-09-02&dtype=L2&sensor=VIIRS&subID=&subType=1&std_only=1&results_as_file=1&addurl=1&file_search=file_search'

curl 'https://oceandata.sci.gsfc.nasa.gov/api/file_search/' --data-raw 'search=*L2_JPSS1_OC.nc++&sdate=2020-09-01&edate=2020-09-02&dtype=L2&sensor=VIIRSJ1&subID=&subType=1&std_only=1&results_as_file=1&addurl=1&file_search=file_search'

Make persistence optional

Add support for a boolean in the configuration file that can enable or disable the persistence mechanism.

"lost synchronisation" with the database when running multiple harvesters in parallel

When running multiple harvesters in parallel, the main process sometimes ends with the following error:

2021-01-25 15:26:15,365 - geospaas_harvesting.daemon - ERROR - An unexpected error occurred
Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
    return self.cursor.execute(sql, params)
psycopg2.OperationalError: lost synchronization with server: got message type "

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/harvest.py", line 208, in launch_harvest
    harvester.harvest()
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/harvesters.py", line 77, in harvest
    self._ingester.ingest(self._current_crawler)
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/ingesters.py", line 270, in ingest
    if self._uri_exists(download_url):
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/ingesters.py", line 78, in _uri_exists
    return DatasetURI.objects.filter(uri=uri).exists()
  File "/venv/lib/python3.7/site-packages/django/db/models/query.py", line 777, in exists
    return self.query.has_results(using=self.db)
  File "/venv/lib/python3.7/site-packages/django/db/models/sql/query.py", line 538, in has_results
    return compiler.has_results()
  File "/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1121, in has_results
    return bool(self.execute_sql(SINGLE))
  File "/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1151, in execute_sql
    cursor.execute(sql, params)
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 68, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
    return self.cursor.execute(sql, params)
  File "/venv/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.OperationalError: lost synchronization with server: got message type "

This might be caused by the combination of Django and multiprocessing.

python list should be the last selection among the options and usage of python data types

@aperrin66
@akorosov
Although you know it better than me about the data types in python. I need to describe what is in my mind when I create this issue.
Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!

Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.

For the speed of running the code the order of selection data types is as follows:

  1. tuple
  2. set
    3.list

OpenDAP crawler: guarantee crawling order

The resources and URLs to explore are stored in sets, which does not guarantee the order in which resources are returned by the iterator.
Using an ordered collection would be a first steps towards being able to resume the crawling in case of interruption.

Cleanly manage configuration keys which require environment variables

Some configuration keys contain the name of an environment variable to be used to retrieve the value (typically for passwords).
These keys are not differentiated from "normal" keys in the configuration, and must be processed differently in the harvester's code.
It would be better to retrieve the value when parsing the configuration file and process all configuration keys the same way.

Fix test that fails with the new metanorm version

test_ingesters.DDXIngesterTestCase.test_get_normalized_attributes fails because it assumes that the location_geometry attribute is a GEOSGeometry instance.
It must be modified to work with a string.

Make NetCDF 4 ingester

Make an Ingester which can extract data from NetCDF 4 files (either local files, or from a remote repository with support for remote opening of NetCDF files like OpenDAP).

Ingesters use too many database connections

Since the check for the existence of the URIs has been moved to the normalizing threads, each of these threads opens a database connection.
This causes too many simultaneous connections to be created, for no noticeable performance gain.
It also causes errors when the database limit for concurrent connections is reached.

harvesting from some of ftp resources

we are going do need a crawler and ingester for FTP repositories.
FTP is really not the most efficient way to get metadata, but sometimes we won't have a choice.

The crawler is going to be simple enough, but the ingester is most likely going to have to download each file, extract the metadata and remove the file. Maybe for some repositories we can deduct some information from the name of the file.

Here are some examples of data we are going to need from FTP repositories for one of the projects:

  GMI L3 from ftp://ftp.remss.com/gmi/bmaps_v08.2/ (registration needed)

CCI Climatology from ftp://anon-ftp.ceda.ac.uk/neodc/esacci/sst/data/CDR_v2/Climatology/L4/

AMSR2 L2: ftp://ftp.gportal.jaxa.jp:/standard/GCOM-W/GCOM-W.AMSR2/L2.SST/3 (registration needed)

Set a deterministic entry_id when harvesting

The value of the entry_id field should be unique and characterize the dataset rather than being generated randomly.
The idea is to be able to find out if the same dataset comes from different URLs.

This needs to be done for each ingester.

Make time range restrictions work with the FTP crawler

The FTPCrawler class inherits from WebDirectoryCrawler, but is not compatible with the functionalities offered by the latter.

This needs to be fixed, if necessary refactoring WebDirectoryCrawler to make it more generic, and to make it easier to implement child classes.

Implement some parallelism

It could be done by:

  • using multi-threading for ingestion
  • using multi-processing at the harvester level (one process per harvester)

Of course the number of threads/processes should be configurable.

Remove the persistence mechanism

The persistence mechanism is triggered when the app stops due to an exception, SIGTERM or SIGINT. It is not triggered in case of brutal interruption, for example SIGKILL.

Maybe dump the state at regular intervals?

It would also be good to dump only useful data and not the full harvester objects to enhance compatibility between object versions

Cleanup the harvest.py script

The harvest.py script is kind of a mess, maybe implement a Runner class to run harvesters?
It could make it easier to implement multi-processing later.

Stop faster on interruption signals

When SIGTERM or SIGINT are received, ingesters wait for the queue of objects to write to the database to be processed, which can take up to two minutes.
It would be nice to speed this up. Several approaches are possible:

  • reduce the size of the queue.
  • don't wait for the queue to be emptied. The ingester objects are pickle-able, so this is feasible. The main thing to do would be to change the way in which the database threads are stopped.

Limit the number of HTTP requests per provider

Some providers might block or crash if too many requests are sent.
We should implement a way to strictly limit the number of connections to a given provider, probably at the Harvester level.

Add support for updates to a data repository

Rather than going through the whole repository every time the harvester is run, it would be better to have a mechanism which only ingests new datasets.

This could be done by making the crawler return URLs sorted from most recent to oldest, and stopping when an already ingested dataset is found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.