nansencenter / django-geo-spaas-harvesting Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 2.56 MB

Harvest data into a GeoSPaaS catalog

License: GNU General Public License v3.0

Dockerfile 0.35% Python 94.64% Shell 0.13% HTML 4.88%

django-geo-spaas-harvesting's People

Contributors

Stargazers

Watchers

Forkers

aperrin66

django-geo-spaas-harvesting's Issues

integrating the logging action with nansat

Harvest sea ice type and drift form OSISAF

Same as in #16

New harvester based on new local crawler and nansat ingester

Get all dataset variables in the DDXIngester

Harvesting dataset variable from DDX metadata works only for a specific metadata structure and for standard variable names accessible through pythesint. It would be nice to have a more generic mechanism.

To do:

extract all variable names from the DDX metadata (in DDXIngester._extract_attributes())
process variable names which are unknown from pythesint (add them as wkv? create them on the fly?)

The FTPCrawler loses the FTP connection after a while

The FTPCrawler initializes an FTP connection when it is instantiated.
If the harvesting process takes too long, the control connection times out.

The connection needs to be re-created when this happens.

Three new source of data must be harvested

Three new source of data (three new product of marine copernicus) must be harvested with their own REST APIs and this the help of mxl provided for them:

SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046

http://nrt.cmems-du.eu/motu-web/Motu?action=describeProduct&service=SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046-TDS&product=dataset-duacs-nrt-global-merged-allsat-phy-l4

MULTIOBS_GLO_PHY_NRT_015_003

http://nrt.cmems-du.eu/motu-web/Motu?action=describeProduct&service=MULTIOBS_GLO_PHY_NRT_015_003-TDS&product=dataset-uv-nrt-daily

GLOBAL_ANALYSIS_FORECAST_PHY_001_024

http://nrt.cmems-du.eu/motu-web/Motu?action=describeProduct&service=GLOBAL_ANALYSIS_FORECAST_PHY_001_024-TDS&product=global-analysis-forecast-phy-001-024

Harvest CMEMS in situ data

Harvest the following CMEMS products:

This is in situ data, so the spatial coverage is not easily available. We will probably need to download all the data (225GB) and harvest it locally.

Use mocking instead of setting GEOSPAAS_PERSISTENCE_DIR in test_daemon_script.py

Setting the environment variable does not work if geospaas_harvesting.harvest has been imported by previous tests.

Cyclic Referring

In the case of circular references, the crawler should handle it.
If one page refers to itself,we have this state of crawler:

which is solved by using EXCLUDE.

But how about the cases that page 1 refers to page 2 and page 2 refers to page 1?

Give the possibility to update pythesint's local data

When starting the harvesting process, the Vocabulary objects in the database are updated using the pythesint local data, but this local data is not refreshed.
We should have an option to enable this.

exists method of django is better than type convertions!

Instead of using bool in this line, it is better to use django option for it:

django-geo-spaas-harvesting/geospaas_harvesting/ingesters.py

Line 78 in f9c3306

return bool(DatasetURI.objects.filter(uri=uri))

Clean up NansatIngester

Currently, the NansatIngester class code is pretty much copied from the nansat_ingestor app in django-geo-spaas.

The _ingest_dataset() method should be split into more easily testable methods.
The unit tests should also be completed and at least cover the whole code of the class.

Harvest sea ice concentration form OSISAF

OSIAF provides sea ice concentration data over THREDDS here:
https://thredds.met.no/thredds/osisaf/osisaf_seaiceconc.html

There are aggregated datasets which are of lesser interest (e.g. Aggregated Sea Ice Concentration for the Northern Hemisphere Lambert Projection) and individual daily files in two subfolders:
https://thredds.met.no/thredds/catalog/osisaf/met.no/ice/conc/catalog.html
https://thredds.met.no/thredds/catalog/osisaf/met.no/ice/amsr2_conc/catalog.html

In the subfolders the files are given for two different hemispheres (nh, sh) and in two different projections (polstere, ease).
In principle we need only data from nh and in polstere projection, so it would be nice if we could select which data to ingest by filename (e.g ice_conc_nh_polstere*).

The task is to extend harvesting to manage the sources mentioned above.

It is important to think about extending that in future to also harvest metadata from:
https://thredds.met.no/thredds/osisaf/osisaf_seaicetype.html
and
https://thredds.met.no/thredds/osisaf/osisaf_seaicedrift.html

Harvest VIIRS Chlorophyll-a from GSFC

To investigate.

See https://oceancolor.gsfc.nasa.gov/data/download_methods/

Here are examples of search queries:

curl 'https://oceandata.sci.gsfc.nasa.gov/api/file_search/' --data-raw 'search=*L2_SNPP_OC.nc+&sdate=2020-09-01&edate=2020-09-02&dtype=L2&sensor=VIIRS&subID=&subType=1&std_only=1&results_as_file=1&addurl=1&file_search=file_search'

curl 'https://oceandata.sci.gsfc.nasa.gov/api/file_search/' --data-raw 'search=*L2_JPSS1_OC.nc++&sdate=2020-09-01&edate=2020-09-02&dtype=L2&sensor=VIIRSJ1&subID=&subType=1&std_only=1&results_as_file=1&addurl=1&file_search=file_search'

Make Django management commands for ingestion

Like in the geospaas.nansat_ingestor app, make it possible to ingest datasets manually using Django management commands.

Make persistence optional

Add support for a boolean in the configuration file that can enable or disable the persistence mechanism.

"lost synchronisation" with the database when running multiple harvesters in parallel

When running multiple harvesters in parallel, the main process sometimes ends with the following error:

2021-01-25 15:26:15,365 - geospaas_harvesting.daemon - ERROR - An unexpected error occurred
Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
    return self.cursor.execute(sql, params)
psycopg2.OperationalError: lost synchronization with server: got message type "

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/harvest.py", line 208, in launch_harvest
    harvester.harvest()
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/harvesters.py", line 77, in harvest
    self._ingester.ingest(self._current_crawler)
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/ingesters.py", line 270, in ingest
    if self._uri_exists(download_url):
  File "/venv/lib/python3.7/site-packages/geospaas_harvesting/ingesters.py", line 78, in _uri_exists
    return DatasetURI.objects.filter(uri=uri).exists()
  File "/venv/lib/python3.7/site-packages/django/db/models/query.py", line 777, in exists
    return self.query.has_results(using=self.db)
  File "/venv/lib/python3.7/site-packages/django/db/models/sql/query.py", line 538, in has_results
    return compiler.has_results()
  File "/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1121, in has_results
    return bool(self.execute_sql(SINGLE))
  File "/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1151, in execute_sql
    cursor.execute(sql, params)
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 68, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
    return self.cursor.execute(sql, params)
  File "/venv/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.OperationalError: lost synchronization with server: got message type "

This might be caused by the combination of Django and multiprocessing.

Make local filesystem crawler

Make a simple crawler which explores a folder recursively.

python list should be the last selection among the options and usage of python data types

@aperrin66
@akorosov
Although you know it better than me about the data types in python. I need to describe what is in my mind when I create this issue.
Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!

Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.

For the speed of running the code the order of selection data types is as follows:

tuple
set
3.list

Paginated APIs crawler: sort products by publishing date rather than start of time coverage

This should avoid missing datasets or trying to ingest them twice if some are added while the harvesting process is happening.

Fix ingester tests

Some ingester tests need to be modified after nansencenter/metanorm#45, because the summary contents has changed.

Use the slim geospaas Docker image

Harvest CMEMS products for TOPVOYS

Harvest the following products:

OpenDAP crawler: guarantee crawling order

The resources and URLs to explore are stored in sets, which does not guarantee the order in which resources are returned by the iterator.
Using an ordered collection would be a first steps towards being able to resume the crawling in case of interruption.

Cleanly manage configuration keys which require environment variables

Some configuration keys contain the name of an environment variable to be used to retrieve the value (typically for passwords).
These keys are not differentiated from "normal" keys in the configuration, and must be processed differently in the harvester's code.
It would be better to retrieve the value when parsing the configuration file and process all configuration keys the same way.

Fix test that fails with the new metanorm version

test_ingesters.DDXIngesterTestCase.test_get_normalized_attributes fails because it assumes that the location_geometry attribute is a GEOSGeometry instance.
It must be modified to work with a string.

Use tempfile for persistence testing

Use temporary directories for persistence tests instead of using the value from the harvest.py module.

Harvest NOAA drifters

Harvest https://tgftp.nws.noaa.gov/SL.us008001/DF.an/DC.sfmar/DS.dbuoy

Make NetCDF 4 ingester

Make an Ingester which can extract data from NetCDF 4 files (either local files, or from a remote repository with support for remote opening of NetCDF files like OpenDAP).

Add the possibility to disable server-side cursors

This seems to solve connection loss problems with Postgres databases.

There is no need for datasetparameter model in geospaas

Fix time condition in Copernicus crawler

The time condition in the Copernicus crawler only uses beginposition.
It should also use endposition.

Ingesters use too many database connections

Since the check for the existence of the URIs has been moved to the normalizing threads, each of these threads opens a database connection.
This causes too many simultaneous connections to be created, for no noticeable performance gain.
It also causes errors when the database limit for concurrent connections is reached.

remove the temporary solution in ftp harvester!

There is a temporary solution in ftp harvester for setting the entry_id of each dataset. I must be removed and the correct place for the solution is inside the metanorm repo. This related to the issue nansencenter/metanorm#22 (comment)

harvesting from some of ftp resources

we are going do need a crawler and ingester for FTP repositories.
FTP is really not the most efficient way to get metadata, but sometimes we won't have a choice.

The crawler is going to be simple enough, but the ingester is most likely going to have to download each file, extract the metadata and remove the file. Maybe for some repositories we can deduct some information from the name of the file.

Here are some examples of data we are going to need from FTP repositories for one of the projects:

  GMI L3 from ftp://ftp.remss.com/gmi/bmaps_v08.2/ (registration needed)

CCI Climatology from ftp://anon-ftp.ceda.ac.uk/neodc/esacci/sst/data/CDR_v2/Climatology/L4/

AMSR2 L2: ftp://ftp.gportal.jaxa.jp:/standard/GCOM-W/GCOM-W.AMSR2/L2.SST/3 (registration needed)

Add checking mechanism for ingested data

Sometimes URLs become obsolete. We need a mechanism which regularly checks the ingested URLs and removes them if necessary.

The Thredds crawler should return HTTP download URLs instead of .dods URLs

DODS files are not meant to be fully downloaded, but rather to be opened remotely.
It would be best to use the HTTPServer option of Thredds to reference direct download links.

This will require making a new ThreddsIngester, which will possibly inherit from the DDXIngester.

Set a deterministic entry_id when harvesting

The value of the entry_id field should be unique and characterize the dataset rather than being generated randomly.
The idea is to be able to find out if the same dataset comes from different URLs.

This needs to be done for each ingester.

Harvest NOAA models for TOPVOYS

Harvest the following models:

NOAA data access doc: https://www.ncdc.noaa.gov/data-access

Check the APIs and Thredds access.

Create GeosGeomtery in Harvesting instead of Metanorm

See nansencenter/metanorm#23

Make time range restrictions work with the FTP crawler

The FTPCrawler class inherits from WebDirectoryCrawler, but is not compatible with the functionalities offered by the latter.

This needs to be fixed, if necessary refactoring WebDirectoryCrawler to make it more generic, and to make it easier to implement child classes.

Update the README with the new harvesters

The lists of crawlers, ingesters and harvesters in the README need to be updated.

Implement some parallelism

It could be done by:

using multi-threading for ingestion
using multi-processing at the harvester level (one process per harvester)

Of course the number of threads/processes should be configurable.

Remove the persistence mechanism

The persistence mechanism is triggered when the app stops due to an exception, SIGTERM or SIGINT. It is not triggered in case of brutal interruption, for example SIGKILL.

Maybe dump the state at regular intervals?

It would also be good to dump only useful data and not the full harvester objects to enhance compatibility between object versions

Cleanup the harvest.py script

The harvest.py script is kind of a mess, maybe implement a Runner class to run harvesters?
It could make it easier to implement multi-processing later.

Harvest Sentinel-3 OLCI Chlorophyll-a from Creodias

An API is available, here is an example of a query:

https://finder.creodias.eu/resto/api/collections/Sentinel3/search.json?maxRecords=10&processingLevel=LEVEL2&instrument=OL&productType=WFR&timeliness=Near+Real+Time&sortParam=startDate&sortOrder=descending&status=all&dataset=ESA-DATASET

Stop faster on interruption signals

When SIGTERM or SIGINT are received, ingesters wait for the queue of objects to write to the database to be processed, which can take up to two minutes.
It would be nice to speed this up. Several approaches are possible:

reduce the size of the queue.
don't wait for the queue to be emptied. The ingester objects are pickle-able, so this is feasible. The main thing to do would be to change the way in which the database threads are stopped.

Update ingesters.Ingester._ingest_dataset() to read a list from normalized_attributes['parameters'] and add Parameters from this list.