nansencenter / django-geo-spaas-harvesting Goto Github PK
View Code? Open in Web Editor NEWHarvest data into a GeoSPaaS catalog
License: GNU General Public License v3.0
Harvest data into a GeoSPaaS catalog
License: GNU General Public License v3.0
Same as in #16
Harvesting dataset variable from DDX metadata works only for a specific metadata structure and for standard variable names accessible through pythesint. It would be nice to have a more generic mechanism.
To do:
DDXIngester._extract_attributes()
)The FTPCrawler initializes an FTP connection when it is instantiated.
If the harvesting process takes too long, the control connection times out.
The connection needs to be re-created when this happens.
Three new source of data (three new product of marine copernicus) must be harvested with their own REST APIs and this the help of mxl provided for them:
SEALEVEL_GLO_PHY_L4_NRT_OBSERVATIONS_008_046
MULTIOBS_GLO_PHY_NRT_015_003
GLOBAL_ANALYSIS_FORECAST_PHY_001_024
Harvest the following CMEMS products:
This is in situ data, so the spatial coverage is not easily available. We will probably need to download all the data (225GB) and harvest it locally.
Setting the environment variable does not work if geospaas_harvesting.harvest
has been imported by previous tests.
When starting the harvesting process, the Vocabulary objects in the database are updated using the pythesint local data, but this local data is not refreshed.
We should have an option to enable this.
Instead of using bool in this line, it is better to use django option for it:
Currently, the NansatIngester class code is pretty much copied from the nansat_ingestor app in django-geo-spaas
.
_ingest_dataset()
method should be split into more easily testable methods.OSIAF provides sea ice concentration data over THREDDS here:
https://thredds.met.no/thredds/osisaf/osisaf_seaiceconc.html
There are aggregated datasets which are of lesser interest (e.g. Aggregated Sea Ice Concentration for the Northern Hemisphere Lambert Projection) and individual daily files in two subfolders:
https://thredds.met.no/thredds/catalog/osisaf/met.no/ice/conc/catalog.html
https://thredds.met.no/thredds/catalog/osisaf/met.no/ice/amsr2_conc/catalog.html
In the subfolders the files are given for two different hemispheres (nh, sh) and in two different projections (polstere, ease).
In principle we need only data from nh and in polstere projection, so it would be nice if we could select which data to ingest by filename (e.g ice_conc_nh_polstere*
).
The task is to extend harvesting to manage the sources mentioned above.
It is important to think about extending that in future to also harvest metadata from:
https://thredds.met.no/thredds/osisaf/osisaf_seaicetype.html
and
https://thredds.met.no/thredds/osisaf/osisaf_seaicedrift.html
To investigate.
See https://oceancolor.gsfc.nasa.gov/data/download_methods/
Here are examples of search queries:
curl 'https://oceandata.sci.gsfc.nasa.gov/api/file_search/' --data-raw 'search=*L2_SNPP_OC.nc+&sdate=2020-09-01&edate=2020-09-02&dtype=L2&sensor=VIIRS&subID=&subType=1&std_only=1&results_as_file=1&addurl=1&file_search=file_search'
curl 'https://oceandata.sci.gsfc.nasa.gov/api/file_search/' --data-raw 'search=*L2_JPSS1_OC.nc++&sdate=2020-09-01&edate=2020-09-02&dtype=L2&sensor=VIIRSJ1&subID=&subType=1&std_only=1&results_as_file=1&addurl=1&file_search=file_search'
Like in the geospaas.nansat_ingestor app, make it possible to ingest datasets manually using Django management commands.
Add support for a boolean in the configuration file that can enable or disable the persistence mechanism.
When running multiple harvesters in parallel, the main process sometimes ends with the following error:
2021-01-25 15:26:15,365 - geospaas_harvesting.daemon - ERROR - An unexpected error occurred
Traceback (most recent call last):
File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
return self.cursor.execute(sql, params)
psycopg2.OperationalError: lost synchronization with server: got message type "
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/venv/lib/python3.7/site-packages/geospaas_harvesting/harvest.py", line 208, in launch_harvest
harvester.harvest()
File "/venv/lib/python3.7/site-packages/geospaas_harvesting/harvesters.py", line 77, in harvest
self._ingester.ingest(self._current_crawler)
File "/venv/lib/python3.7/site-packages/geospaas_harvesting/ingesters.py", line 270, in ingest
if self._uri_exists(download_url):
File "/venv/lib/python3.7/site-packages/geospaas_harvesting/ingesters.py", line 78, in _uri_exists
return DatasetURI.objects.filter(uri=uri).exists()
File "/venv/lib/python3.7/site-packages/django/db/models/query.py", line 777, in exists
return self.query.has_results(using=self.db)
File "/venv/lib/python3.7/site-packages/django/db/models/sql/query.py", line 538, in has_results
return compiler.has_results()
File "/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1121, in has_results
return bool(self.execute_sql(SINGLE))
File "/venv/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1151, in execute_sql
cursor.execute(sql, params)
File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 68, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
return self.cursor.execute(sql, params)
File "/venv/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/venv/lib/python3.7/site-packages/django/db/backends/utils.py", line 86, in _execute
return self.cursor.execute(sql, params)
django.db.utils.OperationalError: lost synchronization with server: got message type "
This might be caused by the combination of Django and multiprocessing.
Make a simple crawler which explores a folder recursively.
@aperrin66
@akorosov
Although you know it better than me about the data types in python. I need to describe what is in my mind when I create this issue.
Python lists are only needed when we want to save the order of elements inside it. When we don't need order, there is no need to use a python list!!!!
Sets should be used for the case that we add another element in it afterwards.
tuple should be used for the case that we do not need to change it afterwards.
For the speed of running the code the order of selection data types is as follows:
This should avoid missing datasets or trying to ingest them twice if some are added while the harvesting process is happening.
Some ingester tests need to be modified after nansencenter/metanorm#45, because the summary contents has changed.
Harvest the following products:
The resources and URLs to explore are stored in sets, which does not guarantee the order in which resources are returned by the iterator.
Using an ordered collection would be a first steps towards being able to resume the crawling in case of interruption.
Some configuration keys contain the name of an environment variable to be used to retrieve the value (typically for passwords).
These keys are not differentiated from "normal" keys in the configuration, and must be processed differently in the harvester's code.
It would be better to retrieve the value when parsing the configuration file and process all configuration keys the same way.
test_ingesters.DDXIngesterTestCase.test_get_normalized_attributes
fails because it assumes that the location_geometry
attribute is a GEOSGeometry
instance.
It must be modified to work with a string.
Use temporary directories for persistence tests instead of using the value from the harvest.py module.
Make an Ingester which can extract data from NetCDF 4 files (either local files, or from a remote repository with support for remote opening of NetCDF files like OpenDAP).
This seems to solve connection loss problems with Postgres databases.
The time condition in the Copernicus crawler only uses beginposition
.
It should also use endposition
.
Since the check for the existence of the URIs has been moved to the normalizing threads, each of these threads opens a database connection.
This causes too many simultaneous connections to be created, for no noticeable performance gain.
It also causes errors when the database limit for concurrent connections is reached.
There is a temporary solution in ftp harvester for setting the entry_id of each dataset. I must be removed and the correct place for the solution is inside the metanorm repo. This related to the issue nansencenter/metanorm#22 (comment)
we are going do need a crawler and ingester for FTP repositories.
FTP is really not the most efficient way to get metadata, but sometimes we won't have a choice.
The crawler is going to be simple enough, but the ingester is most likely going to have to download each file, extract the metadata and remove the file. Maybe for some repositories we can deduct some information from the name of the file.
Here are some examples of data we are going to need from FTP repositories for one of the projects:
GMI L3 from ftp://ftp.remss.com/gmi/bmaps_v08.2/ (registration needed)
CCI Climatology from ftp://anon-ftp.ceda.ac.uk/neodc/esacci/sst/data/CDR_v2/Climatology/L4/
AMSR2 L2: ftp://ftp.gportal.jaxa.jp:/standard/GCOM-W/GCOM-W.AMSR2/L2.SST/3 (registration needed)
Sometimes URLs become obsolete. We need a mechanism which regularly checks the ingested URLs and removes them if necessary.
DODS files are not meant to be fully downloaded, but rather to be opened remotely.
It would be best to use the HTTPServer option of Thredds to reference direct download links.
This will require making a new ThreddsIngester, which will possibly inherit from the DDXIngester.
The value of the entry_id
field should be unique and characterize the dataset rather than being generated randomly.
The idea is to be able to find out if the same dataset comes from different URLs.
This needs to be done for each ingester.
Harvest the following models:
NOAA data access doc: https://www.ncdc.noaa.gov/data-access
Check the APIs and Thredds access.
The FTPCrawler
class inherits from WebDirectoryCrawler
, but is not compatible with the functionalities offered by the latter.
This needs to be fixed, if necessary refactoring WebDirectoryCrawler
to make it more generic, and to make it easier to implement child classes.
The lists of crawlers, ingesters and harvesters in the README need to be updated.
It could be done by:
Of course the number of threads/processes should be configurable.
The persistence mechanism is triggered when the app stops due to an exception, SIGTERM or SIGINT. It is not triggered in case of brutal interruption, for example SIGKILL.
Maybe dump the state at regular intervals?
It would also be good to dump only useful data and not the full harvester objects to enhance compatibility between object versions
The harvest.py
script is kind of a mess, maybe implement a Runner class to run harvesters?
It could make it easier to implement multi-processing later.
When SIGTERM or SIGINT are received, ingesters wait for the queue of objects to write to the database to be processed, which can take up to two minutes.
It would be nice to speed this up. Several approaches are possible:
Some providers might block or crash if too many requests are sent.
We should implement a way to strictly limit the number of connections to a given provider, probably at the Harvester level.
Rather than going through the whole repository every time the harvester is run, it would be better to have a mechanism which only ingests new datasets.
This could be done by making the crawler return URLs sorted from most recent to oldest, and stopping when an already ingested dataset is found.
ingesters.Ingester._ingest_dataset()
to read a list from normalized_attributes['parameters']
and add Parameters from this list.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.