astrodigital / modis-ingestor Goto Github PK
View Code? Open in Web Editor NEWScripts and other artifacts for MODIS data ingestion into Amazon public hosting.
License: MIT License
Scripts and other artifacts for MODIS data ingestion into Amazon public hosting.
License: MIT License
Need to add a MODIS tiles vectors to s3://modis-pds/
These contain the geometry of tiles and the tile designation in hXXvYY format.
There should three:
This is preferable to storing the coordinates of every tile, since it's redundant every day.
If LPDAAC is down (Wednesdays), and we happen to be processing data, we need to gracefully exit, but also be able to resume sooner than waiting for the next scheduled time in the crontab.
Hi there - we use MODIS data on AWS for our work flow.
There is often a delay of several days, sometimes weeks from USGS to AWS.
What is the expected priority on this ingestion process. Should we be looking to use an alternate source?
First steps in building a prototype. The modis-ingestor is the main repository of code that is run on a daily basis. The ingestor should be able to determine the last data available on S3, query and locate new data since that point (and missing/updated data?), fetch the data, split into individual datasets, and uploaded to S3 following standard directory structure of:
PRODUCTNAME/TILEID/DATE
where PRODUCTNAME is a product shortname and version ID (e.g., MCD43A4.006) and TILEID can either be a single directory or multiple directories that make up the tile identifier. For example Landsat uses 006/109 (path/row) but MODIS may use h05v07, or 05/07. DATE is either a directory of the date or at least a scene ID that is unique within that TILEID for a specific date.
AWS Lambda should be sufficient to do all required processing and uploading to S3 within the time limits imposed by Lambda.
When converting from hdf to geotiff, the band numbers are getting mixed up. This code assumes that the first 14 subdatasets are Bands 1-14. But the first 7 are binary quality bands, the next seven contain the reflectance.
(same results can be seen from gdalinfo
)
This means that all the band numbers in the modis-pds
bucket also need to be fixed. For instance, s3://modis-pds/MCD43A4.006/21/07/2017004/MCD43A4.A2017004.h21v07.006.2017014061026_B08.TIF
is actually the reflectance for band 1 and should be named B01.TIF
Be a good web citizen @drewbo
Create an AWS account specifically for the MODIS on AWS, "modis-pds"
@drewbo thanks a lot for the presentation.
The product looks amazing.
2 minor issues I'd mention with file names:
_
char may cause some problems. e.g. rasterio by default doesn't accept it in file names.Comes in as sinusoidal, can convert if needed when saving each band
Reach out to interested parties to vet formatting, discuss approach
Replace print statements with logging facility
Improve/expand on command line options, such as allowing for multiple dates, limit on total # of files, optional prefix to upload to S3 (?)
Add in overviews, either separate .ovr files or embedded in the TIF. Since not all use cases would use the overviews, I think it makes more sense to have them in a separate file.
Verify, implement if needed, final GeoTIFF format that includes lossless compression and ability to make windowed reads.
Python logging module is currently used, but logs should be sent to cloudwatch somehow.
This depends on final implementation. If process ends up as lambda function this would happen automatically, otherwise if it stays as a dockerized container in an EC2 then handler will need to be added to send them to cloudwatch somehow.
So, how CMR works is that when you request data from a date it returns any product which includes that date within it's range. So, for MODIS data you get data for 8 days before and 7 days after in the query result.
Additionally, CMR's date field is always the 1st day of the 16 day window rather than the middle, even though the version 006 MCD43A4 filenames use the 9th day in the naming (and version 005 uses the first)...CMR always uses the first.
So, if you request a date from CMR of MCD43A4.006 without a polygon feature, you get all tiles from every day starting 16 days before your requested date. With 460 tiles per day clearly this is more than the 1000 page limit size we have yet (and whose maximum is 2000).
Note that the CMR query python code will actually throw out all the other dates that aren't the actual specific date...the problem is that we aren't requesting multiple pages from CMR. We need the code to be able to fetch all the results no matter how many there are.
Need a function to delete or remove things from S3. This is most useful for testing to cleanup after testing push, but may also be a useful management function.
This will have links to:
Need to add index file at the top level of the bucket containing sceneids, download URL and a small selection of metadata.
landsat includes:
entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url
Cloud cover doesn't really apply to MODIS, at least not the MCD43A4 product since it is a composite product. The daily products don't have cloud cover estimates that I recall.
a few things WRT metadata:
It seems that the last update of MODIS was on 2019/08/02:
https://modis-pds.s3.amazonaws.com/?prefix=MCD43A4.006/2019-08
Can you check, what is going on and update us on when the issue will be solved?
Best,
Grega
We are attempting to catalog MODIS data sets and have found some issues with the boundary polygons being incorrect. For instance this product: MOD09GQ.A2017112.h23v02.006.2017114034459.hdf found here https://s3-us-west-2.amazonaws.com/modis-pds/MOD09GQ.006/23/02/2017112/MOD09GQ.A2017112.h23v02.006.2017114034459.hdf.xml.
Has a boundary polygon in the XML like this:
94.8565392816915
59.2599860866768
114.567785087656
67.6165109403567
-136.147711707288
56.6146339179703
124.453986719599
59.0449762757696
Which does not seem correct. Note the product.json has the same values.
The tif (band 1) headers look like this:
Corner Coordinates:
Upper Left ( 5559752.598, 7783653.638) (146d11'24.79"E, 70d 0' 0.00"N)
Lower Left ( 5559752.598, 6671703.118) (100d 0' 0.00"E, 60d 0' 0.00"N)
Upper Right ( 6671703.118, 7783653.638) (175d25'41.75"E, 70d 0' 0.00"N)
Lower Right ( 6671703.118, 6671703.118) (120d 0' 0.00"E, 60d 0' 0.00"N)
Center ( 6115727.858, 7227678.378) (130d 8'27.91"E, 65d 0' 0.00"N)
Is there an issue with the data or am I misinterpreting what that boundary polygon in the XML?
Hey all, I'm working on a task to catalog the MODIS data in AWS S3. In the past I've used S3 inventory files to catalog the data, is there something like this that contains a manifest of the data? Thanks, any advice on best practices for cataloging modis would be great!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.