Giter VIP home page Giter VIP logo

modis-ingestor's People

Contributors

drewbo avatar matthewhanson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

modis-ingestor's Issues

add MODIS tiles shapefile

Need to add a MODIS tiles vectors to s3://modis-pds/
These contain the geometry of tiles and the tile designation in hXXvYY format.

There should three:

  • A shapefile (zipped) using the sinusoidal grid
  • A cleaned up shapefile (zipped) in 4326 (direct conversion results in some problems)
  • A geojson file of tiles

This is preferable to storing the coordinates of every tile, since it's redundant every day.

Handle LP-DAAC being down

If LPDAAC is down (Wednesdays), and we happen to be processing data, we need to gracefully exit, but also be able to resume sooner than waiting for the next scheduled time in the crontab.

Big delay on data ingestion

Hi there - we use MODIS data on AWS for our work flow.
There is often a delay of several days, sometimes weeks from USGS to AWS.
What is the expected priority on this ingestion process. Should we be looking to use an alternate source?

First steps

First steps in building a prototype. The modis-ingestor is the main repository of code that is run on a daily basis. The ingestor should be able to determine the last data available on S3, query and locate new data since that point (and missing/updated data?), fetch the data, split into individual datasets, and uploaded to S3 following standard directory structure of:
PRODUCTNAME/TILEID/DATE

where PRODUCTNAME is a product shortname and version ID (e.g., MCD43A4.006) and TILEID can either be a single directory or multiple directories that make up the tile identifier. For example Landsat uses 006/109 (path/row) but MODIS may use h05v07, or 05/07. DATE is either a directory of the date or at least a scene ID that is unique within that TILEID for a specific date.

AWS Lambda should be sufficient to do all required processing and uploading to S3 within the time limits imposed by Lambda.

HDF to Geotiff band numbers

When converting from hdf to geotiff, the band numbers are getting mixed up. This code assumes that the first 14 subdatasets are Bands 1-14. But the first 7 are binary quality bands, the next seven contain the reflectance.

screen shot 2017-01-31 at 2 49 25 pm

(same results can be seen from gdalinfo)

This means that all the band numbers in the modis-pds bucket also need to be fixed. For instance, s3://modis-pds/MCD43A4.006/21/07/2017004/MCD43A4.A2017004.h21v07.006.2017014061026_B08.TIF is actually the reflectance for band 1 and should be named B01.TIF

File names

@drewbo thanks a lot for the presentation.

The product looks amazing.

2 minor issues I'd mention with file names:

  1. _ char may cause some problems. e.g. rasterio by default doesn't accept it in file names.
  2. There is a little bit of confusion between the folder structure and the list of bands on USGS. Is B00 the QA band? Are other band numbers match to file names?

Projection

Comes in as sinusoidal, can convert if needed when saving each band

logging

Replace print statements with logging facility

Command line options

Improve/expand on command line options, such as allowing for multiple dates, limit on total # of files, optional prefix to upload to S3 (?)

overviews

Add in overviews, either separate .ovr files or embedded in the TIF. Since not all use cases would use the overviews, I think it makes more sense to have them in a separate file.

log to cloudwatch

Python logging module is currently used, but logs should be sent to cloudwatch somehow.

This depends on final implementation. If process ends up as lambda function this would happen automatically, otherwise if it stays as a dockerized container in an EC2 then handler will need to be added to send them to cloudwatch somehow.

CMR and page sizing

So, how CMR works is that when you request data from a date it returns any product which includes that date within it's range. So, for MODIS data you get data for 8 days before and 7 days after in the query result.

Additionally, CMR's date field is always the 1st day of the 16 day window rather than the middle, even though the version 006 MCD43A4 filenames use the 9th day in the naming (and version 005 uses the first)...CMR always uses the first.

So, if you request a date from CMR of MCD43A4.006 without a polygon feature, you get all tiles from every day starting 16 days before your requested date. With 460 tiles per day clearly this is more than the 1000 page limit size we have yet (and whose maximum is 2000).

Note that the CMR query python code will actually throw out all the other dates that aren't the actual specific date...the problem is that we aren't requesting multiple pages from CMR. We need the code to be able to fetch all the results no matter how many there are.

delete from s3

Need a function to delete or remove things from S3. This is most useful for testing to cleanup after testing push, but may also be a useful management function.

add index

Need to add index file at the top level of the bucket containing sceneids, download URL and a small selection of metadata.

landsat includes:
entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url

Cloud cover doesn't really apply to MODIS, at least not the MCD43A4 product since it is a composite product. The daily products don't have cloud cover estimates that I recall.

Metadata

a few things WRT metadata:

  • get XML metadata and save as is.
  • convert that XML metadata (or alternately the CMR response if equivalent) to JSON and save
  • metadata filenames should have same filename as the sceneid (as all the other files)

Project milestones

  • Friday Jan 6th - MVP
  • Thursday January 12th - working group call to present format to a few interested parties, see #4.
  • Friday January 27th - Blog posts, finalize landing page
  • Tuesday January 31st - Launch

Problem with boundary polygons

We are attempting to catalog MODIS data sets and have found some issues with the boundary polygons being incorrect. For instance this product: MOD09GQ.A2017112.h23v02.006.2017114034459.hdf found here https://s3-us-west-2.amazonaws.com/modis-pds/MOD09GQ.006/23/02/2017112/MOD09GQ.A2017112.h23v02.006.2017114034459.hdf.xml.

Has a boundary polygon in the XML like this:



94.8565392816915
59.2599860866768


114.567785087656
67.6165109403567


-136.147711707288
56.6146339179703


124.453986719599
59.0449762757696


Which does not seem correct. Note the product.json has the same values.

The tif (band 1) headers look like this:
Corner Coordinates:
Upper Left ( 5559752.598, 7783653.638) (146d11'24.79"E, 70d 0' 0.00"N)
Lower Left ( 5559752.598, 6671703.118) (100d 0' 0.00"E, 60d 0' 0.00"N)
Upper Right ( 6671703.118, 7783653.638) (175d25'41.75"E, 70d 0' 0.00"N)
Lower Right ( 6671703.118, 6671703.118) (120d 0' 0.00"E, 60d 0' 0.00"N)
Center ( 6115727.858, 7227678.378) (130d 8'27.91"E, 65d 0' 0.00"N)

Is there an issue with the data or am I misinterpreting what that boundary polygon in the XML?

Best method to catalog MODIS data

Hey all, I'm working on a task to catalog the MODIS data in AWS S3. In the past I've used S3 inventory files to catalog the data, is there something like this that contains a manifest of the data? Thanks, any advice on best practices for cataloging modis would be great!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.