Giter VIP home page Giter VIP logo

hydrodatasource's Introduction

hydrodatasource

image image

image

πŸ“œ δΈ­ζ–‡ζ–‡ζ‘£

Although there are many hydrological datasets for various watersheds, a noticeable issue is that many data sources remain unorganized and are not part of public datasets. This includes data that hasn't been organized due to its recency, data not considered by existing datasets, and data that will not be made public. These data sources represent a significant portion of available data. For example, the commonly used CAMELS dataset only includes data up to December 2014, almost ten years ago; GRDC runoff data, while useful, is rarely included in specific datasets. Real-time and near-real-time gridded data such as GFS, GPM, SMAP, etc., are infrequently compiled into datasets, with more emphasis on higher quality data like ERA5Land being used for research. A large portion of hydrological data in China is not public, and thus cannot be used to construct datasets.

To address this, we conceived the hydrodatasource repository, aiming to provide a unified way of organizing these data sources for better utilization in scientific research and production, especially within the context of watersheds. For information on currently available public datasets, please visit: hydrodataset.

To be more specific, the goal of this repository is to provide a unified pathway and method for watershed hydrological data management and usage, making hydrological model calculations, especially those based on artificial intelligence, more convenient.

Regarding the part about data acquisition, since it involves a process with significant manual and semi-automatic intervention, we have placed these contents in a separate repository: HydroDataCompiler. Once it is relatively perfected, we will open source this repository.

How many data sources are there

Considering watersheds as the primary focus of data description, our data sources mainly include:

Primary Category Secondary Category Update Frequency Data Structure Specific Data Source
Baseline Geographic Maps Historical Archive Vector Watershed boundaries, site locations, and other shapefiles
Elevation Data Historical Archive Raster DEM
Attribute Data Historical Archive Tabular HydroATLAS dataset
Meteorological Reanalysis Data Sets Historical Archive, Delayed Dynamic Raster ERA5Land
Remote Sensing Precipitation Historical Archive, Near Real-Time Dynamic Raster GPM
Weather Model Forecasts Historical Archive, Real-Time Rolling Raster GFS
AI Weather Forecasts Real-Time Rolling Raster AIFS
Ground Weather Stations Historical Archive Tabular NOAA weather stations
Ground Rainfall Stations Historical Archive, Real-Time/Delayed Dynamic Tabular Non-public rainfall stations
Hydrology Remote Sensing Soil Moisture Historical Archive, Near Real-Time Dynamic Raster SMAP
Soil Moisture Stations Historical Archive, Real-Time Dynamic Tabular Non-public soil moisture stations
Ground Hydrological Stations Historical Archive Tabular USGS
Ground Hydrological Stations Historical Archive, Real-Time Dynamic Tabular Non-public water level and flow stations
Runoff Data Sets Historical Archive Tabular GRDC

Note: The update frequency primarily refers to the frequency of updates in this repository, not necessarily the actual data source's update frequency.

What are the main features

Before using it, it is essential to understand the main features of this repository, as this will guide its use.

Our goal is to make this tool accessible to users with varying hardware resources. To elaborate on hardware resources: due to the extensive variety and volume of data involved, we have set up a MinIO service. MinIO is an open-source object storage service, which can be conveniently deployed locally or in the cloud; in our case, it's deployed locally. Thus, data is stored on MinIO and accessed via its API. This approach allows effective data management and the development of a unified access interface, simplifying data retrieval. However, it does require specific hardware resources, like disk space and memory. Therefore, we also offer a fully local file interaction mode for a portion of the data, although this mode won't be covered by complete functional testing.

Based on this approach, we handle different types of data differently:

For non-public data, we mainly provide utility functions in the public code to assist users in processing their data, facilitating the use of our open-source models. Of course, developers internally provide data retrieval services for their own data. For public data, we offer code for data download, format conversion, and reading, supporting users in handling data on their local systems. Now, let's expand on these two parts.

For non-public data

The non-public data primarily involves ground station data. We provide tools for data format conversion for these data types. We define a data format that users need to prepare, and the subsequent process involves using these tools directly. In general, we expect users to prepare their data in a specific tabular format, which we will then convert into netCDF format for model reading. As for the exact format to prepare, we provide a data_checker function to verify the data format. Users can use this function to understand the specifics. We will also add a document detailing the specific format, which is yet to be completed.

For public data

The public data mainly consists of those already organized into datasets. We provide code for data download, format conversion, and reading to support users in operating data on their local systems. These datasets include, but are not limited to, CAMELS, GRDC, ERA5Land, etc.

However, as previously mentioned, we do not provide complete test coverage for local files. Our primary testing is conducted on MinIO.

How to use

Installation

We recommend installing the package via pip:

pip install hydrodatasource

Usage

Our agreed data file organization structure at the primary level looks like this:

β”œβ”€β”€ datasets-origin
β”œβ”€β”€ datasets-interim
β”œβ”€β”€ basins-origin
β”œβ”€β”€ basins-interim
β”œβ”€β”€ reservoirs-origin
β”œβ”€β”€ reservoirs-interim
β”œβ”€β”€ grids-origin
β”œβ”€β”€ grids-interim
β”œβ”€β”€ stations-origin
β”œβ”€β”€ stations-interim

Here, datasets-origin contains the datasets, basins-origin contains watershed data, reservoirs-origin stores reservoir data, rivers-origin holds river data, grids-origin includes gridded data, and stations-origin has station data.

Data in the origin folders is raw data, while the interim folders contain data that has undergone preliminary processing. Essentially, the data in origin is the result of initial processing in GitLab's One Thing One Vote project, and interim is where origin data is processed into a specific format based on a particular requirement.

This categorization fully covers the types of data listed in the table.

For non-public station data:

First, users need to prepare their data in a tabular format. To understand the specific format required, execute the following command:

from hydrodatasource import station
station.get_station_format()

Place the files in the stations-origin folder. For the specific parent absolute path, please configure it in the hydro_settings.yml file in your computer's user folder.

hydrodatasource's People

Contributors

forestbat avatar ouyangwenyu avatar liutiaxqabs avatar silencesoup avatar xw5555 avatar zaixiaxiaowu avatar wangjingyi1999 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.