AquaViva

Table of Contents

About
Data
Machine Learning
Visualization
Future Work
Contributing
Built With
License
Contact
Acknowledgements
References

About

AquaViva is an innovative project aimed at addressing one of the most important sustainable development goals and overall global humanitarian challenges of our time - the lack of access to clean water (SDG 6). To accomplish this, we are using cutting-edge machine learning models, trained on various datasets including satellite imagery, climatic variables, and geological features, to produce near real-time, high resolution maps of groundwater level.

We believe that this tool has great potential to help communities mitigate water scarcity, monitor groundwater, and efficiently identify suitable sources of clean water. As such, we are committed to keeping our project open-source and free-to-use, and we welcome any contributors to build off of what we have done. This project is part of NASA's Pale Blue Dot Challenge, which shares our deep commitment to using technology for environmental and social good. We are ecstatic to say that we have been recognized as Best Overall in the challenge!

Created by Team Viva Aqua | Francisco Furey, Adam Zheng, Malena Vildoza, El Hadji Malick DIEYE (AKA Jay) 😊

(back to top)

Data

For our training data, we conducted an extensive literature review into past studies, as well as key concepts such as the water balance equation, in order to determine the variables that would provide a comprehensive set of information for predicting groundwater level. We then collected, cleaned, preprocessed, and integrated the datasets together using Python scripts (see scripts/preprocessing) and Jupyter Notebooks (see notebooks/preprocessing)

Data Collection: First and foremost, we used IGRAC/GGIS to obtain piezometric (groundwater level) data from 2015-2022 for 36 wells distributed across Gambia. Then we gathered corresponding data for our 13 input variables (see Features), sourced from AρρEEARS, ClimateSERV, BGS, and GGIS (see Data Sources). Most of the raw data is available under data/original_data (except for a few files that were too large to upload)
Data Cleaning and Preprocessing: We used Jupyter notebooks (see notebooks/preprocessing) to manage the various data formats (.nc4, .nc, .csv), visualize/analyze the raw data, and account for missing/erroneous data using nearest neighbor algorithms and linear interpolation. QGIS was also used to process hydrogeological region and topographical data. All processed data is available under data/processed_data
Data Integration: Using pandas & geopandas, we merged datasets based on date, latitude, and longitude to form our primary dataset, which consisted of ~6600 rows (see data/processed_data/wells_data_gambia_for_machine_learning.csv)

Data Sources

Global Groundwater Information System (GGIS): An interactive portal by IGRAC that compiles data on global groundwater resources. We use it to access groundwater level data as well as data on hydrogeological regions.
British Geological Survey (BGS): This research project by BGS focused on the resilience of African groundwater to climate change. We incorporate their depth to groundwater data, which classifies data into 6 categories (0-7, 7-25, 25-50, 50-100, 100-250, >250 meters) - significantly lower resolution & precision than our targets, but still potentially useful.
Application for Extracting and Exploring Analysis Ready Samples (AρρEEARS): We used this tool to extract various parameters such as NDVI, MIR, EVI, Elevation, Curvature, Drainage Density, and Slope.
ClimateSERV: A tool by SERVIR, NASA, & USAID that provides climatic and vegetation data. We wrote a custom Python library (climateservAccess) for accessing the ClimateSERV API and used it to gather soil moisture, evapotranspiration, streamflow, and precipitation data.

Features

Datatype	Description	Data Source	Resolution
LIS_Soil_Moisture_Combined	Soil Moisture	ClimateSERV/LIS	3 km
LIS_Streamflow	Streamflow	ClimateSERV/LIS	3 km
LIS_ET	Evapotranspiration	ClimateSERV/LIS	3 km
MOD13Q1_061__250m_16_days_EVI	Enhanced Vegetation Index (EVI)	AρρEEARS/MODIS	250 m
MOD13Q1_061__250m_16_days_MIR_reflectance	Mid-Infrared Reflectance	AρρEEARS/MODIS	250 m
MOD13Q1_061__250m_16_days_NDVI	Normalized Difference Vegetation Index (NDVI)	AρρEEARS/MODIS	250 m
NASA_IMERG_Late	Precipitation	ClimateSERV/IMERG	10 km
DepthToGroundwater	Estimated Groundwater Level Range	BGS	5 km
Curvatu_tif2	Curvature	AρρEEARS/NASADEM	30 m
Drainage_density	Drainage Density	AρρEEARS/NASADEM	30 m
Slope_tif2	Slope	AρρEEARS/NASADEM	30 m
Hydrogeo	Hydrogeological Region	IGRAC/GGIS	N/A
NASADEM_HGT	Elevation	AρρEEARS/NASADEM	30 m

Output

Datatype	Description	Data Source
GROUNDWATER_LEVEL	Groundwater Level	IGRAC/GGIS

(back to top)

Machine Learning

All relevant Jupyter Notebooks are located in notebooks/machine_learning.

Model Selection and Training: First, we divided our dataset based on well IDs to avoid overfitting, allocating 83% for training and 17% for testing. We trained 6 different regression models using scikit-learn: SVR, AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor, SGDRegressor, and LinearSVR. Our computational resources limited our ability to test more computationally intensive models like neural networks. However, with access to more powerful machines, exploring these models could yield even more promising results.
Model Evaluation: We employed metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and Coefficient of Determination (R²) for performance assessment, achieving our best result (MAE = 2.6 m, R² = 0.42) with Linear SVR.
Model Optimization: We also applied Cross-Validation and GridSearchCV for hyperparameter tuning to optimize the model's performance, and combined LinearSVR with Nystroem for kernel optimization.

(back to top)

Visualization

Visualization Data: We first defined an area of interest within QGIS, and then split it up into ~2874 points, each representing a 500m pixel. We then gathered feature data for each of these points (see data/final_dataset), processed and compiled it as before (see notebooks/gambia_dataset), and ran it through the Linear SVR model (see notebooks/gambia_dataset/LinearSVR_final_dataset.ipynb) to get predicted groundwater levels. Note: we only used 500m resolution due to time constraints, higher resolutions would have otherwise been entirely feasible.
Visualization Creation: Once we had ML model results, we used IDW (Inverse Distance Weighting) interpolation in QGIS to increase the resolution to about 177m, and exported data to a csv. Then we uploaded the csv to kepler.gl and put together our interactive visualization, exported it to an html file, and customized it to create our Aqua Viva website.

(back to top)

Future Work

Given the enormous potential scale of this project, and the fact that we were just 4 people who worked on this for about a month, there is much else that remains to be done:

Model verification. Although our model was trained on the best open-source data we could find, it was still limited (6600 data points across 36 wells). Despite our best efforts and what we believe to be reasonably accurate results, groundwater level is still a very complex variable to predict and this project would benefit greatly from more data to verify/improve our model.
Streamline model usage. This was just a rough first pass for the process of getting feature data, running it through our model, and visualizing the results. So an important next step would be to create some sort of tool (perhaps a single Jupyter notebook) that streamlines this process and allows the user to adjust parameters easily.
Time series data. Due to time constraints, we only visualized data for one day (December 1, 2023). Especially once model usage is streamlined, it will be much easier to visualize time series data, which would be very useful for evaluating changes in groundwater level over time.
Near real-time data. It is entirely possible to create a tool that automatically retrieves near real-time data, runs it through our model, and outputs data for visualization. Such a tool could be used for groundwater monitoring.
Expand area of interest. Again, due to time constraints, we narrowed our focus to a smaller (but still high-impact) region of Gambia. Of course, with more time, it would be relatively trivial to create a visualization for all of Gambia. We have no idea if the model can be extrapolated to other regions of the world, but we think it might potentially be successful in regions with a similar biome to Gambia. More work should be done to verify this.

(back to top)

Contributing

Whether you would like to help with any of the future work outlined above, add your own data/ML models, or have any other ideas/suggestions - all contributions are welcome and encouraged! Simply fork the repo and create a pull request. You can also open an issue with the tag "enhancement". Thanks in advance for your contributions, and feel free to contact us with questions!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

Built With

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

(back to top)

Acknowledgements

Thank you to TANGO (The Association of Non-Governmental Organizations in the Gambia) for their insights and for connecting us with community leaders in the Gambia
Thank you to Jun Yuan Zhang for his advice regarding groundwater level prediction

(back to top)

References

(back to top)

franfurey / aquaviva Goto Github PK

aquaviva's Introduction

AquaViva

About

Data

Data Sources

Features

Output

Machine Learning

Visualization

Future Work

Contributing

Built With

License

Contact

Acknowledgements

References

aquaviva's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org