WRI Creating a well-being data layer.

This project contains the code, papers, and deliverables for the DSSG project with the World Resources Institute (WRI) Creating a well-being data layer using machine learning, satellite imagery, and ground-truth data link

In the long term, we are building a tool that can be extended to predict the wealth and economic factors of any given area in India. More information on the architecture and implementation below.

Project-Scope
The Data
Methodology
Results
Development
Project Organization
Bibliography

Project Scope

Conducting economic surveys requires huge resources; thus, modern means of acquiring this information using publicly available data and open source technologies create the possibilities of replacing current processes. Satellite images can act as a proxy for existing data collection techniques such as surveys and census to predict the economic well-being of a region.

The project aims to propose an alternative to Demographic Health Surveys using open-source data such as Open Street Map, Sentinel, and Night Light data.

The Data

Demographic Health Surveys

Demographic Health Surveys collect information on population, health, and nutrition for each state and union territory. They are jointly funded by the United States Agency for International Development (USAID), the United Kingdom Department for International Development (DFID), the Bill and Melinda Gates Foundation (BMGF), and the United Nations. The datasets used in this project were obtained from the dhsprogram website.

The dataset was explored manually as well as through Pandas Profiling library.

Box and Violin plots were used to make the following observations:

Wealth index had almost a perfectly linear normal distribution.
Population density was found to have a positive correlation with the wealth index.
Wealth and electricity usage are correlated (figure 1).
The distribution of roof materials is highly diffused (figure 2).
Richer families prefer flush-toilets (figure 3).
The distribution of water source is highly diffused (figure 4).

The figures above visualize the different wealth distributions on several categorical features found on the dataset.

All the images are available in images folder and in the (DSSG/WRI) DHS Analysis.ipynb notebook

Open Street Maps Data

OpenStreetMap (OSM) is an open-source project that crowds sources the world map and has made it available free of cost. The data quality is generally seen as reliable although it varies across the world.

A python module osm_data_extraction was implemented to extract OSM data given the GADM, Level 3 shapefile and a district name. The module uses OSMNx which interacts with the OpenStreetMap's API to get the relevant data for a specific region and stores it in a csv file. An example usage of this module can be found in the notebook araria_district.ipynb.

Due to computing resource constraints, the area of study was restricted to the Araria district of Bihar state.

Night-Time Light Data

Nighttime light data can highlight areas of greater economic activity as these regions tend to be relatively more lit. Image data to proceed with this approach was obtained via Google Earth Engine (GEE). GEE provides a quickly accessible collection of data images captured across timelines, lightwave lengths, and satellite systems.

The data is open and free to use for non-commercial use cases. The first approach of this project explored the usefulness of the GEE interface and the monthly NTL images (from the mines dataset [3]

The second approach looked at another data stream (NASA Black Marble [4] to look at the daily variability of the data. Both approaches were useful to gain an understanding of the different flavors of NTL data, and how these data sources could be utilized in future projects.

A python module ntl_data_extraction and a command-line app download-nightlights were implemented to download the night light data for a given district and the date range. The implementation uses the modapsclient, a RESTful client for NASA's MODIS Adaptive Processing System (MODAPS). The python module also implements a method to convert the hdf5 files to GeoTiff files for further processing. After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data the repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimized for storage, and will contain besides the light intensity values also the data quality flags. The team used NASA’s VIIRS/NPP LunarBRDF-Adjusted Nighttime Lights data with a spatial resolution of 500m.

The data was explored but due to a pressing need for computational resources and time, the data was not integrated with the other data sources and hence not utilized for solution building. We also concluded that for future computations it would be better to use annual composites of the night light data sets from the mines data repository [3], to reduce the need for large amounts of computational resources.

The implementation of the osm_data_extraction and ntl_data_extraction modules would be crucial to scaling the data processing pipeline for the rest of India or any other country in the world.

Project Methodology

Data Preparation

Socio-economic indicators are important measures to assess spatial or societal dimensions. Although DHS data collected by governmental and non-governmental organizations are reliable and comprehensive choices to describe societal phenomena they are available only at certain years intervals for India. In this project, a machine learning-based approach is used to predict the wealth index from OSM data based on DHS data as the ground truth.

Voronoi Tessellations appeared in the classic treatise of Snow on the 1854 cholera epidemic in London in which he demonstrated that proximity to a particular well was strongly correlated to deaths due to the disease. It continues to be a very useful tool to study demographics, territorial systems, and accurate estimates of average rainfall in a region among other applications. In our case, we use a generalization of Voronoi Diagrams called Laguerre-Voronoi Tessellations, famously known as Power Diagrams to tessellate the DHS dataset IAGE71FL.zip. Laguerre Voronoi diagrams partition the Euclidean plane into polygonal cells defined from a set of circles.
The motivation behind using this specific generalization comes from the peculiarity of the DHS data. DHS surveys contain confidential information that could potentially be used to identify an individual through unique information. To avoid this in all DHS surveys the center GPS coordinate of the populated place in a cluster is recorded and separate degradation error values, a random error of 5 km maximum in rural areas and 2 km maximum in urban areas is applied. All the details of the implementation of this tessellation on the DHS data can be found in the following places:

Combine DHS and OSM Data: In the next step we combine the weighted Voronoi GeoDataFrame specific to a district with the OSM vector data of the same district using the following strategy:

Every Voronoi cell has a unique DHS cluster id.
Any point in the OSM vector data gets the DHS cluster-id depending on the Voronoi cell it belongs to.
Any polygon from the OSM vector data gets the cluster id of the Voronoi cell it is contained in.
Any polyline from the OSM vector data is split into several line segments belonging to different Voronoi cells and these line segments get the DHS cluster-id depending on the Voronoi cell they belong to.

This pipeline was partially implemented in the araria_voronoi.ipynb notebook.

Combine DHS and NTL Data: Similar to the techniques used to match the OSM data to DHS clusters, a method will have to develop to aggregate the NTL to the appropriate DHS cluster. It would be recommended to use the same weighted Vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.

Evaluation Strategy

We tend to judge a model's generalization error by the gap between its performance in training and test [5]. For this matter, it's important to strategically partition a dataset in a way that resembles what happens in the desired production environment. Due to the few samples in the dataset restrictions, we performed a Leave One Out evaluation (LOOCV). Leave-one-out cross-validation, or LOOCV, is a configuration of k-fold cross-validation where k is set to the number of examples in the dataset.

LOOCV is a computationally expensive procedure to perform, although it results in a reliable and unbiased estimate of model performance.

Label Transformation.

The original wealth index provided by the DHS data was a classification between 1-5 of the wealth level of a certain district. {1: Poorest 2: Poorer 3: Middle 4: Richer 5: Richest} This label even if at first might seem a multiclassification is a continuous feature that has been post-processed and binarized into categories.

We treat this problem as a regression task that then needs to be binarized again in the post-processing part of the ML pipeline. In the meanwhile, we use the Mean Absolute Error as an intuitive evaluation metric.

We standardized the label to have it between [0-1], for sake of dimensionality.

In the above image, For the state Araria, we can see the scaled label. 0 being poor and 1 rich. There is a district that has a higher wealth than the rest.

Explainable Machine learning pipeline

Due to the possible impact of this project on public policy, we advocate for an explainable ML approach [6].

For the modeling part a set of experiments to determine which machine learning estimators was performed. The selected estimator for this part of the project was a decision tree, due to its key performance and given that the data that we are working with at this stage relative small. This model also allows us to understand how are the ML decisions made.

Results

After preprocessing the data the following results were obtained.

In the difference of wealth distribution, we can see where our model is achieving the best results and where it's failing. This visualization can help to gain trust in the model since metrics do not always give users an understanding of a model's performance.

We can note in the heatmap that most of the errors are between [0.1,0.2]%. This percentual error in the predictions is low and provides preliminary evidence that a model should be able to perform with a high enough quality with a larger dataset.

Conclusions

Predictive Model that is:

Open Source Data
Does not require cloud computing resources

Modeling remains explainable and accountable while preserving accuracy

Explainable and Interpretable machine learning
Accountable. It should be possible to trace the logical reasons why a decision was taken.
High Generalization: Simple models tend to have a higher generalization than complex models [1]

Deliverables

The project had the following deliverables:

Project final presentation
A report with an extensive analysis of the methodology followed.
A separate module of the weighted Voronoi implementation in gis-laguerre
A corresponding article on weighted Voronoi for knowledge dissemination in Towards Data Science

Future Work

Future work steps:

Scaling Up: Due to the limitation of the computational resources we ended up only working for one district of India. Its performance on state and national levels remains to be evaluated.
Integrate with NTL: One further data integration that should be helpful is the Night Light Data, this data theoretically should improve the accuracy in areas where OSM data is scarce.
Temporal Evaluation: As the goal of the project, is to prevent what will happen in the future with forests, there is the need to ensure that the model will generalize as time goes by.

Project Organization

Bibliography

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead: https://www.nature.com/articles/s42256-019-0048-x
Interpretable Machine Learning: A Guide for Making Black Box Models Explainable https://christophm.github.io/interpretable-ml-book/
https://eogdata.mines.edu/products/vnl/
https://viirsland.gsfc.nasa.gov/Products/NASA/BlackMarble.html
The Mythos of Model Interpretability: https://arxiv.org/abs/1606.03490
Explainable Machine Learning for Public Policy: https://arxiv.org/pdf/2010.14374.pdf

rohaan2614 / wri_wellbeing_data_layer Goto Github PK

wri_wellbeing_data_layer's Introduction

WRI Creating a well-being data layer.

Table of contents

Project Scope

The Data

Demographic Health Surveys

Open Street Maps Data

Night-Time Light Data

Project Methodology

Data Preparation

Evaluation Strategy

Label Transformation.

Explainable Machine learning pipeline

Results

Conclusions

Deliverables

Future Work

Project Organization

Solve For Good Collaborators

Word Resources Institute

Omdena

Bibliography

wri_wellbeing_data_layer's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org