Tanzania_water_wells_project

BUSINESS UNDERSTANDING

Overview:

We are working on a project to point the Tanzanian clean water problem. Recent data from the World Bank show that Tanzania has a population of about 60 million. According to Nsemwa (2022) many Tanzanians continue to struggle with insufficient or limited access to clean and safe water. Only 30.6% of Tanzanian households use recommended water treatment methods, and only 22.8% have adequate hand-washing facilities (Ministry of Health report, 2019). Poor sanitation is estimated to cause 432,000 diarrhea-related deaths per year and is a major contributor to several Neglected Tropical Diseases (NTDs) such as intestinal worms, schistosomiasis, and trachoma. Malnutrition is also made worse by poor sanitation (WHO, 2019).

Problem:

Tanzania is the largest country of East-Africa with 59,353,795 population according to worldometers.info. 25 million of this population have lacks access to clean water, 40 million people also have a lack access to improved sanitation. Water is a basic need and for human beings. The Tanzanian Water Ministry aimed to solve this problem by improving clean water sources. There are many water wells already established, but some of them are non-functional or needs repair.

Goal:

Our goal in this project is to build a model that predicts the functionality of water points. With this predictive model, authorities can understand which water points are functional, nonfunctional, and functional but it needs to repair. This model can help the Tanzanian government to find likely maintenance needy wells or give useful information for future wells. With this model, we can help the Tanzanian authorities how to use water sources in a productive way. It also helps the investment of the government on wells.

Project Metric of Success:

The metric is an accuracy score of 70 - 75%

DATA UNDERSTANDSING

Data:

The original data was obtained from the DrivenData 'Pump it Up: Data Mining the Water Table' competition. Basically, there are 4 different data sets; submission format, training set, test set and train labels set which contains status of wells. With given training set and labels set, competitors are wanted to build predictive model and apply it to test set to determine status of the wells and submit. In this project, we used train set and train label set which have 59400 water points data with 40 features.

Plan:

Understanding Data

Cleaning and Exploring Data

Preparing Data to Modeling

Finding Binary Model & Baseline

Ternary Target Modeling

Understanding Data:

The data has a 59,400 rows and 41 columns. Of those 41 columns, 10 were numeric columns and the remaining 31 were string columns; known as object in pandas. After clearly observing the data through the .info method and the descriptive statistics of the numerical columns, we made some critical observations that will assist us in analyzing the data and coming up with an efficient model. These observations include: After going through the variable description of the data and performing the preliminary data inspection, we discovered that there are some columns that provide the same information which makes theme irrelevant in this study. The study discovered that 21 columns will not be used in this investigation and thus were deemed ‘repititve and unuseful’

Data Preparation

It is vital for data to be prepared before being staged for modelling to enhance the model's efficiency and prevent the generation of misleading insights. In this phase of the investigation, the study will look at missing values, duplicated entries, inconsistencies and invalid data. In this section, we first dropped the irrelevant columns mentioned in data understanding since we do not have to prepare columns that will not be used in the study.

Exploratory Data Analysis

In this phase of the investigation, the study will look at the trends, patterns using visualizations and statistics to show the relationships between the variables within the data

Univariate, Bivariate and Multivariate analysis were performed on the data.

EDA Conclusion

For extraction type, submersible has the most amount of water available in the data despite not registering some of the highest heights. Motorpumps and submersible are generally located in low altitude areas, possibly because while there it is nearer to the water table than at a higher altitude. For a handpump to have water available above the median amount of total static head, it needs to be above a height of 500 meters above sea level

The observations made through this analysis can be used to provide recommendations to the government on where more wells ae needed, the requirement of the locatiIn conclusion, the entity that funds most of the waterpoints going by the analysis is the Government of Tanzania and they are also the second most installers of the waterpoints thus it makes sense that they are the chosen stakeholders for this investigation.

The anlaysis also discovers that communal standpipe seems to be the most popular waterpoint type around the country. Most of the wells that are never paid for were discovered to be non functional

##Preparing Data to Modeling:

To prepare our data to machine learning, we did some feature engineering, encoding and scaling.

Findings:

Authorties should check again the wells which they funded.

New tecqniques must be found to feed dry wells and repair wells.

More detailed finding can be found in notebooks with explorations.

Future Improvements:

Feature engineering on categorical columns will be good idea to handle first challange.

Imbalanced target problem will be solved using SMOTE.

The best overall model was a K nearest neighbours classifier with a testing accuracy score of 71%

ekiplimo / tanzania_water_wells Goto Github PK

tanzania_water_wells's Introduction