โ โ By: Asher Lewis Github >
For this project, we are going to try to forecast the average monthly water level for Chennai India's four main reservoirs using time-series data. The threshold of success of our model doing well enough that if it scores higher than the baseline model. The reason for doing so is that in 2019 Chennai experienced a water crisis which had millions of people left without water and required many trains and truck to get the city water. If we can forecast the monthly demand for a given reservoir we can get an idea of how and when the cities reservoirs run out of water. This information can potentially be used later down the line to predict future water demand. The water level is measured in millions of cubic feet. We are going to score our predictions using the Mean Squared Error (MSE). Water demand forecasting is hard in general so we have a rather modest goal for our model to score lower than the baseline's model MSE. This would translate to our model having an MSE closer to zero than the Baseline.regression models.
On 19 June 2019, Chennai city officials declared that "Day Zero", or the day when almost no water is left, had been reached, as all the four main reservoirs supplying water to the city had run dry. First in this project we first combined our two given data sets and saved them into a new csv for analysis and forecasting.
The workflow was than broken up in four separate notebooks with this fifth one serving as the place where the notebooks could all come together.
In each notebook, we analyzed trends and the nature of both the water level and rain. We then explained some elements of time-series data, such as the potential problem of data not being stationary.
After this, we split our data and modeled. We ran a baseline model on each one of the reservoirs. After that, we ran an ARIMA model on each reservoir. For the ARIMA model, we looked at the residuals and plotted the predictions.
The workflow for this project has been divided up into five notebooks:
Four notebooks for each individual reservoir, and a fifth notebook that summarizes our problems and findings.
- Main notebook
- Notebook for Chembarambakkam
- Notebook for Poondi
- Notebook for Redhills
- Notebook for Cholavaram
Feature | Type | Dataset | Description |
---|---|---|---|
Date | datetime64 | chennai_reservoir_levels.csv | The date in year, month and day |
Poondi_water_level | Float64 | chennai_reservoir_levels.csv | Water level of Poondi lake in Millions of Cubic Feet |
Cholavaram_level | Float64 | chennai_reservoir_levels.csv | Water level of Cholavaram lake in Millions of Cubic Feet |
Redhills_water_level | Float64 | chennai_reservoir_levels.csv | Water level of Redhills lake in Millions of Cubic Feet |
Chembarambakkam_water_level | Float64 | chennai_reservoir_levels.csv | Water level of Chembarambakkam lake in Millions of Cubic Feet |
Cholavaram_rain | Float64 | chennai_reservoir_rainfall.csv | Rainfall for Cholavaram lake in millimeters |
Poondi_rain | Float64 | chennai_reservoir_rainfall.csv | Rainfall for Poondi lake in millimeters |
Redhills_rain | Float64 | chennai_reservoir_rainfall.csv | Rainfall for Redhills lake in millimeters |
Chembarambakkam_rain | Float64 | chennai_reservoir_rainfall.csv | Rainfall for Chembarambakkam lake in millimeters |
Our data comes from Chennai Metro and Sewer and was gathered together on Kaggle. It contains data daily data from 2004 to the end of 2019.
All of our models managed to get above our problem statement's goal of higher MSE score than the baseline model.
Time-series data is a very difficult task and I wish I had more time, more time to go in-depth to see things that are very elusive and have to be pulled out. The seasons define us and the trends need to be explored
There are many things we can do in the future such as implementing more complex Models such as SARIMA and var models. Another thing we could do is run the are existing models with differencing the data. Another thing we could have done is regularize the data.
It goes without being said but always getting more data is better. It would be nice to have such features such as temperature and exact water usage.
In terms of the data, it was fascinating to see how in the data how much everything is man-made from the reservoirs themselves to the water scarcity problem with the data. I would suggest better collection methods of water during the monsoon season. Another thing I would suggest is to get a better record of how people use the water. This is truly a crisis that unfortunately awaits most cites unless we take the proper action.