- Imputed missing values in the following categorical columns with the most frequent value in each of them:
'funder', 'installer', 'subvillage', 'public_meeting', 'scheme_management', 'scheme_name', 'permit'
. - Imputed 0 values in
construction_year
with the earliest value of construction year. - Imputed
longitude
andlatitude
values of records whoselongitude
were below 29, with the mean values oflongitude
andlatitude
when grouped byregion_code
. The reason was that according to the GPS coordinates of Tanzania (-6.3728253, 34.8924826
), longitude values must be over 29. (Refer here) - Imputed
date_recorded
values whose years were beforeconstruction_year
, with the latest value ofdate_recorded
.
All Imputation values were based on training data. Test data were transformed using those imputation values.
- Created a new feature called
age
representing how old a waterpoint is, by subtracting the year ofdate_recorded
fromconstruction_year
. - Separated out year and month as
year_recorded
andmonth_recorded
respectively, fromdate_recorded
. - Excluded columns
id, date_recorded, recorded_by, num_private
from training features. - Applied Label (Integer) Encoding on categorical features. Encodings were based on training data. Unknown values in test data were labelled as 0 using a custom function.