This hub is for a UCLA machine learning Math 285J course project on China air pollution PM 2.5, including research references, data sources, and a list of our codes and results.
Nowadays, China air pollution is a pressing issue in the China society, since it might be the cause of the recent dramastic inceases of lung cancers.
##Background reading:
- Fine particulate matter (PM2.5) in China at a city level
- PM2.5 in China: Measurements, sources, visibility and health effects, and mitigation
- Spatiotemporal variations of PM2.5 and PM10 concentrations between 31 Chinese cities and their relationships with SO2, NO2, CO and O3
- Other studies in Chinese: 1, 2
- Evolving the neural network model for forecasting air pollution time series
- Intercomparison of air quality data using principal component analysis, and forecasting of PM10 and PM2.5 concentrations using artificial neural networks
- Machine learning in geosciences and remote sensing
- Weather data on NOAA
- Air pollution data sources: [1] (http://aqi.cga.harvard.edu/china/), [2] (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/24826)
After plotting the time series at various stations in Beijing, there is a clear intraday seasonality, every 8 hours there is a peak of pollution. However, no significant short-term trends are identified. Based on these observations, the following are assumed:
- The PM 2.5 pollutants index is driven by the previous 8 hours weather conditions and the pollution status.
Suppose the time series for different pollutants are denoted by P_i(t), where i denotes the i-th pollutant and t denotes the time in hour. Suppose the time series for different weather conditions such as wind speed, temperature, humidity, and air pressure, are denoted by W_j(t).
Then,
PM2.5(t) = F(PM2.5(t-8), P_1(t-8), ..., P_n(t-8), W_1(t-8), ..., W_m(t-8))
The project is going to learn F using various machine learning methods, linear models (Lasso, Ridge), Random Forest, Extra-Trees, and Neural Networks.