- Overview
- Dataset Description
- Analysis Process
- Model Performance and Findings
- Conclusion and Recommendations
This project aims to analyze emissions data using various machine learning models to identify trends and insights. The dataset used contains records from the Carbon Majors Emissions dataset, which includes emissions data from various entities over several years. The main goal is to evaluate the performance of different models in predicting emissions based on historical data.
The dataset contains the following key features:
- Year: The year of the emission record.
- Parent Entity: The entity responsible for the emissions.
- Parent Type: The type of entity (e.g., state-owned, private).
- Commodity: The type of commodity (e.g., Oil & NGL, Natural Gas).
- Production Value: The production value for the commodity.
- Production Unit: The unit of measurement for production (e.g., Million bbl/yr, Bcf/yr).
- Total Emissions (MtCO2e): The total emissions in million tonnes of CO2 equivalent.
The dataset is highly detailed and provides a comprehensive view of emissions over time, making it suitable for machine learning analysis.
The analysis process involved the following steps:
- Data Preprocessing: Cleaning and preparing the data for analysis, including handling missing values and normalizing the data.
- Feature Selection: Identifying the most relevant features for predicting emissions.
- Model Training: Training various machine learning models on the dataset, including Linear Regression, Decision Trees, Random Forests
- Different Units: The dataset was found to have values in different units in terms of the Production Value. Which forced me to drop the minority Production Value unit.
- Bias: There were inherent biases in the data due to the way it was collected and processed. These biases can skew the model's predictions and reduce its generalizability.
- Linear Regression: Provided a baseline performance with moderate accuracy.
- Decision Trees: Showed improved performance but were prone to overfitting.
- Random Forests: Outperformed Decision Trees by averaging multiple trees to reduce overfitting.
The analysis highlights the importance of using balanced and unbiased datasets for training machine learning models. For future work, the following steps are recommended:
- Data Balancing: Apply techniques such as oversampling the minority class or undersampling the majority class to balance the dataset.
- Bias Mitigation: Address biases in the data collection and processing stages to improve the model's generalizability.
- Continuous Monitoring: Continuously monitor and update the model with new data to maintain its effectiveness.
By implementing these strategies, more reliable and valid machine learning models for emissions analysis can be developed.