This project is a demonstration of utilizing XGBoost for classification tasks on kmer data. The process involves data pre-processing to transform kmer data into a manageable format, training an XGBoost model, and evaluating its performance.
Clone this repository to your local machine and navigate to the project directory. Execute the following command to install necessary dependencies:
pip install -r requirements.txt
Change parameters in input.json file
python main.py input.json
import kmer_ml_pacakge.visualization to plot data
- Reading Chunk Files: There are 2000 chunk files, each containing partial data to be processed.
- Generating a Sparse Matrix: A sparse matrix is generated from the processed chunk files, facilitating efficient storage and computational operations.
- Filtering Data: Data filtering is performed based on specified labels to obtain the relevant dataset for model training. The dataset is filtered to remove two designated data types, and the resulting data is categorized into Human & Animal (HA), Human & Human (HH), and Animal & Animal (AA) based on predefined conditions.
- Hyperparameter Tuning: A grid search is conducted for hyperparameter tuning to find the optimal set of parameters for the XGBoost model.
- K-Fold Cross Validation: K-Fold Cross Validation is performed to assess the model's performance. This approach helps to ensure that the model's performance is consistent across different subsets of the data.
- Calculating Confidence Intervals: Confidence intervals for classification metrics such as precision, recall, and F1-score are calculated and stored. This provides a range within which the true value of the metric is likely to fall, providing an indication of the model's performance stability.
- Author: Ge Zhou
- Email: [email protected]
This documentation provides a high-level overview of the code structure and the project's primary functionalities. For more detailed information, please refer to the inline comments within the script file.