This repository contains a pipeline for analyzing cancer and control patient datasets using next-generation sequencing data. The pipeline involves several functions that perform various tasks, from data preprocessing to neural network-based analysis. Please note that due to the intensive nature of the analysis, there might be cases where your computer's memory might not be sufficient to process all the data.
This function utilizes data from the SraRunTable.txt
metadata file, which can be obtained through the NCBI SRA Run Selector. It creates a dataset containing information about control and cancer patients, along with their SRA run names.
Using the SRA run names from the dataset generated by metadata_treat.py
, this function downloads SRA run data, aligns it to the human hg38 genome, and generates BED files. Replace "cancer" with "control" if using the df_control
dataset.
This function conducts tests on the BED files to enable thorough analysis of the dataset. Various tests and quality checks are performed to ensure the reliability of the data.
Extracts chromosome positioning information from the BED files, which is essential for generating histograms containing fragment distribution data.
Uses the chromosome positioning information to create histograms that provide insights into fragment distribution patterns within the dataset.
Implements a small neural network using the histogram_creation
data. This neural network aids in data analysis and treatment.
Please be aware that due to the complexity of the analysis and the large amount of data involved, your computer's memory might be insufficient to handle all aspects of this pipeline. It's recommended to have a system with sufficient memory and processing capabilities before attempting to run this analysis.
-
Clone the repository to your local machine.
-
Create and activate a Conda environment to isolate dependencies for this pipeline:
conda create -n cancer_analysis_env python=<python_version> conda activate cancer_analysis_env
Replace <python_version>
with the desired Python version.
Install the required dependencies using Conda and Bioconda, including:
-
FASTQC
-
Bedtools
-
Samtools
conda install -c bioconda fastqc bedtools samtools
Ensure you are using a Linux-based system, as the pipeline is designed to work best on this platform.
Run the functions in the order specified above, ensuring that you provide the necessary inputs and configurations.
Monitor memory usage during execution and consider utilizing a system with higher memory capacity if memory-related errors occur.
- FASTQC
- Bedtools
- Samtools
Contributions to this repository are welcome. If you encounter issues or have ideas for improvements, feel free to open an issue or submit a pull request.