bimsbbioinfo / crispr_dart Goto Github PK

A workflow to analyse sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data

License: GNU General Public License v3.0

Python 37.83% R 53.65% HTML 3.19% Scheme 5.33%

crispr-analysis indels snakemake-workflows crispr crispr-cas9 rmarkdown bioconductor r python

crispr_dart's Introduction

crispr-DART (Downstream Analysis and Reporting Tool)

crispr-DART is a pipeline to process, analyse, and report about the CRISPR-Cas9 induced genome editing outcomes from high-throughput sequencing of target regions of interest.

crispr-DART has been developed as part of the study "Parallel genetics of regulatory sequences using scalable genome editing in vivo" and is now published at Cell Reports: Froehlich, J. & Uyar, B. et al, Cell Reports, 2021.

Here is also the news coverage of our story: Scaling up genome editing big in tiny worms.

Pipeline scheme

The pipeline allows single/paired-end Illumina reads or long PacBio reads from both DNA and RNA samples.

The pipeline consists of the following steps:

Quality control (fastqc/multiqc) and improvement (TrimGalore!) of raw reads
Mapping the reads to the genome of interest (BBMap)
Extracting statistics about the detected insertions and deletions (various R libraries including GenomicAlignments and RSamtools)
Reporting of the editing outcomes in interactive reports organized into a website. (rmarkdown::render_site)

Example HTML report output

The HTML reports produced by the pipeline are automatically organised as a website. Example report website can be browsed here

Example screenshots from the reports

You can find below some example screenshots from the HTML reports:

Installation

Download the source code:

> git clone https://github.com/BIMSBbioinfo/crispr_DART.git

Create a guix profile with dependencies

> mkdir -p $HOME/guix-profiles/crispr_dart
> guix package --manifest=guix.scm --profile=$HOME/guix-profiles/crispr_dart

# activate env
> source ~/guix-profiles/crispr_dart/etc/profile

Test the installation on sample data

> snakemake -s snakefile.py --configfile sample_data/settings.yaml --cores 4 --printshellcmds

How to run the pipeline

Preparing the input files

The pipeline currently requires four different input files.

A sample sheet file, which describes the samples, associated fastq files, the sets of sgRNAs used in the sample and the list of regions of interest.

Please see the example sample sheet file under sample_data/sample_sheet.csv.

A BED file containing the genomic coordinates of all the sgRNAs used in this project.

Please see the example BED file for sgRNA target sites under sample_data/cut_sites.bed

A comparisons table, which is used for comparing pairs of samples in terms of genome editing outcomes.

Please see the example table under sample_data/comparisons.tsv

A settings file, which combines all the information from the other input files and additional configurations for resource requirements of tools.

Please see the example file under sample_data/settings.yaml

The sample_data/fasta folder contains fasta format sequence files that are used as the target genome sequence. The sample_data/reads folder contains sample read files (fastq.gz files from Illumina and PacBio sequenced samples).

Running the pipeline

Once the settings.yaml file is configured with paths to all the other required files, the pipeline can simply be run using the bash script run.sh requesting 2 cpus.

> snakemake -s snakefile.py --configfile */path/to/settings.yaml* --cores 4 --printshellcmds

If you would like to do a dry-run, meaning that the list of jobs are created but not executed, you can do

> snakemake -s snakefile.py --configfile */path/to/settings.yaml* --cores 4 --dryrun --printshellcmds

How to cite

See the publication on Cell Reports

Credits

The software has been developed by Bora Uyar from the Akalin Lab with significant conceptual contributions by Jonathan Froehlich from the N.Rajewsky Lab at the Berlin Institute of Medical Systems Biology of the Max-Delbruck-Center for Molecular Medicine.

crispr_dart's People

Contributors

Stargazers

Watchers

Forkers

mil2041 gsarfo-boateng

crispr_dart's Issues

Adding a license?

Awesome work. Would you mind adding a license file so it's clear how to use / re-use / contribute to your work? Many thanks!

Error in test.sh

During bbmap_indexgenome step in test.sh, got below error.
Can you please check?

Finished job 5.
6 of 48 steps (12%) done
ImproperOutputException in line 189 of /data/gpfs/assoc/pgl/data/Dylan/potato_cas9/crispr_DART/snakefile.py:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule bbmap_indexgenome:
/data/gpfs/assoc/pgl/data/Dylan/potato_cas9/crispr_DART/output_test/bbmap_indexes/chrI_II
  File "/data/gpfs/home/wyim/scratch/bin/miniconda3/envs/crispr_dart/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 581, in handle_job_success
  File "/data/gpfs/home/wyim/scratch/bin/miniconda3/envs/crispr_dart/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 259, in handle_job_success
Removing output files of failed job bbmap_indexgenome since they might be corrupted:
/data/gpfs/assoc/pgl/data/Dylan/potato_cas9/crispr_DART/output_test/bbmap_indexes/chrI_II
Skipped removing non-empty directory /data/gpfs/assoc/pgl/data/Dylan/potato_cas9/crispr_DART/output_test/bbmap_indexes/chrI_II
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /data/gpfs/assoc/pgl/data/Dylan/potato_cas9/crispr_DART/.snakemake/log/2023-06-08T143842.546646.snakemake.log

Thanks.

add aligner options in settings file

Currently, the pipeline uses bbmap for pacbio and illumina read alignments. I'd like to add other options such as needleman/wunsch global aligner and the user decides which one to use.

alignment trimming for low coverage regions

add an option to the settings file that will be used to trim off alignments from low coverage 5' and 3' sides of the amplicon, to avoid looking for indels at the low coverage regions. This trimming threshold can be useful in certain situtations where the target amplicon is not completely amplified in some samples.

issue with running Crispr_DART pipeline

Hi,
I tried to install Crispr_DART tool and its related dependency programs. I also downloaded separately GATK-v3.8-0-ge9d806836 and extracted the zip folder after which i got only GenomeAnalysisTK.jar file inside the folder. Next, I gave path for java and gatk in settings.yaml file along with all input files location as well as output directory. Finally when I tried to execute the program. It did not output any result files in the output gatk subfolder created inside the parent output folder Crispr_DART. When I checked the log files generated for gatk step, it was showing following error.

ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/data/ngs/programs/crispr_DART/tool_gatk_3.8/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

Please help me to resolve the issue and generate complete result files.

Thanks
Nihar