The Co-SELECT pipeline uses pyhton based doit software to automate repetitive tasks of analyzing sequencing data on multiple rounds (alternatively cycles used in many of the scripts) of multiple TF experiments. It would be good to have a basic knowledge of doit
, and obviously, it should be installed.
The top level directory has the following subdirectories:
DNAShapeR
- The DNAShapeR program is slightly modified for our need. We particularly need the executableDNAShapeR/src/dnashpe
to generate the shape values from the oligo sequences.src
- This subdirectory contains all the source codes of Co-SELECT.downloads
- This subdirectory should have all the gzipped fastq files for the HT-SELECT experiments downloaded from the ENA website.data
- This subdirectory contains all intermediate files generated by Co-SELECT. It must have enough space! On our system it takes about 3TB for the 131 TF experiments that we analyzed.results
- The results of Co-SELECT are kept here.
The subdirectory src
also contain the following files which are essential:
PRJEB14744.txt
- It gives the details of the sequencing data of the projectPRJEB14744
in ENA. We have mostly used the following columns:run_accession
,fastq_ftp
, andsubmitted_ftp
.PRJEB14744_nonzero_cycle.csv
- It maps the TF experiment (and barcode/primer if there are multiple experiments for the same TF) to the accession number of the projectPRJEB14744
in ENA.PRJEB14744_zero_cycle.csv
- It maps the initial pool (round 0) of the experiments (multiple experiment may share the same initial pool) to the accession number of the projectPRJEB14744
in ENA.tf_inventory_jolma_ronshamir.csv
- It contains all information that we could glean from the previous two papers on the dataset.tf_coremotif.csv
- It gives the coremotifs that we used for the experiments. One may change the coremotifs and try rerunning the complete Co-SELECT analysis.tf_run_coselect.csv
- This gives the list of experiments on which Co-SELECT has to be run.
The example script dodo_downloads.py
can be used for downloading all the experiments. Note that downloading all the datasets will require 136G disk space.
Note that all the doit
task files are kept in src
directory. Hence we would need to change current working directory to src
.
$ cd src
$ doit -f dodo_downloads.py
The download can be made faster using multiple processes, say n=10
, as follows:
$ doit -n 10 -f dodo_downloads.py
Co-SELECT needs to compute the round 0 probabilities using a simple Markov model. This is done by invoking the following command:
$ doit -n 50 -f dodo_round0.py
Co-SELECT analysis on the selected TF experiments configured through the file tf_run_coselect.csv
can be done by invoking the following command:
$ doit -n 50 -f dodo_analyze.py
The comparison of experiment vs control groups in Co-SELECT analysis and the generation of results is done by invoking the following command:
$ doit -n 50 -f dodo_results.py
The summary results and plots are saved at ../results
directory.
The promiscuity of shapemers in the motif-free oligos can be computed and the corresponding plot as in our paper can be generated by invoking the following command:
$ doit -f dodo_promiscuous.py
The lists of highly promiscuous shapemers and and plots are saved at ../results
directory.