leebergstrand / backblast_reciprocal_blast Goto Github PK

This repository contains a reciprocal BLAST program for filtering down BLAST results to best bidirectional hits. It also contains a toolkit for finding and visualizing BLAST hits for gene clusters within multiple bacterial genomes.

License: MIT License

Python 62.50% Shell 5.88% R 31.62%

backblast_reciprocal_blast's People

Contributors

Stargazers

Watchers

Forkers

holert fw1121 gsc0107 enzoandree burlab theokirkland annabananakobana germant13

backblast_reciprocal_blast's Issues

Improve documentation about automated phylogeny

Better reporting of removed genomes

If an automatic phylogenetic tree is produced via GToTree, it's possible that some low-quality genome sequences will be filtered out by GToTree during tree construction. These will then be removed from the final heatmap with a warning thrown to the user that they were removed.

We should consider better documenting this behaviour so that users are not confused by disappearing genomes. We could also make the warning more explicit.

No support for eukaryotic genomes

GToTree only works for prokaryotic genomes unless a custom HMM set for eukaryotes is included, which is outside of the design framework of this tool currently. We should either document this or build a way for eukaryotic genomes to be compatible.

Convert CLI interface to python

The current CLI interface for BackBLAST2 is written in Bash. This causes portability issues between different systems. In particular, identifying the script's source directory is a challenge given that not all systems have a consistent way to identify full directory paths, as mentioned in #57 .

Re-writing the CLI in another language like Python could result in substantial stability improvements.

Support for Multiple BLASTp Threads (search.py)

Problem Description

Older versions of BackBlast support BLASTp multithreading via BLASTp's -num_threads parameter. BackBlast2 no longer supports this functionality.

Problem Solution

Add a threads parameter to search.py and pass this value on to the BLASTp subprocess.

Look for symlinked reference genomes

Problem description

I ran BackBLAST release v2.0.0-alpha6 on a server running Ubuntu 20.04 LTS.

When a subject genome (in the subject_genome_directory) is a symlink, it is not added to the search pipeline.

Furthermore, if no detectable genome files are in the subject_genome_directory, BackBLAST exits with a very cryptic error message (realpath: missing operand).

Proposed solution

In generate_run_templates.sh, function add_subjects_to_config_file, the find command should include the -L flag to look for symlinks. In addition, a check should be added for whether any subject genome files were detected by find.

Possible complications

In general, generate_run_templates.sh should probably be moved to Python long-term.

Unusual behaviour in BackBLAST.py when run in multiple instances

@LeeBergstrand This is a bit of a mystery to me.

Background

BackBLAST.py runs with 1 thread. In the snakemake workflow, I allow the user to run multiple instances of BackBLAST.py at a time (via the snakemake --jobs flag) to perform analyses in parallel.

Problem description

When --jobs is 1 (i.e., 1 process runs at a time), pipeline results match what I'd expect. For example:

This figure is reproducible over multiple runs.

However, when --jobs is > 1, I get different result each time. Here are three runs with identical settings run with --jobs 4 (i.e., 4 single-threaded processes can run at once):

Specifically, it appears that some genomes randomly have empty results from BackBLAST.py. BackBLAST.py successfully ran (i.e., the log file looks fine), but the output file is empty.

Any idea what could be behind this issue (or keywords to look up)? Thanks.

Check all input pathways before calling BLASTp (search.py)

Problem Description

If one puts in the wrong FASTA file paths for search.py, then the error is only caught when BLAST fails as a subprocess.

e.g.

Command line argument error: Argument "query". File is not accessible:  `../Reference_Genes/reference_genes_n47.faa'
Traceback (most recent call last):
  File "/home/lee/miniconda3/envs/backblast/share/backblast/scripts/search.py", line 308, in <module>
    main(cli_args)
  File "/home/lee/miniconda3/envs/backblast/share/backblast/scripts/search.py", line 213, in main
    forward_blast_high_scoring_pairs = get_blast_hight_scoring_pairs(query_gene_cluster_path=query_gene_cluster_path,
  File "/home/lee/miniconda3/envs/backblast/share/backblast/scripts/search.py", line 45, in get_blast_hight_scoring_pairs
    return filter_blast_csv(run_blastp(query_gene_cluster_path,
  File "/home/lee/miniconda3/envs/backblast/share/backblast/scripts/search.py", line 62, in run_blastp
    blast_out = subprocess.check_output(
  File "/home/lee/miniconda3/envs/backblast/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/lee/miniconda3/envs/backblast/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['blastp', '-query', '../Reference_Genes/reference_genes_n47.faa', '-subject', './deltaproteobacterium_NaphS2.faa', '-evalue', '1e-25', '-soft_masking', 'true', '-seg', 'yes', '-outfmt', '10 qseqid sseqid pident evalue qcovhsp bitscore']' returned non-zero exit status 1.

Problem Solution

Check if the input files are there in Python and error out with a clear message before calling BLASTp.

Modified BackBLAST

Improved comment code only but this may have broke something.
Needs to be tested.

Incorrect variable types

Problem description

When using the current develop branch version of BackBLAST.py (commit ff96e60), I find that sometimes the best hits are filtered out. I noticed this when BLAST'ing query sequences to their own reference sequence as a control. One would expect all queries to return with 100% matches.

This works with default settings. For example, the query (10 genes),

BackBLAST.py --gene_cluster C_ferrooxidans_genes.faa --query_proteome C_ferrooxidans_ORFs.faa --subject_proteome C_ferrooxidans_ORFs.faa

Results in all 10 genes with 100% identity to themselves in the genome.

However, when adding the --min_ident flag, suddenly results start to disappear! All 100% match entries vanish, and the next closest hits are shown.

Proposed solution

The problem seems to resolve when clearly specifying that --min_ident is a number (line ~188, start of the main function):

Old:

    input_e_value_cutoff = args.e_value
    input_min_ident_cutoff = args.min_ident

New:

    input_e_value_cutoff = args.e_value
    input_min_ident_cutoff = float(args.min_ident)

Suddenly, everything seems to work. Is it possible that --min_ident was being treated as a character for some reason?

Things to consider

Should --e_value and other input flags also be clearly specified for their variable type?

@LeeBergstrand I can make a PR for this if helpful.

Jackson

Add CLI flag for already indexed database

Hi, it could be possible to don't give raw protein fasta? I'm thinking the use of a huge db (refseq for example), and index the db everytime you use it is a little a waste of time. could you add the option to give the program an indexed db?

regards

Documentation needs to be more explicit.

Hi, I tried to run the script but I got a result I didn't expect.
For each query there is more than one hit as a result of reciprocal blast.
I'd expect that for every time you want there would be only one hit.

I ran the script as follows:

makeblastdb -in br.faa -input_type fasta -dbtype prot
makeblastdb -in reg.faa -input_type fasta -dbtype prot
BackBLAST.py br.faa br.faa reg.faa

I don't know if I'm forgetting anything, or I don't understand how scrpts work.

The files and the output is attached

I ran:
Python 2.7.10
Protein-Protein BLAST 2.6.0+

Regards.

files_rb.zip

Remove BackBlast name from scripts

We can remove the "backblast" from infront of all these script's names.

Minor enhancements for BackBLAST2

Some miscellaneous feature requests or enhancements.

Edge cases for generate_BackBLAST_heatmap.R:

Check that a genome is not eliminated from the heatmap when genes are removed because missing in the gene metadata file. Check tree tips correspond to heatmap y-axis labels immediately before plotting one last time to be defensive.
Check if there are multiple queries with the same ID before collapsing into a wide table. Warn the user and randomly pick one, or error out.
Check if two tree tips have the same name before midpoint rooting, and error out or warn and give unique numerical suffixes

More labour-intensive additions that could be helpful:

Have BackBLAST output FastA files with the sequences of the detected genes. Consider also optionally aligning them or even using them to make unrooted trees.
Add an option to summarize the top x fwd/rev hits of each reciprocal BLAST search for debugging purposes. This could be in a single long table format with a column for whether the BLAST is forward or reverse. Note that the reverse search would only be for the top forward search, though, so maybe it would be more accurate to put reverse BLAST searches in their own table??? Not sure. Consider also flagging when BackBLAST "just barely" misses a gene and warning the user.
Consider adding an option to just run forward BLAST only (some users might find this handy for some reason). However, make sure to warn users that this is generally not advisable.
Add the ability to have multiple query sets and query genomes.

Documentation:

Explain the issue of having multiple gene copies in the query reference genome
Give common commands people might run. For example, using the --until flag to just go up to the BLAST table. (This could be combined with other commands as a workaround for only being able to add one query genome per run, for example.)
Give a manual of the sub-commands of BackBLAST

Unneeded files in BackBLAST

Many files in BackBLAST1 have been deleted in BackBLAST2. Specifically:

Example files in ExampleData
All scripts in SupplementaryScripts
All scripts in Statistics
Some scripts in Visualization

@LeeBergstrand Do you want to save some of these files (especially scripts) for another purpose? If so, you might want to move them over to another repo, for example.

Note that we will likely need to clean up this repo at some point due to the large repo size currently, so some of these files might disappear entirely from the git commit history in this repo. More on this in #47.

Refactor CreateBlankResults.py

Problem description

CreateBlankResults.py takes a text list of empty BLAST tables to create and creates them using query ORF info. This script would fit better in the overall pipeline if it worked on each subject file individually instead.

Proposed solution

Refactor the code and rename it to RepairMissingResults.py. Code would:

Receive a BLAST table (CSV) and query.faa file as input
Check if the BLAST table file is empty
If empty, replace with blank entries and send to the output file
If not empty, just make a copy of the existing table and send it to the output file

Running multiple similar clusters with competition.

When running BackBLAST against multiple clusters inside the same query file the clusters will compete against each other for hits since they all hits are in the same graph.

Add graph support directly into search.py

Background

Currently, search.py sources code from Graph.py in the same folder.

Problem description

Sourcing Graph.py like this causes some challenges for the use of search.py. Namely, it appears that search.py and Graph.py must be in the same folder for search.py to easily run. Although it is possible to specify relatives paths for sourcing code in Python, this seems to not be the recommended course of action when the files are not part of a proper Python package structure.

Proposed solution

Graph.py is short. Could the code just be integrated into search.py? This would also be a good excuse to re-write this tutorial code.

@LeeBergstrand Let me know if you think this is the best course of action.

Error in rule generate_phylogenetic_tree

Followed instructions from the "develop" branch however ran into the following errors after running the last line of code from the recommended workflow.

BackBLAST run output_dir/config.yaml output_dir

my code: BackBLAST run rbb_pit/config.yaml rbb_pit

I just wanted the csv file, so I used "-t NA" in the setup code to bypass the phylogenetic tree.

BackBLAST setup -t NA query.faa query_genome.faa subject_dir output_dir

my code: BackBLAST setup -t NA mm_queryproteinlist.faa mmproteinrefseq.faa subject_dir rbb_pit

Now having dramas with the heatmap...

iqtree potential cause, any thoughts?

Using thread number in temp_query may not be HPC safe.

@jmtsuji With regards to #36

That should work. However, if Snakemake is used in HPC mode with a Job queuing system its possible that thread A on machine X might have the same thread number as thread B on machine Y. If these machines are sharing disk (as would be the case in HPC) then their files could clober each other. Maybe replace the thread number with something like a UUID? https://docs.python.org/3.7/library/uuid.html

Add sanity check to ensure that the query sequences are contained the query genome proteins (search.py)

Problem Description

If a user collects their pathway proteins and their query organisms proteins from different sources, for example, Uniport and Genbank, then BackBlast will give blank results because the two files use different accession systems. The query pathway file and query organism proteins have to use the same accessions.

Problem Solution

Scan the query organism proteins for the pathway proteins by accession and display an error message if they are not found.

Replace the usage of the pathway query file with a file containing a list of accessions from the query subject file. Automatically use the pathway accession list file to extract a pathway query file out of the query organism protein file as a temp file.

Best practice for BackBLAST conda install

I'm trying to decide the best way to organize BackBLAST as a conda package.

In my mind, BackBLAST has two major code components:

BackBLAST.sh (command-line interface) and the snakemake workflow
the BackBLAST scripts for individual steps in the snakemake workflow. Recall that the snakemake workflow currently creates conda envs with the dependencies needed to run these scripts.

We can either install these components together or separately. Here are two different scenarios I've thought of:

Make the BackBLAST conda package so that it includes BackBLAST.sh and all of the support scripts. This means that many dependencies (e.g., R) will be installed when installing BackBLAST itself.
Make the BackBLAST conda package so that it includes BackBLAST.sh (and the Snakefile) only. This would be a very simple install. I would then create a second conda package (e.g., BackBLAST-utils) with the support scripts and dependencies; this second conda package would be installed and run during the snakemake workflow.

@LeeBergstrand Any preferences? I lean toward the second option to keep things modular, in case we continue to expand the pipeline and need additional conda envs in the future. (FYI, this is not a high priority issue if you are busy currently.)

fail to run set up

dear author, I used install manual in this website (https://zenodo.org/record/3697265#.XykIRigzbic) and create the backblast env by conda. but when I want to use backblast setup command, it was failed. and the error was as follows:

I don't know how to solve it, so maybe do you have any suggestions? thanks !

Strip CSV suffix form heatmap column labels.

Input File Documentation

Usage: BackBLAST.py <queryGeneList.faa> <queryBLASTDB.faa> <subjectBLASTDB.faa>

Query and subject BLAST DB are straight forward, but what is the queryGeneList.faa?
If I have about 100 protein *.faa files in my query, would "queryGeneList.faa" be a concatenated file of these 100 *.fa files?

Thanks, Chris

Originally posted by @tvtv195 in #8 (comment)

Help with BackBlast Please!

Hello

I was wondering if I could please have some help in an error I am getting for BackBlast. I am trying to run the example data provided in the folder using the line

python ~/BackBLAST/BackBLAST.py queryGeneList.faa CP000431.1.faa AL123456.3.faa

but I keep getting the error "_csv.Error: iterator should return strings, not bytes (the file should be opened in text mode)"

Before this error, I first make a blast database using
makeblastdb -in AL123456.3.faa -dbtype prot
makeblastdb -in CP000431.1.faa -dbtype prot

next I concatenate all the protein sequences using cat *.faa > queryGeneList.faa

I installed both Bio-python and Blast and have tried to the instructions from this post #8

If you are curious here is the rest of my coding error i get

python ~/BackBLAST/BackBLAST.py queryGeneList.faa CP000431.1.faa AL123456.3.faa
Opening AL123456.3.faa...

Forward Blasting to subject proteome...
Traceback (most recent call last):
File "/Winnebago/cacornel/BackBLAST/BackBLAST.py", line 124, in
BLASTForward = filterBLASTCSV(BLASTForward) # Filters BLAST results by percent identity.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Winnebago/cacornel/BackBLAST/BackBLAST.py", line 70, in filterBLASTCSV
for HSP in BLASTreader:
_csv.Error: iterator should return strings, not bytes (the file should be open

Large repo size

The BackBLAST repo is currently a couple hundred MB in size, which is quite large. I suspect this is mostly due to old ExampleData files, which have now been removed in BackBLAST2.

We will need to clean the repo somehow to get the repo size down eventually. We have a couple options:

Make a completely new Github repo for BackBLAST2 (and leave the current master branch more-or-less as-is for BackBLAST)
Delete this repo, clean up the git history locally (e.g., the article you sent earlier), and re-make the BackBLAST repo (possibly renamed) with a simplified git history. Might have to lack the old ExampleData folder.

@LeeBergstrand Thoughts? Best practices?

leebergstrand / backblast_reciprocal_blast Goto Github PK

backblast_reciprocal_blast's People

Contributors

Stargazers

Watchers

Forkers

backblast_reciprocal_blast's Issues

Better reporting of removed genomes

No support for eukaryotic genomes

Problem Description

Problem Solution

Problem description

Proposed solution

Possible complications

Background

Problem description

Problem description

Proposed solution

Things to consider

Problem description

Proposed solution

Background

Problem description

Proposed solution

Problem Description

Problem Solution

Recommend Projects

Recommend Topics

Recommend Org