Giter VIP home page Giter VIP logo

screen_assembly's Introduction

Screen assemblies

Pipeline that screens for presence of genes of interest (GOI) in bacterial assemblies. Generates multiple CSVs and plots that describe which genes are present and how variable their sequence is. Can use DNA or protein query sequences (GOIs) and DNA contigs/fastas or protein fastas as database (db) to search in.

If you use this script in scientific publications, then please reference Davies et al., 2019., Nature Genetics., http://dx.doi.org/10.1038/s41588-019-0417-8

Getting Started

You need one fasta file with all GOIs as the query and a folder with db contigs/fastas. Db files can only have one '.' in the name (i.e., sample_1.fa NOT sample.1.fa)

Prerequisites

Required

Python 3

Command line blast

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Optional

Clustal Omega

http://www.clustal.org/omega/

IQtree

http://www.iqtree.org/doc/Quickstart

Installing

pip3 install --user screen_assembly

Make sure screen_assembly3.py is in you PATH

If tkinter is missing, do 'sudo apt-get install python3-tk' (on Ubuntu)

Check for updates

pip3 install --user screen_assembly

Running the tests

Once screen_assembly3.py is in your PATH type screen_assembly3.py -h . If you have all dependencies then the help menu will display. Otherwise read the erorr and install whichever dependency is missing.

Running the program

Please see the WIKI

Authors

License

This project is licensed under the MIT License - see the LICENSE https://github.com/shimbalama/screen_assembly/blob/master/LICENSE file for details

Acknowledgments

  • Mark Davies lab and Jake for testing

screen_assembly's People

Contributors

shimbalama avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

screen_assembly's Issues

Not making unique seq fasta for AA_seqs_and_ref_nuc local variable 'i' referenced before assignment

HI Liam,

Recently I have had an issue with some query proteins, looking at the output notes for the proteins with an issue I am getting the following message:
Not making unique seq fasta for SpeC_seqs_and_ref_nuc local variable 'i' referenced before assignment

This time I had an issue with 3 out of 23 proteins. I have checked the inputs, which are fine, no odd characters . Also the proteins that have an issue changes across different runs.

Any insights on what is going on and how to fix it?
Please let me know if you need any more information.

Thank you,
Jacqui

run from empty dir

Currently have to run run an empty dir due to loose glob.

File "/home/liam/.local/bin/screen_assembly3.py", line 227, in main
    fin = open(any_file, "rt")

Picks up input dirs if any. Make glob more stringent and not pass anything in flags -i and -q

extra dependency -MUSCLE

Hi, Liking the tool, but have been having a bit of trouble getting the alignment and plotting function working. I'll post separate issues for each thing I have managed to troubleshoot so far.

In the code you call MUSCLE aligner however this isn't listed as a dependency in the wiki. ClustalO is listed and is called following so I'm not sure if MUSCLE has been removed but is still in the code or should be a listed dependency.

in analysis.py lines 766-792

def versions(args):

    '''
    Save wrapped program version to text
    '''

    with open('versions.txt', 'w') as fout:
        fout.write(str(args)+'\n')
        call(['blastn', '-version'], stdout=fout)
        if args.aln:
            call(['muscle', '-version'], stdout=fout)
        #if args.raxml:
        #    call([args.raxml_executable, '-v'], stdout=fout)
        if args.IQtree:
            call(['iqtree', '-version'], stdout=fout)

def clustal(args, query, seq_type):

    if not os.path.exists(query + '_' + seq_type + '_non_redundant.aln'):
        call(['clustalo',
            '-i', query + '_' + seq_type + '_non_redundant.fasta',
            '-o', query + '_' + seq_type + '_non_redundant.aln',
            '--threads', '1'])#seems faster than just giving clustal muli threads

    if args.plots:
        if seq_type == 'aa':
            gap_plot(args, query, seq_type)

some protein sequences break script "Not making unique seq fasta for BCAL1869_seqs_and_ref_nuc local variable 'i' referenced before assignment"

Hi, I have previously run the script with nucleotide queries which has been working fine however I tried to run with some protein queries and it ran for some but for others I get the following error message

"Not making unique seq fasta for BCAL1869_seqs_and_ref_nuc local variable 'i' referenced before assignment"

The issue appears to only effect the sequences that are of concern (the csv and box and carriage plots work for the other queries) and when isolated from the multi fasta into a shorter list the same error occurs. I had a quick dig around and suspect that there might be an indentation/ local variable call issue at line 254 of analysis (if i > 1) causing the function responsible for creating the seqs_and_ref_nuc file to throw up the error but not entirely sure.

I am also not sure what it would be that is causing some sequences to fail but others succeed which I assume is a different issue that may be specific to my dataset. In case you want to see if that issue is reproducible here is the link to the sequences I have been using

https://www.burkholderia.com/downloads/burkholderia/bgd_r_9_1/Burkholderia/complete/fna-complete.tar.gz
(I needed to alter file names to remove # and reduce file name length).

fasta file is also attached in .txt format.
mixed_success_fail.txt

def make_fasta_non_redundant(args, query, seq_type):

    '''
    Boil each multi fasta down to unique seqs for every query - at aa level
    '''
    try:
        redundant_map = collections.defaultdict(list)
        for i, record in enumerate(SeqIO.parse(query + '_seqs_and_ref_' + seq_type + '.fasta','fasta')):
            seq_str = str(record.seq)
            redundant_map[seq_str].append(record.id)


        #write non redundant fasta
        fout= open(query + '_' +seq_type+'_unique_seqs.fasta','w')
        if i > 1:
            for i, seq in enumerate(redundant_map):
                if 'ref' in redundant_map.get(seq):
                    fout.write('>0 ' + ','.join(redundant_map.get(seq))+'\n')
                else:
                    fout.write('>'+ str(i + 1) + ' ' + ','.join(redundant_map.get(seq))+'\n')
                fout.write(seq+'\n')
        fout.close()
    except Exception as e:
        print ('Not making unique seq fasta for',query + '_seqs_and_ref_' + seq_type, e)

iTOL heatmap

Hi Shimbalama,
Could you please change your iTOL heatmap file so that the genes with the most hits are shown first, then the others in descending order? Cheers

name 'iplot' is not defined

Hi,

When I try the plot function (on MacOSX) I get the following error

#code run
screen_assembly3.py -t 1 -l 95 --percent_identity 95 -o -k -q genes.mfa -i contigs/

Plotting...
Traceback (most recent call last):
File "/Users/user/.local/bin/screen_assembly3.py", line 232, in
main()
File "/Users/user/.local/bin/screen_assembly3.py", line 205, in main
pool.map(iplot.plot_vars, tmp)
NameError: name 'iplot' is not defined

looking through the script I couldn't see where the module iplot (or plot which is on line 211) is defined (I'm assuming they are calling a module from matplotlib but wasn't sure). Not sure if there is a library that I haven't loaded but it looks like there needs to be a statement importing something as iplot at the beginning of the script.

        if args.plots:
            print ('Plotting...')
            if args.operon:
                with Pool(processes=int(args.threads)) as pool:
                    tmp = [(args, query, 'nuc') for query in query_seqs]
                    pool.map(iplot.plot_vars, tmp)
            else:
                variant_types(args, assemblies) 
                with Pool(processes=int(args.threads)) as pool:
                    tmp = [(args, query, 'aa') for query in query_seqs]
                    pool.map(plot.plot_vars, tmp)   
            var_pos_csv(args, ana.blast_type)

If it helps to troubleshoot it appears some of the plots are working as I get the box_and_carriage plots.svg outputs but I don't get any of the snp/ gap variation plots. Not a major concern but

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.