daveuu / baga Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 936 KB

Bacterial and Archaeal Genome Analyser

License: GNU General Public License v3.0

Python 98.98% TeX 1.02%

baga's People

Contributors

Stargazers

Watchers

Forkers

pauruihu jpjh

baga's Issues

Comparative Analyses problem

Hi! It's me again :) I find out that baga can't finish the last part of Comparative Analyses with my data and it's shown this:

-- Comparative Analyses --

Provide --genome_name or --genome_length for a scale bar unit of actual substitutions
Plotting to NZ_CP006918.1__Klebsiella_SNPs_rooted.labelled_tree_transfers.svg
rerooting to midpoint
Traceback (most recent call last):
  File "/home/paula/programas/baga/baga_cli.py", line 3060, in <module>
    genome_length = genome_length)
  File "/home/paula/programas/baga/ComparativeAnalysis.py", line 2421, in doPlot
    thisVGT = node_2_VGT[nodename]
KeyError: 'NODE 24'

Probably it happens because I haven't done something properly, but I don't know how to solve it.
Thank you (again)
Kind regards!

add link to publication or preprint

Is BAGA published or is there a pre-print, or anything else available to cite?

Prepare reads with just --reads_name

Is it possible to generate the baga.CollectData.Reads-your_read_group_name.baga without having to specify the --subsample_to_cov option? e.g. baga/baga_cli.py PrepareReads --reads_name your_read_group_name

It would be nice to make full use of the reads instead of subsampling. Thanks for your time!

Include multiple completed genomes

It would be desirable to include other completed genomes in the sample set along with the read sample set for comparison to the reference genome

rooted tree needed for infer recombination

Hi
When I run the ComparativeAnalysis --infer_recombination step using the command below I get an error that rooted tree I need for input does not exist:

baga_cli.py ComparativeAnalysis --infer_recombination --path_to_MSA $GENOME_NAME""$READS_NAME"_SNPs.phy" --path_to_tree $GENOME_NAME""$READS_NAME"_SNPs_rooted.phy_phyml_tree"

The issue seems to be that ClonalFrame requires a rooted tree but the previous infer recombination step outputs unrooted tree by default, with name $GENOME_NAME"__"$READS_NAME"_SNPs.phy_phyml_tree"

I tried manually rooting the phyml tree on my reference using iTOL then exporting the resulting newick file to the baga directory with the appropriate _rooted.phy_phyml_tree extension, but that resulted in an error that OROOT was not expected in the newick file. Any fix would be appreciated as the rest of the pipeline is working great

Picard version conflict

During the execution of AlignReads with align and deduplicate options. Baga was not able to find picard.
The installed version from checkforget is 2.5.0 but baga is trying to execute picard-tools-1.135.

I changed the name of the path to picard from external_programs/picard-tools-2.5.0/ to external_programs/picard-tools-1.135/ and is working.

bwa dependency checking miscommunication

When trying to do the Alignment step of the reads against the reference I executed first:

baga/baga_cli.py Dependencies --checkgetfor AlignReads

And all the dependencies were installed and found correctly.
But then when I executed the alignreads command:

/home/rrios/baga/baga_cli.py AlignReads -n TX6128 -g TX0082 -a -d

The output was:

Bacterial and Archaeal Genome Analyser:
Novel analyses and wrapped tools pipelined for convenient processing
of genome sequences
Version 0.2 (December 20 2015)
David Williams
david.williams.at.liv.d-dub.org.uk
Work on this software was started at The University of Liverpool, UK
with funding from The Wellcome Trust (093306/Z/10) awarded to:
Dr Steve Paterson (The University of Liverpool, UK)
Dr Craig Winstanley (The University of Liverpool, UK)
Dr Michael A Brockhurst (The University of York, UK)
Copyright (C) 2015 David Williams
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it
There is NO WARRANTY, to the extent permitted by law
09:53:59
09:53:59 |=== Starting baga analysis at 09:53:59 on Fri 23 Sep, 2016 ===|
09:53:59
-- Read Aligning module --
09:53:59 Logger for AlignReads will write to baga-nosample_logs/16-09-23_09-53-59_AlignReads/00_main.log
baga.CollectData.Genome-TX0082.baga
baga.PrepareReads.Reads-TX6128.baga
Loading processed reads group TX6128
Loading genome TX0082
Aligning reads . . .
Writing BWA index files for genome_sequences/TX0082.fna
Could not find the bwa executable at executable at /home/rrios/otros/Efm_fnm_deletion_reads/bagadev/external_programs/bwa/bwa.
You can check if it is installed using:
/home/rrios/baga/baga_cli.py Dependencies --check bwa
You can install it locally using:
/home/rrios/baga/baga_cli.py Dependencies --get bwa

Then I executed:

/home/rrios/baga/baga_cli.py Dependencies --get bwa

And it solved the problem but there is some miscommunication with in those calls.

Repeats plot interpretation

Hi! I've searched inside the code the meaning of the different colours used on the repeats plotting, but I can't figure it out by my own (I'm still a beginner in this field...). In order to make a good interpretation of my data, could you tell me the reason of using each colour? I have regions in blue, purple and pink in some pairs. Thank you.

Typo installing pre-reqs for debian systems

This pipeline is fantastic!

Just a quick note on a typo:

sudo apt-get install build-essentials cython

It should be:

sudo apt-get install build-essential cython

Again, this is great!

How to install the baga module?

How does one install the baga module/library?

"ImportError: No module named baga"

GATK and java version trouble

Greetings,
I have installed baga and everythin has worked well until I get to the indel realignment step where it need GATK to do it.
I have got GATK version 3.6-0-g89b7209 which works on java 1.8.

[rrios@BUSRVGENOMICA baga]$ java -jar /usr/bin/GenomeAnalysisTK.jar -version
3.6-0-g89b7209

I executed it like these :

[rrios@BUSRVGENOMICA baga]$ /source_installers/baga/baga_cli.py AlignReads -n TX6128 -g contig_139 --indelrealign --GATK_jar_path /usr/bin/GenomeAnalysisTK.jar
and the output was:
Using system JAVA JRE
GATK v3.3 requires Java v1.7 but your version is 1.8. Please install Java v1.7 to continue or use --JRE_1_7_path to specify Java v1.7 binary to use.

Then I tried with java 1.7:

/source_installers/baga/baga_cli.py AlignReads -n TX6128 -g contig_139 --indelrealign --GATK_jar_path /usr/bin/GenomeAnalysisTK.jar --JRE_1_7_path /source_installers/jre1.7.0_80/bin/java
and the output:
Using provided JAVA JRE at /source_installers/jre1.7.0_80/bin/java
Java 1.7: found!
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.launcher.LauncherHelper.checkAndLoadMain(Unknown Source)

There was a problem checking your --GATK_jar_path argument. Please report this as a baga bug.

Which is the same result when I execute GATK on its own with Java 1.7:

/source_installers/jre1.7.0_80/bin/java -jar /usr/bin/GenomeAnalysisTK.jar -version Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.launcher.LauncherHelper.checkAndLoadMain(Unknown Source)

So, here I am reporting the bug.

For now I'm going to see if I can download a previous version of GATK and work this around.

Thank you

failure of BAGA filter rearrangements

Dear BAGA developers,

I would be happy to get support with the following problem concerning rearrangement analysis within the BAGA pipeline, installed on December 2nd from git clone http://github.com/daveuu/baga.git (BAGA version v0.2.1, Python version 2.7.5)

Executing the pipeline described in https://github.com/daveuu/baga/blob/master/docs/guide1.md works well until step 3 (alignment of reads). However, in step 4 (reaarangements), dependency check
baga/baga_cli.py Dependencies --checkgetfor Structure
is still OK but check for rearrangements by
baga/baga_cli.py Structure --reads_name myreads --genome_name mygenome --ratio_threshold 0.4 --check
ends without error message and the following prompt:

baga.CollectData.Genome-mygenome.baga
Loading alignments information for: taudien_myreads__mygenome from AlignReads output
indexing alignments/mygenome/mysample___mygenome_dd_si_realn.bam
Using pySAM

The files

*si_realn.bam.bai
*mean_insert_size.baga
*realn_depths.baga

were created but when starting the next step by
baga/baga_cli.py Structure --reads_name myreads --genome_name mygenome --plot
the error message is:

baga.CollectData.Genome-mygenome.baga
Loading alignments information for: myreads__mygenome from AlignReads output
Loading genome mygenome
Could not find: baga.Structure.CheckerInfo- myreads___ mygenome.baga
Traceback (most recent call last):
File "baga/baga_cli.py", line 1979, in
checker_info['genome_name'],
NameError: name 'checker_info' is not defined

When repeating the check, the script obviously starts at a later point, since the depth- and insert size files are already available, but finally ends with an error message:

Collecting coverage for myreads_ reads aligned to mygenome
Found and loaded previously scanned coverage depths from alignments/mygenome/Myreads__mygenome_dd_si_realn_depths.baga which saves time!
Calculating mean insert size for myreads_ reads aligned to mygenome
Loaded insert size from alignments/mygenome/mysample__mygenome_dd_si_realn_mean_insert_size.baga
Previously, no paired, aligned reads found in alignments/mygenome/mysample___mygenome_dd_si_realn.bam
Structure analysis for rearrangements is not possible
Skipping: no aligned pairs found

We suspected the problem somewhere around line 135 in Structure.py and shifted up _pysam.index(BAM) (before try):

instantiate Structure Checkers # instantiate Structure Checkers

checkers = {}
for BAM in BAMs:
    indexfile = _os.path.extsep.join([BAM,'bai'])
print ("Test: ", indexfile)
    if not(_os.path.exists(indexfile) and _os.path.getsize(indexfile) > 0):
        print('indexing {}'.format(BAM))
        fail = False
        _pysam.index(BAM)
    try:
            print('Using pySAM')
	_pysam.index(BAM)

Then the error message is:

Loading alignments information for: myreads__mygenome from AlignReads output
('Test: ', u'alignments/mygenome/mysample___mygenome_dd_si_realn.bam.bai')
indexing alignments/mygenome/mysample___mygenome_dd_si_realn.bam
Traceback (most recent call last):
File "baga/baga_cli.py", line 1937, in
ratio_threshold = args.ratio_threshold)
File "mypath/baga_analysis/baga/Structure.py", line 143, in checkStructure
_pysam.index(BAM)
File “mypath/baga_analysis/local_packages/pysam/init.py", line 66, in call
self.dispatch, args, catch_stdout=kwargs.get("catch_stdout", True))
File "pysam/csamtools.pyx", line 132, in pysam.csamtools._samtools_dispatch (pysam/csamtools.c:3259)
File "pysam/csamtools.pyx", line 34, in pysam.csamtools._forceCmdlineBytes (pysam/csamtools.c:1380)
File "pysam/csamtools.pyx", line 22, in pysam.csamtools._forceBytes (pysam/csamtools.c:1237)
TypeError: Expected bytes, got unicode

Do you have any idea what’s going wrong? Thank you very much for your help!

Genome Sequences Download

One last question: when I download the reference genome at home I have no problems, but here at work we have a very restricted internet using a proxy and the download doesn't work. Could you tell me through which port it's the download done?

Problems with Repeats filter

Hi,
I am having problems running the Repeats filter on some highly repetitive genomes. Reference NC_009488.1 dies early on:

baga_cli.py Repeats -g NC_009488.1 --find
Traceback (most recent call last):
  File "baga/baga_cli.py", line 1832, in <module>
    finder.findRepeats(minimum_percent_identity = args.minimum_percent_identity * 0.01, minimum_repeat_length = args.minimum_repeat_length)
  File "baga/Repeats.py", line 1209, in findRepeats
    self.getHomologousContiguousBlocks()
  File "Repeats.py", line 260, in getHomologousContiguousBlocks
    thisORFhit_nearestWithHits = self.getAdjacentORFsWithHit(thisORFhit, direction = 1, getall = True)
  File "baga/Repeats.py", line 156, in getAdjacentORFsWithHit
    next_hit_ORF = self.ORFs_with_hits_ordered[thisORFn_hits + n * direction]

NC_010793.1 runs for longer then fails with a different error:

baga_cli.py Repeats -g  NC_010793.1 --find
Traceback (most recent call last):
  File "baga/baga_cli.py", line 1832, in <module>
    finder.findRepeats(minimum_percent_identity = args.minimum_percent_identity * 0.01, minimum_repeat_length = args.minimum_repeat_length)
  File "baga/Repeats.py", line 1217, in findRepeats
    self.align_blocks(min_pID = 0.85, max_extensions = 15)
  File "baga/Repeats.py", line 547, in align_blocks
    postORFB_start, postORFB_end = loci_ranges_use_B[(i+1)*2:(i+1)*2+2]
ValueError: need more than 0 values to unpack

Any help would be appreciated - I can run the Repeats module on other genomes, so I think something about the repetitive nature of my references is causing a problem. Thanks.

avoid trimming

Hello again!
What would you recommend to change in order to avoid the trimming step and complete the whole process? I thought about it because my reads are already preprocessed and cleaned. I understand that if I just eliminate that step there may be some outputs not generated that are necessary to continue...
Thank you (I'm looking forward to the documentation ;))