Giter VIP home page Giter VIP logo

metadmg-cpp's Introduction

Build Status

===== Program for analysing NGS data.

http://www.popgen.dk/angsd

Installation:

  1. Using a local folder containing htslib
#download htslib
git clone --recurse-submodules https://github.com/samtools/htslib.git;
#download angsd
git clone https://github.com/angsd/angsd.git;

#install htslib
cd htslib
make

#install angsd
cd ../angsd
make HTSSRC=../htslib
  1. Systemwide installation of htslib
git clone https://github.com/angsd/angsd.git;
cd angsd; make HTSSRC=systemwide
  1. Using htslib submodule
git clone https://github.com/angsd/angsd.git;
cd angsd; make

Notes

  • I've switched over to using htslib for parsing single reads (to allow for CRAM reading, while avoid having to write my own CRAM parser). I'm still using my own readpools. Users should therefore also download and install htslib.
  • If you are on a mac computer and the compilation process complains about a missnig crybtolib library then do 'make CRYPTOLIB=""'

Program has a paper

http://www.biomedcentral.com/1471-2105/15/356/abstract

metadmg-cpp's People

Watchers

 avatar  avatar  avatar  avatar  avatar

metadmg-cpp's Issues

getdamage runmode 1 output

Hi @ANGSD
when running the getdamage in runmode 1 it should output the reference/contig/chromosome name instead of taxids.

Referred here

Thanks
Antonio

Could not recreate the sequence for read

Hi guys,

I've run into an issue with the getdamage command, it fails on a sample with segfault and message

Could not recreate the sequence for read: MT_A00706:67:HGMMWDSXX:3:2238:22146:7999

The offending read entry in the respective BAM file is

MT_A00706:67:HGMMWDSXX:3:2238:22146:7999 16 contig02205 2308 45 15M1I77M75D71M18S * 0 0 CGGATCACGCCCTTGTCCTCGTTGCGCACAAAGGTCTTGATGGCCAGGCGGCGCTGCGGCGCGGTGGCGATCACGGACAGGTCGCGCAGCCCTTTGACGGCCTCCTTGTGGCGCACGCCGAAGCGGTGTTCCTCGTCGATGATCAGCAGGCCCAGGTTCTTGAAGCGGGTGTCCGCGCTCAG F:FFF:FFFFFFFFFF:F,FFFF,FFFFFFFF,:,FFFFFF:FFF:F:FF:FFFFFF:FFFFFFF::FFFFFFFF:FFF,FFJJJJJJJ$$J$J$JJ$-JFFFFFFFFFFFFFF,FFFFFF:FFFFFFF,FFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF s1:i:82 s2:i:0 PG:Z:MarkDuplicates RG:Z:34507.BH75-55 NM:i:84 AS:i:133 de:f:0.0606 rl:i:0 cm:i:6 nn:i:0 tp:A:P ms:i:132 MD:Z:12G36C5G32G3^TCCAGCGCCATGCCCATGGTGCGCGGGATGGGCGTGGCCGTCAGCGTGAGCACGTCCACCTCGGCGCGCATGGCC2C1T3T26C35

first time encountering this, I've used it quite a bit now for the screening pipeline without issues

Important!!! Errors when running make on version: e099b28: VERSION:30b6835

npl206@a00710:/willerslev/scratch/mwp/metaDamage_test/NEW_metadamage/metadamage$ make
HTSSRC not defined, assuming systemwide installation -lhts
echo '#define METADAMAGE_VERSION "30b6835"' > version.h
g++ -c -O3 -std=c++11 ngsLCA.cpp
ngsLCA.cpp: In function ‘int2int* bamRefId2tax(bam_hdr_t*, char*, char*)’:
ngsLCA.cpp:113:43: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&valinbam,sizeof(int));
^
ngsLCA.cpp:114:38: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&val,sizeof(int));
^
ngsLCA.cpp:126:31: warning: ignoring return value of ‘ssize_t bgzf_read(BGZF*, void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_read(fp,&val,sizeof(int));
^
g++ -MM -O3 -std=c++11 ngsLCA.cpp >ngsLCA.d
g++ -c -O3 -std=c++11 ngsLCA_cli.cpp
ngsLCA_cli.cpp: In function ‘pars* pars_init()’:
ngsLCA_cli.cpp:12:14: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->htsfile = "CHL_155_12485.sort.bam";
^
ngsLCA_cli.cpp:13:17: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->acc2taxfile="nucl_gb.accession2taxid.gz";
^
ngsLCA_cli.cpp:14:16: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->namesfile = "names.dmp.gz";
^
ngsLCA_cli.cpp:15:15: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->nodesfile= "nodes.dmp.gz";
^
ngsLCA_cli.cpp:23:14: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->outnames="outnames";
^
g++ -MM -O3 -std=c++11 ngsLCA_cli.cpp >ngsLCA_cli.d
g++ -c -O3 -std=c++11 profile.cpp
profile.cpp: In member function ‘void damage::write(char*, bam_hdr_t*)’:
profile.cpp:310:17: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
char outname="metaout";
^
profile.cpp:337:33: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF
, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,kstr.s,kstr.l);
^
profile.cpp: In member function ‘void damage::bwrite(char*, bam_hdr_t*)’:
profile.cpp:350:40: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&MAXLENGTH,sizeof(int));
^
profile.cpp:355:42: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&it->first,sizeof(int));
^
profile.cpp:356:50: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&it->second.nreads,sizeof(int));
^
profile.cpp:358:55: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,it->second.mm5p[l],sizeof(int)16);
^
profile.cpp:361:55: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF
, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,it->second.mm3p[l],sizeof(int)16);
^
g++ -MM -O3 -std=c++11 profile.cpp >profile.d
g++ -c -O3 -std=c++11 metadamage.cpp
metadamage.cpp: In function ‘int main_getdamage(int, char
*)’:
metadamage.cpp:40:16: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
char *onam = "meta";
^
g++ -MM -O3 -std=c++11 metadamage.cpp >metadamage.d
g++ -c -O3 -std=c++11 shared.cpp
g++ -MM -O3 -std=c++11 shared.cpp >shared.d
g++ -O3 -std=c++11 -o metadamage *.o -lz -llzma -lbz2 -lpthread -lcurl -lhts -lgsl

"path file" for -names, -nodes, -acc2tax etc

is it possible to to make metadamage read a file containing the path for "-names", "-nodes", "-acc2tax" (like one per line) instead of us having to list it on the command? I am thinking that you will more or less use the same paths every time, so if you just have a .txt file ("ncbi.tax") with those paths, you can do:

metadamage merge file.lca file.bdamage.gz -t ncbi.tax -howmany 15 > file.merged.lca
OR
metadamage lca -simscorelow 0.95 -simscorehigh 1.0 -t ncbi.tax -bam file.bam -outnames file -lca_rank genus -minmapq 30 -howmany 15

instead of

metadamage merge file.lca file.bdamage.gz -names /willerslev/users-shared/science-snm-willerslev-npl206/ngsLCA/ngsLCA/ncbi_tax_dump_files/names.dmp.gz -nodes /willerslev/users-shared/science-snm-willerslev-npl206/ngsLCA/ngsLCA/ncbi_tax_dump_files/nodes.dmp.gz -howmany 15 > file.merged.lca
OR
metadamage lca -simscorelow 0.95 -simscorehigh 1.0 -names /willerslev/users-shared/science-snm-willerslev-npl206/ngsLCA/ngsLCA/ncbi_tax_dump_files/names.dmp.gz -nodes /willerslev/users-shared/science-snm-willerslev-npl206/ngsLCA/ngsLCA/ncbi_tax_dump_files/nodes.dmp.gz -acc2tax /willerslev/users-shared/science-snm-willerslev-zsb202/soft/ngslca/accession2taxid/combined_taxid_accssionNO_20200425.gz -bam file.bam -outnames file -lca_rank genus -minmapq 30 -howmany 15

Metadamage getdamage fails for bwa aligned files with " ^ " in the MD tag

I've created bam files using bwa mem and running metadamage getdamage fails giving error msg for multiple specific reads which all contain ^ in the MD tag.

metadamage getdamage -l 15 -p 7 -o test -r 0 26_se.bam
./metadamage refName: (null) minLength: 15 printLength: 7 runmode: 0 outname: test nthreads: 4
Could not recreate the sequence for read: M_K00171:751:HW5LWBBXX:8:1204:22445:37431

samtools view XXX.bam | grep 'M_K00171:751:HW5LWBBXX:8:1204:22445:37431'
M_K00171:751:HW5LWBBXX:8:1204:22445:37431 0 chr1 3368972 60 92M77D94M * 0 0 CTGGCCTCAGGTACCACCTAAGGCCGCGGGCTGTCTGTGCTGTGGCCCCAGTGGCTGAGTAAAAGCTTGGAAAACTCAAAGCAAGGCCCGCAGGTGCTCGGCCAGAGCGCTGCCCACCCTACAGCAATGCACAGTGGCTGTTAAGCTTGGGGGATCCACATCTTCTCCCCTGCCAGTAATGTGGGA AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJ&J&J$JJ$JJJJ$JF<JJ<JJA<JJA<JJFJJFJJJJJJJJJJFFFFJFJJFJJAFFFFJFJJJJJFAJJJFF7A<JJJJJJJAFJJJJJJFAFFFAAA NM:i:79 MD:Z:24T0A66^GGAGCTCAGGCCACCTCAAGCCGCAGGAAAGGAGCTCTGCTCCCGCAGATAGAGGTGTGCTTTCTGGCTGGCACACT94 AS:i:94 XS:i:20

This is only a single example of a read with the ^ in the MD tag, as soon as i remove all reads with this characteristic it works fine

The program should have a graceful exit

Hi @ANGSD

when the program doesn't get the proper arguments, it dies with a segmentation fault. It would be nice to have a graceful exit

$ /vol/cloud/geogenetics/opt/bin/metaDMG++ lca -h
#/vol/cloud/geogenetics/opt/bin/metaDMG++ lca -h
        -> Must supply arguments in the form -pattern value
Segmentation fault (core dumped)

getdamage and print_ugly

Hi,

I'm testing out the new getdamage and print_ugly interplay.

Currently, I am running: ./metadamage getdamage --minlength 10 --printlength 35 --threads 8 --runmode 1 --outname subs subs.bam which correctly produces subs.bdamage.gz and subs.res.gz.

When I then run: ./metadamage print_ugly subs.bdamage.gz I get a the correct file subs.bdamage.gz.uglyprint.mismatch.txt.gz and an almost empty subs.bdamage.gz.uglyprint.stat.txt.gz which only contains the header:
#taxid name rank nalign nreads mean_rlen var_rlen mean_gc var_gc

I think we should fill this file with the information from each chromosome, wouldn't that make sense? Of course name and rank would be undefined in this case, but the rest of the variables should be able to be computed as far as I can see?

getdamage -l/--length default

It is not clear for me what is the default option for getdamage -l/--length option. It seems to me that if you don't add the -l argument to the command it will just skip every read, and print all the skipped lines in the stdout.
Maybe just add a default (e.g. 30 bp) and prevent it from spitting too much information on stdout (max 10 lines and some warning)?

Eigen dependency

Hi,

Is Eigen a new, required, dependency? After pulling the latest changes and running make I get:

HTSSRC not defined, assuming systemwide installation -lhts
echo '#define METADAMAGE_VERSION "18f7f72"' > version.h
c++ -c  -O3 -std=c++11   main_pmd.cpp
c++ -MM -O3 -std=c++11   main_pmd.cpp >main_pmd.d
c++ -c  -O3 -std=c++11   metadamage.cpp
In file included from metadamage.cpp:18:
./regression.h:3:10: fatal error: 'eigen3/Eigen/Core' file not found
#include <eigen3/Eigen/Core>
         ^~~~~~~~~~~~~~~~~~~
1 error generated.
make: *** [metadamage.o] Error 1

intallation warnings??

when installing metadamage using:

git clone https://github.com/ANGSD/metadamage.git
cd metadamage
make

I get the following messages, not sure if they have an signifcance?

npl206@a00814:/willerslev/edna/metadamage$ make
HTSSRC not defined, assuming systemwide installation -lhts
echo '#define METADAMAGE_VERSION "6eb5ef8"' > version.h
g++ -c -O3 -std=c++11 ngsLCA.cpp
g++ -MM -O3 -std=c++11 ngsLCA.cpp >ngsLCA.d
g++ -c -O3 -std=c++11 ngsLCA_cli.cpp
ngsLCA_cli.cpp: In function ‘pars* pars_init()’:
ngsLCA_cli.cpp:12:14: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->htsfile = "CHL_155_12485.sort.bam";
^
ngsLCA_cli.cpp:13:17: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->acc2taxfile="nucl_gb.accession2taxid.gz";
^
ngsLCA_cli.cpp:14:16: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->namesfile = "names.dmp.gz";
^
ngsLCA_cli.cpp:15:15: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->nodesfile= "nodes.dmp.gz";
^
ngsLCA_cli.cpp:23:14: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
p->outnames="outnames";
^
g++ -MM -O3 -std=c++11 ngsLCA_cli.cpp >ngsLCA_cli.d
g++ -c -O3 -std=c++11 profile.cpp
profile.cpp: In member function ‘void damage::write(char*, bam_hdr_t*)’:
profile.cpp:310:17: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
char outname="metaout";
^
profile.cpp:337:33: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF
, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,kstr.s,kstr.l);
^
profile.cpp: In member function ‘void damage::bwrite(char*, bam_hdr_t*)’:
profile.cpp:350:40: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&MAXLENGTH,sizeof(int));
^
profile.cpp:355:42: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&it->first,sizeof(int));
^
profile.cpp:356:50: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&it->second.nreads,sizeof(int));
^
profile.cpp:358:55: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,it->second.mm5p[l],sizeof(int)16);
^
profile.cpp:361:55: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF
, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,it->second.mm3p[l],sizeof(int)16);
^
g++ -MM -O3 -std=c++11 profile.cpp >profile.d
g++ -c -O3 -std=c++11 metadamage.cpp
metadamage.cpp: In function ‘int main_getdamage(int, char
*)’:
metadamage.cpp:75:16: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
char onam = "meta";
^
g++ -MM -O3 -std=c++11 metadamage.cpp >metadamage.d
g++ -c -O3 -std=c++11 shared.cpp
shared.cpp: In function ‘int2int
bamRefId2tax(bam_hdr_t*, char*, char*, int2int&)’:
shared.cpp:250:43: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&valinbam,sizeof(int));
^
shared.cpp:251:38: warning: ignoring return value of ‘ssize_t bgzf_write(BGZF*, const void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_write(fp,&val,sizeof(int));
^
shared.cpp:263:37: warning: ignoring return value of ‘ssize_t bgzf_read(BGZF*, void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
bgzf_read(fp,&val,sizeof(int));
^
g++ -MM -O3 -std=c++11 shared.cpp >shared.d
g++ -O3 -std=c++11 -o metadamage *.o -lz -llzma -lbz2 -lpthread -lcurl -lhts -lgsl

metadamage print output and filtering

Issues with metadamage print output right now:

  1. combining -ctga with -names does not work. The output keeps showing taxID instead of the taxon name

  2. -ctga prints both 5' and 3' in the same line with no differentiation. One has to count the columns to find out when the 3' G>A starts. If -p is 5 is ok, but I was testing with -p 15 and it is clumsy. Maybe split in two different lines? Make two columns? Something else?

  3. maybe we can have more options for filtering the print output? I was actually every time piping metadamage print into some "awk '{if ($2>1000 && $12>0.20) print}'" whenever I used it. For example, it would be nice to be able to filter the output by a) number of alignments, b) C>T frequency and c) G>A frequency. And actually my awk solution is no good because it only prints the "matching line", it would be great if metadamage actually included the next relevant lines. For example, if I am filtering for C>T frequency > 0.20, to include also the frequencies of position 2,3,4 etc so we can access the damage pattern.

Metadamage getdamage for paired-end reads with bwa mem

When running metadamage getdamage on paired-end reads obtained from bwa-mem the program doesn't output information to the bdamage.gz file.

./metadamage getdamage -l 15 -p 7 -o test -r 0 XXpe.bam
./metadamage refName: (null) minLength: 15 printLength: 7 runmode: 0 outname: test nthreads: 4
skipping: K00171:751:HW5LWBBXX:8:2101:17858:19390 is paired (can be considered using the -paired flag, this msg is printed 2 times more
skipping: K00171:751:HW5LWBBXX:8:2228:20050:23399 is paired (can be considered using the -paired flag, this msg is printed 1 times more
skipping: K00171:751:HW5LWBBXX:8:2101:17858:19390 is paired (can be considered using the -paired flag, this msg is printed 0 times more
-> Will dump: 'test.res.gz' this contains damage patterns for: 0 items
-> Setting threads to: 4
-> Will dump: 'test.bdamage.gz' this contains damage patterns for: 0 items
-> Setting threads to: 4

The errorr message is only shown for the first three reads in the file with one being the read-pair "K00171:751:HW5LWBBXX:8:2101:17858:19390" and another being " K00171:751:HW5LWBBXX:8:2228:20050:23399"

When trying again with the -paired flag, it still fails.

tempfolder only applies to binary file and for lca

Hi,

I have just tried running the new version as: ./metadamage lca ... -tempfolder tmp/ which only effected the binary .bam.bin file and not the:

  • .bdamage.gz
  • .lca.stat
  • .lca.gz
  • .log

files.

Also, I noticed that is has to be written as -tempfolder tmp/, with a / in the end of the filename (and the folder has to already be present). These issues is not a problem for me, since I'll just fix it in the Python part of the pipeline, however, still wanted to mention it though.

In addition to the above mentions issues for metadamage lca, it also seems that it is not implemented for metadamageprint_ugly yet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.