Giter VIP home page Giter VIP logo

tn93's Introduction

TN93

This is a simple program meant to compute pairwise distances between aligned nucleotide sequences in sequential FASTA format using the Tamura Nei 93 distance.

To build, you need to use cmake. Type

git clone https://github.com/veg/tn93.git
cd tn93
cmake .
make install

Note, you can use :

cmake [-DCMAKE_INSTALL_PREFIX=/install/path DEFAULT /usr/local/] ./

to set a different install path.

If the compiler supports OpenMP, the program will be built with multithreaded support.

USAGE

usage: tn93 [-h] [-o OUTPUT] [-t THRESHOLD] [-a AMBIGS] [-l OVERLAP][-d COUNTS_IN_NAME] [-f FORMAT] [-s SECOND_FASTA] [-b] [-c] [-q] [FASTA]

Try it from using the example file in 'data' by typing

tn93 -t 0.05 -o data/test.dst data/test.fas

Output (diagnostics written to stderr, histogram written to stdout so can be redirected)

Example:

Read 8 sequences of length 1320
Will perform 28 pairwise distance calculations
Progress:     100% (       7 links found,          inf evals/sec)
{
    "Actual comparisons performed" :28,
    "Total comparisons possible" : 28,
    "Links found" : 7,
    "Maximum distance" : 0.0955213,
    "Mean distance" : 0.0644451,
    "Histogram" : [[0.005,0],[0.01,0],[0.015,0],[0.02,0],[0.025,0],[0.03,2],[0.035,1],[0.04,0],[0.045,1],[0.05,3],[0.055,1],[0.06,2],[0.065,2],[0.07,3],[0.075,4],[0.08,3],[0.085,3],[0.09,2],[0.095,0],[0.1,1],[0.105,0],[0.11,0],[0.115,0],[0.12,0],[0.125,0],[0.13,0],[0.135,0],[0.14,0],[0.145,0],[0.15,0],[0.155,0],[0.16,0],[0.165,0],[0.17,0],[0.175,0],[0.18,0],[0.185,0],[0.19,0],[0.195,0],[0.2,0],[0.205,0],[0.21,0],[0.215,0],[0.22,0],[0.225,0],[0.23,0],[0.235,0],[0.24,0],[0.245,0],[0.25,0],[0.255,0],[0.26,0],[0.265,0],[0.27,0],[0.275,0],[0.28,0],[0.285,0],[0.29,0],[0.295,0],[0.3,0],[0.305,0],[0.31,0],[0.315,0],[0.32,0],[0.325,0],[0.33,0],[0.335,0],[0.34,0],[0.345,0],[0.35,0],[0.355,0],[0.36,0],[0.365,0],[0.37,0],[0.375,0],[0.38,0],[0.385,0],[0.39,0],[0.395,0],[0.4,0],[0.405,0],[0.41,0],[0.415,0],[0.42,0],[0.425,0],[0.43,0],[0.435,0],[0.44,0],[0.445,0],[0.45,0],[0.455,0],[0.46,0],[0.465,0],[0.47,0],[0.475,0],[0.48,0],[0.485,0],[0.49,0],[0.495,0],[0.5,0],[0.505,0],[0.51,0],[0.515,0],[0.52,0],[0.525,0],[0.53,0],[0.535,0],[0.54,0],[0.545,0],[0.55,0],[0.555,0],[0.56,0],[0.565,0],[0.57,0],[0.575,0],[0.58,0],[0.585,0],[0.59,0],[0.595,0],[0.6,0],[0.605,0],[0.61,0],[0.615,0],[0.62,0],[0.625,0],[0.63,0],[0.635,0],[0.64,0],[0.645,0],[0.65,0],[0.655,0],[0.66,0],[0.665,0],[0.67,0],[0.675,0],[0.68,0],[0.685,0],[0.69,0],[0.695,0],[0.7,0],[0.705,0],[0.71,0],[0.715,0],[0.72,0],[0.725,0],[0.73,0],[0.735,0],[0.74,0],[0.745,0],[0.75,0],[0.755,0],[0.76,0],[0.765,0],[0.77,0],[0.775,0],[0.78,0],[0.785,0],[0.79,0],[0.795,0],[0.8,0],[0.805,0],[0.81,0],[0.815,0],[0.82,0],[0.825,0],[0.83,0],[0.835,0],[0.84,0],[0.845,0],[0.85,0],[0.855,0],[0.86,0],[0.865,0],[0.87,0],[0.875,0],[0.88,0],[0.885,0],[0.89,0],[0.895,0],[0.9,0],[0.905,0],[0.91,0],[0.915,0],[0.92,0],[0.925,0],[0.93,0],[0.935,0],[0.94,0],[0.945,0],[0.95,0],[0.955,0],[0.96,0],[0.965,0],[0.97,0],[0.975,0],[0.98,0],[0.985,0],[0.99,0],[0.995,0],[1,0]]
}

DEPENDENCIES

  • gcc >= 5.0.0
  • cmake >= 3.0.0

ARGUMENTS

optional arguments:
  -h, --help               show this help message and exit
  -v, --version            show tn93 version 
  -o OUTPUT                direct the output to a file named OUTPUT (default=stdout)
  -t THRESHOLD             only report (count) distances below this threshold (>=0, default=0.015)
  -a AMBIGS                handle ambigous nucleotides using one of the following strategies (default=resolve)
                           resolve: resolve ambiguities to minimize distance (e.g.R matches A);
                           average: average ambiguities (e.g.R-A is 0.5 A-A and 0.5 G-A);
                           skip: do not include sites with ambiguous nucleotides in distance calculations;
                           gapmm: a gap ('-') matched to anything other than another gap is like matching an N (4-fold ambig) to it;
                           a string (e.g. RY): any ambiguity in the list is RESOLVED; any ambiguitiy NOT in the list is averaged (LIST-NOT LIST will also be averaged);
  -g FRACTION              in combination with AMBIGS, works to limit (for resolve and string options to AMBIG)
                           the maximum tolerated FRACTION of ambiguous characters; sequences whose pairwise comparisons
                           include no more than FRACTION [0,1] of sites with resolvable ambiguities will be resolved
                           while all others will be AVERAGED (default=1.0)
  -f FORMAT                controls the format of the output unless -c is set (default=csv)
                           csv: seqname1, seqname2, distance;
                           csvn: 1, 2, distance;
                           hyphy: {{d11,d12,..,d1n}...{dn1,dn2,...,dnn}}, where distances above THRESHOLD are set to 100;
  -l OVERLAP               only process pairs of sequences that overlap over at least OVERLAP nucleotides (an integer >0, default=100):
  -d COUNTS_IN_NAME        if sequence name is of the form 'somethingCOUNTS_IN_NAMEinteger' then treat the integer as a copy number
                           when computing distance histograms (a character, default=':'):
  -s SECOND_FASTA          if specified, read another FASTA file from SECOND_FASTA and perform pairwise comparison BETWEEN the files (default=NULL)
  -b                       bootstrap alignment columns before computing distances (default = false)
                           when -s is supplied, permutes the assigment of sequences to file
  -r                       if -b is specified AND -s is supplied, using -r will bootstrap across sites
                           instead of allocating sequences to 'compartments' randomly
  -c                       only count the pairs below a threshold, do not write out all the pairs 
  -m                       compute inter- and intra-population means suitable for FST caclulations
                           only applied when -s is used to provide a second file  
  -u PROBABILITY           subsample sequences with specified probability (a value between 0 and 1, default = 1.0) 
  -0                       report distances between each sequence and itself (as 0); this is useful to ensure every sequence
                           in the input file appears in the output, e.g. for network construction to contrast clustered/unclustered
  -q                       do not report progress updates and other diagnostics to stderr 
  FASTA                    read sequences to compare from this file (default=stdin)

NOTES

All sequences must be aligned and have the same length. Only IUPAC characters are recognized (e.g. no ~). Sequence names can include copy number as in

>seqname:10

':' can be replaced with another character using -d, and sequences that have no explicit copy number are assumed to be a single copy. Copy numbers only affect histogram and mean calculations.

tn93's People

Contributors

spond avatar stevenweaver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tn93's Issues

error with CXXABI_1.3.8

I'm facing the following error with other server users.
tn93: /usr/lib64/libstdc++.so.6: version CXXABI_1.3.8' not found (required by tn93)`
but for me it worked add to the environment

LD_LIBRARY_PATH=/usr/local/lib64/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

but for other users this is not working.

strings /usr/local/lib64/libstdc++.so | grep CXXABI
CXXABI_1.3
CXXABI_1.3.1
CXXABI_1.3.2
CXXABI_1.3.3
CXXABI_1.3.4
CXXABI_1.3.5
CXXABI_1.3.6
CXXABI_1.3.7
CXXABI_1.3.8
CXXABI_1.3.9
CXXABI_1.3.10
CXXABI_TM_1
CXXABI_FLOAT128
strings /usr/lib64/libstdc++.so.6 | grep CXXABI
CXXABI_1.3
CXXABI_1.3.1
CXXABI_1.3.2
CXXABI_1.3.3

Question about results when using SKIP as the match mode

Hello,

My lab currently uses this TN93 implementation both by itself and as part of HIVTrace. We have a need for a pure python implementation, which I've been working on. I'm comparing the output from my implementation to the output from this implementation to ensure I'm producing consistent results. As I've been looking at the results produced when using the SKIP match mode I'm seeing a difference in the values produced when dealing with a sequence that has ambiguities.

I looked deeper into the final matrix of counts, and I'm finding that the ambiguities seem to be being counted rather than skipped. I want to make sure I have the behavior correct - should the SKIP mode pass over ambiguous nucleotides the same way it passes over gaps?

I have a small example file to demonstrate what I mean (I had to add a .txt suffix, but it's a standard fasta file). The file has two 240nt sequences in it - the first has no ambiguous nucleotides, and the second has seven ambiguous nucleotides.

When I run tn93 on this file (setting the ambiguity mode to SKIP and setting the reporting threshold value to 1) I get a distance of 0.0190 and these counts:

       A     C     G    T
A   89.5     0      1     0.5 
C      0     37     0.5   0.5 
G      1     0.5    55.5   0 
T      0     0.5    0     53.5

The counts for this sum to 240 - the full length of the sequence.

When I run the python version that I've developed I'm getting a much lower distance (0.0043) and I'm getting these counts:

     A     C     G    T
A    89     0     1     0 
C    0     36     0     0 
G    1      0     54    0 
T    0      0     0    53

This sums to 233, which is the number I expect - the full sequence minus the seven positions where there were ambiguous nucleotides. Should there be fractional counts when SKIP is selected? Since it doesn't try any resolutions I thought the counts would all be integers.

Thanks very much!

short_example_seqs.fasta.txt

CMAKE Shortestpath error for shared declaration

My process d getting stuck here don't know how to fix this. Please this help would be greatly appreciated!

$ make install
-- Configuring done
-- Generating done
-- Build files have been written to: /home/pv197/tn93/tn93
[ 10%] Built target tn93
[ 21%] Built target tn93-cluster
[ 31%] Built target seqcoverage
[ 34%] Building CXX object CMakeFiles/ShortestPathTN93.dir/src/ShortestPathTN93.cpp.o
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp: In function ‘void relaxDistanceEstimates(long unsigned int, long int, char, long int, double)’:
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp:92:208: error: ‘left_to_do’ is predetermined ‘shared’ for ‘shared’
#pragma omp parallel for default(none) shared(my_distance_estimate,nodeParents,workingNodes,distanceEstimates, step_penalty, min_overlap, resolutionOption, firstSequenceLength, theSequence, left_to_do)
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp:92:208: error: ‘theSequence’ is predetermined ‘shared’ for ‘shared’
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp:92:208: error: ‘firstSequenceLength’ is predetermined ‘shared’ for ‘shared’
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp:92:208: error: ‘resolutionOption’ is predetermined ‘shared’ for ‘shared’
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp:92:208: error: ‘min_overlap’ is predetermined ‘shared’ for ‘shared’
/home/pv197/tn93/tn93/src/ShortestPathTN93.cpp:92:208: error: ‘step_penalty’ is predetermined ‘shared’ for ‘shared’
make[2]: *** [CMakeFiles/ShortestPathTN93.dir/src/ShortestPathTN93.cpp.o] Error 1
make[1]: *** [CMakeFiles/ShortestPathTN93.dir/all] Error 2
make: *** [all] Error 2

fasta_diff.cpp Error on Install

Installation failure on MacOS 10.14.4 upon "cmake ." command. Errors and warnings reference fasta_diff.cpp (see output below).

Cheers,
Joel

Tanyas-iMac-Pro:tn93 WertheimLab$ sudo cmake .
-- The C compiler identification is AppleClang 10.0.1.10010046
-- The CXX compiler identification is AppleClang 10.0.1.10010046
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/WertheimLab/tn93
Tanyas-iMac-Pro:tn93 WertheimLab$ sudo make install
Scanning dependencies of target validate_fasta
[ 2%] Building CXX object CMakeFiles/validate_fasta.dir/src/validate_fasta.cpp.o
[ 4%] Building CXX object CMakeFiles/validate_fasta.dir/src/stringBuffer.cc.o
[ 7%] Building CXX object CMakeFiles/validate_fasta.dir/src/tn93_shared.cc.o
[ 9%] Linking CXX executable validate_fasta
[ 9%] Built target validate_fasta
Scanning dependencies of target selectreads
[ 11%] Building CXX object CMakeFiles/selectreads.dir/src/trim_reads.cpp.o
[ 14%] Building CXX object CMakeFiles/selectreads.dir/src/stringBuffer.cc.o
[ 16%] Building CXX object CMakeFiles/selectreads.dir/src/tn93_shared.cc.o
[ 19%] Building CXX object CMakeFiles/selectreads.dir/src/argparse_trim.cpp.o
[ 21%] Linking CXX executable selectreads
[ 21%] Built target selectreads
Scanning dependencies of target ShortestPathTN93
[ 23%] Building CXX object CMakeFiles/ShortestPathTN93.dir/src/ShortestPathTN93.cpp.o
[ 26%] Building CXX object CMakeFiles/ShortestPathTN93.dir/src/stringBuffer.cc.o
[ 28%] Building CXX object CMakeFiles/ShortestPathTN93.dir/src/tn93_shared.cc.o
[ 30%] Linking CXX executable ShortestPathTN93
[ 30%] Built target ShortestPathTN93
Scanning dependencies of target tn93
[ 33%] Building CXX object CMakeFiles/tn93.dir/src/TN93.cpp.o
[ 35%] Building CXX object CMakeFiles/tn93.dir/src/stringBuffer.cc.o
[ 38%] Building CXX object CMakeFiles/tn93.dir/src/tn93_shared.cc.o
[ 40%] Building CXX object CMakeFiles/tn93.dir/src/argparse.cpp.o
[ 42%] Linking CXX executable tn93
[ 42%] Built target tn93
Scanning dependencies of target nucfreqsfasta
[ 45%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/nuc_freqs_from_fasta.cpp.o
[ 47%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/stringBuffer.cc.o
[ 50%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/tn93_shared.cc.o
[ 52%] Linking CXX executable nucfreqsfasta
[ 52%] Built target nucfreqsfasta
Scanning dependencies of target readreduce
[ 54%] Building CXX object CMakeFiles/readreduce.dir/src/read_reducer.cpp.o
[ 57%] Building CXX object CMakeFiles/readreduce.dir/src/stringBuffer.cc.o
[ 59%] Building CXX object CMakeFiles/readreduce.dir/src/tn93_shared.cc.o
[ 61%] Building CXX object CMakeFiles/readreduce.dir/src/argparse_merge.cpp.o
[ 64%] Linking CXX executable readreduce
[ 64%] Built target readreduce
Scanning dependencies of target seqcoverage
[ 66%] Building CXX object CMakeFiles/seqcoverage.dir/src/charfreqs.cpp.o
[ 69%] Building CXX object CMakeFiles/seqcoverage.dir/src/stringBuffer.cc.o
[ 71%] Building CXX object CMakeFiles/seqcoverage.dir/src/tn93_shared.cc.o
[ 73%] Building CXX object CMakeFiles/seqcoverage.dir/src/argparse_cf.cpp.o
[ 76%] Linking CXX executable seqcoverage
[ 76%] Built target seqcoverage
Scanning dependencies of target fasta_diff
[ 78%] Building CXX object CMakeFiles/fasta_diff.dir/src/fasta_diff.cpp.o
/Users/WertheimLab/tn93/src/fasta_diff.cpp:37:9: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
auto compare_records = [&] (const std::string & id1, const char* seq_value) -> int {
^
/Users/WertheimLab/tn93/src/fasta_diff.cpp:37:32: error: expected expression
auto compare_records = [&] (const std::string & id1, const char* seq_value) -> int {
^
/Users/WertheimLab/tn93/src/fasta_diff.cpp:85:9: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
auto echo_fasta_sequence = [&] (const char* id, const char* data, FILE * where) -> void {
^
/Users/WertheimLab/tn93/src/fasta_diff.cpp:85:36: error: expected expression
auto echo_fasta_sequence = [&] (const char* id, const char* data, FILE * where) -> void {
^
/Users/WertheimLab/tn93/src/fasta_diff.cpp:105:17: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
auto it = sequences_to_add.find (master_id);
^
/Users/WertheimLab/tn93/src/fasta_diff.cpp:156:57: warning: range-based for loop is a C++11 extension [-Wc++11-extensions]
for (std::pair<std::string, std::string> it : sequences_to_add) {
^
/Users/WertheimLab/tn93/src/fasta_diff.cpp:171:32: warning: range-based for loop is a C++11 extension [-Wc++11-extensions]
for (std::string s : updated_sequences) {
^
5 warnings and 2 errors generated.
make[2]: *** [CMakeFiles/fasta_diff.dir/src/fasta_diff.cpp.o] Error 1
make[1]: *** [CMakeFiles/fasta_diff.dir/all] Error 2
make: *** [all] Error 2

Handle commas in sequence IDs?

Hey, folks! I recently tried computing pairwise distances from a SARS-CoV-2 MSA that included the reference genome (which has a comma in the FASTA sequence ID), and this breaks the tn93 CSV output. Here's a minimal reproducible example:

>Species X, complete genome
ACGTACGTAC
>Species Y, complete genome
ACGTACGTAA
>Species Z, complete genome
CCGTACGTAC

The tn93 CSV output is as follows:

ID1,ID2,Distance
Species X, complete genome,Species Y, complete genome,0.108711
Species Y, complete genome,Species Z, complete genome,0.239924
Species X, complete genome,Species Z, complete genome,0.108711

As a quick workaround, I'm just removing commas from my input FASTA, but would it be possible to add support for commas in sequence IDs within tn93? Some potential ideas:

  • The simplest solution might be to allow the user to use tabs to delimit columns instead of commas (e.g. as a --tsv-output flag?)
    • Or perhaps let the user specify an arbitrary delimiter? (e.g. as a --delimiter ? flag?)
  • A less simple solution would be to wrap comma-containing IDs in quotes, e.g.:
    • "Species X, complete genome","Species Y, complete genome",0.108711

No pressure/rush, and I totally understand just requiring the user to not have commas in the IDs, but just a proposition 😄

issue installing tn93

Hi, I was trying to install tn93 on my laptop, and I got an error that I can't solve at the "make install" step.
could you help me?
here the issue:

(base) epi-c02yk4ktjgh5:tn93 carla$ make install
Consolidate compiler generated dependencies of target tn93
[ 2%] Building CXX object CMakeFiles/tn93.dir/src/tn93_shared.cc.o
/Users/carla/tn93/src/tn93_shared.cc:450:3: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
auto ambiguityHandler = [&] (unsigned c1, unsigned c2) -> void {
^
/Users/carla/tn93/src/tn93_shared.cc:450:27: error: expected expression
auto ambiguityHandler = [&] (unsigned c1, unsigned c2) -> void {
^
1 warning and 1 error generated.
make[2]: *** [CMakeFiles/tn93.dir/src/tn93_shared.cc.o] Error 1
make[1]: *** [CMakeFiles/tn93.dir/all] Error 2
make: *** [all] Error 2

Error compiling on Windows Subsystem for Linux - Ubuntu 18.04

When I run cmake . in the tn93 directory, I get the following:

-- The C compiler identification is GNU 7.4.0
-- The CXX compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/niema/Desktop/tn93

Then, when I run make, I get the following:

Scanning dependencies of target nucfreqsfasta
[  2%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/nuc_freqs_from_fasta.cpp.o
[  4%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/stringBuffer.cc.o
[  7%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/tn93_shared.cc.o
[  9%] Linking CXX executable nucfreqsfasta
[  9%] Built target nucfreqsfasta
Scanning dependencies of target readreduce
[ 11%] Building CXX object CMakeFiles/readreduce.dir/src/read_reducer.cpp.o
[ 14%] Building CXX object CMakeFiles/readreduce.dir/src/stringBuffer.cc.o
[ 16%] Building CXX object CMakeFiles/readreduce.dir/src/tn93_shared.cc.o
[ 19%] Building CXX object CMakeFiles/readreduce.dir/src/argparse_merge.cpp.o
[ 21%] Linking CXX executable readreduce
[ 21%] Built target readreduce
make[2]: *** No rule to make target 'CMakeFiles/seqcoverage.dir/depend'.  Stop.
CMakeFiles/Makefile2:173: recipe for target 'CMakeFiles/seqcoverage.dir/all' failed
make[1]: *** [CMakeFiles/seqcoverage.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Much thanks in advance!

Error compiling on Fedora 30

[apoon42@mimm4750g tn93-1.0.6]$ cmake .
-- The C compiler identification is GNU 9.2.1
-- The CXX compiler identification is GNU 9.2.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Configuring done
-- Generating done
-- Build files have been written to: /home/apoon42/src/tn93-1.0.6
[apoon42@mimm4750g tn93-1.0.6]$ make
Scanning dependencies of target nucfreqsfasta
[  2%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/nuc_freqs_from_fasta.cpp.o
[  5%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/stringBuffer.cc.o
[  8%] Building CXX object CMakeFiles/nucfreqsfasta.dir/src/tn93_shared.cc.o
[ 10%] Linking CXX executable nucfreqsfasta
[ 10%] Built target nucfreqsfasta
Scanning dependencies of target readreduce
[ 13%] Building CXX object CMakeFiles/readreduce.dir/src/read_reducer.cpp.o
/home/apoon42/src/tn93-1.0.6/src/read_reducer.cpp: In function ‘void handle_a_sequence(StringBuffer&, StringBuffer&, Vector&, Vector&, long int, long int, Vector*)’:
/home/apoon42/src/tn93-1.0.6/src/read_reducer.cpp:137:35: error: ‘firstSequenceLength’ not specified in enclosing ‘parallel’
  137 |                 if (perfect_match (current_sequence.getString(), stringText(current_clusters, sequence_lengths, cluster_index), firstSequenceLength) >= min_overlap) {
      |                     ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/apoon42/src/tn93-1.0.6/src/read_reducer.cpp:129:21: error: enclosing ‘parallel’
  129 |             #pragma omp parallel for default(none) shared(currently_defined_clusters, try_cluster, sequence_lengths, current_sequence, current_clusters)
      |                     ^~~
/home/apoon42/src/tn93-1.0.6/src/read_reducer.cpp:137:153: error: ‘min_overlap’ not specified in enclosing ‘parallel’
  137 | sters, sequence_lengths, cluster_index), firstSequenceLength) >= min_overlap) {
      |                                                                  ^~~~~~~~~~~

/home/apoon42/src/tn93-1.0.6/src/read_reducer.cpp:129:21: error: enclosing ‘parallel’
  129 |             #pragma omp parallel for default(none) shared(currently_defined_clusters, try_cluster, sequence_lengths, current_sequence, current_clusters)
      |                     ^~~
make[2]: *** [CMakeFiles/readreduce.dir/build.make:63: CMakeFiles/readreduce.dir/src/read_reducer.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:110: CMakeFiles/readreduce.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

System info:

[apoon42@mimm4750g tn93-1.0.6]$ gcc --version
gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[apoon42@mimm4750g tn93-1.0.6]$ cmake --version
cmake version 3.14.5

Create OpenCL implementation of tn93

A good starting point would be to search for parts of the code that have already been parallelized. For example, searching omp parallel, and investigating whether the use of OpenCL may result in faster computation.

100s in `hyphy` output format

Setting -f to hyphy outputs a matrix comprising mostly 100 entries.
To reproduce, I retrieve NCBI PopSet 1892228972 and downloaded the FASTA file as sequence.fasta:

art@Wernstrom Downloads % mafft sequence.fasta > hiv.mafft.fa
[ omit output ]
art@Wernstrom Downloads % tn93 -o temp.csv hiv.mafft.fa 
Read 602 sequences of length 1094
Will perform 180901 pairwise distance calculations
Progress:     100% (    2210 links found,          inf evals/sec)
{
	"Actual comparisons performed" :180901,
	"Comparisons accounting for copy numbers " :180901,
	"Total comparisons possible" : 180901,
	"Links found" : 2210,
	"Maximum distance" : 0.159834,
	"Sequences" : 602,
	"Mean distance" : 0.0907582,
[ truncate output ]
art@Wernstrom Downloads % head -n5 temp.csv 
ID1,ID2,Distance
MT787751.1 HIV-1 isolate 16854 from Italy nonfunctional pol protein (pol) gene, partial sequence,MT787795.1 HIV-1 isolate 16786 from Italy pol protein (pol) gene, partial cds,0.00367472
MT787751.1 HIV-1 isolate 16854 from Italy nonfunctional pol protein (pol) gene, partial sequence,MT787869.1 HIV-1 isolate 17835 from Italy pol protein (pol) gene, partial cds,0.0139596
MT787751.1 HIV-1 isolate 16854 from Italy nonfunctional pol protein (pol) gene, partial sequence,MT787972.1 HIV-1 isolate 2498686 from Italy pol protein (pol) gene, partial cds,0.0130311
MT787751.1 HIV-1 isolate 16854 from Italy nonfunctional pol protein (pol) gene, partial sequence,MT788268.1 HIV-1 isolate 520925 from Italy pol protein (pol) gene, partial cds,0.012028
art@Wernstrom Downloads % tn93 -o temp.txt -f hyphy hiv.mafft.fa
[ omit output ]
art@Wernstrom Downloads % head -n3 temp.txt
{
{0,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100}
{100,0,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,0.00367472,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,0.0139596,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,0.0130311,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,0.012028,100,100,0.00552784,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100}

See also #16

Invalid character results in wrong error message ("All sequences must have the same length")

I have one sequence (hCoV_19_Norway_1539_2020_EPI_ISL_417487) that tn93 keeps thinking has one fewer characters than it actually has (or at least seems to have). I have attached a minimal working example below:

example.txt

I tried to run tn93 as follows:

cat example.aln | tn93 -l 1 -t 1

But I get the following error message:

All sequences must have the same length (29811), but sequence 'hCoV_19_Norway_1539_2020_EPI_ISL_417487' had length 29810

However, I tried checking it in Python (lines[3] is the problematic sequence):

lines = open('example.txt').readlines()

len(lines[1])  # prints 29812 (includes the newline at the end)
lines[1][:10]  # 'CTTCCCAGGT'
lines[1][-10:] # 'AATTTTAGT\n'
set(lines[1])  # {'\n', 'R', 'G', 'A', 'C', 'T', 'M'}

len(lines[3])  # prints 29812 (includes the newline at the end)
lines[3][:10]  # 'CTTCCCAGGT'
lines[3][-10:] # 'AATTTTAGT\n'
set(lines[3])  # {'V', 'S', '\n', 'R', 'G', 'I', 'A', 'C', 'Y', 'T'}

len(lines[5])  # prints 29812 (includes the newline at the end)
lines[5][:10]  # '----------'
lines[5][-10:] # 'AATTTTAGT\n'
set(lines[5])  # {'\n', 'G', 'A', '-', 'C', 'T'}

Excluding the newline character after every line (which is included in the lengths printed by the above code), each sequence has exactly 29811 characters.

The only weird character I see in the problematic sequence is I, which doesn't seem to be a standard IUPAC character. Thoughts?

Unexpected behavior introduced in v1.08

Hi,

I noticed the v1.08 unrolls the comparison loops manually here:

tn93/src/tn93_shared.cc

Lines 584 to 605 in 55565df

for (; p + 2 <= span_start; p+=2) {
unsigned c1 = s1[p], c2 = s2[p];
unsigned c3 = s1[p+1], c4 = s2[p+1];
if (__builtin_expect(c1 < 4 && c2 < 4, 1)) {
integer_counts [c1][c2] ++;
} else { // not both resolved
if (c1 == GAP || c2 == GAP) {
continue;
}
ambiguityHandler (c1,c2);
}
if (__builtin_expect(c3 < 4 && c4 < 4, 1)) {
integer_counts2 [c3][c4] ++;
} else { // not both resolved
if (c3 == GAP || c4 == GAP) {
continue;
}
ambiguityHandler (c3,c4);
}
}
. The continue at line 592 may make the comparison at line 597-604 skipped. Is it a new feature or bug?

Thanks,
Tianqi

tn93 installation error

Hi @stevenweaver. I have been trying to install the tn93 package to no avail. The cmake command worked fine and all build files were created. However, the make install command generates a bunch of errors as posted below! I am working on a linux os machine. Please is there anything else i need to do to get it working?

Thanks in advance for your help.

Abayomi

Scanning dependencies of target readreduce
[ 2%] Building CXX object CMakeFiles/readreduce.dir/src/read_reducer.cpp.o
[ 5%] Building CXX object CMakeFiles/readreduce.dir/src/stringBuffer.cc.o
[ 8%] Building CXX object CMakeFiles/readreduce.dir/src/tn93_shared.cc.o
[ 10%] Building CXX object CMakeFiles/readreduce.dir/src/argparse_merge.cpp.o
[ 13%] Linking CXX executable readreduce
[ 13%] Built target readreduce
Scanning dependencies of target validate_fasta
[ 16%] Building CXX object CMakeFiles/validate_fasta.dir/src/validate_fasta.cpp.o
[ 18%] Building CXX object CMakeFiles/validate_fasta.dir/src/stringBuffer.cc.o
[ 21%] Building CXX object CMakeFiles/validate_fasta.dir/src/tn93_shared.cc.o
[ 24%] Linking CXX executable validate_fasta
[ 24%] Built target validate_fasta
Scanning dependencies of target seqcoverage
[ 27%] Building CXX object CMakeFiles/seqcoverage.dir/src/charfreqs.cpp.o
[ 29%] Building CXX object CMakeFiles/seqcoverage.dir/src/stringBuffer.cc.o
[ 32%] Building CXX object CMakeFiles/seqcoverage.dir/src/tn93_shared.cc.o
[ 35%] Building CXX object CMakeFiles/seqcoverage.dir/src/argparse_cf.cpp.o
[ 37%] Linking CXX executable seqcoverage
[ 37%] Built target seqcoverage
Scanning dependencies of target tn93-cluster
[ 40%] Building CXX object CMakeFiles/tn93-cluster.dir/src/cluster.cpp.o
/home/abayomi/git/tn93/src/cluster.cpp: In function ‘int main(int, const char**)’:
/home/abayomi/git/tn93/src/cluster.cpp:241:10: error: ‘outer_iterator’ does not name a type
auto outer_iterator = remaining.begin();
^
/home/abayomi/git/tn93/src/cluster.cpp:243:12: error: ‘outer_iterator’ was not declared in this scope
while (outer_iterator != remaining.end()) {
^
/home/abayomi/git/tn93/src/cluster.cpp:253:12: error: ‘inner_iterator’ does not name a type
auto inner_iterator = remaining.begin();
^
/home/abayomi/git/tn93/src/cluster.cpp:255:14: error: ‘inner_iterator’ was not declared in this scope
while (inner_iterator != remaining.end()) {
^
/home/abayomi/git/tn93/src/cluster.cpp:314:10: error: ‘outer_iterator’ does not name a type
auto outer_iterator = remaining.begin();
^
/home/abayomi/git/tn93/src/cluster.cpp:325:12: error: ‘outer_iterator’ was not declared in this scope
while (outer_iterator != remaining.end()) {
^
/home/abayomi/git/tn93/src/cluster.cpp:366:14: error: ‘cluster_iterator’ does not name a type
auto cluster_iterator = join_to.begin();
^
/home/abayomi/git/tn93/src/cluster.cpp:367:32: error: ‘cluster_iterator’ was not declared in this scope
unsigned long first = *cluster_iterator;
^
CMakeFiles/tn93-cluster.dir/build.make:62: recipe for target 'CMakeFiles/tn93-cluster.dir/src/cluster.cpp.o' failed
make[2]: *** [CMakeFiles/tn93-cluster.dir/src/cluster.cpp.o] Error 1
CMakeFiles/Makefile2:178: recipe for target 'CMakeFiles/tn93-cluster.dir/all' failed
make[1]: *** [CMakeFiles/tn93-cluster.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

missing input name

void dump_sequence_fasta (unsigned long index, FILE* output, long firstSequenceLength, double * d = NULL, bool = false, unsigned long from = 0L, unsigned long to = 0L);

the input name is_prot is missing in this header file. It probably should be bool is_prot = false instead of bool = false.

Make distance file output more deterministic

Currently, the output of the CSV file outputs with unsorted edges, we should print the output sorted by the first column.

To reproduce:
Run tn93 twice, save the output, and notice the differences between files.

We need to have sorted rows in order for some testing environments to work.

"Illegal Instruction" error on version 1.0.9, installed through conda

Here is the output

$ tn93 sequences.fa 
Read 103 sequences of length 29903
Will perform 5253 pairwise distance calculationsIllegal instruction

Downgrading to 1.0.6 fixes the issue.
Here is the CPU info:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    24
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping:              4
CPU MHz:               2294.611
BogoMIPS:              4589.38
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              16896K
NUMA node0 CPU(s):     0-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single intel_ppin fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.