tothuhien / lsd2 Goto Github PK

View Code? Open in Web Editor NEW

23.0 3.0 6.0 13.15 MB

A phylogeny dating method using least-squares algorithms and criteria

License: GNU General Public License v2.0

C++ 99.39% Makefile 0.07% CMake 0.36% C 0.18%

dating phylogeny least-squares time-scaling

lsd2's Introduction

LSD2: LEAST-SQUARES METHODS TO ESTIMATE RATES AND DATES FROM PHYLOGENIES

News

lsd2 is now integrated in IQ-tree
For people who prefer R, an R-wrapper of lsd2 can be found here: https://github.com/tothuhien/Rlsd2
Changing some default setting: Temporal constraint is now imposed by default without option -c. Variance is also used by default without applying -v 1. Outgroups are now by default kept in the tree, to remove them use option -G together with -g

Compile/install LSD2:

Type make from the folder src, you will have the executable file lsd2 in the same place. Note that C++ compiler and library support for the ISO C++ 2011 is required to compile the program from sources.

Mac/Linux users can install lsd2 via Homebrew as follows (the Homebrew version is not yet updated with the current one on github): brew install brewsci/bio/lsd2

Run LSD2:

If you want to use the interface, type ./lsd2 without parameters in the terminal from the folder containing the executable file. Otherwise, type ./lsd2 options where the list of options can be obtained by ./lsd2 -h.

The input tree file is required and should be specified by option -i.

The input date file is necessary to estimate absolute dates and can be specified by option -d. The input date file should contain the date of most of the tips and possiblly some internal nodes if known. If some tip dates are missing, the program just uses the subtree containing all defined date tips & nodes for estimating the rate. The missing tip dates would be inferred at the end using the estimated rate & dates. In order to avoid undetermined problem, sufficient dates should be given. A tree where all tips having the same date with no further date information on internal nodes will not be able to infer absolute dates. In this case, you can still estimate relative dates using options -a and -z to specify the root date and tip date.

By default, lsd2 imppose the temporal constraints (date of a node >= date of its ancestors) on every node. It should be noticed that LSD2 always assumes an increasing-time order from root to tips, i.e the date of a node is smaller than that of its children. If your data has the reverse order, the simplest way is to take the negation of the input date, and take the negation again of the output date to obtain your expected results.

The program first collapses all internal branches that are considered uninformative (<= 0.5/seqlength by default, and low support value if specified) and impose a constraint of minimum branch length for the time scaled tree. This value is rounded to a time unit (day, week, year, My etc ...) based on the rounding factor of option -R. Users should be aware of this to select the right date units for their data if the default one does not fit. These values could be manually specified via option -l (for uninformative branch length threshold), -S (support value threshold), and options -u, -U (for minimum internal/external branches lengths of time scaled tree).

Note that if the input tree contains lots of null branches, then applying a positive minimum branch length on the time scaled tree could produce biais. In this case, it's suggested to use -u 0 to allow null branches in the out tree. On the other hands, if the input branch lengths are significantly high then you could increase the minimum branch length. Users are encouraged to try different variants to select a good one that fits with their data.

Further options can be specified, see ./lsd2 -h for more details.

Input files format

Input_tree_file

Input tree(s) in newick format are compulsory. A tree can be either binary or polytomy - and either having support value or not. The input file must contain one tree per line, for example:

((A:0.12,D:0.12):0.3,(B:0.3,C:0.5):0.4);

((A:0.12,B:0.3):0.7,(C:0.5,D:0.8):0.1);

Input_date_file

An input date file is optional. If it's not provided then the program estimates the relative dates by assuming all tips have the same date (1 by default), and the root has date 0 by default.

A correct date can be a real or a string of format year-month-day. Suppose that we have an input ((A:0.12,D:0.12):0.3,(B:0.3,C:0.5):0.4); then an example of input date file can be as follows:

5			# number of temporal constraints
A 1999.2		# the date of A is 1999.2
B 2000.1		# the date of B is 2000.1
C l(1990.5)		# the date of C is >= 1990.5 (more recent than 1990.5)
D b(1998.21,2000.5)	# the date of D is between 1998.21 and 2000.5
mrca(A,B,C) u(1980)	# the date of the most recent ancestor of A,B, and C is <= 1980 (older than 1980)

You can also define the labels for internal nodes and use them to define their dates. For example you have an input tree: ((A:0.12,D:0.12)n1:0.3,(B:0.3,C:0.5)n2:0.4); then an input date file can be as follows:

5
A 2000-07-12
n1 l(2001-05-11)
C b(2001-04-11,2004-01-15)
n2 u(2003-02-12)

If the date format is detected as year-month-day and there're some imprecise dates (missing month, or missing day) then lsd2 automatically turns it into the corresponding interval.

Given rate file

If the rates are known and you want to use it to infer the dates, then you can give them in a file. The file should have each rate per line which corresponds to each tree in the Input_tree_file, for example:

0.0068	
0.0052

Outgroup file

2
outgroup1
outgroup2

If there are more than 1 outgroups, than they must be monophyletic in the input trees. By default, the program uses outgroup to root the tree and removes them before dating process. To keep to root in the final tree, use option -k in addition.

Partition file

You can partition the branch trees into several subsets that you know each subset has a different rate.

Suppose that we have a tree ((A:0.12,D:0.12)n1:0.3,((B:0.3,C:0.5)n2:0.4,(E:0.5,(F:0.2,G:0.3)n3:0.33)n4:0.22)n5:0.2)root;

then an example for partition file can be as follows:

group1 1 {n1} {n5 n4}
group2 1 {n3}

Each line defines a group rate, which contains a list of subtrees whose branches are supposed to have the same substitution rate. It starts by the name of the group (group1), then the prior proportion (1) of the group rate compared to the main rate. This is just a starting value, and the proportion will be estimated at the end; giving an appropriate value helps to converge faster. Each subtree is then defined between {}: the first node is the root of the subtree and the following nodes (if there any) define its tips. If the first node is a tip label then it takes the mrca of all tips as the root of the subtree. If there's only root and not any tip defined, then the subtree is extended down to the tips of the full tree. Hence, {n1} defines the subtree rooted at the node n1; and {n5 n4} defines the subtree rooted at n5 that has one tip as n4 and other tips as the ones of the full trees (here are B,C). As a consequence, in this example, the branches will be partitioned into 3 groups such that each group has a different rate:

group1: (n1,A), (n1,D), (n5,n4), (n5,n2), (n2,B), (n2,C);
group2: (n3,F), (n3,G);
group0: the remaining branches of the tree (main rate).

Note that if the internal nodes don't have labels, then they can be defined by mrca of at least two tips, for example n1 is mrca(A,D)

Using variances

Variance is used to penalize long branch lengths. The variance formula of each branch v_i is proprtion to (b_i + b), where b (specified by option -b) is the pseudo constant added to adjust the dependency of variances to branch lengths. This parameter is a positive number, and by defaul is maximum of median branch length and 10/seqlength. It could be adjusted based on how much your input tree is relaxed. The smaller it is, the more variances are linear to branch lengths, which is more appropriate for strict clock tree. The bigger it is the less dependent of branch lengths on variances, which may be better for relaxed tree. Option -v is used to set variance option. -v 1 is set by default to use variance. Set -v 0 if you don't want to use variance, and -v 2 to run program twice where the second time calculates variances based of the estimated branch length of the first time. Simulation shows that -v 2 give slightly better result than -v 1 in average.

Some examples of command lines:

If the input tree is rooted:
- You want to estimate rate & dates (by default under temporal constraints, using variances, your sequence length is 1000):
./lsd2 -i rootedtree_file -d date_file -s 1000
- Similar as above, but you want to force the root date to 0:
./lsd2 -i rootedtree_file -d date_file -a 0 -s 1000
```
- You want to remove outlier nodes with Zscore threshold 3:
```
./lsd2 -i rootedtree_file -d date_file -e 3 -s 1000
- You want to collapse only null branches in the input tree (by default all branches <= 0.5/seqlength are collapsed), and impose a minimum of 0.1 (estimated by default) for the branches of the time scaled tree:
./lsd2 -i rootedtree_file -d date_file -e 3 -u 0.1 -l 0 -s 1000
- Similar as above, but you don't want to collapse any branch even null, then set a negative value for option -l:
./lsd2 -i rootedtree_file -d date_file -e 3 -u 0.1 -l -1 -s 1000
- Similar as above, but you allow nullability for external branches of output tree:
./lsd2 -i rootedtree_file -d date_file -e 3 -u 0.1 -U 0 -l 0 -s 1000
```
- You know the tree partition where each part should have a different rate:
```
./lsd2 -i rootedtree_file -d date_file -p parition_file -s 1000
```
- You want to re-estimate the root position locally around the given root
```
./lsd2 -i rootedtree_file -d date_file -r l -s 1000
```
- You want to calculate confidence intervals from 100 simulated trees, and you'd like to apply a lognormal relaxed clock of standard deviation 0.4 to the simulated branch lengths.
```
./lsd2 -i rootedtree_file -d date_file -r l -f 100 -s 1000 -q 0.4

(To calculate confidence intervals, the sequence length is required via option -s. The program generates simulated branch lengths using Poisson distributions whose mean equal to the estimated ones multiplied with sequence length. In addition, a lognormal relaxed clock is also applied to the branch lengths. This ditribution has mean 1 and standard deviation settable by users with option -q, by default is 0.2; 0 means strict clock. The bigger q is, the more your tree is relaxed and the bigger confidence intervals you should get).
- You want to calculate confidence intervals from your bootstrap trees.
./lsd2 -i rootedtree_file -d date_file -f bootstrap.nwk -s 1000
```
- If all tips have the same date (for example 0), and you know the root date (for example -10)
```
./lsd2 -i tree_file -a -10 -z 0
If the input tree is unrooted, you should either specify outgroups or use option -r to estimate the root position.
- If you don't have any outgroup and you want to estimate the root position:
./lsd2 -i unrootedtree_file -d date_file -r a
- If you have a list of outgroups and want to use them for rooting:
./lsd2 -i unrootedtree_file -d date_file -g outgroup_file
- If you want to remove the outgroups from the tree:
./lsd2 -i unrootedtree_file -d date_file -g outgroup_file -G

Output files:

.result : contain the estimated rates, root date, possibly confidence intervals, outlier nodes and the value of the objective function.

.nexus : trees in nexus format which contain information about the dates of internal nodes, branch lengths, and the confidence intervals (if option -f was used).

.date.nexus : similar to .nexus trees, but branch lengths are rescaled to time unit by multiplying to the estimated rate.

.nwk : similar to .nwk trees but in newick format, so do not contain confidence intervals information.

Citation

If you use this software, please cite: “Fast dating using least-squares criteria and algorithms”, T-H. To, M. Jung, S. Lycett, O. Gascuel, Syst Biol. 2016 Jan;65(1):82-97.

lsd2's People

Contributors

Stargazers

Watchers

Forkers

bqminh wook2014 amkram jacobus84 cknabukeera hussen-ai

lsd2's Issues

Understanding date notation in date file

Hi @tothuhien,
I'm looking at your examples for date notation in the date file and could use help getting this right for my analysis.

The example on the readme:

5			# number of temporal constraints
A 1999.2		# the date of A is 1999.2
B 2000.1		# the date of B is 2000.1
C l(1990.5)		# the date of C is at least 1990.5
D b(1998.21,2000.5)	# the date of D is between 1998.21 and 2000.5
mrca(A,B,C) u(2000.12)	# the date of the most recent ancestor of A,B, and C is at most 2000.12

I have an ancestral node that is the MRCA of a monophyletic group that is at least 48.7 MYA, but the max age is not know. Reading the example above, I would think that should be represented as l(-48.7), the node is at least -48.7, but could be older (more negative).

In comparing runs of lsd2 where I use l(-48.7) and u(-48.7), I think I have it backwards and I should be using u(-48.7). l(-48.7) results in an estimated node age of -48.7 from the analysis, but when I use u(-48.7) I get an estimated age that is older, -157.

Am I thinking about "at least" and "at most" in your examples backwards?
Thank you!

Compiling error in Windows

Would you like to support LSD2 in Windows?

I'm using clang for Windows to compile LSD. There is a compilation error, see attached screenshot. It complaint that the time() function is not declared. I'll try to fix it.

Allow YYYY-MM-DD for tip and root dates

With the new support of YYYY-MM-DD date format, can you please allow -a rootDate and -z tipDate to accept this format?

Also these options currently require integer. Can you relax this assumption, i.e. a real number is OK?

Thanks a lot for your work, Hien!

Error with setting root date

Hello-
I am trying to date the root node with -a b(-65,-55) and get an syntax error. Is it possible to give the root node an interval ?

~/bigdata/lsd2/src/lsd2 -i mytree6.tree -f 100 -s 1748014 -a b(-65,-55) -z 0 -u 0 -l -1 -o ~/bigdata/CH3-150/LCOMP_8_outfiles/LSD_OUT/LSD12

Thanks,
Dylan

Handling unresolved trees

Hi Hien,

Could it be possible to add support for trees with politomies in LSD2? Now when I try to date one I get:

...
Estimating the root position locally around the given root ...
lsd2: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

If I randomly resolve politomies by adding zero branches, everything works.

Or at least a more informative error message should be added, so the user knows what to do :)

Thanks,
Anna

Segmentation fault when all the dates are imprecise

Hi Hien,

I have a data set where the only information available for the sampling is year, and I tried to specify the dates (all of them) as intervals, e.g. b(1970.0,1970.9972602739726) for 1970, then to run LSD2 (v1.4.2.2) as

lsd2 -i {input_tree} -d {input_dates} -v 2 -c -s {sequence_length} -f 1000 -o {work_dir} -e 3

What I get is a segmentation fault:

TREE 1
*PROCESSING:
Reading the tree ... 
Calculating the outlier nodes ...
bash: line 2: 13495 Segmentation fault      (core dumped)

I suspect it might be due to the fact that intervals are not taken into account for outlier detection. In any case at least a more informative error message would be helpful.

Thanks,
Anna

date format

Hi,

Could you tell me how to format the file with the dates?
It works with the year but I don't understand how to format the date with months and possibly the day?
Thanks,

Cyril

Confidence interval values

Hi again Hien,

I have another question about confidence intervals. I have been getting some weird values on some of the nodes. See this tree attached for an example. The root node is supposed to be -65 mya old. The next node up from it is: 2.33388e-310, 2.33388e-310. I am guessing it has to do with the small branch lengths? There are other nodes like it on this tree. Is there a parameter I can use to adjust it so these numbers make sense?

I have also been trying this in iqtree2 the newest version (covid). I get similar errors in both.

Thanks so much again!

LSD14.nexus.pdf

Add example using the files in example/

Can you provide a comamnd line using the files in examples/ folder so I can write a test for the brew and conda packages?

FYI - lsd2 is now in homebrew science

You can add this to your documentaiton to install via Homebrew:

brew install brewsci/bio/lsd2

is this okay for me

(base) lixingguangtekiMacBook-puro:bin lixingguang$ ./lsd2_mac -i 755.newick.tre -d 755.date -s 8462 -v 2 -f 1000

TREE 1
*PROCESSING:
Reading the tree ...
Collapse 2 (over 753) internal branches having branch length <= 5.90877e-05
(settable via option -l)
Parameter to adjust variances was set to 0.029848 (settable via option -b)
Minimum branch length of time scaled tree (settable via option -u and -U): 0
Dating under temporal constraints mode ...
Re-estimating using variances based on the branch lengths of the first run ...
Computing confidence intervals using sequence length 8462 and a lognormal
relaxed clock with mean 1, standard deviation 0.2 (settable via option -q)
*RESULTS:

Dating results:
rate 0.00261091, tMRCA 1945.75 , objective function 0.910066
Results of the second run with variances based on results of the first run:
rate 0.00263291, tMRCA 1943.85, objective function 0.951829
Results with confidence intervals:
rate 0.00263291 [0.00237043; 0.00297681], tMRCA 1943.85 [1930.17; 1953.32], objective function 0.951829

TOTAL ELAPSED TIME: 58.9098 seconds
(base) lixingguangtekiMacBook-puro:bin lixingguang$

Relaxed Clock Parameters

I'm very excited about the ability to estimate confidence intervals using a relaxed clock! Do you have any suggestions for how to estimate an appropriate standard deviation for the UCLD based on the data?

Thanks!

Add -V or --version flag

% lsd2 --version
lsd2 1.3

and return error code 0 (no error)

-V is ok too

unrooted bifurcating tree dating could output multifurcating trees

I wonder if there is an option to restrict the output to be bifurcating rooted trees. I have tested options like -l 0.0 -u 0.0, but it can still produce multifurcating trees.

Negative branch lengths despite -c option

The command line is:

lsd -i example.phy.timetree.subst -s 1998 -c -o example.phy.timetree -g example.phy.timetree.outgroup -k

I note that without the outgroup, everything seems fine.

Any idea why?
The tree and outgroup files are attached.

example.phy.timetree.subst.txt

example.phy.timetree.outgroup.txt

malloc: Incorrect checksum for freed object ...: probably modified after being freed

I cloned the github repo, complied and tried LSD2 with a simple example:

~/software/lsd2/src/lsd2 -i example.phy.treefile -r a -s 1998 -c

However, there is a crash:

TREE 1
*PROCESSING:
Reading the tree ... 
Using the median branch length 0.0981519 to adjust variances ...
Minimum branch length of time scaled tree was set to 0/365 = 0
Estimating the root position on all branches using fast method ...
lsd2(11564,0x1153f9dc0) malloc: Incorrect checksum for freed object 0x7fa3cdd06690: probably modified after being freed.
Corrupt value: 0x0
lsd2(11564,0x1153f9dc0) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Interestingly, this does not always occur, only after a few runs.

Moreover, when I tried the homebrew version, everything works fine:

TREE 1
*PROCESSING:
Reading the tree ... 
Estimating the root position on all branches using fast methode ...
*WARNINGS:
- The results correspond to the estimation of relative dates when T[mrca]=0 and T[tips]=1
*RESULTS:
- Dating results:
 rate 0.298264, tMRCA 0, objective function 148.416

TOTAL ELAPSED TIME: 0.006538 seconds

Do you know what happened? The tree file is attached.
example.phy.treefile.txt

Thanks!
Minh

Issue with CI calculation on a datafile with constraints on internal nodes (v2.3)

Hi Hien,

I have tried to redo an old analysis (which worked with lsd2 v1.4.2.2) and run into the segmentation fault.

Here is the tree: rooted_tree.nwk and the dates file: lsd2.dates, which has some constraints on some internal nodes, looking like:

mrca(CU2281-16,CU443-12)	b(1900,2008.62977767797)

With lsd2 v1.4.2.2, I was running the following command (successfully):

lsd2 -i rooted_tree.nwk -d lsd2.dates -v 2 -c -s 3894 -e 3 -f 100

With lsd v2.3, I can run the dating without CIs with no problem:

lsd2 -i rooted_tree.nwk -d lsd2.dates -v 2 -s 3894 -e 3

However, adding CIs (-f 100):

lsd2 -i rooted_tree.nwk -d lsd2.dates -v 2 -s 3894 -e 3 -f 100

leads to the following error:

TREE 1
*PROCESSING:
Reading the tree ... 
Collapse 23 (over 384) internal branches having branch length <= 0.000128403
 (settable via option -l)
Parameter to adjust variances was set to 0.013743 (settable via option -b)
Calculating the outlier nodes with Zscore threshold 3 (setable via option -e)...
Minimum branch length of time scaled tree (settable via option -u and -U): 0
Dating under temporal constraints mode ...
Re-estimating using variances based on the branch lengths of the first run ...
Computing confidence intervals using sequence length 3894 and a lognormal
 relaxed clock with mean 1, standard deviation 0.2 (settable via option -q)
Segmentation fault (core dumped)

If I remove the constraints from the date file, everything works smoothly.

Cheers,
Anna

Details about confidence intervals

Hi Authors,

Thanks for your great work.
I am using IQ-TREE with default parameters to estimate the dates of the nodes. According to IQ-TREE: the confidence intervals are estimated based on a mixture of Poisson and lognormal distributions for a relaxed clock model.

May I know is it the 95% confidence interval or 99% confidence interval?

Thanks so much for your kind help.

Request to change mrca notation

LSD allows to specify the date of ancestral node via mrca(A,B,C) for the most recent common ancestor of A,B,C. To be honest, this is quite difficult to remember due to the mrca notation.

How about just simply use a list A,B,C?

Together with my suggestion about the rate range format, the example may look like:

5			# number of temporal constraints
A 1999.2		# the date of A is 1999.2
B 2000.1		# the date of B is 2000.1
C 1990.5:NA		# the date of C is at least 1990.5
D 1998.21:2000.5	# the date of D is between 1998.21 and 2000.5
A,B,C NA:2000.12	# the date of the most recent ancestor of A,B, and C is at most 2000.12

What do you think?

Allow teh user to choose -s behaviour

Hi Hien,

I have a full-genome alignment of length 3,985,129 and it's a shame to have it replaced by 1000 for the CI estimation, as the CIs I get are unreasonably large for the dates, e.g. 2015.26 [1995.59 - 2016.04] while non-existant for the rates 1.47261e-06 [1.47261e-06; 1.47261e-06].

It would be better in my opinion to keep 1000 as a default but use the user-supplied value if it's available.

Cheers,
Anna

Date file with the format yyyy-mm-dd

Dear Tothuhien,

How can I use a date file with the format:

tip_1 2020-01-15
tip_2 2020-02-26
tip_3 2019-12-31
tip_4 2020-06-45

to perform a dated tree?

Missing description

Segmentation fault with -g option

The command line was:

lsd2 -i aln.fa.timetree.subst -s 30285 -c -o aln.fa.timetree -g aln.fa.timetree.outgroup -k -d aln.fa.timetree.date

And screen output:

Removing outgroups of tree 1 ...
Segmentation fault: 11

Without -g option and add -r a, it works fine.

Any idea?

Thanks!

The input files are attached.

aln.fa.timetree.subst.txt
aln.fa.timetree.outgroup.txt
aln.fa.timetree.date.txt

Add -Ofast to compiler flags ?

This adds optimization which makes a big difference

Request for date range specification

A nice feature of LSD is to allow users to specify the range of the dates in case of uncertainty. So you have b(x,y) for date range (x,y); u(x) for (-inf, x) and l(x) for the range (x,+inf). While this is OK, I think this notation of b, u, l is not intuitive and quite difficult to remember.

Why not just use the same math notation I wrote above?

An even easier solution for biologists (who are not mathy) is to use the range notation in R, because many biologists nowadays know R:

x:y for the range (x,y)

You can extend this format, say x: or x:NA if the upper bound is +inf and :y or NA:y if the lower bound is -inf. (NA for not available).

I prefer this format, which is more user-friendly.

What do you think? It is possible to implement this in LSD?

Perhaps you can still allow the b, u, l notation at the same time, for backward compatibility.

Segmentation fault

Dear Hien,
I'm trying to run the latest versions of LSD2 (1.4.2, 1.4.2.1) and getting a segmentation fault:

> lsd2 -i ./rooted_tree.nwk -d ./lsd2.dates -v 2 -c -s 3512 -f 1000 -o ./lsd2_wd -e 3

TREE 1
*PROCESSING:
Reading the tree ... 
Calculating the outlier nodes ...
Segmentation fault (core dumped)

while the previous version (1.4) was working fine on the same data (available here).

Use standard CXX and CXXFLAGS

Current

CC=g++
CPPFLAGS=-std=c++11

Should be CXX and CXXFLAGS

Add functionality to include LSD2 as a library

It would be great being able to call LSD2 within another program, because it's often used as downstream analysis to tree reconstruction. So adding some API to call LSD2 as a library is desirable, which should meet the following:

Converting input/output to C++ stream:
One main issue is the input/output, which right now only works with files. It's better if the program allows to do input/output from memory to speed up the interface. However, this is not easy because LSD2 currently uses FILE data structure of C.

One solution is to convert everything into C++ istream for input and ostream for output. This re-factoring would allow us to pass everything to the library via in-memory structure such as stringstream and don’t have to rely on external disc files.

Ensuring thread-safety:
The API should be thread safe as it might be called on different threads (e.g. to date many trees in parallel). So for example, the functions should not use global or static variables.

Is it possible to do that?

Option -n broken

I stated the problem here, initially I thought it was a problem of the R wrapper.

Any version after 1.7.1 breaks when using -n != 1.
This problem carries over to IQtree and the R version.

Accepting dates in ISO format

Right now the date file for -d option must contain dates as integer.

Is it possible to accept dates in ISO 8601 format, such as "2020-04-17" for 17 April 2020?

This is a common format that people used to store the dates of the sample. It's quite handy, also many users wouldn't know how to convert back and forth to integers, that LSD supports.

Thanks!

Please add install instructions

ie.

need g++ with C++11 support
make -C src
lsd2 binary is in src/ folder

More flexible date format

LSD allows the date format YYYY-MM-DD, which is very useful. thanks!

However, in my data there are some sample which dated 2020-01, meaning that we don't have the exact day, and only know that it was sampled in Jan 2020. Is it possible for you to accept this format?

I can of course pre-process the date again to the range from 2020-01-01 to 2020-01-31. But that's a bit tedious because different months have different number of days...

So can you allow YYYY-MM, just to specify the year and month of the sample? And internally LSD will convert it into a range and automatically account for uncertainty.

Also I have other samples with just YYYY, i.e. we only know the year of the sample. For example, 2019 really means the range from 2019-01-31 to 2019-12-31.

Is it also possible for accept this format? I know this is ambiguous. But perhaps if you see that at least another taxon has the date YYYY-MM-DD, then a simple YYYY would really mean that. Only when there is no YYYY-MM-DD for any taxa, then YYYY will really mean that number.

Sorry for a lot of requests... I'm just thinking about how to make LSD most user-friendly. Hopefully then people will come to use your software.

Thanks!

Outliers not being removed

Hi Hien,

There seems to be a problem with outlier removal in LSD2 v1.4.2.3:
I'm running it as

lsd2 -i {input_tree} -d {input_dates} -v 2 -c -s {sequence_length} -o {output_name} -f 1000 -e 3

It detects 32 outliers but they are present in the output {output_name}.date.nexus tree.

My tree and dates are here.

Better error message for all unknown dates

Hi Hien,

I have a very corner-case example of a tree with a few tips and a date file that contains some dates for the ids that are not in the tree, and some very large date intervals (all the same) for the ids that are in the tree (arisen due to a naming problem ;) ).
I try to run LSD2 as follows:

lsd2 -i tree.nwk -d dates.tab -v 2 -c  -f 1000 -o tree_lsdated2 -r a -e 3 -s 1000

and obtain a segmentation fault:

TREE 1
*PROCESSING:
Reading the tree ... 
Parameter to adjust variances was set to 0.000684512 (settable via option -b)
Calculating the outlier nodes ...
Segmentation fault (core dumped)

What I would like to obtain instead is something like: "There are not enough temporal constraints provided to date this tree".

The tree and the dates are here.

tothuhien / lsd2 Goto Github PK

lsd2's Introduction

LSD2: LEAST-SQUARES METHODS TO ESTIMATE RATES AND DATES FROM PHYLOGENIES

News

Compile/install LSD2:

Run LSD2:

Input files format

Input_tree_file

Input_date_file

Given rate file

Outgroup file

Partition file

Using variances

Some examples of command lines:

Output files:

Citation

lsd2's People

Contributors

Stargazers

Watchers

Forkers

lsd2's Issues

Recommend Projects

Recommend Topics

Recommend Org