Giter VIP home page Giter VIP logo

simulation-ratevariation's Introduction

pipeline to simulate data with rate variation

Code to reproduce the simulation study in: Frankel & Ané (2023) "Summary tests of introgression are highly sensitive to rate variation across lineages".

Preprint available here: https://www.biorxiv.org/content/10.1101/2023.01.26.525396v1

Author information:

goal and model

Goal: simulate sequence data under a network model with ILS and rate variation, and quantify the performance of various methods and assess their robustness to various types of rate variation.

Model of rate variation: there are different types of branch lengths.

  • branch lengths in coalescent units in the species network: those were not varied. If they varied across lineages, then the network could be time inconsistent (2 paths from the root to a given hybrid nodes could have different lengths) and non-ultrametric (2 paths from the root to 2 different tips could have different lengths).
  • branch lengths in substitutions per site on the species network and in gene trees: this type of rate-variation was assessed.
    • they could vary across lineages, making the network (in subst/site) either time-inconsistent and/or non-ultrametric

    • they could also vary across genes: some genes overall slow, others overall fast

    • they could also vary beyond variation across lineages and across genes. for example, the length in substitution/site of a given branch e in a given gene g, from some lineage l in the network, could be this:

      t_e × r_l × r_g × r_lg × r_e

      where:
      t_e is the original length of edge e,
      r_l = rate for lineage l, shared across all genes for edges in this lineage,
      r_g = rate for gene g, shared across all edges for this gene tree
      r_lg = rate specific for edge e in gene l (shared across all individual copies of that gene in the same lineage, until and after they coalesce within this lineage),
      r_e = residual rate variation for edge e in gene g, independent of lineage if evolves in.

code

in scripts:

  • simulatedata.jl to simulate:
    • networks using (via RCall) the R package SiPhyNetwork
      warning: this is not simulating any rate variation in coalescent units.
    • gene trees within the network under the coalescent using PhyloCoalSimulations
    • rate variation in substitutions / site: scale branch lengths in gene trees
    • sequences along these gene trees using seqgen.
  • calcD.jl calculates the ABBA-BABA test and the D3 test. earlier version on this gist to share it publically, after LF shared it with Claudia.
  • outgroup_reroot.jl to find the outgroup of a simulated network. also on this gist
  • infer_networks.jl with helper functions for inferring networks/hybridization. currently contains functions to set up & convert files for HyDe input.
  • fasta2phylip.py file conversion script, called by simulation_pipeline function in infer_networks.jl.
  • simpipeline.jl used for parallelization simulation_pipeline function with multiple parameter sets, and then concatenating/summarizing results with summarize_hyde_dstat_d3 from infer_networks.jl.

dependencies

  • julia packages: see Project.toml
  • seqgen for the simulation of sequences: Seq-Gen v1.3.4 downloaded from github at commit from 2019-08-29.
  • HyDe: python package, using Python 3, see notes/software_install.md

preliminary runs

See choosing_params.ipynb for preliminary runs to look at networks, and to choose parameters to simulate networks using SiPhyNetworks: lambda, mu, nu, the proportions of the 3 hybrid types, and the success hybridization probability curve given the average genetic distance between the 2 taxa.

data processing & visualization

Relevant files are in data_processing_viz:

  1. data_processing.Rmd: combines individual parameter sets' D, D3, HyDe results into alldatasets_dstat.csv, alldatasets_d3.csv, and alldatasets_hyde_filtered.csv, all found in output
  2. resultviz.Rmd: creates the type-1 error and power plots in the paper, gets max, min, mean # subnetworks per category for plots
  3. paper_plots.jl: creates example h1 (h) and h2 (h') plot for the paper

output

folder structure

the output folder has structure:

output
├── alldatasets_d3.csv
├── alldatasets_dstat.csv
├── alldatasets_hyde_filtered.csv
├── 10-taxa_1000-genes_1000-sites_0.0-lindist_0.0-linbygenedist_0.0-genedist_0.0-edgedist_0.03-subrate_1.0-coalmedian
│    └──  scalednets.txt
├── ...
└── 10-taxa_1000-genes_1000-sites_0.7-lindist_0.7-linbygenedist_0.7-genedist_0.7-edgedist_0.03-subrate_1.0-coalmedian
    └── scalednets.txt

... collapses the 14 other folders from the 16 parameter sets ran. There were 16 parameter sets, derived from all of the possible combinations of no rate variation / yes rate variation for r_l (lindist), r_g (genedist), r_lg (linbygenedist), r_e (edgedist). Directories are named accordingly. Rates were pulled from a log-normal distribution with mean 1 and standard deviation 0.0 (no lineage rate variation) or 0.7 (yes lineage rate variation) on the log scale.

In the three csv files, they all share some column structure:

  • H1: number of reticulations in that taxon triplet before iteratively shrinking any 2/3-cycle
  • H2: number of reticulations in that taxon triplet after iteratively shrinking any 2/3-cycle
  • Gamma1: true gamma (proportion of genomic contribution from a parent species to the hybrid species) value of the first hybridization event, if applicable (NA value if not)
  • Gamma2: true gamma value of the second hybridization event, if applicable (NA value if not)
  • Gamma3: true gamma value of the third hybridization event, if applicable (NA value if not)
  • coalmedian: median edge length of branches species network was scaled to (1.0 for all parameter sets)
  • subrate: substitution rate in units substitutions/site/edge (0.03 for all parameter sets)
  • lindist: value for standard deviation on the log scale of log-normal distribution with mean 1, used to pull r_l values. 0.0 = no lineage rate variation, 0.7 = yes lineage rate variation
  • linbygenedist: value for standard deviation on the log scale of log-normal distribution with mean 1, , used to pull r_lg values. 0.0 = no lineage-by-gene rate variation, 0.7 = yes lineage-by-gene rate variation
  • genedist: value for standard deviation on the log scale of log-normal distribution with mean 1, , used to pull r_g values. 0.0 = no gene rate variation, 0.7 = yes gene rate variation
  • edgedist: value for standard deviation on the log scale of log-normal distribution with mean 1, , used to pull r_e values. 0.0 = edge lineage rate variation, 0.7 = yes edge rate variation
  • simid: unique value for each parameter set combination (values range from 1-16)

alldatasets_d3.csv

This is a comma-delimited file containing the results for D3 for every possible ingroup triplet across 100 10-taxon replicate networks for each parameter set. Each row contains a triplet and its values (detailed below). Its size is 15.8MB with 79903 rows and 27 columns.

D3 specific column structure:

  • taxa1: outgroup taxon, not used in these calculations
  • taxa2, taxa3 taxa4: taxa used in calculation
  • Rep: replicate number the triplet originated from
  • TotalSites: number of sites used
  • D13: pairwise genetic distance between t1 (one of the taxa in taxa2, taxa3, taxa4) and t3 (D3_Nonsister below)
  • D23: pairwise genetic distance between t2 (one of the taxa in taxa2, taxa3, taxa4) and t3 (D3_Nonsister below)
  • D3: D3 value calculated from D13 and D23
  • D3_sd: standard deviation of D3 across 5000 bootstrap samples
  • Z: Z-value
  • P: P-value
  • D3_Nonsister: taxon from taxa2, taxa3, taxa4 that is treated as nonsistert3 in calculations
  • taxonset: combination of taxa2, taxa3, and taxa4 values

alldatasets_dstat.csv

This is a comma-delimited file containing the results for the D-statistic for every possible ingroup triplet across 100 10-taxon replicate networks for each parameter set. Each row is a triplet and its values (detailed below). Its size is 12.3MB with 79903 rows and 26 columns.

D specific column structure:

  • taxa1: outgroup taxon
  • taxa2, taxa3, taxa4: ingroup taxa
  • Rep: replicate number the triplet originated from
  • TotalSites: number of sites used
  • ABBAcount: number of ABBA patterns
  • BABAcount: number of BABA patterns
  • BBAAcount: number of BBAA patterns
  • D: D-value
  • Dsd: standard deviation of D across 5000 bootstrap samples
  • Z: Z-value
  • P: P-value

alldatasets_hyde_filtered.csv

This is a comma-delimited file containing the results for HyDe for every possible ingroup triplet across 100 10-taxon replicate networks for each parameter set. We calculated the HyDe statistic for each of the 3 possible choices of the putative hybrid species for a given ingroup triplet, and we only retained the choice with the smallest p-value and applied a Bonferroni correction for multiple testing. Each row is a triplet and its values (detailed below). The file's size is 13.7MB with 79903 rows and 24 columns.

HyDe specific column structure:

  • outgroup: outgroup used for HyDe statistic
  • P1: parent 1 taxon
  • Hybrid: hybrid taxon
  • P2: parent 2 taxon
  • Rep: see above
  • Zscore: Z-score
  • origPvalue: P-value before Bonferroni correction
  • HyDeGamma: gamma value inferred from HyDe
  • GammaDiff: difference between Gamma1 and HyDeGamma, if applicable
  • taxonset: combination of P1, Hybrid, and P2 values, used for Bonferroni correction/filtering
  • Pvalue: P-value after Bonferroni correction

scaled nets

scalednets.txt, found in each individual parameter set's directory, are files containing 100 (1 per line, for each replicate) 10-taxa networks in HybridNetwork format for each parameter set. These networks were simulated using SiPhyNetwork, then scaled all branches to have a median edge length of 1.0, which we used as measuring coalescent units (to later simulate a moderate level of ILS).

simulation-ratevariation's People

Contributors

cecileane avatar laufran avatar

Stargazers

 avatar

Watchers

 avatar

simulation-ratevariation's Issues

version number

Which versions of SiPhyNetwork and of ape are you using? (or which commit for SiPhyNetwork)?

When I tried to run your jupyter notebook, I couldn't use SiPhyNetwork, so I upgraded my version of R and RStudio... which led me to several hours trying to compile packages. I was finally able to re-install SiPhyNetwork. And now your jupyter notebook throws me an error, which traces back to R"write.net(net, tol=1e-12)" .

calcPatternCounts in calcD.jl MethodError

I'm trying the following example that we use in the docstrings and getting a MethodError:

> file = "scripts/examples/tinyexample.fasta"
> calcPatternCounts(file, ["P2","P5","P3","P1"])
MethodError: Cannot `convert` an object of type LongAA to an object of type String
Closest candidates are:
  convert(::Type{String}, ::String) at essentials.jl:210
  convert(::Type{String}, ::WeakRefStrings.WeakRefString) at /Users/laurenfrankel/.julia/packages/WeakRefStrings/AN9bY/src/WeakRefStrings.jl:81
  convert(::Type{String}, ::FilePathsBase.AbstractPath) at /Users/laurenfrankel/.julia/packages/FilePathsBase/qgXdE/src/path.jl:117
  ...

Stacktrace:
 [1] calcPatternCounts(file::String, taxa::Vector{String}, nsites_perlocus::Nothing; ambig::Symbol)
   @ Main ~/Desktop/Ane/simulation-ratevariation/scripts/calcD.jl:86
 [2] calcPatternCounts (repeats 2 times)
   @ ~/Desktop/Ane/simulation-ratevariation/scripts/calcD.jl:78 [inlined]
 [3] top-level scope
   @ In[2]:2
 [4] eval
   @ ./boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1116

Seems to be an error in line 86 of calcD.jl: alg[i] = LongDNASeq(convert(String, alg[i])) while trying to Convert. Perhaps due to me upgrading packages in the last week?

I try with other fastas that I've simulated and it seems to work so I'm not sure what the deal is.

> calcPatternCounts("output/fourtaxa_sequence.fasta", ["t1", "t2", "t7", "t4"])
([1000], [28], [25], [25])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.