Giter VIP home page Giter VIP logo

remytuyeras / haplodynamics Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 1.04 MB

A python library to develop genomic data simulators

Home Page: http://www.normalesup.org/~tuyeras/node_diss/blg/haplodx.html

License: GNU General Public License v3.0

Python 100.00%
allele-frequencies genomics-data hardy-weinberg-equilibrium linkage-disequilibrium population-genetics population-genomics simulation simulator vcf vcf-files gwas gwas-dataset dna dna-sequences genomics microarray-data

haplodynamics's Introduction

HaploDynamics

A python library to simulate genomic data

Presentation

DOI PyPI - Version PyPI - License PyPI - Python Version Website Downloads Downloads

CodeFactor Codecov

HaploDynamics (HaploDX) is a Python 3+ library that provides a collection of functions for simulating population-specific genomic data. The package is part of the Genetic Simulator Resources (GSR) catalog, which can accessed by clicking on the image below.

Catalogued on GSR

Highlights and updates

Five reasons to use this package:

  • An intuitive user interface for writing short, concise Python code that achieves realistic simulations.
  • Speed and efficiency, with a lightweight implementation that allows for fast generation of simulations.
  • Flexibility, with the ability to mix your own models with the framework to create custom simulations.
  • A comprehensive set of arithmetic operations (coming soon) for working with mutiple VCF files.
  • Detailed documentation with thorough tutorials and performance analyses to help you get started quickly.

Release v0.4b*:

Installation

Installation via pip

Install the HaploDynamics package by using the following command.

$ pip install HaploDynamics

After this, you can import the modules of the library to your script as follows.

import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as fmx

To upgrade the package to its latest version, use the following command.

$ pip install --upgrade HaploDynamics==0.4b1

Manual installation

HaploDynamics uses the SciPy library for certain calculations. To install SciPy, run the following command, or see SciPy's installation instructions for more options.

$ python -m pip install scipy

You can install the HaploDynamics GitHub package by using the following command in a terminal.

$ git clone https://github.com/remytuyeras/HaploDynamics.git

Then, use the pwd command to get the absolute path leading to the downloaded package.

$ ls
HaploDynamics
$ cd HaploDynamics/
$ pwd
absolute/path/to/HaploDynamics

To import the modules of the library to your script, you can use the following syntax where the path absolute/path/to/HaploDynamics should be replaced with the path obtained earlier.

import sys
sys.path.insert(1,"absolute/path/to/HaploDynamics")
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as fmx

Quickstart

The following script generates a VCF file containing simulated diploid genotypes for a population of 1000 individuals with LD-blocks of length 20kb, 5kb, 20kb, 35kb, 30kb and 15kb.

import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

The equation strength=1 forces a high amount of linkage disequilibrium and the equation population=0.1 increases the likelyhood of the simulated population to have rare mutations (e.g. to simulate a population profile close to African and South-Asian populations).

More generally, the function genmatrix() takes the following types of parameters:

Parameters Type Values
blocks list[int] List of positive integers, ideally between 1 and 40.
strength float From -1 (little linkage) to 1 (high linkage)
population float From 0 (for more rare mutations) to 1 (for less rare mutations)
Npop int Positive integer specifying the number of individuals in the genomic matrix

The generation of each locus in a VCF file tends to be linear in the parameter Npop. On average, a genetic variant can take from 0.3 to 1 seconds to be generated when Npop=100000 (this may vary depending on your machine). The estimated time complexity for an average machine is shown below.

Use cases

The following script shows how to display linkage disequilibirum correlations for the simulated data.

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,20,20,20,20,20],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

rel, m, _ = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

A typical output for the previous script should look as follows.

The following script shows how you can control linkage disequilibrium by using LD-blocks of varying legnths. You can display the graph relating distances between pairs of variants to average correlation scores by using the last output of the function LD_corr_matrix().

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

ld_blocks = [5,5,5,10,20,5,5,5,5,5,5,1,1,1,2,2,10,20,40]
strength=1
population=0.1
Npop = 1000
simulated_data = hdx.genmatrix(ld_blocks,strength,population,Npop)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

#Correlations
rel, m, dist = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

#from genetic distances to average correlaions
plt.plot([i for i in range(len(dist)-1)],dist[1:])
plt.ylim([0, 1])
plt.show()

Typical outputs for the previous script should look as follows.

Correlations genetic distances to average correlations

Finally, the following script shows how you can generate large regions of linkage.

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

ld_blocks = [1] * 250
strength=1
population=0.1
Npop = 1000
simulated_data = hdx.genmatrix(ld_blocks,strength,population,Npop)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

#Correlations
rel, m, dist = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

#from genetic distances to average correlaions
plt.plot([i for i in range(len(dist)-1)],dist[1:])
plt.ylim([0, 1])
plt.show()

Typical outputs for the previous script should look as follows.

Correlations genetic distances to average correlations

To cite this work

Tuyeras, R. (2023). HaploDynamics: A python library to develop genomic data simulators (Version 0.4-beta.1) [Computer software]. DOI


Documentation

haplodynamics's People

Contributors

remytuyeras avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

haplodynamics's Issues

Improve convolution operation for AFS

For a genetic schema $(b,f,g)$, the stochastic process $f:[0,1] \to [0,1]$ should be convoluted to afs_sample() using a boundary test inside the process linkage_disequilibrium():

q = afs(alpha)
    while not(lb_freq(beta,gamma,p,t,shift) <= q <= ub_freq(beta,gamma,p,t,shift)):
        q = afs(alpha)

Create class ```VCF``` for handling VCF operations

The class VCF should handle collection of VCF files via pointers (without using much RAM)

Goals

  • merge VCF files (useful for parallel simulations)
  • concatenate VCF files (useful for sequential simulations)
  • truncate VCF files (data extraction)
  • numerate() method to enumerate rows in a regular expression of vcf files

Note

  • metadata and attributes should be handled accordingly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.