Giter VIP home page Giter VIP logo

pbwt's Introduction

The pbwt package provides a core implementation and development environment for PBWT (Positional Burrows-Wheeler Transform) methods for storing and computing on genome variation data sets.

More precisely, PBWT supports a run-length compressed representation of aligned haplotype data, on which efficient matching algorithms can be built. Typically PBWT compression is much better than generic compression, particularly for large numbers of haplotypes, and search algorithms are linear in the query size independent of reference size.

A description of the basic data structure and matching algorithms is given in "Efficient haplotype matching and storage using the Positional Burrows-Wheeler Transform (PBWT)", Richard Durbin Bioinformatics 30:1266-72 (2014).

There are various phasing and imputation methods in the software that are not yet published.

Richard Durbin [email protected]

May 2013, updated September 2014

Installation instructions

Download htslib from https://github.com/samtools/htslib, and compile it

git clone https://github.com/samtools/htslib cd htslib make cd ..

Download and make pbwt

git clone https://github.com/richarddurbin/pbwt cd pbwt make

Brief usage instructions

Typing

pbwt

by itself gives a list of commands with brief descriptions.

A quick synopsis for usage is:

macs 11000 1e6 -t 0.001 -r 0.001 > 11k.macs
pbwt -checkpoint 10000 -readMacs 11k.macs -write macs11k.pbwt -writeSites macs.sites

NB "checkpoint 10000" writes out files every 10000 sites during the vcfq conversion to alternating checkA.{pbwt,sites} and checkB.{pbwt,sites} files.

pbwt -read macs11k.pbwt -subsample 0 10000 -write macs10k.pbwt
pbwt -read macs11k.pbwt -subsample 10000 1000 -write macs1k.pbwt
pbwt -read macs10k.pbwt -sfs > macs10k.sfs

gives the site frequency spectrum for macs10k

pbwt -read macs1k.pbwt -haps macs1k.haps

writes out the haplotypes stored in macs1k

pbwt -read macs10k.pbwt -matchDynamic macs1k.pbwt > macs1k-10k.max

for each sequence in macs1k, finds maximal matches to anything in macs10k

pbwt -read macs10k.pbwt -maxWithin > macs10k.max

finds maximal matches for each sequence in macs10k to anything else in macs10k

To start from real data in a .vcf file rather than a macs simulation use

pbwt -checkpoint 10000 -readVcfGT data.vcf -writeAll data

Note that -writeAll xxx will write xxx.pbwt, xxx.sites, xxx.samples and any other associated files, and -readAll xxx will correspondingly read xxx.pbwt and any available files based on suffix.

pbwt is very happy to handle up to 100,000 haplotypes, probably a million.

pbwt's People

Contributors

richarddurbin avatar mcshane avatar johnlees avatar pd3 avatar danjlawson avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.