Giter VIP home page Giter VIP logo

cs4775-structure-rolloff's Introduction

STRUCTURE and ROLLOFF

STRUCTURE and ROLLOFF reimplementations for CS 4775 (Computational Genetics and Genomics) at Cornell

Original STRUCTURE paper: "Inference of Population Structure Using Multilocus Genotype Data"

Original ROLLOFF paper: "The History of African Gene Flow into Southern Europeans, Levantines, and Jews"

Installation

To install in a virtual environment:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Usage

structure.py infers admixture proportions given a dataset of genetic variants in Variant Call Format. The output is written to an HDF5 file, which is used by visualize.py and rolloff.py.

python structure.py [-h] [-k num_populations] [-o output.hdf5]
                    [--profile] [-d drop_frac] [-m num_burn_in_rounds]
                    [-s num_samples] [-c num_rounds_btwn_samples] data_file

Command-line arguments:

  • data_file: the input file, either a VCF file or an Eigenstrat (.phgeno) file
  • -k: the number of populations
  • -o or --out: the HDF5 file to which the output will be written
  • -d or --drop-frac: the fraction of loci to drop
  • -m or --burn-in: the burn-in period
  • -s or --num-samples: the number of samples to collect
  • -c or --sample-interval: number of rounds between samples

structure.py can also take a .phgeno data file as its primary argument.

visualize.py creates charts to display the output from structure.py. It takes one command-line argument, the location of the HDF5 file.

rolloff.py estimates the ROLLOFF statistic for two populations. It takes two command-line arguments:

rolloff.py [-h] [--profile] [-m centimorgans] data_file.hdf5
  • data_file.hdf5: the file generated by structure.py
  • -m or --min-bin-size: the minimum bin size to use, in centimorgans

Data files

The data files are compressed and stored in the data/ folder. They are from the 1000 Genomes repository. Use of these datasets is subject to these terms.

  • data/1.1-200000.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz (the 200K dataset) contains variant calls from the first 200,000 positions in chromosome 1.
  • data/1.1-2500000.ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz (the 2.5M dataset) contains variant calls from the first 2.5 million positions in chromosome 1.

cs4775-structure-rolloff's People

Contributors

veeara282 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.