downloadhgdp's Introduction

DownloadHGDP

Framework to download and collate WGS data for several hundred Human Genome Diversity Project samples.

Thousands of GRCh38 BAM files for HGDP samples are publicly available on the European Nucleotide Archive. The framework here downloads alignments in key regions of interest from each of these files. It then merges the data for all ~730 samples into a single sorted and indexed BAM file, where the resulting alignments have proper read groups that directly correspond to HGDP sample identifiers.

Please not that although the current framework works, care is required to delete and redownload files that failed to appropriately extract alignments

# Populate regions_of_interest.bed with the regions from which you'd like to extract alignments
# The file should contain three columns: CHROM, START and END.
# It shouldn't have any header lines and CHROM should be a number (1-22), X or Y without a chr prefix

cd by_sample_bams
./download_regions

# Delete files in which some alignments were not properly extracted
./download_regions.sh
# Continue these iterations until all files are properly downloaded
cd ../

cd merged_bam
./merge_bams.sh
cd ../

# The final merged BAM will be in merged_bam/merged_alns.sorted.bam

Recommend Projects

tfwillems / downloadhgdp Goto Github PK

downloadhgdp's Introduction

DownloadHGDP

downloadhgdp's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent