Giter VIP home page Giter VIP logo

chm13's Introduction

Telomere-to-telomere consortium

Introduction

We have sequenced the CHM13hTERT human cell line on the Oxford Nanopore platform to approximately 120x coverage. We have also sequenced approximately 50x coverage using 10X Genomics as well as BioNano DLS and Arima Genomics HiC. PacBio (both CLR and HiFi) data for this cell line has been previously generated by the Washington University School of Medicine and the University of Washington, and is available from NCBI SRA.

Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. We followed Josh Quick's ultra-long read (UL) protocol for library preparation and sequencing.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. While not required, we would appreciate if you would acknowledge the "telomere-to-telomere" (T2T) consortium for the creation of this data and encourage you to join us if you would like to help finish the human reference genome. More information about our consortium can be found on the T2T homepage.

Citation:

Miga KH, Koren S, et al. Telomere-to-telomere assembly of a complete human X chromosome. bioRxiv, 2019.

Draft Assembly

The current assembly draft (v0.6) is generated with Canu v1.7.1 including rel1 data up to 2018/11/15 and incorporating the previously released PacBio data. Two gaps on the X plus the centromere were manually resolved. Contigs with low coverage support were split and the assembly was scaffolded with BioNano. The assembly was polished with two rounds of nanopolish and two rounds of arrow. The X polishing was done using unique markers matched between the assembly and the raw read data, the rest of the genome used traditional polishing. Finally, the assembly was polished with 10X Genomics data. We validated the assembly using independent BACs. The overall QV is Q37 (Q42 in unique regions) and the assembly resolves over 80% of the bacs (280/341).

The assembly is 2.94 Gbp in size with 503 scaffolds (593 contigs) and an NG50 of 83 Mbp (70 Mbp)

Outside of the X, this should be considered a draft and likely has mis-assemblies. We will continue to update releases as we validate/fix the assembly. Unpolished Canu assemblies are available below for each data release and may be a more suitable basis for the structural analysis of other chromosomes, but will have a lower consensus accuracy.

Downloads

Sequencing Data

Oxford Nanopore Data

We sequenced a total of 367 Gbp of data (118x coverage). The read N50 is 53 kbp and there are 193 Gbp bases in reads >50 kbp (62x). The longest full-length mapping read is 1.3 Mbp.

Sequencing data was generated from three lines of CHM13 (NHGRI, UW, UCD), which all originate from the original line established by Urvashi Surti. Only the NHGRI line was karyotyped and confirmed to be stable prior to sequencing. For the NHGRI line, NHGRI (PI: Phillippy) and University of Nottingham (PI: Loose) contributed approximately 140 flowcells of UL data using Quick's ultra-long protocol; 199 Gbp (64x, 1.4 Gbp/flowcell). The read N50 is 71 kbp and there are 128 Gbp of data in reads >50 kbp (41x). For the UW line, University of Washington (PI: Eichler) contibuted 80 flowcells of UL data using a new UL protocol developed by Glennis Logsdon; 38 Gbp (12x, 0.5 Gbp/flowcell). The read N50 is 130 kbp and there are 30 Gbp of data in reads >50 kbp (10x). For the UCD line, UCDavis (PI: Dennis) contributed two PromethION cells using a ligation prep; 114 Gbp (37x, 57 Gbp/flowcell). The read N50 is 36 kbp and there are 25 Gbp of data in reads >50 kbp (8x).

rel3 (genomic DNA)

rel3 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.1.5 with the HAC model. We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel2 (genomic DNA)

rel2 is the same data as rel1 but recalled with the latest generation callers (Guppy flip-flop 2.3.1). We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel1 (genomic DNA)

The full dataset as of 2019/01/09. These basecalls were generated on-instrument and use older versions of Guppy (depending on when the flowcell ran on the instrument).

Downloads

fast5 data

The raw fast5 data, without basecalls, is available for completeness. The data is grouped into 226 sets.

Downloads

10X Genomics Data

Raw fastq files

Approximately 50x of data was generated on a NovaSeq instrument. Based on the summary output of Supernova, there are 1.2 billion reads with 41x effective coverage. The mean molecule length is 130 kbp and an N50 of 864 reads per barcode.

Downloads

BioNano DLS Data

Approximately 430x of data was generated using the Saphyr instrument and the DLE-1 enzyme. There are 15.2 M molecules with an N50 molecule length of 115.9 kbp and a max of 2.3 Mbp (2 M molecules > 150 kbp, N50 218 kbp). The assembly of the molecules is 2.97 Gbp in size with 255 contigs and an NG50 of 59.6 Mbp.

Downloads

  • BNX (md5: 59a7a5583e900e1e5cecb08a34b5b0dc)
  • CMAP (md5: cf1a6fbcf006a26673499b9297664fdb)

Hi-C Data

A library was generated using an Arima genomics kit and sequenced to approximately 40x on an Illumina HiSeq X.

Downloads

Previously generated PacBio data

The PacBio data (both CLR and HiFi) was previously generated and is available from the SRA. The list of P6-C4 cells used for arrow polishing are listed here.

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3.amazon.com/nanopore-human-wgs/ with s3://nanopore-human-wgs to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz to the current working directory use the following command.

aws s3 --no-sign-request cp s3://nanopore-human-wgs/chm13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .

or to download the full dataset use the following command.

aws s3 --no-sign-request sync s3://nanopore-human-wgs/chm13/ .

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/ 

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Contact

Please raise issues on this Github repository concerning this dataset.

History

* rel1 and 2: 2nd March 2019. Initial release.
* asm v0.6 and canu rel2 assembly: 28th May 2019. Assembly update.
* Hi-C data added: 25th July 2019. Data update.
* asm v0.6 alignments of rel2 added: 30th Aug 2019. Data Update
* rel3: 16th Sept 2019. Data update.

chm13's People

Contributors

skoren avatar aphillippy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.