Giter VIP home page Giter VIP logo

rapid-curation-2.0's Introduction

Rapid-curation-2.0

TPF-less rapid curation of genomes

Requirements

Biopython v1.81

gfastats v1.2.6

pandas

Getting started

Before curating:

  1. Decontaminate the haplotypic assemblies.
  2. Modify the names in the decontaminated assemblies; H1.scaffold_1 for hap1 and H2.scaffold_2. The post-scritps are designed to accept this H1 and H2 notation.
  3. Concatenate the assemblies into a single fasta; plot a pretext map.

Curation:

  1. Curate both haplotypes simultaneously. The presence of both haplotypes can be especially useful for identifying sex and microchromosomes, as well as haplotig duplications (mis-phased sequences).
  2. Tags:
    • Create "Hap_1" and "Hap_2" tags in PretextView. These tags only need to be created once, PretextView will remember them in other curations. In the PretextView menu, click "Meta Data Tags" and type in the two tags as such:

      image image


    • Teasing the haplotypes apart gets a little messy, especially if there are sequences moved between haplotypes (i.e./ a scaffold from Hap_1 assigned to a Hap_2 scaffold or vice versa). The unassigned scaffolds can be sorted by the H1 and H2 notations we added prior to mapping. However, we need to use the Hap_1 and Hap_2 tags we just created to sort the chromosomes. For each chromosome, assign the appropriate haplotype tag to the left most scaffold, as such:

      image image


    • Tag the sex chromosomes as per usual. The current VGP standard is to move the sex chromosomes into Hap_1, so make sure that any sex chromosomes are also tagged with the Hap_1 tag.
    • Tag any unlocalized sequences as "unloc". Place any unloc sequences at the end (right most side) of their chromosomal assignment. These unlocs need to be painted with the chromosome they belong too.
  3. Once done, paint all the scaffolds (from both haplotypes) into chromosomes. The homologs will approximately alternate. With everything painted, generate your AGP.

Post-curation:

  1. Run the post-scripts. They are designed to process both haplotypes at once and will separate the haplotype files into two folders, "Hap_1 " and "Hap_2".
sh curation_2.0_pipe.sh -f <haplotype combined fasta> -a <PretextView generated agp> 
-h help
-f combined haplotype fasta
-a haplotype agp generated from pretextview

Example:
sh curation_2.0_pipe.sh -f rCycPin1.HiC.haps_combined.fasta -a rCycPin1.HiC.haps_combined.pretext.agp
  1. Run hap2_hap1_ID_mapping.sh; this will run a mashmap between your hap1 and hap2 fasta files to identify any homologous pairs that aren't named the same. The output from this is a .out mashamp file and a tsv. The tsv contains the current names of hap2 chromosomes, and the names of their homolog in hap1; this is parsed from the .out file. The parsing can sometimes get confused by repetitive/similar/small/etc chromosomes, so I recommend plotting your mashmap or visualizing it in Jbrowse to ensure the pairs in the tsv are correct. Then, to generate a hap2 fasta with updated names, you can pass the hap2 fasta and the tsv output to update_mapping.rb. This will modify the names and output a new fasta. These two scripts were authored and kindly shared by Michael Paulini of the GRIT team at the Wellcome Sanger Institute. They are copied here for ease of access as they make substantially easier the process of renaming hap2 chromosomomes.
  2. (Suggested) Generate a pretext map for each haplotype to ensure it curated as anctipated.
  3. Use chr_submission.py to generate the chr.tsv file that is necessary for NCBI submissions.
  4. SUCCESS!

Outputs

ADD THIS

Wishlist/operations to include

  • Generating the chromosome file that is necessary for NCBI submissions. Will need to be able to double check for unloc pieces.
  • Another program for automatically pushing the curated files to VGP S3.
  • Better way to parse multiple tags
  • More flexibility in placement of unlocs
  • Another post-processing script to quick-align and parse the results to adjust the order and orientation of Hap_2 chromosomes to match Hap_1.
  • Script for checking for curation statistics; number of breaks, joins, etc.
  • more flexibility in dealing with sex chromosomes so as to accomodate variable sex chromosome systems (i.e./XY1Y2, etc.)
  • removing haplotigs, but they have to be painted to be removed as per the configuation right now; the proximity ligations being inserted b/c of painting aren't being removed when the haplotigs get removed
  • output haplotigs to fasta

FAQ

  1. Why won't my PretextMap open in PretextView?

Hi-res PretextMaps likely require an HPC to generate the map, but will also require a discrete GPU to open the map in PretextView because it requires 16GB of RAM (i./e/ Macbooks with the M1 chip will have this capacity).

  1. Why aren't my unlocalized (unloc) sequences being named correctly?

a. I (at this time) configured the pipeline to process unlocs placed at the end of their respective chromosome assignments. Processing unlocs placed at the beginning of the painted chromosome is more complicated, but is possible - time permitting I will go back and modify this in the future. For now place all unlocs at the right end of their painted chromosome.
b. The unlocs also have to be painted. Double check to make sure they have been painted along with their assigned chromosome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.