Giter VIP home page Giter VIP logo

Comments (20)

ekg avatar ekg commented on August 22, 2024

@inti YES!!!! vg should in theory be able to all of these things. However, the intron as indel interpretation issue is somewhat complicated. I need to think about that. (perhaps you could elaborate?)

You can start testing by building a graph with a few transcripts for a gene. However, there is a blocking issue: #39.

Ideally, we should align assembled trancripts. This would get us the "indel" corresponding to the intron splicing events. However, it may be more feasible to align just the exons from each transcript, using the transcript name as the alignment name. Then, we'd have to modify the graph to include the introns that are excised. I'm not sure at present which way will work better. Do you have test data to work with?

from vg.

inti avatar inti commented on August 22, 2024

Hi @ekg ,
So what I had in mind was: align to the reference to identify transcript structures, then use vg to refine the alignments. In that way the introns would be known.
One option would be to have a reference + known variation + known transcripts and build the variation graph and annotate the transcripts paths on the graph or at least annotate the exon-exon boundaries and the transcripts they belong to. In that way when aligning RNA-seq data one could exploit those paths explicitly. I can see your suggestion would be doing this as well.
Several aligners, like GSNAP, would take a exon-intro map and used that for the spliced alignments. Would it be easier to use this strategy? Either use this information during mapping or simply take the exon-intro map and edit the graph file to add the corresponding paths?

I think a test could be to takes some mRNA-seq data of the human genome, potentially for one of the 1KG samples, and testing and check whether we can find the known variation. Makes sense to you? I do not have that data at hand but for the sake of trying I can find it and send you small bam files.

from vg.

ekg avatar ekg commented on August 22, 2024

Several aligners, like GSNAP, would take a exon-intro map and used that for the spliced alignments. Would it be easier to use this strategy? Either use this information during mapping or simply take the exon-intro map and edit the graph file to add the corresponding paths?

I really like this approach. It's not completely general and will require more coding to enable (ideally, we just align assembled transcripts to build up the RNA+DNA variation graph), but it will work cleanly and the input is well-curated.

To get both variation and transcripts, all that needs to be done is to modify the 1000G+GRCh37 graph to include the intronic deletions that we see in the RNA-seq data. Then we'd just need to clean up the graph, ensuring partial ordering of ids and such (vg mod -c), and re-index to enable alignment.

Code will need to be written to do the inclusion. It would look like this.

for each transcript:
    make an alignment to represent the transcript
    for each intron:
        find the node(s) at the particular path positions of the start and end of the intron
            (uses vg::Index::get_path)
        append a vg::Mapping to cut the nodes and introduce an edge for the intron skip
            to the alignment
    emit the alignment

use vg mod to include the alignments in the graph

So the basic idea is that we'd use the exon/intron map to build up alignments (vg::Alignment) whose paths are the transcript alignment and whose names are the transcript names. Then we can use an existing tool vg mod -i to add the alignments to the graph.

vg::Index::get_path is here: https://github.com/ekg/vg/blob/master/index.hpp#L188 and https://github.com/ekg/vg/blob/master/index.cpp#L1371-L1389

What kind of format are the exon-intro maps in?

from vg.

inti avatar inti commented on August 22, 2024

Regarding the format of the exon-intro maps. it depends on the software (as always?). GNSAP does it like this, from the README http://research-pub.gene.com/gmap/src/README

On the other hand, to perform known intron-level splicing, you will need
to create a file with the following format:

>NM_004448.ERBB2.intron1 17:35110090..35116769
>NM_004448.ERBB2.intron2 17:35116920..35118100
>NM_004449.ERG.intron1 21:38955452..38878739
>NM_004449.ERG.intron2 21:38878638..38869541

Again, coordinates are 1-based, and specify the exon coordinates
surrounding the intron, with the first coordinate being from the donor
exon and the second one being from the acceptor exon.


There are several ways to help you generate these files.  First, if
you have a GTF file, you can use the included programs gtf_splicesites
and gtf_introns like this:

    cat <gtf file> | gtf_splicesites > foo.splicesites
    cat <gtf file> | gtf_introns > foo.introns

Second, if you retrieve an alignment tracks from UCSC, like this:

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz

if you are aligning to genome hg18, or

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

if you are aligning to genome hg19, you can process this track using
the included program psl_splicesites or psl_introns, like this:

    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo.splicesites
    gunzip -c refGene.txt.gz | psl_introns -s 1 > foo.introns

Note that alignment tracks in UCSC sometimes have an extra column on
the left.  The "-s" flag allows you to indicate how many columns
should be skipped.

the gtf_splicesites and gtf_introns come bundled with GSNAP and they come in handy to go from GTF/GFF3 annotations to into-exon maps.

This approach could be flexible since people can either rely on known annotations or first map the RNA-seq data to the genome identify likely variation and transcript structures and then re-map/refine the variation/bam files using variation graphs.

Is it possible to modify a variation graph with a VCF file? would it be easier to take a GTF/GFF file and convert it into a mock VCF file to code the introns as just another variant, perhaps with a special label like NM_004448.ERBB2.intron1 (as example above)?

from vg.

ekg avatar ekg commented on August 22, 2024

Is it possible to modify a variation graph with a VCF file? would it be easier to take a GTF/GFF file and convert it into a mock VCF file to code the introns as just another variant, perhaps with a special label like NM_004448.ERBB2.intron1 (as example above)?

It would be possible, but again there needs to be some coding to make this happen. The simplest way is to convert the variants in the file into reads of the novel haplotype represented by the variant, then map these into the existing graph.

One thing that's lacking in vg now is naming mechanisms. Read and sequence names need to be able to be used to annotate paths in the graph. You could actually script out the above solution using the vg command line API if this were the case. So, keep an eye on #39.

from vg.

ekg avatar ekg commented on August 22, 2024

Have you had a chance to test this out further?

from vg.

inti avatar inti commented on August 22, 2024

No yet. I got busy with something else. I am expecting to come back to this within 2-3 weeks. helps?

On Oct 7, 2015, at 12:41, Erik Garrison [email protected] wrote:

Have you had a chance to test this out further?


Reply to this email directly or view it on GitHub #35 (comment).

from vg.

ekg avatar ekg commented on August 22, 2024

With banded/split read mapping and the graph editing stuff stabilizing, I'd say it's the perfect time to come back to this. --- These enable the alignment of long contigs (such as transcript sequences) to the reference, and then the editing of the reference with them. Also, I think I've got GCSA indexing working for the whole human genome in <200G of RAM, which is critical for making a usable high-performance mapper. Fingers crossed.

Ping me if you come back to this and have any questions.

from vg.

adamnovak avatar adamnovak commented on August 22, 2024

We've done a couple proof of concept things with RNA graphs, and I think we're definitely ready for someone to do RNA-seq with vg as an actual project. I'm going to close this.

from vg.

CDieterich avatar CDieterich commented on August 22, 2024

Hi,

I would like to follow-up on this interesting thread.
While I understand that pan-transcriptome and pan-genome indices can be build and used right away,
I was wondering how splice alignment in spirit of HiSAT2 could be performed.
Any feedback is highly appreciated.

from vg.

ekg avatar ekg commented on August 22, 2024

from vg.

CDieterich avatar CDieterich commented on August 22, 2024

Ok, first part sounds great (i.e. incorporating known splice sites).

The second part is even more interesting (i.e. finding novel splice or even back.splice sites)

I am very happy to see your positive reply. Could you pinpoint me towards possible options
for the alignment module ?

from vg.

loganylchen avatar loganylchen commented on August 22, 2024

You can start testing by building a graph with a few transcripts for a gene.

Hi @ekg, I tried to build a graph for a gene with different transcripts. But what I got was separate linear reference for each transcript. Could you help me with that?

vg construct -r gene.fa  > tmp/x.vg

What I've got:
image

Looking forward to your reply.
Best

from vg.

jeizenga avatar jeizenga commented on August 22, 2024

Hi @btrspg, since this issue was closed we've developed methods for constructing variation graphs with splice edges in vg rna. I would recommend checking that out to start. The vg mpmap mapping tool also has some specialized features for mapping RNA-seq data, which is discussed here.

from vg.

loganylchen avatar loganylchen commented on August 22, 2024

@jeizenga Appreciate.

from vg.

loganylchen avatar loganylchen commented on August 22, 2024

Hi, @jeizenga, I still have some questions about vg rna. I used a fake fasta and a fake gtf file to construct the graph, but I don's know if I have already successful done the construction of graph.

I manipulated like this:

vg construct -r fake.fasta > fake.vg
vg rna -n fake.gtf --use-embeded-paths fake.vg > fake_rna.vg 

Then I used sequenceTubeMap to see is there any new path in the graph but what I got was a linear reference.

from vg.

jeizenga avatar jeizenga commented on August 22, 2024

If I recall correctly, sequenceTubeMap is designed primarily for genomic variation graphs with phased haplotypes. There are some pitfalls that might be catching you up on a splice graph. I would recommend vg view and dot instead. It's not as pretty, but it works well for debugging and troubleshooting. Documentation is here:
https://github.com/vgteam/vg/wiki/Visualization

from vg.

jonassibbesen avatar jonassibbesen commented on August 22, 2024

Hi, @btrspg, transcript paths are not added to the graph by default in vg rna. You can add them by using the option -r for reference transcript paths. You can also use -a for adding haplotype-specific transcript paths if you have embedded non-reference paths in your graph or if you supplied at set of haplotypes using -l.

from vg.

adamnovak avatar adamnovak commented on August 22, 2024

@btrspg If you run vg construct with FASTA of sequences, and no VCF, it just creates a graph with one linear node or run of nodes for each sequence. vg construct is meant to build a graph from a FASTA reference and a VCF (perhaps GRCh38's FASTA and the 1000 Genomes VCF, for example); it doesn't align the sequences in the FASTA together and squish the shared parts down to make a graph.

If you have multiple homologous sequences to make a graph from, you want https://github.com/ekg/seqwish or maybe vg msga. Both of those methods will align the sequences against each other.

from vg.

loganylchen avatar loganylchen commented on August 22, 2024

Thank you all @jeizenga @jonassibbesen @adamnovak . In deed, what I need is to use vg to do some RNA-seq work. I have known the reason why I could not see the graph path was that I need to build the index for fake_rna.vg with a specific parameter -L(include alt paths in xg)

vg construct -r fake.fasta > fake.vg
vg rna -n fake.gtf --use-embeded-paths fake.vg > fake_rna.vg
vg index -L -T -G fake_rna.gbwt -g fake_rna.gcsa -x fake_rna.xg

Then I could see the different path from different transcripts in sequenceTubeMap.

I used vg view and dot as @jeizenga mentioned to check the graph, so I knew the vg rna did the right thing. Then I figured out the index influenced the graph visualization.

Thank you all again.

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.