pcingola / snpeff Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
They create all sorts of problems and solve nothing.
Just remove them.
For this input:
chr1 11017092 C T
The ouput is:
chr1 11017092 C T . . . EFF=EXON(MODIFIER||||774|C1orf127|protein_coding|CODING|ENST00000520253|7|1|WARNING_TRANSCRIPT_NO_START_CODON+WARNING_REF_DOES_NOT_MATCH_GENOME),EXON(MODIFIER||||823|C1orf127|protein_coding|CODING|ENST00000377004|8|1|WARNING_REF_DOES_NOT_MATCH_GENOME),INTRAGENIC(MODIFIER|||||C1orf127||CODING|||1),INTRON(MODIFIER||||656|C1orf127|protein_coding|CODING|ENST00000377008|7|1),INTRON(MODIFIER||||657|C1orf127|protein_coding|CODING|ENST00000418570|3|1|WARNING_TRANSCRIPT_NO_START_CODON)
Notice that even if 4 transcripts are hit for this gene, a INTRAGENIC result is added.
In Gene.java line 224, hitTranscript is overwritten for each transcript in the gene,
so if the last transcript is not hit hitTranscript will be false when we exit the loop and because of that an INTRAGENIC effect is added.
So as far as I understood, maybe :
hitTranscript = tr.seqChangeEffect(seqChange, changeEffects);
has to be changed to
hitTranscript |= tr.seqChangeEffect(seqChange, changeEffects);
but maybe I'm missing something :)
Julien
Finish test cases.
I'm having a problem where some of the records in my VCF files have a single N in the ALT field. SnpEff/SnpSift keeps changing these to A,C,G,Ts. Is there any way to turn this feature off in the future? Or for that matter, a feature that makes it so that SnpEff doesn't change anything in the original VCF record besides adding a INFO field.
Example: original VCF record, VCF record after SnpEff
1 123456 rs123 NTGTATT N
1 123456 rs123 NTGTATT A,C,G,T
The script is invoked (from SnpEff's directory) using the NCBI's ID as parameter
./scripts/buildNcbiDatabase.pl 'NC_001788.1'
Step 1: NCBI's page is downloaded in order to scrape the UID
curl http://www.ncbi.nlm.nih.gov/nuccore/NC_001788.1 > NC_001788.1
Step 2: Scrapte UID
...meta name="ncbi_uidlist" content="5835345"...
Step 3: Download GenBank file
curl "http://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?sendto=on&val=5835345" > NC_001788.1.gbk
Step 4: Add config, build db
echo NC_001788.1.genome : NC_001788.1 >> snpEff.config
Step 5: Build
java -jar snpEff.jar build NC_001788.1
Looks like we are choking on mixed variants:
chrX 134555866 . CGT AGCT
chrX 152864607 . GCGTG GCGCTG,GCGCTA
User request:
Can you guys output the version of transcript .Example for transcript ENST00000560659 with version it would be ENST00000560659.3 (based on Ensemble core GrCh75.37).The version number is must for the tool to be HGVS complaint.
In a circular genome, an Exon with lower coordinates can be after an exon with higher coordinates.
This should be reflected in "transcript sort" algorithm
E.g.:
GenBank GQ861354): exon [97398..97628] is before exon [68375..68476]
CDS complement(join(96834..96860,97398..97628,68375..68476))
/gene="rps12"
/trans_splicing
/codon_start=1
/transl_table=11
/product="ribosomal protein S12"
/protein_id="ACY66286.1"
/db_xref="GI:262400797"
/translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTITPKKPNSA
LRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVG
VKDRQQGRSQYGVKKPK"
TEST!!!
Add protein coding field using protein.fa and cds.fa info
HGSV should be "p.0"
Need better Kozak predictions
http://en.wikipedia.org/wiki/Kozak_consensus_sequence
Hi pcingola,
When an indel in a vcf file overlaps a dbSNP record, I observe that .the indel will be annotated with the dbSNP record.
Do you require a 50% reciprocal overlap criteria?
How about multiple nucleotide polymorphism? Would you recommend me to decomplex the mnp into isolated snps before using snpEff or other way round?
Hi Pablo,
One of our scientists noticed that the effect annotation for BRCA1, when using canonical
, is not the transcript variant most often considered to be "standard" (at least according to some!) He'd like to use NM_007294.3
and I found a way to pass this to snpEff through the -onlyTr
option. However, I would effectively have to provide all transcript variants for all genes, otherwise all other genes are annotated as being INTRAGENIC
.
Is there any way snpEff could select the longest transcript in all other cases except for the transcripts provided in the text file?
Stats do not work on '-t' mode. Multi-threading compatible Stats objects should be implemented.
Use the "everything is an expression" concept to allow for more generic expressions.
Sample VCF:
$ cat zzz.vcf
1 551124 . A G 318.2 PASS AC=12
Sample command line:
java -Xmx4g -jar snpEff.jar -v Zv9.74 zzz.vcf
Error:
java.lang.RuntimeException: No white-space, semi-colons, or equals-signs are permitted in INFO field. Name:"LOF" Value:"(ILDR2 (2 of 2)|ENSDARG00000096600|1|1.00)"
at ca.mcgill.mcb.pcingola.vcf.VcfEntry.addInfo(VcfEntry.java:154)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.addInfo(VcfOutputFormatter.java:280)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.toString(VcfOutputFormatter.java:391)
at ca.mcgill.mcb.pcingola.outputFormatter.OutputFormatter.endSection(OutputFormatter.java:111)
at ca.mcgill.mcb.pcingola.outputFormatter.VcfOutputFormatter.endSection(VcfOutputFormatter.java:327)
at ca.mcgill.mcb.pcingola.outputFormatter.OutputFormatter.printSection(OutputFormatter.java:144)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.iterateVcf(SnpEffCmdEff.java:346)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.runAnalysis(SnpEffCmdEff.java:791)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:711)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:663)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:734)
at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:123)
From user:
... I am working with Bursaphelenchus xylophilus genome (nematode). The genome consists of contigs and scaffolds, some of them with the same number. SnpEff mixed scaffold with contig if they have the same number, if not, it works fine,....Is it a way to resolve this problem?
Add command line option "-classic" to override (same as overriding sequence ontology).
User request.
Now Hom/Het calculations work only on single sample VCF files.
I should extend this to multiple samples.
Just wondering about what appears to be a new annotation, SPLICE_SITE_REGION, which is currently given a LOW impact. I am assuming this is +/- a certain number of base pairs of known splice site donor/acceptor positions. A quick update to the documentation with the provenance and details of the annotation would be appreciated. Thanks!
Some references:
https://www.biostars.org/p/91806/
https://www.biostars.org/p/69222/
https://www.biostars.org/p/86929/
https://www.biostars.org/p/105030/#108482
https://www.biostars.org/p/107744/
https://www.biostars.org/p/74822/
https://www.biostars.org/p/86929/
https://www.biostars.org/p/69222/
https://www.biostars.org/p/108112/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/
User email:
Program runs well for al these 3 files but there is only 1 problem that in snpEff_summary file for the File 3, For the "Number of effects by type and region" portion the Variation graph is not showing the bar of Exons which have 50.226% value. Although its not a big deal and we can generate the graph by our self but I want to know the possible reasons for this.
NOTE: Unfortunately the user doesn't seem keen on providing data to replicate error conditions.
Add transcript filed when we checked against CDS and PROTEIN sequences.
Then we can emit popper warnings if the sequence does not check.
Check a transcript using CDS and PROTEIN.
These values have to be serialized (saved).
When using a transcript:
i) If both checks are OK, then we are relatively confident that the Reference Genome
annotations are OK.
ii) If it doesn't check: Add a warning.
iii) If no checking was performed: Show an overall warning.
Biological question: ... is it possible to have a transcript that has exons in BOTH positive AND negative strands? E.g. Transcript TR1 has exons Ex1, Ex2, Ex3 and, say Ex1 is in the positive strand while Ex2 and Ex3 are on the negative strand.
...looked at the FlyBase annotation for FBtr0084084 and they are claiming these are an example of trans-splicing from the other strand. I'm not sure I believe it but there is at least one mass-spec protein backing it up.
Papers:
http://www.ncbi.nlm.nih.gov/pubmed/15520256 [^]
http://www.ncbi.nlm.nih.gov/pubmed/20615941 [^]
By default a genome should be downloaded if it is not available.
Add command line option "-nodownload" to override default behaviour.
Now is not consistent (and quite confusing):
-Invoking with '-h' show help for 'eff' command
-Invoking without any command line option shows all avaialble commands.
Many people update the software but forget to update the database.
SnpEff should show a simple and clear error message to avoid confusion.
Duplicate VCF fields and headers are now banned.
SnpEff & SnpSift behaviour must be modified accordingly.
"A promoter-level mammalian expression atlas", Nature, 2014-03
the snpeff databases command lists the zaire_ebola as an available database but the url leads to a 404
Add command line option "-classic" to override.
// if (outFor.equals("TXT")) outputFormat = OutputFormat.TXT;
It is a biologically plausible for transcripts to have exons in both directions (plus and minus strand)
This may be the case reported in Maize genome, transcript AC208892.3_FGT005
which leads to errors when creating the summary (genes) file.
Caused by: java.lang.RuntimeException: Interval error: end before start. Start:216323401, End: 216323400
at ca.mcgill.mcb.pcingola.interval.Interval.(Interval.java:32)
at ca.mcgill.mcb.pcingola.interval.Marker.(Marker.java:31)
at ca.mcgill.mcb.pcingola.interval.Markers.merge(Markers.java:221)
at ca.mcgill.mcb.pcingola.interval.Gene.sizeof(Gene.java:385)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
After debugging, this is produced by:
start: 216323401 end:216323400
2:21632 3452-216323702 'Exon_2_216323452_216323703', rank: 3, frame: 2, sequence: aggagagggaggtggtggtggagcaccagcaggaggagccgaggaggagggccccg
2:21632 3401-216323400 'Exon_2_216323400_216323401', rank: 2, frame: 2, sequence:
2:21632 3283-216323328 'Exon_2_216323284_216323329', rank: 1, frame: 0, sequence: atggcggaggaccagaccaaccccagcggcccagccccagcaagcg
2:21632 3945-216324424 'Exon_2_216323946_216324425', rank: 4, frame: 0, sequence: caagggtttatttcgaggaatcaaattaacaacgatgtagtaacaggtgcacgagg
Modeled centromeres have sequences, but are "modeled" (they should be annotated like that)
Instead of assuming that all transcripts are non-coding by default, we could set as protein coding if the transcript has a CDS.
There might be problems with this approach. For
instance, in the human genome:
$ zcat genes.gtf.gz | grep -w CDS | cut -f 2 | ~/snpEff/scripts/uniqCount.pl
113 IG_C_gene
64 IG_D_gene
24 IG_J_gene
366 IG_V_gene
21 TR_C_gene
3 TR_D_gene
82 TR_J_gene
296 TR_V_gene
461 non_stop_decay
57770 nonsense_mediated_decay
773 polymorphic_pseudogene
731883 protein_coding
So there are 57K nonsense_mediated_decay transcripts that have CDSs, but are assumed not to be coding. As a workaround, we could add this only in cases
where the biotype is unknown (it's a better guess than assuming they are non-
Add support for VCF ALT tags
From user:
... I was wondering how difficult it would be to get SNPEFF to report the effects of frameshifts according to the "nomenclature for the description of sequence variants?"
For example, right now for a specific mutation at chr9:139811029 where we have a deletion GT -> G, the amino acid change gets annotated with -214. But this causes a frameshift, and should technically be reported as:
V214Gfs*14
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.