pombase / curation Goto Github PK
View Code? Open in Web Editor NEWPomBase curation
PomBase curation
follow up
Original comment by: ValWood
i.e rpl2401-> rpl24 etc as /gene and /synonym
Original comment by: ValWood
This list might be out of date now, but somewhere I have the MySQL query to retreive the list of complexes with only one annotation
at some point a good consistency check would be to see why
(are OK if the complex is multiple copies of the same subunit but for others should be possible to annotate other subunits....)
GO:0032045 guanyl-nucleotide exchange factor complex 1
GO:0017102 methionyl glutamyl tRNA synthetase complex 1
GO:0005961 glycine dehydrogenase complex (decarboxylating) 1
GO:0000164 protein phosphatase type 1 complex 1
GO:0000798 nuclear cohesin complex 1
GO:0043224 nuclear SCF ubiquitin ligase complex 1
GO:0000113 nucleotide-excision repair factor 4 complex 1
GO:0031464 Cul4A-RING ubiquitin ligase complex 1
GO:0005968 Rab-protein geranylgeranyltransferase complex 1
GO:0005942 phosphoinositide 3-kinase complex 1
GO:0043505 centromere-specific nucleosome 1
GO:0005945 6-phosphofructokinase complex 1
GO:0009349 riboflavin synthase complex 1
GO:0005954 calcium- and calmodulin-dependent protein kinase complex 1
GO:0043614 multi-eIF complex 1
GO:0031201 SNARE complex 1
GO:0031588 AMP-activated protein kinase complex 1
GO:0000941 inner kinetochore of condensed nuclear chromosome 1
GO:0005745 m-AAA complex 1
Original comment by: ValWood
I had a tracker item to standardise the post-translational modification annotation (Which I think I have pretty much done, see
below), so these are probably ready to map to an ontology.
I was intending to do this as my "obo edit" training exercise, so you can issign this issue to me. I might pass it to Antonia and
do something else as I think this is pretty straightforward.
[http://old.genedb.org/genedb/Curation?organism=pombe&action=search&search=modification%2C+acetylated modification, acetylated] (11)
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/2
Original comment by: mah11
These are all fixed except for the GO ID in the with field for some of the IPI mappings.
I will look into converting these to Intact Complex IDs, or creating specific complex binding terms
The others will filter through when we do our next GO update which will probably be in a few weeks
Cheers
val
On 07/03/2011 14:22, Rachael Huntley wrote:
> Hi Val,
>
> Would you be able to help us with some questions on your gene association file, please?
>
> We're currently trying to improve the GO annotation we integrate into UniProtKB from external MODs, by looking at including the MOD identifiers used in the 'with' field of externally-generated annotations; at the moment we ignore any 'with' field data that doesn't use a GO identifier or UniProtKB accession, and so integrate these annotations into our set with an empty 'with' field, which obviously is not ideal. Therefore we would very much like to include S. pombe identifiers that match the following regular expression:
> (GeneDB_Spombe):(SP(\d|\w)+.(\d|\w)+)
>
> Does this look reasonable to you?
>
> In addition, as our database schema does not allow more than one value in the 'with', we are 'unwrapping' lists of identifiers that are separated by a pipe, to generate multiple annotation rows that differ solely by the contents of the 'with' field. We feel that this should be a reasonable way of treating such data for IPI and IMP annotations for as I understand the pipe usage, it should be interpreted as separating indicating two gene products that have been shown to independently (but from data obtained from same paper and type of evidence) interact with the annotation object, to support the annotation of the same GO term. However please let me know if your interpretation is different, as we can't find any GO documentation on correct usage of pipes in the GAF format!
>
> Currently, there are a number of exceptions for the use of the with column generated from your file, which I have listed below together with the reasons why they have been rejected. If you feel that any of these are using a check that is too stringent, please do let us know.
>
> Rejected:
> [IC UniProtKB:O42870]
> Reason: The with column for an IC annotation should be filled with a GO ID
>
> Rejected:
> [IEP GeneDB_Spombe:SPAC25G10.03]
> [IEP GO:0016592]
> [IEP PMID:12161753]
> Reason: The with column should not be filled when using IEP
>
> Rejected:
> [IGI GeneDB_Spombe:S000000807]
> Reason: The identifier used is an SGD identifier, not GeneDB
>
> Rejected:
> [IGI SGD:000003904]
> Reason: The identifier is missing an 'S' at the beginning
>
> Rejected:
> [IGI SGD:S00000268]
> [IGI SGD:S00000550]
> [IGI SGD:S00003295]
> Reason: SGD identifiers should be 'S' followed by 9 digits
>
> Rejected:
> [IMP GO:0004660]
> [IMP GO:0005681]
> [IMP GO:0008990]
> [IMP GO:0046557]
> [IMP GO:0047657]
> Reason: A GO ID should not be used in the with column of an IMP annotation
>
> Rejected:
> [IMP PMID:11679064]
> [IMP PMID:12193640]
> [IMP PMID:14623292]
> [IMP PMID:16738311]
> Reason: A PMID should not be used in the with column of an IMP annotation
>
> In addition, we are trying to be quite strict in only importing annotations that apply a MOD identifier, rather than gene symbols. Therefore would you be willing to convert the following 'with' contents in your GO file into UniProtKB accessions?
>
> [IGI UniProtKB:BIN3_HUMAN]
> [IGI UniProtKB:ECC1_HUMAN]
> [IGI UniProtKB:MK01_HUMAN]
> [IGI UniProtKB:PIGW_HUMAN]
> [IGI UniProtKB:PRS6B_HUMAN]
> [IGI UniProtKB:PYRF_ECOLI]
>
> We've also noticed that you have several IPI annotations that have a GO ID for a complex in the 'with' column. We are planning to use IntAct complex IDs to cover this type of information, so we will continue to reject annotations that have a GOID in the with field for IPI. Supplying an IntAct complex ID in the with field instead of the GO ID would be a more accurate representation of the data, since the GO complex terms are not defined particularly well with regard to the composition of the complex in differing species.
> We have been working with IntAct on making protein complex IDs more visible in QuickGO (e.g. see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0005680\#info=2\) and they have been very responsive to our requests. If you feel like you wanted to supply IntAct complex IDs in the with field, I am sure IntAct would be more than willing to create any complex IDs that are missing.
>
> Rejected:
> [IPI GO:0000812]
> [IPI GO:0005680]
> [IPI GO:0005681]
> [IPI GO:0005685]
> [IPI GO:0005832]
> [IPI GO:0005884]
> [IPI GO:0008180]
> [IPI GO:0016575]
> [IPI GO:0016592]
> [IPI GO:0031011]
> [IPI GO:0031011|GO:0000812]
> [IPI GO:0031511]
> [IPI GO:0031533]
> [IPI GO:0032221]
> [IPI GO:0033186]
> [IPI GO:0034967]
> [IPI GO:0035267]
> [IPI GO:0070209]
Original comment by: ValWood
There are a LARGE number of ISS annotations which are "unsupported"
i.e the with column target does not have the annotation
A few may need removing but I think most probably need more granular at SGD. In some cases it is possible that the original annotation was removed from SGD (It would be great if there was some alerting for this!)
There is a Goose MySQL query to get the numbers.
Maybe this should link to a wiki page with a link to the query where we periodically check the total?
I don't really intend to do this as a task, but in general the numbers should obviously decrease.
It also might be a nice training exercise for Antonia (or us) to check a few of these and either
i) ask SGD if the annotation should be moved down
or
ii) find other supporting evidence
or
iii) remove if incorrect
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/4
Original comment by: mah11
1 create e-mail group?
2 Send interactions to BioGrid
3. update feedback form based on comments
4. Pombelist announce
Original comment by: ValWood
Yes, this is one of our jobs this year – to start to add in sets of ISS annotations from other groups where it doesn’t already exist in our database.
I agree these are important annotations, which have only been excluded from the GOA set because, historically, there has been the concern about possible circulate ISS annotations, and also because we have not tried seriously to integrate ‘with’ data from external groups yet – it is going to be a lot of work to sanity check the multiple different values and formats that ‘with’ data from different can contain, this field has quite a range of contents, but I don’t think it would be correct to integrate ISS annotations otherwise.
Emily
Valerie Wood wrote:
> Hi Emily,
>
> Did you talk to Dan and Uniprot about including the ISS annotations in the GOA/Uniprot entries (at least from the reference genomes)? This would be useful as it seems that some people use GOA as the annotation source. This means people think these annotations are missing and make function predictions where we already have good manually curated supported predictions and it all gets very messy….
>
> HNY
>
> Val
>
> E Dimmer wrote:
>
>> Hi Val,
>>
>> Thanks – I’ll talk with Dan as to whether it would be possible to integrate MOD IDs.
>>
>> There are ISS annotations in this file, however only those created by the GOA annotation tool. ISS annotations from groups external to GOA have never been included in any of our releases/displays – GOA has only ever integrated manual non-ISS annotations from other groups. GOA decided this to avoid the potential problem of circular ‘ISS’ annotations – however I think we are now beginning to see that other groups create very ISS annotations to us, and we might need to revisit that decision. A range of GO_REFs for the different types ISS annotations created by curators might help us in this area.
>>
>> Cheers,
>> Emily
Original comment by: ValWood
After global uniprot update
Original comment by: ValWood
annotate identical proteins as "identical"
Original comment by: ValWood
Original comment by: ValWood
Some pseudogenes have changed status and need "warning, previously annotated as pseudo"
and GO annotation.
If a gene has a single frameshift, it probably isn't good to represent them as pseudo genes, make "valid translations" for these, removing the psudo label and flagging as
"
warning, possible frameshift
warning, previously annotated as pseudo
(they may be problems in the sequenced stain and some are problems with the sequence)
This way people will see them in the gene set if they are tring to do comparisons with other strains, or octosporus, japonicus etc. Adding the pseudo flag makes them less visible by excluding them from the protein set....
Still need to submit some of the the Biogrid curation from the commuity curation pilot
Original comment by: ValWood
1:1 only
protein only
no IEA/ISS?RCA
omit high level terms
omit cell fraction etc
omit S. c specific (lik budding)
others?
Original comment by: ValWood
Low complexity regions need systematic IDs
Original comment by: ValWood
Feb 13th
reevaulated all use and reduced from
242 to 57 annotations
126 to ?? terms
check after update
Original comment by: ValWood
To pick up new features
http://www.genedb.org/genedb/pombe/newgenes.jsp
and gene structure changes
http://www.genedb.org/genedb/pombe/coordChanges.jsp
and recent Broad updates (lots)
FIRST
check NCBI name problem fixed
check dummy PMID fixed
also fix strain of mating type contig
instructions are in sequence update
AFTER
report update to
Uniprot
pomblelist
NCBI
Raja pir/ ensembl/biogrid Eleanor gp2protein
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/29
Original comment by: mah11
Need to follow up on handling of notes
On 09/02/2011 19:12, Kim Rutherford wrote:
> On Wednesday 9 February 2011 at 18:20:36, Val Wood wrote:
>
>> There are ~9575 notes on various features. Many non-CDS features have a
>> single note in free text (because controlled_curation is only allowed in
>> GeneDB for CDS)
> The Sanger loader stores /notes in the featureprop table with a
> property type of "comment" (as far as I can tell). There are only 2539
> comment properties in the database, so I have a bit of investigation to
> do.
>
> The breakdown of comments by type is:
>
> count | name
> -------+---------------
> 174 | tRNA
> 13 | rRNA
> 391 | ncRNA
> 361 | repeat_region
> 6 | snoRNA
> 372 | polypeptide
> 1222 | region
>
> The Sanger loader has cleverly not bothered to put any on the gene
> features.
>
>
>> These are mainly "a bit controlled" and can be refined quite easily
>> into some sort of vocabulary.
>> You mentioned that 366 notes are on CDS...I would like to get rid of
>> these if possible so if you can send me a list at some point (no hurry
>> as I won't have time for a few weeks)
> Do you need a list of the notes, or a list of the CDSs that have notes?
>
> Kim.
Val
There are a bunch of notes on
5'UTR
3'UTR
LTR
etc
Original comment by: ValWood
/note
956.
consider removing non cds features as many are ars/repeat/rna related
filter terms:
splice donor
splice branch
mRNA from
confirmed by mRNA
confirmed intron
anticodon
LTR
nominal overlap
this transcription could be
this transcript could be
Homol
gene-free
gene free
duplicated region
region duplicates
has transcript profile
confirmed
longest ORF
previously annotatd as dubious
Intron predicted
SPNG
ABO (EMBL ID)
Tf
TF1
TF2
dg I
dh I
TATA
poly A
wtf
check how many not CDS not, see Tim
Original comment by: ValWood
[wrong summary removed 2015-10-22 mah]
all unpublished
all ref genome
Original comment by: ValWood
activation
arrest/maintenece
response
recovery
check new signalling ontology
Original comment by: ValWood
Make fission yeast GO slim for function/ component
Original comment by: ValWood
diff? ascospore wall biogenesis and assembly
based on
https://sourceforge.net/tracker/?func=detail&aid=3092005&group\_id=36855&atid=440764
I have a note to
fix direct annotations to spore wall assembly
(should be biogenesis?)
Original comment by: ValWood
Would be nice to add confirmed introns for all Solexa and Broad data
Original comment by: ValWood
We need to host the contig files somewhere where myself Midori and Antonia will be able to edit them in Artemis simultaneously.....
val
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/25
Original comment by: mah11
These are things from my list which may or may not need doing
• Question new term for transcription factor activity \(90\) \(is this the correct term? GO:0003704\)
• specific RNA polymerase II transcription factor activity remap to new term ?
• check diff between cellualr protein complex assembly and protein complex assembly
https://sourceforge.net/tracker/?func=detail&aid=1891961&group\_id=36855&atid=440764
• annotate all sequence specific transcription factors to promoter binding
also need to check that everything from the transcription related SF items is closed, and any other reannotations required from these are done
Original comment by: ValWood
curate horizontal transfer events from Broad paper
Original comment by: ValWood
We need to go through the RNAs and merge the Wilhelm/Broad RNAs where coincident
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/23
Original comment by: mah11
See GeneDB listed
also
/ID="SPAC31G5.09c"
/note="alternative UTRs for this feature are represented by different polyA sites"
Original comment by: ValWood
add product heirarchy here
announce to pombelist
Original comment by: ValWood
PMID: 18257517 allow to map a bunch of phosphorylation sites onto specific residues
Original comment by: ValWood
i) Most noncoding RNAs (rRNA, sno, etc) are annotated with RCA, need updating to ISS
ii) tRNA annotations (specific codons) need checking
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/13
Original comment by: mah11
I occasionally check whether annotation have appeared at the following terms which could be more specific.
These should move at some point to be documented on the wiki,
and form part of the automated alerting which Kim will set up down the line.
For now it is just a case of collecting the terms which should usually be more specific....
This are usually (not always), were checked today (22 March 2010)
cell cycle -> mitotic cell cycle or regulation of cell cycle
mitotic cell cycle->regulation of mitotic cell cycle
regulation of cell cycle -> regulation of mitotic cell cycle
Also checked and moved today
nuclear mRNA splicing, via spliceosome -> nuclear mRNA cis splicing via spliceosome
Previously checked
metabolic -> cellular metabolic DONE 9th November
DNA replication -> DNA-dependent DNA replication DONE 10 November 2010
all response to xxx stress are annotated to cellular response to xxx stress 10 November 2010
Original comment by: ValWood
repeat, include questions about
Ensembl
YOGY
links to others
Community curation
country
Original comment by: ValWood
Add
gene expression, antisense transcript detected
to all genes with antisense transcripts
(can probably do this in Artemis)
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/9
Original comment by: mah11
Follow this up with Kwang and Jacky
Original comment by: ValWood
Need to follow this up at some point. MAy need to check further with Juan...
Yes, what Xavi did is perfect. He's put together mat2 and mat3 with adjacent regions (which should be present in all h90 strains). In h90 cells, mat1 would switch between having a p cassette or a m cassette. In 972 h- there is a deletion of mat2-p. However, in some h+ strains there is a deletion of mat3-M, but more frequently there are mutations that prevent the switching from P to M, although mat2-P is still there. Therefore, one would have to assemble a different region for each strain. I think the configuration of the different h+ and h- strains is described in the very old pombe book. Anyway, most of the time people don't even know which h+ or h- allele they have.
I think it would help to annotate mat1-m and mat3-m as such in the main contig (rather than calling them matmi_1 and _2 - or at least add a note), so that people know which one they are looking at. Also, it would help to annotate the features of the mating type region that Xavi put in his contig and that are present in 972 (like the IR-L and IR-R regions, or the homology boxes). Finally, would it be possible to add a note somewhere directing people to the mating type contig?
I hope this is useful - let me know if I can help with anything else
Juan
Original comment by: ValWood
• Hideki, gap filling ?
• Sasaki M, Idiris A, Tada A, Kumagai H, Giga-Hama Y, Tohda H.
Yeast. 2008 Sep;25(9):673-9.PMID: 18727152 new telomeric clone
pending
http://www.sanger.ac.uk/Projects/S\_pombe/sequence\_updates.shtml
http://www.sanger.ac.uk/Projects/S\_pombe/sequence\_discrepancies.shtml
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/30
Original comment by: mah11
Need to
Original comment by: ValWood
Spreadsheet
http://spreadsheets.google.com/ccckey=pZhlLFuj8ewDe799QTmxzCA&hl=en12:42
eg
http://proto.informatics.jax.org/prototypes/GOgraphEX/PPOD12\_Graph/ORTHOMCL1245.html
AUG-FEB
PMID: 18820293 rps2
PMID: 2834104
PMID:14623272 ? done?
Need to add pre August list
Original comment by: ValWood
Get lists of IC pairs and numbers
Fill holes
Original comment by: ValWood
GO TERMS
GO:0070315 G1 to G0 transition involved in cell differentiation
GO:0070317 negative regulation of G0 to G1 transition
regulation of cell quiesence
PAPERS
PMID: 19833516 yanagida G0 genes
PMID:17535257
PMID 19366728 sajiki at oist.jp (Kenich Sajiki)
PMID: 20133687
PMID: 20418666
others?
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/12
Original comment by: mah11
Add High throughput TAG
Original comment by: ValWood
subunit composition <-> X-dimerization activity
Original comment by: ValWood
1. Add new telomeric region to contigs
(are there 2 now?)
2. Do fixes small which are confirmed
http://www.sanger.ac.uk/Projects/S\pombe/sequence\updates.shtml
http://www.sanger.ac.uk/Projects/S\pombe/sequence\discrepancies.shtml
3. Can we fill any gaps using Broad data? Speak to Nick Rhind
To Do
Give community advance notice of contig changes
Make Fixes
Update Stats/ webpage (sequencing status etc)/ download data
EMBL resubmission
Original comment by: ValWood
There is a script which
when supplied with a GO ID, and the current GO database, and the fission yeast/ budding yeast ortholog table will report difference in annotations betweent he 2 organisms.
Need to check all "high level" terms
Recently done
translation
transport
ribsome biogenesis
vitamin metabolism
Part done
vesicle mediated transport
DNA recombination
DNA repair
from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/16
Original comment by: mah11
need to check how often the IGI supproted GO annotation is not applied to both genes (ignore cellular protein localization)
Original comment by: ValWood
gene expression, split expression regulated by into =ve -ve (some already have qualifers)
Go through and work out what is needed
Original comment by: ValWood
make sure DB IDs in misc_RNAs and introns are in dbxref
(i.e mRNA from .....)
Original comment by: ValWood
After ref genome meeting, summarized “contributes_to”
Provided a list of OK egs
need to adress the others
notes:
should encourage people not to just remove, but to reannotate
examples to discuss
histone acetyl transferase
should only be used when a function is required for an activity, nto for a regulationor process
should it always be present for >1 subunit?
warning in logs for incorrect usage
Original comment by: ValWood
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.