Giter VIP home page Giter VIP logo

curation's People

Contributors

kimrutherford avatar manulera avatar valwood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

curation's Issues

single annotations to "complex" terms to fix

This list might be out of date now, but somewhere I have the MySQL query to retreive the list of complexes with only one annotation
at some point a good consistency check would be to see why
(are OK if the complex is multiple copies of the same subunit but for others should be possible to annotate other subunits....)

GO:0032045 guanyl-nucleotide exchange factor complex 1
GO:0017102 methionyl glutamyl tRNA synthetase complex 1
GO:0005961 glycine dehydrogenase complex (decarboxylating) 1
GO:0000164 protein phosphatase type 1 complex 1
GO:0000798 nuclear cohesin complex 1
GO:0043224 nuclear SCF ubiquitin ligase complex 1
GO:0000113 nucleotide-excision repair factor 4 complex 1
GO:0031464 Cul4A-RING ubiquitin ligase complex 1
GO:0005968 Rab-protein geranylgeranyltransferase complex 1
GO:0005942 phosphoinositide 3-kinase complex 1
GO:0043505 centromere-specific nucleosome 1
GO:0005945 6-phosphofructokinase complex 1
GO:0009349 riboflavin synthase complex 1
GO:0005954 calcium- and calmodulin-dependent protein kinase complex 1
GO:0043614 multi-eIF complex 1
GO:0031201 SNARE complex 1
GO:0031588 AMP-activated protein kinase complex 1
GO:0000941 inner kinetochore of condensed nuclear chromosome 1
GO:0005745 m-AAA complex 1

Original comment by: ValWood

Post translational modifications, move to ontology

I had a tracker item to standardise the post-translational modification annotation (Which I think I have pretty much done, see
below), so these are probably ready to map to an ontology.

I was intending to do this as my "obo edit" training exercise, so you can issign this issue to me. I might pass it to Antonia and
do something else as I think this is pretty straightforward.

[http://old.genedb.org/genedb/Curation?organism=pombe&action=search&search=modification%2C+acetylated modification, acetylated] (11)

from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/2

Original comment by: mah11

Fixing Complex IDs in "with" fields

Need to follow up fixing the GO ID in the with field (UniProt would like us to use the Intact complex ID)
There are some other things in here which could be added to consistency checks...

These are all fixed except for the GO ID in the with field for some of the IPI mappings.
I will look into converting these to Intact Complex IDs, or creating specific complex binding terms

The others will filter through when we do our next GO update which will probably be in a few weeks

Cheers

val

On 07/03/2011 14:22, Rachael Huntley wrote:
> Hi Val,
>
> Would you be able to help us with some questions on your gene association file, please?
>
> We're currently trying to improve the GO annotation we integrate into UniProtKB from external MODs, by looking at including the MOD identifiers used in the 'with' field of externally-generated annotations; at the moment we ignore any 'with' field data that doesn't use a GO identifier or UniProtKB accession, and so integrate these annotations into our set with an empty 'with' field, which obviously is not ideal. Therefore we would very much like to include S. pombe identifiers that match the following regular expression:
> (GeneDB_Spombe):(SP(\d|\w)+.(\d|\w)+)
>
> Does this look reasonable to you?
>
> In addition, as our database schema does not allow more than one value in the 'with', we are 'unwrapping' lists of identifiers that are separated by a pipe, to generate multiple annotation rows that differ solely by the contents of the 'with' field. We feel that this should be a reasonable way of treating such data for IPI and IMP annotations for as I understand the pipe usage, it should be interpreted as separating indicating two gene products that have been shown to independently (but from data obtained from same paper and type of evidence) interact with the annotation object, to support the annotation of the same GO term. However please let me know if your interpretation is different, as we can't find any GO documentation on correct usage of pipes in the GAF format!
>
> Currently, there are a number of exceptions for the use of the with column generated from your file, which I have listed below together with the reasons why they have been rejected. If you feel that any of these are using a check that is too stringent, please do let us know.
>
> Rejected:
> [IC UniProtKB:O42870]
> Reason: The with column for an IC annotation should be filled with a GO ID
>
> Rejected:
> [IEP GeneDB_Spombe:SPAC25G10.03]
> [IEP GO:0016592]
> [IEP PMID:12161753]
> Reason: The with column should not be filled when using IEP
>
> Rejected:
> [IGI GeneDB_Spombe:S000000807]
> Reason: The identifier used is an SGD identifier, not GeneDB
>
> Rejected:
> [IGI SGD:000003904]
> Reason: The identifier is missing an 'S' at the beginning
>
> Rejected:
> [IGI SGD:S00000268]
> [IGI SGD:S00000550]
> [IGI SGD:S00003295]
> Reason: SGD identifiers should be 'S' followed by 9 digits
>
> Rejected:
> [IMP GO:0004660]
> [IMP GO:0005681]
> [IMP GO:0008990]
> [IMP GO:0046557]
> [IMP GO:0047657]
> Reason: A GO ID should not be used in the with column of an IMP annotation
>
> Rejected:
> [IMP PMID:11679064]
> [IMP PMID:12193640]
> [IMP PMID:14623292]
> [IMP PMID:16738311]
> Reason: A PMID should not be used in the with column of an IMP annotation
>
> In addition, we are trying to be quite strict in only importing annotations that apply a MOD identifier, rather than gene symbols. Therefore would you be willing to convert the following 'with' contents in your GO file into UniProtKB accessions?
>
> [IGI UniProtKB:BIN3_HUMAN]
> [IGI UniProtKB:ECC1_HUMAN]
> [IGI UniProtKB:MK01_HUMAN]
> [IGI UniProtKB:PIGW_HUMAN]
> [IGI UniProtKB:PRS6B_HUMAN]
> [IGI UniProtKB:PYRF_ECOLI]
>
> We've also noticed that you have several IPI annotations that have a GO ID for a complex in the 'with' column. We are planning to use IntAct complex IDs to cover this type of information, so we will continue to reject annotations that have a GOID in the with field for IPI. Supplying an IntAct complex ID in the with field instead of the GO ID would be a more accurate representation of the data, since the GO complex terms are not defined particularly well with regard to the composition of the complex in differing species.
> We have been working with IntAct on making protein complex IDs more visible in QuickGO (e.g. see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0005680\#info=2\) and they have been very responsive to our requests. If you feel like you wanted to supply IntAct complex IDs in the with field, I am sure IntAct would be more than willing to create any complex IDs that are missing.
>
> Rejected:
> [IPI GO:0000812]
> [IPI GO:0005680]
> [IPI GO:0005681]
> [IPI GO:0005685]
> [IPI GO:0005832]
> [IPI GO:0005884]
> [IPI GO:0008180]
> [IPI GO:0016575]
> [IPI GO:0016592]
> [IPI GO:0031011]
> [IPI GO:0031011|GO:0000812]
> [IPI GO:0031511]
> [IPI GO:0031533]
> [IPI GO:0032221]
> [IPI GO:0033186]
> [IPI GO:0034967]
> [IPI GO:0035267]
> [IPI GO:0070209]

Original comment by: ValWood

check ISS outliers/stale ISS

There are a LARGE number of ISS annotations which are "unsupported"
i.e the with column target does not have the annotation
A few may need removing but I think most probably need more granular at SGD. In some cases it is possible that the original annotation was removed from SGD (It would be great if there was some alerting for this!)
There is a Goose MySQL query to get the numbers.
Maybe this should link to a wiki page with a link to the query where we periodically check the total?
I don't really intend to do this as a task, but in general the numbers should obviously decrease.
It also might be a nice training exercise for Antonia (or us) to check a few of these and either
i) ask SGD if the annotation should be moved down
or
ii) find other supporting evidence
or
iii) remove if incorrect

from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/4

Original comment by: mah11

community curation tasks

1 create e-mail group?
2 Send interactions to BioGrid
3. update feedback form based on comments
4. Pombelist announce

Original comment by: ValWood

get all pombe GO annotations into Uniprot/GOA

Yes, this is one of our jobs this year – to start to add in sets of ISS annotations from other groups where it doesn’t already exist in our database.
I agree these are important annotations, which have only been excluded from the GOA set because, historically, there has been the concern about possible circulate ISS annotations, and also because we have not tried seriously to integrate ‘with’ data from external groups yet – it is going to be a lot of work to sanity check the multiple different values and formats that ‘with’ data from different can contain, this field has quite a range of contents, but I don’t think it would be correct to integrate ISS annotations otherwise.

Emily

Valerie Wood wrote:

> Hi Emily,
>
> Did you talk to Dan and Uniprot about including the ISS annotations in the GOA/Uniprot entries (at least from the reference genomes)? This would be useful as it seems that some people use GOA as the annotation source. This means people think these annotations are missing and make function predictions where we already have good manually curated supported predictions and it all gets very messy….
>
> HNY
>
> Val
>
> E Dimmer wrote:
>
>> Hi Val,
>>
>> Thanks – I’ll talk with Dan as to whether it would be possible to integrate MOD IDs.
>>
>> There are ISS annotations in this file, however only those created by the GOA annotation tool. ISS annotations from groups external to GOA have never been included in any of our releases/displays – GOA has only ever integrated manual non-ISS annotations from other groups. GOA decided this to avoid the potential problem of circular ‘ISS’ annotations – however I think we are now beginning to see that other groups create very ISS annotations to us, and we might need to revisit that decision. A range of GO_REFs for the different types ISS annotations created by curators might help us in this area.
>>
>> Cheers,
>> Emily

Original comment by: ValWood

identical proteins

annotate identical proteins as "identical"

Original comment by: ValWood

After UPDATE; sort pseudogenes

Original comment by: ValWood

Some pseudogenes have changed status and need "warning, previously annotated as pseudo"
and GO annotation.

If a gene has a single frameshift, it probably isn't good to represent them as pseudo genes, make "valid translations" for these, removing the psudo label and flagging as
"
warning, possible frameshift
warning, previously annotated as pseudo
(they may be problems in the sequenced stain and some are problems with the sequence)

This way people will see them in the gene set if they are tring to do comparisons with other strains, or octosporus, japonicus etc. Adding the pseudo flag makes them less visible by excluding them from the protein set....

Community curation/ BioGRID

Still need to submit some of the the Biogrid curation from the commuity curation pilot

Original comment by: ValWood

GO: colocalizes_with

Feb 13th
reevaulated all use and reduced from
242 to 57 annotations
126 to ?? terms
check after update

Original comment by: ValWood

EMBL resubmission

To pick up new features
http://www.genedb.org/genedb/pombe/newgenes.jsp
and gene structure changes
http://www.genedb.org/genedb/pombe/coordChanges.jsp
and recent Broad updates (lots)
FIRST
check NCBI name problem fixed
check dummy PMID fixed
also fix strain of mating type contig
instructions are in sequence update
AFTER
report update to
Uniprot
pomblelist
NCBI
Raja pir/ ensembl/biogrid Eleanor gp2protein

from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/29

Original comment by: mah11

/note=

Need to follow up on handling of notes

On 09/02/2011 19:12, Kim Rutherford wrote:
> On Wednesday 9 February 2011 at 18:20:36, Val Wood wrote:
>
>> There are ~9575 notes on various features. Many non-CDS features have a
>> single note in free text (because controlled_curation is only allowed in
>> GeneDB for CDS)
> The Sanger loader stores /notes in the featureprop table with a
> property type of "comment" (as far as I can tell). There are only 2539
> comment properties in the database, so I have a bit of investigation to
> do.
>
> The breakdown of comments by type is:
>
> count | name
> -------+---------------
> 174 | tRNA
> 13 | rRNA
> 391 | ncRNA
> 361 | repeat_region
> 6 | snoRNA
> 372 | polypeptide
> 1222 | region
>
> The Sanger loader has cleverly not bothered to put any on the gene
> features.
>
>
>> These are mainly "a bit controlled" and can be refined quite easily
>> into some sort of vocabulary.
>> You mentioned that 366 notes are on CDS...I would like to get rid of
>> these if possible so if you can send me a list at some point (no hurry
>> as I won't have time for a few weeks)
> Do you need a list of the notes, or a list of the CDSs that have notes?
>
> Kim.

Val
There are a bunch of notes on
5'UTR
3'UTR
LTR
etc

Original comment by: ValWood

annotation notes

/note
956.

consider removing non cds features as many are ars/repeat/rna related

filter terms:
splice donor
splice branch
mRNA from
confirmed by mRNA
confirmed intron
anticodon
LTR
nominal overlap
this transcription could be
this transcript could be
Homol
gene-free
gene free
duplicated region
region duplicates
has transcript profile
confirmed
longest ORF
previously annotatd as dubious
Intron predicted
SPNG
ABO (EMBL ID)
Tf
TF1
TF2
dg I
dh I
TATA
poly A
wtf

check how many not CDS not, see Tim

Original comment by: ValWood

checkpoints

activation
arrest/maintenece
response
recovery

check new signalling ontology

Original comment by: ValWood

After transcription overhaul....

These are things from my list which may or may not need doing

• Question new term for transcription factor activity \(90\) \(is this the correct term? GO:0003704\)
• specific RNA polymerase II transcription factor activity remap to new term ?
• check diff between cellualr protein complex assembly and protein complex assembly 

https://sourceforge.net/tracker/?func=detail&aid=1891961&group\_id=36855&atid=440764
• annotate all sequence specific transcription factors to promoter binding

also need to check that everything from the transcription related SF items is closed, and any other reannotations required from these are done

Original comment by: ValWood

horizontal transfer

curate horizontal transfer events from Broad paper

Original comment by: ValWood

annotate alternative transcripts

See GeneDB listed

also
/ID="SPAC31G5.09c"
/note="alternative UTRs for this feature are represented by different polyA sites"

Original comment by: ValWood

unique products

add product heirarchy here
announce to pombelist

Original comment by: ValWood

Some annotation checks required/ terms not for direct annotation

I occasionally check whether annotation have appeared at the following terms which could be more specific.
These should move at some point to be documented on the wiki,
and form part of the automated alerting which Kim will set up down the line.
For now it is just a case of collecting the terms which should usually be more specific....

This are usually (not always), were checked today (22 March 2010)
cell cycle -> mitotic cell cycle or regulation of cell cycle
mitotic cell cycle->regulation of mitotic cell cycle
regulation of cell cycle -> regulation of mitotic cell cycle

Also checked and moved today
nuclear mRNA splicing, via spliceosome -> nuclear mRNA cis splicing via spliceosome

Previously checked
metabolic -> cellular metabolic DONE 9th November
DNA replication -> DNA-dependent DNA replication DONE 10 November 2010
all response to xxx stress are annotated to cellular response to xxx stress 10 November 2010

Original comment by: ValWood

community survey

repeat, include questions about
Ensembl
YOGY
links to others
Community curation
country

Original comment by: ValWood

mating type region

Need to follow this up at some point. MAy need to check further with Juan...

Yes, what Xavi did is perfect. He's put together mat2 and mat3 with adjacent regions (which should be present in all h90 strains). In h90 cells, mat1 would switch between having a p cassette or a m cassette. In 972 h- there is a deletion of mat2-p. However, in some h+ strains there is a deletion of mat3-M, but more frequently there are mutations that prevent the switching from P to M, although mat2-P is still there. Therefore, one would have to assemble a different region for each strain. I think the configuration of the different h+ and h- strains is described in the very old pombe book. Anyway, most of the time people don't even know which h+ or h- allele they have.

I think it would help to annotate mat1-m and mat3-m as such in the main contig (rather than calling them matmi_1 and _2 - or at least add a note), so that people know which one they are looking at. Also, it would help to annotate the features of the mating type region that Xavi put in his contig and that are present in 972 (like the IR-L and IR-R regions, or the homology boxes). Finally, would it be possible to add a note somewhere directing people to the mating type contig?

I hope this is useful - let me know if I can help with anything else

Juan

Original comment by: ValWood

modification ontology related

Need to

  1. Decide which evidence codes are required IDA +
  2. Write some text for the curation tool
  3. Work out which terms the existing modifications map to (maybe the common ones could be mentioned in the help text)
  4. ?

Original comment by: ValWood

reference genome annotation

Spreadsheet
http://spreadsheets.google.com/ccckey=pZhlLFuj8ewDe799QTmxzCA&hl=en12:42

eg
http://proto.informatics.jax.org/prototypes/GOgraphEX/PPOD12\_Graph/ORTHOMCL1245.html

AUG-FEB
PMID: 18820293 rps2
PMID: 2834104
PMID:14623272 ? done?

Need to add pre August list

Original comment by: ValWood

IC pairs

Get lists of IC pairs and numbers
Fill holes

Original comment by: ValWood

GO annotation G0/ stationary phase

GO TERMS
GO:0070315 G1 to G0 transition involved in cell differentiation
GO:0070317 negative regulation of G0 to G1 transition
regulation of cell quiesence
PAPERS
PMID: 19833516 yanagida G0 genes
PMID:17535257
PMID 19366728 sajiki at oist.jp (Kenich Sajiki)
PMID: 20133687
PMID: 20418666
others?

from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/12

Original comment by: mah11

Orfeome Data

Add High throughput TAG

Original comment by: ValWood

subunit composition

subunit composition <-> X-dimerization activity

Original comment by: ValWood

document procedure for Sequence and feature updates

1. Add new telomeric region to contigs
(are there 2 now?)

2. Do fixes small which are confirmed
http://www.sanger.ac.uk/Projects/S\pombe/sequence\updates.shtml
http://www.sanger.ac.uk/Projects/S\
pombe/sequence\
discrepancies.shtml

3. Can we fill any gaps using Broad data? Speak to Nick Rhind

To Do
Give community advance notice of contig changes
Make Fixes
Update Stats/ webpage (sequencing status etc)/ download data
EMBL resubmission

Original comment by: ValWood

QC High level (Slim) GO annotation consistency checking

There is a script which
when supplied with a GO ID, and the current GO database, and the fission yeast/ budding yeast ortholog table will report difference in annotations betweent he 2 organisms.
Need to check all "high level" terms
Recently done
translation
transport
ribsome biogenesis
vitamin metabolism
Part done
vesicle mediated transport
DNA recombination
DNA repair

from trac ticket:
https://sourceforge.net/apps/trac/pombase/ticket/16

Original comment by: mah11

IGI consistency

need to check how often the IGI supproted GO annotation is not applied to both genes (ignore cellular protein localization)

Original comment by: ValWood

gene expression

  1. gene expression, split expression regulated by into =ve -ve (some already have qualifers)

  2. Go through and work out what is needed

Original comment by: ValWood

linking db xrefs

make sure DB IDs in misc_RNAs and introns are in dbxref

(i.e mRNA from .....)

Original comment by: ValWood

GO: contributes to

After ref genome meeting, summarized “contributes_to”

Provided a list of OK egs

need to adress the others

notes:
should encourage people not to just remove, but to reannotate
examples to discuss
histone acetyl transferase
should only be used when a function is required for an activity, nto for a regulationor process
should it always be present for >1 subunit?

warning in logs for incorrect usage

Original comment by: ValWood

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.