gmod / chado Goto Github PK

the GMOD database schema

License: Artistic License 2.0

Perl 0.01% Shell 0.01% PLpgSQL 99.98% Dockerfile 0.01%

chado's Introduction

Chado

Chado is a modular schema for handling all kinds of biological data. It is intended to be used as both a primary datastore schema as well as a warehouse-style schema.

Introduction

Chado was originally conceived as the next generation Flybase database, combining the sequence annotation database gadfly with the Harvard and Cambridge databases. We have avoided organism or project specificity in the schema, and we hope it will be of use to other projects.

The modules currently in chado are:

Module	Description
Audit	database audits
Companalysis	data from computational analysis
Contact	people and groups
Controlled Vocabulary (cv)	controlled vocabularies and ontologies
Expression	summarized RNA and protein expresssion
General	identifiers
Genetic	genetic data and genotypes
Library	descriptions of molecular libraries
Mage	microarray data
Map	maps without sequence
Organism	species
Phenotype	phenotypic data
Phylogeny	phylogenetic trees
Publication (pub)	publications and references
Sequence	sequences and sequence features
Stock	specimens and biological collections
WWW	generic classes for web interfaces

For documentation on the various modules, see http://www.gmod.org.

Other modules are possible; the existing modules cover a very large variety of use cases.

Chado has a fairly abstract schema, and ontologies and controlled vocabularies (CVs) are utilised where their use is favourable to relational modeling. In particular, the sequence ontology (SO) is vital to the sequence module.

Some (but not all) of the use cases we have discussed are:

Central dogma genome annotations
Genes that break the central dogma (of which there are many Annotated in fly, including polycistronic transcripts, transplicing, selenocysteine readthroughs, rna editing, ....)
Sequence variation data, including SNPs, transposable element insertions, indels, ... how this relates to phenotypes, how these effect the central dogma....
Non-wildtype data, including representing a wildtype transcriptome and proteome on a non wildtype genome; implicit and explicit central dogma examples for mutant strains
Complex phenotypic data
Ontologies structured as graphs; querying over graph ontologies non-recursively by pre-computing the closure
Sequence ontology
Comparative data
Genetic interactions
Transgene constructs, complex genetic experiments and their results

The core schema is DBMS independent. The SQL table create files can be found in the chado/modules directory. The main Chado developers are currently using PostgreSQL.

Installation

Please read the included chado/INSTALL.Chado.md document for instructions on how to install the Chado schema.

Docker

Chado has been dockerized with Flyway so that you can look around, and test migrations easily. The following instructions are going to show you how to pull the already build images. If you would like to build your own image locally, there are instructins for that at the very bottom.

Using the Chado Docker Image

You can start and work within the Chado container by using the following commands. You will want to replace the <version> with a current version available on Github. You can see the available options on the github package repository.

docker run -it ghcr.io/gmod/chado:<version>

If you don't have the image locally, this will pull it down from the Github package repository first and then start a container for you. PostgreSQL and flyway will be running within the container and Chado will be cloned in the current working directory.

The container will be shut down once you exit the opened bash terminal. If you want the container to remain running then you would use -dit instead of -it to run docker in detached mode.

As a test, you can run the flyway info command after starting the container:

flyway info

You can explore the Chado tables using the psql command in the following way:

psql -U postgres -h localhost -d chado

You will be prompted for a password. The default password is chadotest

Testing a Pull Request

In order to allow you to modify the migrations while running the container, you will mount your current directory inside the container using the following approach. This way you can edit the files locally and they will be automatically updated within the container and available to flyway.

git clone https://github.com/GMOD/Chado chado-pr<number>
cd chado-pr<number>
git checkout <prBranch>
docker run -it -rm --volume=$(pwd):/Chado ghcr.io/gmod/chado:pr-<number>

You can test FlyWay migrations by running these flyway commands:

See what migrations need to be applied:

flyway info

Run the migrations available:

flyway migrate

The migrations have been applied correctly if you do not see any red errors. However, it is good practice to run flyway info again to confirm.

Locally Build the Docker Image

The following commands should be carried out in a cloned version of this repository, in the root directory.

To build the docker image using the default values, run the following command:

docker build --tag gmod/chado:local --file docker/Dockerfile ./

If you would like to build a docker that uses a specific version of PostgreSQL then you can run the following command instead. In this case you will replace <version> with the version of PostgreSQL. We currently support 12, 13, 14, 15, and 16.

docker build --tag gmod/chado:local --file docker/Dockerfile --build-arg PGSQL_VERSION=<version> ./

Chado Support

Please see our website for more information on Chado and the GMOD project:

http://www.gmod.org/

You can send questions to the Chado mailing list:

[email protected]

You can browse the schema CVS repository here:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/

Authors

Chris Mungall, David Emmert and the GMOD team

Full list of committers:

a8wright a8wright@224a875b-6a50-0410-9993-82261b5d0d45
Allen Day [email protected]
Ben Faga [email protected]
Bobular [email protected]
Brian O. [email protected]
Brian O'Connor [email protected]
Chris Vandevelde [email protected]
Chun-Huai Cheng [email protected]
cmungall [email protected]
Colin Wiel [email protected]
Cyril Pommier [email protected]
Dave Clements [email protected]
David Emmert [email protected]
Don Gilbert [email protected]
elee [email protected]
Eric Just [email protected]
Eric Rasche [email protected]
Frank Smutniak [email protected]
Hilmar Lapp [email protected]
Jason Stajich [email protected]
Jay Sundaram [email protected]
Jim Hu [email protected]
Josh Goodman [email protected]
Kathleen Falls [email protected]
Ken Youens-Clark [email protected]
Lacey-Anne Sanderson [email protected]
lallsonu lallsonu@224a875b-6a50-0410-9993-82261b5d0d45
Lincoln Stein [email protected]
Malcolm Cook [email protected]
Marc RJ Carlson [email protected]
Mark Gibson [email protected]
Meg Staton [email protected]
Monty Schulman [email protected]
Nathan Liles [email protected]
nm249 [email protected]
nmenda [email protected]
Nomi Harris [email protected]
Peili Zhnag [email protected]
Peter Ruzanov [email protected]
Pinglei Zhou [email protected]
Ram Podicheti [email protected]
Richard D. Hayes [email protected]
Rob Buels [email protected]
Scott Cain [email protected]
Seth Redmond [email protected]
Sheldon McKay [email protected]
Shengqiang Shu [email protected]
Stan Letovsky [email protected]
Stephen Ficklin [email protected]
Tony deCatanzaro [email protected]
Yuri Bendana [email protected]
zheng zha [email protected]

chado's People

Contributors

Stargazers

Watchers

chado's Issues

gmod_load_cvterms.pl does not load synonym types

The current loading script does not populate the type_id column:

foreach my $s ($novel_terms{$k}->get_synonyms() ) {
		    $s=~ s/\s+$//;
		    $s =~ s/\\//g;
		    message("...adding synonym  '$s'  to the database...\n");
		    $new_term->add_synonym($s);  #need to add a type_id to each synonym!
}

A change to something like $new_term->add_synonym($s, { type => 'exact' , autocreate => 1} ); could do the trick, but how to identify the correct types? EXACT, NARROW, BROAD, and RELATED seem standard in .obo files, but there may also be a need for loading custom types.

“Variable” table (v1.4)

This issue imported from the Chado v1.4 requested changes google doc:

For those using Drupal/Tripal, I’d like the same kind of table for variables that would be used to store some global informations like the Chado schema version, or the installed plugins, or some custom comments or whatever. The table would have 2 text columns: name and value.
I’d like to see in that table a first variable being ‘chado_schema_version’ = ‘1.4’ and I would like to use it to store the Chado Controller version installed on the database if one, like ‘chaco_version’=’2.0’ for instance. It would be useful for future updates or compatibility check with other Chado modules. I would also like to use that table to store some parameters like ‘chaco_authentication_mode’ = ‘postgresql’ or ‘tripal’ or… You see what I mean? This table could/should be used by Chado modules in order to store some informations that could not be stored anywhere else in the database.

Stephen says (4/24/2015): Because variables would be application specific I think this type of table should be created by the application (i.e. Tripal). in fact we’ve had some discussions about creating such a table for Tripal.

BigInt for primary keys

http://gmod.827538.n3.nabble.com/PRIMARY-KEYs-should-be-of-type-bigint-CASCADE-td4049183.html
The issue was raised that the 32-bit integer used for primary keys on the feature and other tables may not be large enough to accommodate enough features. The suggestion was to make it a bigint instead. But the issue was raised that this will at least double the size of the indexes an a suggestion was made for partitionaling the database.

analysis.uniquename (v1.4)

http://gmod.827538.n3.nabble.com/Analysis-uniqename-td4049078.html

The suggestion was raised that there needs to be some sort of unique constraint on analysis. To be consistent the idea was proposed for adding a uniquename field to the analysis table
Stephen says (4/24/2015). I agree the current unique constraint on the analysis table is a bit hard to deal with. Also, It makes sense to me that a feature or stock would have a human-readable name that doesn’t need to be unique and a separate unique name, but I don’t think an analysis needs two names. I think the analysis name would be the best thing to make unique (no need for a unique name). But since a unique constraint has never existing on the name folks could have duplicate names and an upgrade that included a unique constraint on the name would break things. I’m not sure how to resolve this… marking for v1.4.

Term/typedef in feature_property.obo

Hi,

I'm working on setting up a new Chado database from scratch with minimum amount of hardcoding. feature_property.obo definitely is a great help for that.

However, I noticed that all entries are in 'typedef' stanzas. From other ontologies (e.g. GO), I learned that 'typedef' stanzas are only used for relationships (and thus I store those in Chado with flag is_relationshiptype=1). This is also how I understood it from the OBO format guide.

Certain entries (such as 'linked_to') certainly qualify as relationships, whereas others (such as 'aminoacid') certainly don't. Would there be a way to reflect this in the stanza type? This would be a great help for me.

Thanks,
Christoph

P.S. one more minor thing: The header contains two 'default-namespace' lines, one of which has the value 'unknown'. I assume we can do without that - might prevent ambiguities, depending on employed obo parser.

Add cvalue_id to all prop tables

Adding a cvalue_id field to the biomaterialprop table, as is done for the natural diversity module. This allows the property values to also utilize a cvterm. The idea is to produce something similar to what was done with several natural diversity modules where values could be either be free text or from a controlled vocabulary.

See the related document outlining the suggestions for the 1.4 Chado release: https://docs.google.com/document/d/1t_Jb4XxUPtgGRSKtznsqvjTjmR4vpFHPQROMQ0OZK9A/edit?usp=sharing

New Indexes for featureloc feature and feature_relationship

This issue imported from the Chado v1.4 requested changes google doc:

Indexes Suggested by Karl O. Pinc

http://gmod.827538.n3.nabble.com/RFC-Chado-relase-v1-3-tp4048655p4048701.html
New indexes on these tables

feature_relationship
featureloc
feature

Index feature_relationship_idx1b

Here is the issue discussion

-- The effectiveness of this will vary based on
-- whether you have more subjects or objects.

create index feature_relationship_idx1b 
  on feature_relationship (object_id, subject_id, type_id); 

create index featureloc_idx1b 
  on featureloc (feature_id, fmin, fmax); 

create index feature_idx1b 
  on feature (feature_id, dbxref_id) 
  where dbxref_id is not null;

Change a field or a custom table?

Hi,
I want to store some data about pairwise comparisons (subject id, object id, alignment length and % match). The table feature_relationship is almost a perfect choice, but its field 'value' is a text, and I need a numeric. A workaround would be to store this field in feature_relationshipprop, but that's one more join (and it is an integer, but we can live with that). What would be a better way out? A custom table?
Thanks in advance.

Stocks with multiple organisms (v1.4)

This issue imported from the Chado v1.4 requested changes google doc:

Lacey Sanderson and Ethy Cannon need to represent germplasm (stock) with more than one contributed organism (e.g. interspecific crosses). Suggestion is to add a stock_organism table but organism_id in stock table is not null, implying a primary contributor.

This is also a problem for feature maps where the landmark feature (e.g. linkage group) is from a single feature that has a single species.

Bob MacCullum says (3/10/2015): I (Bob) thought stock.organism_id was already allowed null - we certainly use it that way (organism_id is always null) and we have no documented SQL changes. However, I think that's a problem with our documentation!! Here's something I emailed to the list, but reading it again I see an unfinished sentence… I think it was supposed to end with "expect undefined behaviour when doing find_or_create on tables with multiple keys including fields with null values".

Stephen says (3/10/2015): Yes, you’re right Bob, the table does allow NULL for an organism_id, perhaps this will resolve the issue, Lacey & Ethy, and allow you to use a stock_organism table?

Sook says (3/18/2015): What about adding a type_id to the stock_organism to specify the relationship between the stock record and the organisms (e.g. derives_from).

organism_analysis linker table

I think there was some request for an organism_analysis linker table in the schema? We plan on creating one at HWG anyway.

-- ================================================
-- TABLE: organism_analysis
-- ================================================

create table organism_analysis (
       organism_analysis_id bigserial not null,
       primary key (organism_analysis_id),
       organism_id bigint not null,
       foreign key (organism_id) references organism (organism_id) on delete cascade INITIALLY DEFERRED,
       analysis_id bigint not null,
       foreign key (analysis_id) references analysis (analysis_id) on delete cascade INITIALLY DEFERRED,
       constraint organism_analysis_c1 unique (organism_id,analysis_id)
);
create index organism_analysis_idx1 on organism_analysis (organism_id);
create index organism_analysis_idx2 on organism_analysis (analysis_id);

COMMENT ON TABLE organism_analysis IS 'Linking table for associating organisms and analyses.';

Suggestions for schema modification.

There are a lot of suggestions I have regarding modifications to the Chado schema. How do I best make those suggestions? Is there an automated build system?

You have a lot of composite natural keys where the first value on the key is also indexed. There is no reason to ever do that.

featureloc_idx1
analysisfeature_idx1
synonym_idx1
dbxref_idx1

Amongst others are good examples of where you can save GBs of ram and disk space by eliminating useless indexes.

This kind of stuff,

the default of ('now'::text)::timestamp(6) with time zone

Is just a slower method of saying DEFAULT now().

boxquery and boxquery are a costumed rolled and much slower method of doing operations on an intrange with a gist index. Problem being PostgreSQL provides a much more featurefull standardized and optimized method with the int4range type.

Storing the md5 as character(32) is also a bad idea, those should be stored as UUID which would save a ton of space in index and disk.

feature_cvterm vs feature_dbxref vs featureprop for feature annotations

Hello,

@mpoelchau and myself have been discussing the behavior of storing GFF files for feature annotations via Tripal. We are considering a gene that perhaps has been annotated with GO terms, KEGG terms, proposed PFAM domains, and Interproscan family annotations.

My understanding of the Chado tables (which i want to emphasize is up for debate) is:

feature_cvterm is for annotating features with all of the cases I described above (GO, KEGG, PFAM) because some decision was made based on computational evidence to associate the feature with that annotation. The feature_cvtemrprop table exists to store evidence codes, qualifiers, etc.
feature_dbxref is for storing references to that record, itself, in anotehr database. So it should only be used to link back to the feature itself on a different site. Gene families its a part of, for example, wouldnt belong here.
featureprop: its hard for me to distinguish when a term annotation is better suited as a featureprop. props can have pubs for evidence but theres no featurepropprop table for evidence codes. Also, the "value" field seldom may not make sense if tagging with an annotation.

I'll add this is the most definitive guidance i found in my search on the chado wiki in the sequence module manual

Detailed annotations, such as associations to Gene Ontology (GO) terms or Cell Ontology terms, can be attached to features using the feature_cvterm linking table. This allows multiple ontology terms to be associated with each feature.
Provenance data can be attached with the feature_cvtermprop and feature_cvterm_dbxref higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature_cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features.
Annotations for existing features can also go into the featureprop table using the Chado feature_property ontology (defined in chado/load/etc/feature_property.obo) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related chado/load/etc/genbank_feature_property.obo file) is to capture terms that are likely to appear in GFF or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.

Insofar as the GFF file holding the annotations:

The gff spec states: Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label.

similarly, NCBI calls most things dbxrefs in a much broader definition than the one i use above.

Here's the conflict. KEGG terms, for example, are not ontologies. But when we read the GFF file, we parse Ontology_terms into feature_cvterm, dbxrefs to dbxrefs, and everything else to props. So for the annotations to go into feature_cvterm, they would need to be in the GFF under ontology_terms.

As monica phrased her doubts:

With GO, I get it - a GO term refers to a formal, accessioned description of a gene function (e.g. http://amigo.geneontology.org/amigo/term/GO:0003676). A GO term does not also refer to a protein sequence - you annotate the protein sequence with the GO term. An InterPro accession is an accessioned ‘signature’ (which is a combo of HMMs, profiles, position-specific scoring matrices or regular expressions), which is annotated by curators with free-text descriptions from the literature. (And they can also be associated with a GO term). As such, I view InterPro domain accessions more as entries within a very authoritative database, rather than a controlled vocabulary. Although perhaps the domain name is enough to call it a controlled vocabulary at this point?

The consequence of these decisions is we display featureprops, feature_cvterms, and feature_dbxrefs in different locations and in different ways to end users.

suggested mapping for NCBI databases

I'm creating a Tripal module that imports NCBI XML and creates chado records, as well as linked records.

I think it would be neat to have a recommended, agreed upon mapping of NCBI data into chado. This could go into the documentation/wiki.
ie bioproject -> project
biosample -> biomaterial
assembly -> analysis of type X . (note that some @laceysanderson might argue this should actualy be a project since its multiple analyses?)

Property Table Updates

This issue imported from the Chado v1.4 requested changes google doc:

http://gmod.827538.n3.nabble.com/Property-tables-and-units-td4047268.html#a4047297

How to deal with numerical values? Otherwise can’t naively sort or filter values in a query.
To solve table property issues, I (Valentin Guignon, Bioversity International) propose to the Chado community to debate and vote on the 5 flavors to keep only one of them. Flavors:

replacing existing table by a new table with "typed" properties
Pros: keeps the same schema with minor changes on the 'prop tables
Cons: some old software may "not like" to see new columns and would need to be updated. Values stored in non-text value fields won’t be seen by old softwares (while they would be seen with 2b and 3b).
adding a new table with "typed" properties and keep the old one
Pros: backward compatibility…
Cons: ...but properties will come from 2 different tables and conflicting values could appear.
Double table management.
adding a new table with "typed" properties and remove the old one
Pros: forces people to be aware and use the new way of storing properties
Cons: no backward compatibility
2b) same as 2 with triggers to auto-synchronize the 2 tables
Pros: backward compatibility
Cons: double data storage, working with the old property table may have side effect issues.
3b) same as 3 with a view and rules in order to have virtual old property tables that would behave just like before but store data in the new table
Pros: backward compatibility
Cons: inserting/updating through the view replacing the old property tables may have side effect issues.

Auditing record creation

This issue imported from the Chado v1.4 requested changes google doc:

Requested by Lukas Mueller & seconded by Sook Jung (v1.4)

http://gmod.827538.n3.nabble.com/RFC-Chado-relase-v1-3-tp4048655p4048658.html

Add timestamps for creation time (maybe also update) in the nd_experiment and stock (possibly project) tables?
- Lacey says (4/24/2014): I also support this and have often wondered why features get this support but the rest of us don't ;) Looks like there is some pushback against this for wait till v1.4 until that can be sorted out?
In breeding programs, people would like to know when and how many things have been added.
Comment from 26 v. guignon: That should be the purpose of the Audit module or Chado Controller History module. I don't think we would need those in base tables as it would be used for traceability and the purpose of the audit/history modules are traceability. It would be redundant and require space and triggers to keep those up to date which would impact Chado performances

Relations ontology 404 not found

When I try to run make the ontologies, ro does not get downloaded. This is what I see:

$ make ontologies
./Build ontologies
Available ontologies:
[1] Relationship Ontology
[2] Sequence Ontology
[3] Gene Ontology
[4] Chado Feature Properties
[5] Plant Ontology

Which ontologies would you like to load (Comma delimited)? [0]  1,3,4,5
fetching files for Relationship Ontology
  +http://www.obofoundry.org/ro/ro.obo
    404 Not Found   (http://www.obofoundry.org/ro/ro.obo)
fetching files for Gene Ontology
  +http://www.geneontology.org/ontology/gene_ontology.obo
    ./tmp/go/go.obo is up to date
fetching files for Chado Feature Properties
  +load/etc/feature_property.obo
    loading...done!
fetching files for Plant Ontology
  +http://palea.cgrb.oregonstate.edu/viewsvn/Poc/trunk/ontology/OBO_format/po_anatomy.obo?view=co
    updated
    loading...done!

Core goal: Documentation

I wonder if moving the documentation for Chado into this repo via readthedocs would make documentation for chado more visible. Additionally, I think it would make it easier for contribution of "how to use chado" (i.e. best practices and/or just different methods). For example, the recent issue on mRNA/Gene versioning (tripal/tripal#645) has shown multiple common ways in use -it would be great to document these! While we could put them on the gmod wiki, I worry these docs get lost there.

Reasons for including docs in github:

easy to tag an issue as needing updates to documentation which helps keeps the documentation up to date
makes it easier to find/navigate all the documentation without getting lost in all the other gmod docs
easy to contribute changes ("Edit in gihub" automatically creates a fork and PR)
allows us to curate docs which improves quality (I volunteer to keep up with this although I would Love help)
I've noticed moving Tripal docs to github makes me feel more rewarded for updating docs since people actually see I'm making updates

Downsides:

documentation in multiple places (gmod wiki, readthedocs, schemaspy). This could be mitigated by good external linking making readthedocs the primary source

Core Goal: using modern schema deployment

I was discussing with my group ways to make Chado easier to maintain. I need to preface this by saying I only watched Stephen write the diffs himself, I haven't done it so my understanding is limited. But. It seems like right now when you make changes to Chado, you have to change the SQL for installing that verison from scratch, and then also create diff files to go from any previous version of chado.
It seems like a lot of work.

The laravel php framework uses database migrations for this sort of thing. Changes to the DB structure are made sequentially, so all you need to do to upgrade is run migrations your site hasn't run yet. Instead of two types of files, one of which seems to be a fair amount of work to do, you only have one: the migration file. Each change to the db is a single migration file. You can see our use of migrations for our TreeSnap site here. For exmaple, this migration adds a single table to the db. A future migration might modify a column.

The main disadvantage of migrations is we introduce a dependency. We clearly wouldn't use laravel, since its an arbitrary php framework. But why not a package manager like pip, gems, or composer? Chado would remain platform agnostic, you would introduce a dependency but its likely is already in use on the server.
And, the main benefit: it becomes much easier to contribute to, and maintain, Chado.

Indexes for cvtermsynonym

In Tripal we often search for cvtermsynonyms using the synonym column and the type_id column. We can improve performance if we add an index for both of those columns

Core Goal: Defining Contribution Guidelines and Governance

The issue of governance came up during our groups module meeting today. Meeting notes here.

We are having a similar discussion in Tripal right now.
In that meeting I started the discussion by linking this guide for open source structures.

We thought the community would benefit from some formalized governance by

reducing work overload and reliance on @scottcain
developing faster
being more cohesive in terms of implementation

This thread is for discussion on what a Chado governance structure would look like.

Some questions...

Would it make more sense to organize at the GMOD level?
What do we require of contributions? In terms of documentation, usage guide, commenting, the migration script stuff...
What do we require of someone who will have write access to the repo?

base table table?

Something I was reflecting on with Tripal is we do a lot of weird checks to determine if a chado table is a base table or not-- and even then, we have to hardcode quite a few in.

Has a table of tables ever been proposed? In our case t could just be a table listing the base table names, but it could be expanded to have a type_id for each table. Or, is htere another way to access metadata about a table to get a base table list? When people add custom tables to chado, they could just insert their base table into the table of tables.

Broken link in README.md - "INSTALL.Chado"

The "INSTALL.Chado" link in the Installation section of the main README.md file is broken.

project_phenotype table?

I would like to attach a set of phenotype records to the study (project) that generated them, but don't want to use the nd_experiment table:

project --- nd_experiment_project --- nd_experiment --- nd_experiment_phenotype --- phenotype

This is because it is challenging to maintain data integrity due to both the length of the connecting chain, and the lack of constraints on the nd_experiment table.

CREATE TABLE project_phenotype (
project_phenotype_id BITINT SERIAL NOT NULL,
PRIMARY KEY (project_phenotype_id),
project_id BIGINT NOT NULL,
FOREIGN KEY (project_id) REFERENCES project (project_id) ON DELETE CASCADE INITIALLY DEFERRED,
phenotype_id BIGINT NOT NULL
FOREIGN KEY (phenotype_id) REFERENCES phenotype (phenotype_id) ON DELETE CASCADE INITIALLY DEFERRED,
CONSTRAINT project_phenotype_c1 UNIQUE (project_id, phenotype_id)
);
CREATE INDEX project_phenotype_idx1 ON project_phenotype (project_id);
CREATE INDEX project_phenotype_idx2 ON project_phenotype (project_id);

Annotating a feature with multiple types of cvterms

Hello all,

We are working on a way to add additional information to a feature_cvterm association. We would like to specify an analysis_id for a feature_cvterm record. We would also like to specify whether this cvterm is a relationship term.

What we are trying to create is something to similar to this page:
https://www.arabidopsis.org/servlets/Search?action=search&type=annotation&tair_object_id=2025391&locus_name=AT1G01620

Where the columns relationship type and keyword are cvterms and evidence is an analysis.

Currently chado provides a way for us to associate a publication to a feature but it doesn't offer a way to link analyses. In our case, some annotations don't necessarily have a publication to link to but do have analyses (BLAST and InterProScans for example).

Does chado offer a way to do this?

Thanks!

Companalysis Module

This issue imported from the Chado v1.4 requested changes google doc:

http://gmod.827538.n3.nabble.com/An-alteranate-high-throughput-sequencing-chado-representation-tc4046238.html

Add a ‘type’ field to the analysis table. (v1.4)
- Stephen says (1/14/2015). I like this. It would make it easier to distinguish between different types of analyses. We currently do this by using a property. But perhaps storing directly as a field in analysis would be better.

Updated synonym table?

I am just a DB not a geneticist

It seems as if the table synonym is supposed to link to common names. Here is a row of sample data from input (unrelated to the problem)

5 | O76864    | EG:100G10.4 protein (EG:100G10.4) [] - [O76864_DROME]                                                                                                                                                                                                               |             |             |            1 |     4785000

I tried search synonym.name for all of these but I couldn't find it. The page on Uniprot for O76864 links it to this entry in Flybase Interestingly that has the link back to Uniprot for O76864, but I don't see that in Chado anywhere.

Phenotype Module

This issue imported from the Chado v1.4 requested changes google doc:

http://gmod.827538.n3.nabble.com/proposal-for-changes-to-the-phenotype-module-tc4044535.html#a4044612

Deprecate fields of the phenotype table: observable_id, attr_id, cvalue_id, assay_id
New tables:
- phenotype_cvtermrelationship
Naama Menda says (3/11/2015): Inclined towards option #1 http://gmod.org/wiki/Chado_Post-Composed_Phenotypes#Option_1:_Group_Module_is_an_intrinsic_part_of_the_phenotype_cvterm_table
Need a way to decouple phenotype description from its value, and to allow post-composing of phenotypes , e.g. EQ statements.
Valentin Guignon says (3/11/2015): about names, I suggest phenotype_cvtermrelationship → phenotype_cvterm_composition and rank → composition_order (or keep rank and add a column composition_order and adjust constraints according to that)

Changes to cvterm, dbxref and project tables

This issue imported from the Chado v1.4 requested changes google doc:

Changes suggested by Bob MacCallum, VectorBase
(population biology browser runs off Chado)

cvterm: Email discussions in March 2012 ("Anyone had any luck loading GAZ->Chado cv tables") suggested adding dbxref_id to cvterm's constraint cvterm_c1 - but looking at our schema changes (from Chado trunk) it seems that we have simply dropped that constraint completely. I'm not sure what caused the change of heart, sorry. This is to accommodate GAZ and other ontologies with multiple terms with the same name (e.g. "London").

dbxref: Also GAZ-related I believe: ALTER TABLE dbxref ALTER COLUMN accession TYPE character varying(1024);
Stephen says (04/24/2015). This increase in size for the accession seems reasonable to me.
(For the record we will gradually reduce our reliance on GAZ in favour of geometric approaches in Solr. I'm not suggesting the above changes should go into the trunk - I'm just mentioning them for completeness.)

project: ALTER TABLE project ALTER COLUMN description TYPE character varying(4096);
Stephen says (04/24/2015). I think all ‘description’ and ‘comment’ fields in all tables should be of type text. Not a character varying.
A full (and very short) list of our patches applied to an (old-ish) Chado can be found here.

XSLT transforms take not-insignificant amounts of RAM

This isn't a "serious" issue, but more a feature that I'd like to track and eventually PR/patch.

I've recently refactored the way my prebuilt chado schemas are being built into a way that I can more easily publicise, and others can use. Part of that process involved moving the build to a VM.

The VM has only 1 Gb of RAM, and during the XSLT processing step of the Gene Ontology OBO file, XSLT chews through that gig and crashes. This is because XSLT requires loading the entire document into memory, and that's not always feasible for these large XML documents.

Best practices seem to suggest switching to a streaming transform like XST.

Add date_created, date_updated to “base” tables (v1.4)

sounds self explanatory
The issue description is this which looks unrelated to me

Change 'feature.seqlen' to a bigint to accommodate longer sequences.
Great.

One more thing that come to my mind is to have a datetime column for
most of the central tables for example in the pub table. It simply allows me to
have the state of row with making changes to the core tables or
adding additional linking tables and application logic. The idea is
similar to what ruby on rails framework add to every table once you run
the migration through it(date_created, date_updated).

Changes to the Project and Biomaterial table

This issue imported from the Chado v1.4 requested changes google doc:

Requests by Andrew Farmer

http://gmod.827538.n3.nabble.com/RFC-Chado-relase-v1-3-tp4048655p4048675.html

Add a biomaterial_project table (v1.4)
- Stephen says (4/24/2015). because the biomaterial, protocol, study, treatment are not specific to just MAGE data perhaps they should be moved out of the MAGE module into a new ‘Experiment’ module? The assay, array and element tables would also apply to SNP assays as well, so perhaps MAGE is no longer a good name for this module?
Other Customization to tables:
- project table
  - add dbxref_id => dbxref (supports NCBI BioProject ids, for example) (v1.4)
    - Stephen says (3/9/2015): Perhaps a project_dbxref table should be used instead.
    - Stephen says (4/24/2015): adding a dbxref_id to the project table can be discussed for v1.4
  - add type_id => cvterm (supports classification of projects, initially using cv derived from NCBI's BioProject vocabulary) (v1.4)
  - ALTER COLUMN description TYPE text (BioProject descriptions can be long)
    - Stephen says (3/9/2015): a few other tables have a ‘description’ field that is too short (e.g pub, library). We end up storing longer descriptions in the associated prop table. Should we extend the description fields of all tables to be of type ‘text’?
- biomaterial table
  - add stock_id => stock (allows tracking cultivar, etc. for samples; but this may be unnecessary if subspecies info is now going to be included in organism) (v1.4)
    - Stephen says (3/9/2015): Perhaps a stock_biomaterial table should be used instead.
  - add project_id => project (links samples to primary projects, as is done in NCBI) (v1.4)
    - Stephen says (3/9/2015): This should be handled by the suggested biomaterial_project table?

direct way to link analysis and biomaterial?

as part of mapping NCBI data into chado (see #76 ) I am adding NCBI assemblies as Chado analyses. NCBI assemblies are directly associated with biosamples, which map to chado biomaterials. However, there is no biomaterial_analysis linker table.

MAGE provides a way to link biomaterials and analyses, but it is not pretty. I'd have to create a quantificiation, acquisition, assay, and protocol to link them. I'd basically be bluffing for all of those records: i dont have something equivalent in NCBI to populate them with. I'd much prefer to link directly.

I guess its also possible to link biomaterials to analyses via a shared project, but that wouldnt be specific enough if a project had multiple analyses/biomaterials.

What do you all think? is biomaterial_analysis feasible? Does it go against analysis being a single specific run of a program? If so, should we be loading in ncbi assemblies as a different chado type?

search_path problem with boxrange function

Summary
Recent versions (9.3+) of PostgreSQL that have been patched for CVE-2018-1058 will throw the following error when a vacuum / analyze is attempted on the featureloc table.

Error

vacuumdb: vacuuming of database "prod" failed: ERROR:  function create_point(integer, integer) does not exist
LINE 1: SELECT box (create_point(0, $1), create_point($2,500000000))
                    ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:  SELECT box (create_point(0, $1), create_point($2,500000000))
CONTEXT:  SQL function "boxrange" during inlining

Details
This problem is caused by the binloc_boxrange and binloc_boxrange_src indices on the featureloc table. These indices are generated by the boxrange function defined in the default Chado schema. Due to changes made for CVE-2018-1058, when a function is used in an index the search_path within that function has the public schema removed, thus any other user defined functions fail to execute because they are not found in the search_path. In our case, calls to create_point from within the boxrange function fail.

Possible solutions

Hard code the boxrange function to use the schema name e.g. public.create_point(0, $1).

Hard coding the public schema in a function name is not ideal and would break on any installs that did not run in the public schema.

Use the SET SEARCH_PATH FROM CURRENT in the function definition of boxrange.

SET search_path = public;
CREATE OR REPLACE FUNCTION boxrange (bigint, bigint) RETURNS box AS
 'SELECT box (create_point(0, $1), create_point($2,500000000))'
LANGUAGE 'sql' IMMUTABLE SET SEARCH_PATH FROM CURRENT;

This is what the PostgreSQL devs currently recommend. We would need to somehow dynamically set the search_path during build/install time. Talk to @scottcain on how best to do that within the GMOD schema build/install process.

Other functions
We should check for other functions that may be impacted by this change as well.

Additional resources

gmod_load_cvterms.pl does not (yet) support cross cv relationships

Attempting to load related crop ontology terms in separate namespaces (e.g. "YamTrait" and "YamVariable") results in the error "Cross referencing relationships across namespaces is not supported (yet.. ) SKIPPING this relationship!".

So far, the way we get around this error is to load everything in a single namespace, but then we end up with mixed ontologies. Could we update the loading script to support cross referencing? The big advantage is that it would let us load more refined namespaces, which in turn would make it easy to distinguish which terms can have phenotype measurements linked to them, and how terms can be used for trait composing.

Phenotype Module changes to handle EQ statements

http://gmod.org/wiki/Chado_Post-Composed_Phenotypes

Changes requested to phenotype_cvterm for managing relationships
- add a ‘grp_id’ to link the phenotype_cvterm table to the group module
- Add a ‘type_id’
Stephen Ficklin says (3/10/2015): These EQ statements seem so important to storing information about phenotypes, I'm thinking that storing them in the group will obfuscate their meaning. I can understand trying to use the group module to store these because there is no other table structure to do so. Perhaps a dedicated set of tables for storing EQs in the phenotype module is better than trying to use the group module. Another advantage to creating custom tables for EQ statements is it encourages folks to store them in the same way because they'll see the tables and natural ask how to use them.

associating contact roles with a project?

Hello,

I'm mapping out a module that would integrate drupal users with chado contacts. Part of that is I would like to assign users to different roles for different subsets of chado data. The most obvious way to do that to me is to use the chado.project_contact table. However that table only has contact_id and project_id: how can i specify that user X is the admin for a project and user Y has read access, and user Z has write access?

I would think either a) type_id for project_contact or b) project_contactprop. Either approach would involve adding to Chado. there's always c) do it my own way and/or not in Chado. i can see how this isnt strictly biological data and therefore i should have, instead, a public linker table for users and projects where i can specify the role as its own column.

Any thoughts/experience on this?

Storing vocabulary terms borrowed from other CVs?

Ontologies are increasingly borrowing terms from other ontologies and using the IDs from the source ontology, and the current tables of Chado make this difficult to represent. One example is the Plant Trait Ontology (TO). Many of it's terms come from GO, CHEBI and BFO. I'm working on improvements to the Tripal OBO loader to deal with this and I do not know how we should store these.

Consider the case where a GO term is used in the TO. How do I represent that term?

Option 1: Create a new cvterm that duplicates the original GO Term. Problems:

If the term changes in the GO then the two records get out of sync.
This does not accurately reflect the fact that the term belongs to the GO and not the TO. It's just being borrowed.
This option forces me to add a dbxref record, and following the foreign keys all the way to the db will lead me to the TO record in the db table. This has the wrong URL and URLprefix for a GO term.

Option 2: Use the original cvterm from the GO. This seems like the correct way to go.... but:

The relationships for this term may not be the same in the borrowing vocabulary. We cannot represent a new set of relationships that are different from those already created. (e.g. those created by the GO and distinguish those as belonging to the TO).
If we do not physically add a record to the cvterm table than many existing tools that expect to find all terms for a vocabulary in the cvterm table won't find these borrowed terms.
If a borrowed term serves as a root term in the vocabulary how can tools determine that it serves as a root because the cvterm table has a cv_id that points to the owning ontology.

I don't yet have a solution for this but if I can come up with one I'll post it here.

PRIMARY KEYs should be TYPE bigint [CASCADE]

PostgreSQL uses bigint to manage SEQUENCEs (the data structure used to maintain the serial data type. This should be paired with a column of type bigint, which is denoted by the type name bigserial in a CREATE TABLE declaration. Without this, eventually INSERTs will fail with the message ERROR: integer out of range as the serial begins to return ids that are too large for the column type.

This is a change that would have far reaching implications, affecting all columns that are foreign keys.

The following are related to this issue:

ALTER TYPE of feature.seqlen to bigint
ALTER TYPE of featureloc.fmin, featureloc.fmax to bigint

Tables requested by Main Lab

This issue imported from the Chado v1.4 requested changes google doc:

http://gmod.827538.n3.nabble.com/Opening-up-comments-on-changes-to-Chado-td4036098.html#a4039024

All requests except for image tables were worked into the v1.3 release. The following are linker tables for the eimage table. But there may be a different solution for files/images in general.

contact_image (maybe should be image_contact) (v1.4)
feature_image (maybe should be image_feature) (v1.4)
stock_image (maybe should be image_stock) (v1.4)
organism_image (v1.4)

Property Tables

eimageprop

(v1.4) (VG added: for DDL, see request #3)

EImage Table / Module (v1.4)

This issue imported from the Chado v1.4 requested changes google doc:

Should this eimage table be pulled out into a new ‘image’ module with appropriate linking tables?

Scott says: I’m not sure what you’re suggesting: keeping eimage, but creating a series of “eimiage_” linking tables? Probably not a bad idea.

Stephen says (1/14/15): Yep, that works, and perhaps calling that collection of tables the EImage or Image module? It would be nice to rename the eimage table as just image but that’s probably a no-go. Or we can leave the eimage table as part of the expression module as it was intended and create a new Image module with an image table and image_ tables.

Bob MacCullum says (3/10/2015): an Image module would be great

Stephen says (4/24/2015). We are working on a file module for Tripal and will propose a set of tables that can associate any type of file (including images) with any data type in Chado.

Support for display of regions on maps with float coords

Storing coordinates for features on maps with float coordinates is awkward. featurepos+featureposprop to indicate start and stop coordinates can work, but feels unnecessarily convoluted.

Some thoughts and discussion:

Change featureloc.fmin and featureloc.fmax to a float? Float fields are problematic, but so is representing a float as an int and divided by X, for example representing 3.6 cM with 36 and dividing by 10.
Add a type field to the featurepos table to indicate the “type” of the position (e.g. start or end).
Add a new ‘featureinterval’ table
- with these fields:
  - featureinterval_id
  - featuremap_id (map set, to get coordinate units)
  - feature_id (object feature being placed)
  - srcfeature_id (target feature)
  - startpos (double precision)
  - endpos (double precision)
- Stephen Ficklin (1/14/15): In the email it says this would help newbies but duplicates some information already in the featurepos table. Not sure if we should add tables that duplicate info. I got
  slapped around once for suggesting such a thing.
- Scott Cain (1/14/15): As well you should! Perhaps featureinterval should be a view?

1.4 schema documentation

The old wiki documentation was generated via a python script. It is now out of date and we dont have an easy way to update it. we also don't want to lose the comments and changes in the existing wiki.

Change the python script to somehow update?
Use a more brand name documentation generator? That can somehow deal with this?
Deprecate the old documentation and point people to a new docs that is generated via a brand name documentation generator?

Updating MAGE module meta data to support next gen data

Updating table descriptions to generalize mage to include next-gen sequencing experiments and other possible analyses instead of microarray only. This is a suggested topic for the 2018 PAG hackathon.

A related topic is the suggested modifications for the 1.4 Chado release: https://docs.google.com/document/d/1t_Jb4XxUPtgGRSKtznsqvjTjmR4vpFHPQROMQ0OZK9A/edit?usp=sharing

Add keys to "temp" tables.

The tmp_gff_load_cache table is a not-really-temporary table that the GFF3 bulk loader uses to keep track of feature names and IDs. To make database replication easier, it should have a primary key or unique index.

The gff_meta table is also without either a primary key or a unique index.
So, for replication of the DB (even though this table may be "disposable"), it too needs either a primary key or a unique index.

This is migrated from the bug tracker at Sourceforge: http://sourceforge.net/p/gmod/bugs/363/

Suggestion: separate Chado Schema/Perl into two different repositories

I'd like to propose for discussion that we split this Chado repository into two different GitHub projects where the Chado schema is in one and everything else (i.e. Perl code) is in another.

There are several reasons I think this would be beneficial:

Perl is not the preferred language for everyone, it's losing users to Python in the informatics sphere, and there are potentially newer ways to get Chado installed if we explored and implemented them.
Tools that use the Chado schema sometimes never need the Perl code.
Right now, Tripal packages it's own copy of Chado rather than rely on the official Chado release. It would be preferable to pull directly from the Chado repository rather than package Chado with tools.
Changes in the schema are very slow while changes in the Perl code happen more regularly. As there has been some discussion related to Chado schema Governance it seems these two update schedules are different enough that separate repositories would help. i.e. one group of individuals responsible for the schema can worry about updates to that while another group can worry about updates to the Perl code.
The last digit version number confuses people who think it implies schema changes.

Changes to Chado Installation

This issue imported from the Chado v1.4 requested changes google doc:

Changes to Chado Installation that Karl O. Pinc suggested

http://gmod.827538.n3.nabble.com/RFC-Chado-relase-v1-3-tp4048655p4048701.html

Do not delete tables before installing. (If the table already exists the transaction should roll back.)
Do not be so chatty, display only warnings and errors, not informational messages.
I would like to have it be possible to install Chado into it's own schema. The first step for this is to get rid of the multiple schemas that Chado currently uses.
I would like to see Chado be able to be installed modularly. At present this is rather-to-very difficult. (At least it's difficult if you want only Chado. I don't know the process if Tripal is involved.) See email link for ideas on how to do this.
In a follow up email, Siddhartha Basu suggested releasing just the DDL without the software.

group module

From the google dock
The group schema will define groups of things.
Here is the group module discussion.

As per the discussion with Scott Cain, Stephen Ficklin, Lacey Sanderson, a proposal was made to require the following for addition of new modules (including group module)
Properly define use cases that provide enough detail such that anyone can understand how the data will be stored in each table
A full schema DDL must be available.
Documentation for the module must be ready to be published on the GMOD wiki, including at a minimum HTML tables describing the module

Chado CV/DB table challenge

We are having a bit of trouble organizing controlled vocabularies in the CV and DB tables. The problem comes in with the "short name" (e.g. GO, SO, etc...) for the vocabulary which gets stored in the db.name column. The db table also houses the URL and URL prefix. The URL prefix is meant to be used for any dbxref (even if it's not a term) to rebuild a web link to point to the cross reference.

This works great for the Gene Ontology because all of the terms have one short name (GO), so that record get stores in the db table and the three vocabularies for GO (i.e. molecular_function, biological_process, cellular_component) all get stored in in the cv table. This works great! The db.urlprefix can be used to build all of the link-outs because GO uses the same URL for all of them.

This breaks for vocabularies such as EDAM. "EDAM is an ontology of bioinformatics types of data including identifiers, data formats, operations and topics." In this case the entire vocabulary is named 'EDAM' and it has separate vocabularies (like GO) named 'data', 'operation', 'format', etc. The terms in this ontology are not prefixed with 'EDAM' but rather the vocabulary name. So, to make link-outs work, this forced us to reverse how we store them. We put 'EDAM' in the cv table... which is not ideal.

KEGG provides more challenges. We want to store KEGG orthologs and pathway as cvterms. These terms are organized into KEGG's BRITE hierarchy. So, in a way it seems like an ontology. Because orthologs and pathways have different URLs we are forced to store the PATH and KO vocabularies in the db table. But, if we want to preserve the organization of the BRITE heirarchy we have to lump these into the same KEGG BRITE cv record. Same problem as EDAM, except that higher level terms in the BRITE hierarchy don't have corresponding pages so we can't link to them. Some of our term links will be broken.

So, I think we need to adjust how we recommend and how our scripts store vocabularies in Chado. Here's one suggestion

We keep the db.urlprefix field but use that as a "default". This ensures continued backwards compatibility.
Because a database can use multiple URLs for all kinds of data (not just terms) and even for different vocabularies, we create a new dburl table where URLs can be stored. We document a storage format that all Chado client applications can understand if they follow it. For example, Tripal currently uses {db} and {accession} as tokens in the URL string. This allows the URL to be constructed by filling in the tokens and those tokens can appear anywhere in the string. Whereas now, usage implies that the accession is added to the end of the value in the db.urlprefix.
We create a new dbxref_dburl table that allows a dbxref record to specify a different URL to use. So, if one is not specified then the default urlprefix is used.

This approach allows us to have only one true database record, and put the CV's where they go in the cv table. It would allow us to have multiple URLs per database, and indicate a different URL construction string if needed. This would solve our KEGG problem as well because we could leave the db.urlprefix blank and add to the dburl table specific URLs for dbxref records that use them.

Any thoughts?

Changes to feature_stock, featuremap_stock, featureinterval and linkers

This issue imported from the Chado v1.4 requested changes google doc:

Requests by Ethy Cannon / Naama Menda / Steve Cannon

http://gmod.827538.n3.nabble.com/Opening-up-comments-on-changes-to-Chado-tc4036098.html#a4039144

feature_stock with a type_id is used to associate a QTL feature with the parent with the favorable allele. (v1.3)
featuremap_stock with a type_id is used to associate a genetic map with a stock record describing its mapping population.
Do other linker tables have types? If so, should we make this a standard for all linker tables to have a type, but let it be non-required? (v1.4, get some more use cases)

http://gmod.827538.n3.nabble.com/Proposed-changes-to-map-module-tc4031971.html

Change featureloc.fmin and featureloc.fmax to a float. Something to do with featuremap but I don’t understand….
Add a type field to the featurepos table to indicate the “type” of the position (e.g. start or end).
Add a new ‘featureinterval’ table
- with these fields:
  - featureinterval_id
  - featuremap_id (map set, to get coordinate units)
  - feature_id (object feature being placed)
  - srcfeature_id (target feature)
  - startpos (double precision)
  - endpos (double precision)
Stephen Ficklin (1/14/15): In the email it says this would help newbies but duplicates some information already in the featurepos table. Not sure if we should add tables that duplicate info. I got slapped around once for suggesting such a thing.
Scott Cain (1/14/15): As well you should! Perhaps featureinterval should be a view?

gmod / chado Goto Github PK

chado's Introduction

Chado

Introduction

Installation

Docker

Using the Chado Docker Image

Testing a Pull Request

Locally Build the Docker Image

Chado Support

Authors

chado's People

Contributors

Stargazers

Watchers

Forkers

chado's Issues

Recommend Projects

Recommend Topics

Recommend Org