opencb / cellbase Goto Github PK
View Code? Open in Web Editor NEWHigh-Performance NoSQL database and RESTful web services to access to most relevant biological data
License: Apache License 2.0
High-Performance NoSQL database and RESTful web services to access to most relevant biological data
License: Apache License 2.0
Several functionalities are required.
Filters:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
There is a tag="basic" for each gencode-basic transcript. Tasks:
1.- gencode gtf has to be downloaded
2.- The list of gencode-basic transcript ids (ENSTxxx) must be loaded within the GeneParser into a HashSet.
3.- GeneParser will include a new annotationFlag "basic" for all parsed genecode-basic transcripts
4.- getAllConsequenceTypesByVariantList at VariantAnnotationMongoDBAdaptor will check that flag before proceeding to annotate the variant
Includes:
Several errors are raised because new 'datastore' library integration.
Deploying war files should be avoided. Maven pom.xml files need to be properly configured
A new method is needed to calculate the consequence type from SNV variants. This will be part of the Variant Annotation new functionality.
The behaviour must be as similar as Ensembl VEP as possible
VariationParser takes too much time to generate, a new strategy is needed to improve performance
Would be great to implement a ws with all species information.
This is an example of the response. More information can be added.
{
"taxonomies":[
{
"name":"vertebrates",
"species":[
{
"text":"Homo Sapiens",
"assembly":"GRCh38",
"chromosomes":[
{
"name": "5",
"isCircular": 0,
"size": 180915260,
"end": 180915260,
"start": 1,
"cytobands": [
{
"stain": "acen",
"name": "p11.1",
"end": 17600000,
"start": 16100001
}
]
}
]
}
]
},
{
"name":"metazoa",
"species":[
]
}
]
}
Some unit tests do not pass because they are outdated or they are based in local paths (/home/.... ). Fix those tests.
Write clear guidelines for using the query command of the CLI
Create a "load" interface in cellbase-core module. This interface will define the operations to load the data models, created by cellbase-app 'build' command, into a database.
A MongoDB implementation of this interface should be implemented in cellbase-mongodb module.
There is no the need of passing different species here:
https://github.com/opencb/cellbase/blob/develop/cellbase-app/src/main/java/org/opencb/cellbase/app/cli/CliOptionsParser.java#L96
Different species can be executed in different executions. This will make the code a bit simpler without losing any real functionality.
Variation document must contain population frequencies, this can be obtained from EVA datasets
The following Exon WS should be implemented (currently not working):
/{version}/{species}/feature/exon/{exonId}/info
/{version}/{species}/feature/exon/{exonId}/region
/{version}/{species}/feature/exon/{exonId}/sequence
/{version}/{species}/feature/exon/{exonId}/transcript
Interesting but not urgent:
/{version}/{species}/feature/exon/{exonId}/aminos
/{version}/{species}/feature/exon/{exonId}/bysnp
should not be there and has been marked as Deprecated.
Uniprot's data is already integrated in CellBase. Link functional description of the variants with the vriant annotation WS
There is a mechanism in Ensembl Perl to avoid passing a huge registry file, this will avoid maintaining this file and will make CLI simpler since no parameter is needed for the registry file
This new option must return the collections installed for one species together with the indexes created and number documents. Other info may be also useful to be returned
CellBase must make use of Maven modules to offer a bigger modularity and reduce dependencies loaded.
Currently, CellbaseClient can only call the GET WS for variant annotation. Include an option to allow making calls to the POST WS, thereby enabling sending bigger variant batches
Currently MySQL-Hibernate implementation is found in cellbase-core. To offer a more modular implementation and to have a plugin oriented framework the interfaces (cellbase-core) must be implemented in a different module, so a cellbase-mongodb module must be created for MongoDB
ClinVar WS are now querying the ClinVar collection. ClinVar is also loaded within the clinical collection. Only one ClinVar copy will remain, the one within the clinical collection, and all queries should point to this one.
Some dependencies are using old versions such as Jackson, SQLite or Jersey, these need to be upgraded and tested
The following Exon WS should be implemented (currently not working):
/{version}/{species}/feature/gene/{geneId}/tfbs
/{version}/{species}/feature/gene/{geneId}/mirna_target
/{version}/{species}/feature/gene/{geneId}/reactome
/{version}/{species}/feature/gene/{geneId}/protein
returns the PPIs for the specified gene. We should rename this WS to ppi
or protein_interaction
.
Would be also interesting to create a proper /{version}/{species}/feature/gene/{geneId}/protein
WS returning UniProt information for this gene.
Method 'getAllByIdList' uses a complex aggregation when a much more simple elemMatch could be used. Also, currently 'supercontigs' are also returned:
http://www.ebi.ac.uk/cellbase/webservices/rest/v3/hsapiens/genomic/chromosome/13/info?of=json
Documentation needs to be significantly improved: building, architecture, REST calls
To create directories and other file system actions a new NIO API was developed in Java 7, this must used, e.g.:
DisGeNET database need to be downloaded and included:
http://www.disgenet.org/web/DisGeNET/v2.1
A new collection gene_disease_association must be created.
A tutorial for downloading data soruces and building the data models is needed:
https://github.com/opencb/cellbase/wiki/Download-and-Build-Data-Models
NoSQL databases offer a higher performance and scalability. Document oriented database MongoDB fits very well for Cellbase needs. A new implementation based on MongoDB needs to be done.
New CLI must be implemented using JCommander, the available commands are: download, build, load and query
New module app will accept different command such as download, build and query
http://www.ebi.ac.uk/cellbase/webservices/rest/latest/hsa/feature/id/BRCA2/starts_with?of=json
Returns null instead of a QueryResponse json serialized object.
Remove from cellbase adaptors direct uses of the mongoDB drivers. Use datastore functionality instead.
The WS must use the ClinicalMongoDBAdaptor and query the
referenceClinVarAssertion.measureSet.measure.measureRelationship.symbol
field within the ClinVar record.
PPI from IntAct must be added, data models must be created in biodata-models
cellbase.sh should work when executed from any directory in the system.
New variant annotation functionality can be implemented, this will return all the known information about a variant in CellBase: consequence type #26 , conservation, ...
Data models must be added to biodata-models.
To add RefSeq parser method in GeneParser, data must be loaded together with Ensembl gene set
Transcript HGVS shall be calculated and included within the VariantAnnotation object
The following SNP WS should be implemented (currently not working):
/{version}/{species}/feature/snp/{snpId}/consequence_type
/{version}/{species}/feature/snp/{snpId}/population_frequency
/{version}/{species}/feature/snp/{snpId}/xref
Interesting but not urgent:
/{version}/{species}/feature/snp/{snpId}/sequence
/{version}/{species}/feature/snp/{snpId}/regulatory
List of deprecated WS:
/{version}/{species}/feature/snp/{snpId}/consequence_types
/{version}/{species}/feature/snp/{snpId}/phenotypes
UniProt database needs to be integrated in CellBase
The web service:
http://www.ebi.ac.uk/cellbase/webservices/rest/latest/species?of=json
does not show the species correctly, it returns repeated species in different formats.
In order to have a better documentation Swagger must be integrated and configured
Some schemas should be defined, using JSON Schemas seems the simplest approach
Add Gene Expression Atlas data to the knowledgebase. Implement corresponding code for the:
Some parsers will be reimplemented so that they generate a general data model stored in a json object. 'Loaders' will be implemented which will transform data into an appropriate an efficient format for the specific DBMS (e.g. MongoDB), as well as will load them into the DB. The objective is to obtain a data model which contains all the information regardless of the specific implementation for a given DBMS.
When querying population frequencies like:
http://wwwdev.ebi.ac.uk/cellbase/webservices/rest/v3/hsapiens/genomic/region/3:1166675-1166675/snp
Only frequencies different from '1' are returned. MongoDB has to contain only those.
Download and Installation tutorial needs to be improved with some new sections and it should point to README for building the software:
https://github.com/opencb/cellbase/wiki/Download-and-Installation
biodata-models repository contains all data models in OpenCB. CellBase models should be moved from cellbase-core to biodata-models.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.