A standardized reference genome resource manager. See the documentation.
refgenie / plantref Goto Github PK
View Code? Open in Web Editor NEWRefgenieserver content repository for plant genomes server
Home Page: http://plantref.databio.org
Refgenieserver content repository for plant genomes server
Home Page: http://plantref.databio.org
A standardized reference genome resource manager. See the documentation.
Following on your comment about using common/short names, aliases, or adapt the names to be more human friendly, I think we can come up with a list like this, at leas for some of the species
My question is how this would be implemented without loosing the information that is currently in the name, were you thinking of creating a kind of alias for the name? or by creating a tag to the fasta asset?
If using tags then we can also use it to group versions of a genome, leaving the "core" part of the id as genome name and move the build id as tag.
An example could be Zea mays. This could be the genome name (Zea_mays_B73, or simply ZeaMays as B73 is the main reference line) but it has a few entries in the genomes list, this info should go in the tags. eg the tags list for the fasta asset would be: B73_AGP_v4_0 and B73_5b_60
Normally all these would be different versions, we will make sure that these groupings make sense in terms of versions and that we can identify the latest one which will be referred by default also. In case of branches of the genome, like different strains or such there would be a different genome name. e.g , ZeaMays PH207 is a different line as B73 so it would go into a genome named PH207 and a single fasta asset with tag v1_0.
Thus, when adding a new version of a genome (e.g we are adding ZeaMays v5.0), the upgrading process would mean adding this asset tag 5.0 and also change the default to it. This way, when someone runs refgenie pull ZeaMays:fasta
they would get the default->5.0.
From the point of view of the data curation we need to make sure that all default assets for a genome are compatible.
Also a related question about next assets to include: what is the best way to have linked assets that do not have a natural parent/children relationship?
If we start including annotations we may have cases where multiple annotations are associated with the same genome fasta and version. e.g we can have zea_mays_GTF_annotation_v1.0 and zea_mays_GTF_annotation_v2.0
I think a possible way to do this is to use the genome fasta asset as a (optional) parent in the GTF annotation build recipe.
It is not strictly used during the build, but can be used to check compatibility and to keep track of the association.
Or would it be possible to add an option to set "associated-assets" and then link an annotation with a genome fasta in a specific tag?
@ieguinoa -- as discussed at BOSC, here's a new repo underlying a server for plant genomes.
I've put 2 fasta assets in and built these, which are now deployed here:
http://refgenomes.databio.org:84/
(assets from here: https://github.com/usegalaxy-be/reference-data/blob/master/refgenie/refgenie_conf.sh )do you want to see if you can figure out how to populate the PEP metadata that describes the assets you want to build? Then I will build them all and deploy to test.
we can change server and everything later but this is a good test.
@ieguinoa
can you confirm the sequences in this file?
Musa_acuminata_Genescope-Cirad-fasta | fasta | ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_03/Genomes/mac.con.gz | files | f2e587473d858fffe64b60899ea1e045 |
---|
I have the correct checksum but can't make a fai on this file. it looks like there is some corruption in the end of this original file.
[E::fai_build_core] Format error, unexpected character at line 7882692
[faidx] Could not build fai index /project/shefflab/deploy/plantref/genomes/data/4591f237dfa34c1b426f6596e2ce5bb5cb92581dc5ea883c/fasta/default/4591f237dfa34c1b426f6596e2ce5bb5cb92581dc5ea883c.fa.fai
@ieguinoa TLDR: I found two identical genomes in here. Can you prune this to just one of these or otherwise correct?
These two genomes have different versions and different fasta file paths, with different checksums:
Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta | fasta | ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_03/Genomes/sit.con.gz | files | 9ede8e22816d388ad63d17f4f22397e2 |
---|---|---|---|---|
Setaria_italica_JGI_v2_2-fasta | fasta | ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_04_5/Genomes/sit.fasta.gz | files | 650591c5e36fa5ab60ae3e5d1d30555d |
However, the genome identifier we computed for them was identical, so I investigated further. It turns out the unzipped files are actually identical, so maybe the compression differed?
diff Setaria_italica_JGI_v2_2-fasta-fasta Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta-fasta -s
Files Setaria_italica_JGI_v2_2-fasta-fasta and Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta-fasta are identical
@ieguinoa can you confirm the contents of this file?
Thalassiosira_pseudonana_JGI_3_0-fasta | fasta | ftp://ftp.psb.ugent.be/pub/plaza/plaza_pico_03/Genomes/tps.fasta.gz | files | df4f77a45f88acb0dee211639fab2ae2 |
---|
it appears to be an empty file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.