refgenie / plantref Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 295 KB

Refgenieserver content repository for plant genomes server

Home Page: http://plantref.databio.org

Dockerfile 15.78% Python 84.22%

plantref's Introduction

A standardized reference genome resource manager. See the documentation.

plantref's People

Contributors

Watchers

Forkers

ieguinoa

plantref's Issues

Human friendly IDs + next steps

Following on your comment about using common/short names, aliases, or adapt the names to be more human friendly, I think we can come up with a list like this, at leas for some of the species
My question is how this would be implemented without loosing the information that is currently in the name, were you thinking of creating a kind of alias for the name? or by creating a tag to the fasta asset?
If using tags then we can also use it to group versions of a genome, leaving the "core" part of the id as genome name and move the build id as tag.
An example could be Zea mays. This could be the genome name (Zea_mays_B73, or simply ZeaMays as B73 is the main reference line) but it has a few entries in the genomes list, this info should go in the tags. eg the tags list for the fasta asset would be: B73_AGP_v4_0 and B73_5b_60
Normally all these would be different versions, we will make sure that these groupings make sense in terms of versions and that we can identify the latest one which will be referred by default also. In case of branches of the genome, like different strains or such there would be a different genome name. e.g , ZeaMays PH207 is a different line as B73 so it would go into a genome named PH207 and a single fasta asset with tag v1_0.
Thus, when adding a new version of a genome (e.g we are adding ZeaMays v5.0), the upgrading process would mean adding this asset tag 5.0 and also change the default to it. This way, when someone runs refgenie pull ZeaMays:fasta they would get the default->5.0.
From the point of view of the data curation we need to make sure that all default assets for a genome are compatible.

Also a related question about next assets to include: what is the best way to have linked assets that do not have a natural parent/children relationship?
If we start including annotations we may have cases where multiple annotations are associated with the same genome fasta and version. e.g we can have zea_mays_GTF_annotation_v1.0 and zea_mays_GTF_annotation_v2.0
I think a possible way to do this is to use the genome fasta asset as a (optional) parent in the GTF annotation build recipe.
It is not strictly used during the build, but can be used to check compatibility and to keep track of the association.
Or would it be possible to add an option to set "associated-assets" and then link an annotation with a genome fasta in a specific tag?

structure ready to be populated

@ieguinoa -- as discussed at BOSC, here's a new repo underlying a server for plant genomes.

I've put 2 fasta assets in and built these, which are now deployed here:

http://refgenomes.databio.org:84/

(assets from here: https://github.com/usegalaxy-be/reference-data/blob/master/refgenie/refgenie_conf.sh )do you want to see if you can figure out how to populate the PEP metadata that describes the assets you want to build? Then I will build them all and deploy to test.

we can change server and everything later but this is a good test.

corrupt file for Musa_acuminata_Genescope-Cirad

@ieguinoa
can you confirm the sequences in this file?

Musa_acuminata_Genescope-Cirad-fasta	fasta	ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_03/Genomes/mac.con.gz	files	f2e587473d858fffe64b60899ea1e045

I have the correct checksum but can't make a fai on this file. it looks like there is some corruption in the end of this original file.

[E::fai_build_core] Format error, unexpected character at line 7882692
[faidx] Could not build fai index /project/shefflab/deploy/plantref/genomes/data/4591f237dfa34c1b426f6596e2ce5bb5cb92581dc5ea883c/fasta/default/4591f237dfa34c1b426f6596e2ce5bb5cb92581dc5ea883c.fa.fai

Corrupt fasta file for Thalassiosira_pseudonana_JGI_3_0

Same as #5 -- it appears the fasta file for Thalassiosira_pseudonana_JGI_3_0 is corrupt. @ieguinoa can you double-check this one as well?

Duplicate Arabidopsis_lyrata

Related to #8

I think these two are identical sequences, with different wrapping:

Arabidopsis_lyrata_JGI_v1_0-fasta-fasta
Arabidopsis_lyrata__JGI_v2_1-fasta-fasta

@ieguinoa how do you want to proceed?

Duplicated genomes

@ieguinoa TLDR: I found two identical genomes in here. Can you prune this to just one of these or otherwise correct?

These two genomes have different versions and different fasta file paths, with different checksums:

Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta	fasta	ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_03/Genomes/sit.con.gz	files	9ede8e22816d388ad63d17f4f22397e2
Setaria_italica_JGI_v2_2-fasta	fasta	ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_04_5/Genomes/sit.fasta.gz	files	650591c5e36fa5ab60ae3e5d1d30555d

However, the genome identifier we computed for them was identical, so I investigated further. It turns out the unzipped files are actually identical, so maybe the compression differed?

diff Setaria_italica_JGI_v2_2-fasta-fasta Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta-fasta -s
Files Setaria_italica_JGI_v2_2-fasta-fasta and Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta-fasta are identical

empty file

@ieguinoa can you confirm the contents of this file?

Thalassiosira_pseudonana_JGI_3_0-fasta	fasta	ftp://ftp.psb.ugent.be/pub/plaza/plaza_pico_03/Genomes/tps.fasta.gz	files	df4f77a45f88acb0dee211639fab2ae2

it appears to be an empty file.

refgenie / plantref Goto Github PK

plantref's Introduction

plantref's People

Contributors

Watchers

Forkers

plantref's Issues

Human friendly IDs + next steps

structure ready to be populated

corrupt file for Musa_acuminata_Genescope-Cirad

Corrupt fasta file for Thalassiosira_pseudonana_JGI_3_0

Duplicate Arabidopsis_lyrata

Duplicated genomes

empty file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent