Giter VIP home page Giter VIP logo

plantref's Introduction

Build package Test refgenie CLI install with bioconda

Refgenie

A standardized reference genome resource manager. See the documentation.

plantref's People

Contributors

ieguinoa avatar nsheff avatar stolarczyk avatar

Watchers

 avatar  avatar  avatar

Forkers

ieguinoa

plantref's Issues

Human friendly IDs + next steps

Following on your comment about using common/short names, aliases, or adapt the names to be more human friendly, I think we can come up with a list like this, at leas for some of the species
My question is how this would be implemented without loosing the information that is currently in the name, were you thinking of creating a kind of alias for the name? or by creating a tag to the fasta asset?
If using tags then we can also use it to group versions of a genome, leaving the "core" part of the id as genome name and move the build id as tag.
An example could be Zea mays. This could be the genome name (Zea_mays_B73, or simply ZeaMays as B73 is the main reference line) but it has a few entries in the genomes list, this info should go in the tags. eg the tags list for the fasta asset would be: B73_AGP_v4_0 and B73_5b_60
Normally all these would be different versions, we will make sure that these groupings make sense in terms of versions and that we can identify the latest one which will be referred by default also. In case of branches of the genome, like different strains or such there would be a different genome name. e.g , ZeaMays PH207 is a different line as B73 so it would go into a genome named PH207 and a single fasta asset with tag v1_0.
Thus, when adding a new version of a genome (e.g we are adding ZeaMays v5.0), the upgrading process would mean adding this asset tag 5.0 and also change the default to it. This way, when someone runs refgenie pull ZeaMays:fasta they would get the default->5.0.
From the point of view of the data curation we need to make sure that all default assets for a genome are compatible.

Also a related question about next assets to include: what is the best way to have linked assets that do not have a natural parent/children relationship?
If we start including annotations we may have cases where multiple annotations are associated with the same genome fasta and version. e.g we can have zea_mays_GTF_annotation_v1.0 and zea_mays_GTF_annotation_v2.0
I think a possible way to do this is to use the genome fasta asset as a (optional) parent in the GTF annotation build recipe.
It is not strictly used during the build, but can be used to check compatibility and to keep track of the association.
Or would it be possible to add an option to set "associated-assets" and then link an annotation with a genome fasta in a specific tag?

structure ready to be populated

@ieguinoa -- as discussed at BOSC, here's a new repo underlying a server for plant genomes.

I've put 2 fasta assets in and built these, which are now deployed here:

http://refgenomes.databio.org:84/

(assets from here: https://github.com/usegalaxy-be/reference-data/blob/master/refgenie/refgenie_conf.sh )do you want to see if you can figure out how to populate the PEP metadata that describes the assets you want to build? Then I will build them all and deploy to test.

we can change server and everything later but this is a good test.

corrupt file for Musa_acuminata_Genescope-Cirad

@ieguinoa
can you confirm the sequences in this file?

Musa_acuminata_Genescope-Cirad-fasta fasta ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_03/Genomes/mac.con.gz files f2e587473d858fffe64b60899ea1e045

I have the correct checksum but can't make a fai on this file. it looks like there is some corruption in the end of this original file.

[E::fai_build_core] Format error, unexpected character at line 7882692
[faidx] Could not build fai index /project/shefflab/deploy/plantref/genomes/data/4591f237dfa34c1b426f6596e2ce5bb5cb92581dc5ea883c/fasta/default/4591f237dfa34c1b426f6596e2ce5bb5cb92581dc5ea883c.fa.fai

Duplicate Arabidopsis_lyrata

Related to #8

I think these two are identical sequences, with different wrapping:

Arabidopsis_lyrata_JGI_v1_0-fasta-fasta
Arabidopsis_lyrata__JGI_v2_1-fasta-fasta

@ieguinoa how do you want to proceed?

Duplicated genomes

@ieguinoa TLDR: I found two identical genomes in here. Can you prune this to just one of these or otherwise correct?

These two genomes have different versions and different fasta file paths, with different checksums:

Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta fasta ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_03/Genomes/sit.con.gz files 9ede8e22816d388ad63d17f4f22397e2
Setaria_italica_JGI_v2_2-fasta fasta ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_monocots_04_5/Genomes/sit.fasta.gz files 650591c5e36fa5ab60ae3e5d1d30555d

However, the genome identifier we computed for them was identical, so I investigated further. It turns out the unzipped files are actually identical, so maybe the compression differed?

diff Setaria_italica_JGI_v2_2-fasta-fasta Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta-fasta -s
Files Setaria_italica_JGI_v2_2-fasta-fasta and Setaria_italica_JGI_8_3X_chromosome-scale_assembly_release_2_0_annotation_version_2_1-fasta-fasta are identical

empty file

@ieguinoa can you confirm the contents of this file?

Thalassiosira_pseudonana_JGI_3_0-fasta fasta ftp://ftp.psb.ugent.be/pub/plaza/plaza_pico_03/Genomes/tps.fasta.gz files df4f77a45f88acb0dee211639fab2ae2

it appears to be an empty file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.