Giter VIP home page Giter VIP logo

pynome's People

Contributors

4ctrl-alt-del avatar biggstd avatar spficklin avatar

Watchers

 avatar  avatar  avatar

pynome's Issues

JSON file missing in iRODS

Hi.

In transferring iRODS genome data to Colorado State for the NDN project, I noticed that these things (see 'ils' example below)

  1. There is no JSON file with metadata in iRODs. Does that file include Taxonomy IDs? That is very important and maybe should be in directory name?
  2. Remove the '.txt' from '*.Splice.sites.txt' names

[ffeltus@scidas-scratch1 MG2]$ ils
/scidasZone/sysbio/genomes/Zymoseptoria_tritici/MG2:
Zymoseptoria_tritici-MG2.1.ht2
Zymoseptoria_tritici-MG2.2.ht2
Zymoseptoria_tritici-MG2.3.ht2
Zymoseptoria_tritici-MG2.4.ht2
Zymoseptoria_tritici-MG2.5.ht2
Zymoseptoria_tritici-MG2.6.ht2
Zymoseptoria_tritici-MG2.7.ht2
Zymoseptoria_tritici-MG2.8.ht2
Zymoseptoria_tritici-MG2.fa
Zymoseptoria_tritici-MG2.gff3
Zymoseptoria_tritici-MG2.gtf
Zymoseptoria_tritici-MG2.Splice_sites.txt

Pynome Naming Convention Parsing Issues

Hi. I was working with Colorado State to move the iRODs genome data into an NDN network when I realized there are some substantial naming/parsing issues:::

  1. In the naming specification in

It says that genome files will be in this directory format:
[genus][species]{[infraspecific name]}/[assembly_name]/

that contains these files:
[genus]_[species]{_[infraspecific name]}-[assembly_name].{hisat2 extension}
[genus]_[species]{_[infraspecific name]}-[assembly_name].gff3
[genus]_[species]{_[infraspecific name]}-[assembly_name].fasta
[genus]_[species]{_[infraspecific name]}-[assembly_name].gtf
[genus]_[species]{_[infraspecific name]}-[assembly_name].Splice_Sites.txt
[genus]_[species]{_[infraspecific name]}-[assembly_name].meta.json

However, the files are not in an "infraspecific name" subdirectory and the file names are not in the '[genus][species]{[infraspecific name]}-[assembly_name].' format:

C- /scidasZone/sysbio/genomes/Absidia_glauca
C- /scidasZone/sysbio/genomes/Acanthamoeba_castellanii_str_neff
C- /scidasZone/sysbio/genomes/Acidomyces_richmondensis
C- /scidasZone/sysbio/genomes/Acidomyces_richmondensis_bfw
C- /scidasZone/sysbio/genomes/Acremonium_chrysogenum_atcc_11550

BIG ISSUES:

  1. There are no dashes in the file names, only underscores.
  2. What is an 'infraspecific name'?
  3. If there is a NULL level in the schema, then can we put in double underscores or dash-underscore so that when you parse the file name you always get the same number of fields between genomes. (e.g. 'Absidia_glauca' versus 'Acremonium_chrysogenum_atcc_11550' gives different numbers of levels if you parse at an underscore and I am not sure if 'atcc' and '11550' should be infraspecific names? Why is there no dash before the assembly name? If there is no assembly name, then how can there be an assembly subdirectory (which doesn't seem to exist anyways).

Shall we meet to discuss these issues?

Alex

Problems related to gffread

Over 150 genomes failed on the last run of pynome. On inspecting one of them (./2792677/ASM1369444v1-ncbi), there were problems creating the cDNA FASTA file.

  1. The genome assembly FASTA index file with extension .fa.fai did not have the sequence names in the first column. Not sure what caused that.
  2. The cDNA file could not be created. Using the gffread utility on Kamiak I got the following message Error: no ID found for GFF record start. It turns out there is a transcript_id = ""; entry in the gene feature of the GTF file. When I manually removed that entry for each gene the cDNA was built just fine.

Changes to the NCBI Downloader

We need to make the following changes to the NCBI downloader of Pynome:

  1. Assembly Filters:
    • We should limit assemblies to only those that are RefSeq representative genome assemblies. Representative genome assemblies are labeled as such in the 6th column of the assembly_summary_genbank.txt file.
    • The assembly must have either a GFF3 file or a GTF file (or gzipped version of those). If a GFF3 file then we must convert it to GTF (Which I think pynome already does). If a GTF file then it can be used as is.
    • We will not include bacterial or viral genomes. (Pynome already does this)
  2. Directory Naming:
    • When retrieving the name of the species ignore the infraspecific_name column of the assembly_summary_genbank.txt file. Just use the organism_name column as it does have the infraspecific name in it. The only thing, it appears, that the infraspecific_name column is holding is a key/value pair specifying the cultivar or the strain. We can just ignore this in the file naming.
    • Make sure any spaces in the name get replaced with undrescores.
  3. Additional Metdata Not Urgent. It would be nice to include the NCBI metadata information for the assembly in the assembly directory. It can be obtained in JSON format using this URL: https://api.ncbi.nlm.nih.gov/datasets/v1alpha/genome/accession/$id where $id is the NCBI ID (first column of the assembly_summary_genbank.txt file.

NCBI Genomes

We should explore the possibility of adding NCBI genome assemblies:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Changes for Salmon and Kallisto

It seems Pynome is using the entire genome assembly for Salmon and Kallisto indexing rather than the cDNA sequences. The following should be used instead.

Ensembl

For both Salmon and Kallisto we need to use the cDNA file that Ensembl provides. You can find the cDNA files in an FTP directory similar to the following:

ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/oryza_sativa/cdna/

We need to retrieve the file with the suffix .cdna.all.fa.gz. The file just needs to be uncompressed and used instead of the whole genome FASTA file.

NCBI

If a GTF file exists we need to download it and run the following command to create a cDNA FASTA file:

gffread -w transcripts.fa -g genome.fa transcripts.gtf

Where:

  • transcript.fa is the name of the output FASTA file that will have the cDNA entries.
  • genome.fa is the name of the whole genome FASTA file
  • transcript.gtf is the name of the GTF file.

We want to name the output file transcript.fa to follow our naming convention for all assembly files.

If a GTF file is not available but a GFF is then I believe Pynome already has code to convert it to GTF. Although, I think currently happens after genome indexing. Instead the converion of a GFF to GTF should happen prior to indexing and then that GTF file can be used to create a cDNA FASTA file just as described above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.