systemsgenetics / pynome Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 10.14 MB

Genome collection script for the SciDAS project.

License: GNU General Public License v3.0

Python 97.38% Dockerfile 2.62%

pynome's People

Contributors

Watchers

pynome's Issues

JSON file missing in iRODS

Hi.

In transferring iRODS genome data to Colorado State for the NDN project, I noticed that these things (see 'ils' example below)

There is no JSON file with metadata in iRODs. Does that file include Taxonomy IDs? That is very important and maybe should be in directory name?
Remove the '.txt' from '*.Splice.sites.txt' names

[ffeltus@scidas-scratch1 MG2]$ ils
/scidasZone/sysbio/genomes/Zymoseptoria_tritici/MG2:
Zymoseptoria_tritici-MG2.1.ht2
Zymoseptoria_tritici-MG2.2.ht2
Zymoseptoria_tritici-MG2.3.ht2
Zymoseptoria_tritici-MG2.4.ht2
Zymoseptoria_tritici-MG2.5.ht2
Zymoseptoria_tritici-MG2.6.ht2
Zymoseptoria_tritici-MG2.7.ht2
Zymoseptoria_tritici-MG2.8.ht2
Zymoseptoria_tritici-MG2.fa
Zymoseptoria_tritici-MG2.gff3
Zymoseptoria_tritici-MG2.gtf
Zymoseptoria_tritici-MG2.Splice_sites.txt

Support Kallisto and Salmon indexes

GEMmaker now supports Kallisto and Salmon. Pynome should create those indexes too:

https://gemmaker.readthedocs.io/en/latest/preparing_and_running.html#step-2-acquire-reference-genome-files

Pynome Naming Convention Parsing Issues

Hi. I was working with Colorado State to move the iRODs genome data into an NDN network when I realized there are some substantial naming/parsing issues:::

In the naming specification in

It says that genome files will be in this directory format:
[genus][species]{[infraspecific name]}/[assembly_name]/

that contains these files:
[genus]_[species]{_[infraspecific name]}-[assembly_name].{hisat2 extension}
[genus]_[species]{_[infraspecific name]}-[assembly_name].gff3
[genus]_[species]{_[infraspecific name]}-[assembly_name].fasta
[genus]_[species]{_[infraspecific name]}-[assembly_name].gtf
[genus]_[species]{_[infraspecific name]}-[assembly_name].Splice_Sites.txt
[genus]_[species]{_[infraspecific name]}-[assembly_name].meta.json

However, the files are not in an "infraspecific name" subdirectory and the file names are not in the '[genus][species]{[infraspecific name]}-[assembly_name].' format:

C- /scidasZone/sysbio/genomes/Absidia_glauca
C- /scidasZone/sysbio/genomes/Acanthamoeba_castellanii_str_neff
C- /scidasZone/sysbio/genomes/Acidomyces_richmondensis
C- /scidasZone/sysbio/genomes/Acidomyces_richmondensis_bfw
C- /scidasZone/sysbio/genomes/Acremonium_chrysogenum_atcc_11550

BIG ISSUES:

There are no dashes in the file names, only underscores.
What is an 'infraspecific name'?
If there is a NULL level in the schema, then can we put in double underscores or dash-underscore so that when you parse the file name you always get the same number of fields between genomes. (e.g. 'Absidia_glauca' versus 'Acremonium_chrysogenum_atcc_11550' gives different numbers of levels if you parse at an underscore and I am not sure if 'atcc' and '11550' should be infraspecific names? Why is there no dash before the assembly name? If there is no assembly name, then how can there be an assembly subdirectory (which doesn't seem to exist anyways).

Shall we meet to discuss these issues?

Alex

IRODs user privileges as iRODs admin

This worked:
'ichmod -rM read susmit /scidasZone/sysbio'

Problems related to gffread

Over 150 genomes failed on the last run of pynome. On inspecting one of them (./2792677/ASM1369444v1-ncbi), there were problems creating the cDNA FASTA file.

The genome assembly FASTA index file with extension .fa.fai did not have the sequence names in the first column. Not sure what caused that.
The cDNA file could not be created. Using the gffread utility on Kamiak I got the following message Error: no ID found for GFF record start. It turns out there is a transcript_id = ""; entry in the gene feature of the GTF file. When I manually removed that entry for each gene the cDNA was built just fine.

Changes to the NCBI Downloader

We need to make the following changes to the NCBI downloader of Pynome:

Assembly Filters:
- We should limit assemblies to only those that are RefSeq representative genome assemblies. Representative genome assemblies are labeled as such in the 6th column of the assembly_summary_genbank.txt file.
- The assembly must have either a GFF3 file or a GTF file (or gzipped version of those). If a GFF3 file then we must convert it to GTF (Which I think pynome already does). If a GTF file then it can be used as is.
- We will not include bacterial or viral genomes. (Pynome already does this)
Directory Naming:
- When retrieving the name of the species ignore the infraspecific_name column of the assembly_summary_genbank.txt file. Just use the organism_name column as it does have the infraspecific name in it. The only thing, it appears, that the infraspecific_name column is holding is a key/value pair specifying the cultivar or the strain. We can just ignore this in the file naming.
- Make sure any spaces in the name get replaced with undrescores.
Additional Metdata Not Urgent. It would be nice to include the NCBI metadata information for the assembly in the assembly directory. It can be obtained in JSON format using this URL: https://api.ncbi.nlm.nih.gov/datasets/v1alpha/genome/accession/$id where $id is the NCBI ID (first column of the assembly_summary_genbank.txt file.

NCBI Genomes

We should explore the possibility of adding NCBI genome assemblies:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Changes for Salmon and Kallisto

It seems Pynome is using the entire genome assembly for Salmon and Kallisto indexing rather than the cDNA sequences. The following should be used instead.

Ensembl

For both Salmon and Kallisto we need to use the cDNA file that Ensembl provides. You can find the cDNA files in an FTP directory similar to the following:

ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/oryza_sativa/cdna/

We need to retrieve the file with the suffix .cdna.all.fa.gz. The file just needs to be uncompressed and used instead of the whole genome FASTA file.

NCBI

If a GTF file exists we need to download it and run the following command to create a cDNA FASTA file:

gffread -w transcripts.fa -g genome.fa transcripts.gtf

Where:

transcript.fa is the name of the output FASTA file that will have the cDNA entries.
genome.fa is the name of the whole genome FASTA file
transcript.gtf is the name of the GTF file.

We want to name the output file transcript.fa to follow our naming convention for all assembly files.

If a GTF file is not available but a GFF is then I believe Pynome already has code to convert it to GTF. Although, I think currently happens after genome indexing. Instead the converion of a GFF to GTF should happen prior to indexing and then that GTF file can be used to create a cDNA FASTA file just as described above.

systemsgenetics / pynome Goto Github PK

pynome's People

Contributors

Watchers

pynome's Issues

JSON file missing in iRODS

Support Kallisto and Salmon indexes

Pynome Naming Convention Parsing Issues

IRODs user privileges as iRODs admin

Problems related to gffread

Changes to the NCBI Downloader

NCBI Genomes

Changes for Salmon and Kallisto

Ensembl

NCBI

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent