glarue / jgi-query Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 17.0 225 KB

A simple command-line tool to download data from Joint Genome Institute databases

License: Mozilla Public License 2.0

Python 100.00%

bioinformatics cli genomes genomics python

jgi-query's Issues

Downloading bacterial genomes in bulk with submission IDs

Hi,

I have to download specific bacterial genomes from JGI. Is there a way to download them in bulk through their submission IDs via jgi-query?

Thanks,
Marco

downloading multiple datasets in bulk?

I can't tell whether this is possible using jgi-query or with the JGI API in general. I would like to download all of their bacterial genomes if at all possible but can't find a way to get a list by kingdom.

Can you provide any guidance here?
A

Files on tape can't be downloaded

In many cases, files are stored on tape, and the download seems to have a problem. I am unfamiliar with Python to make a pull request, but one partial workaround could be if the same file can be found on tape and hard drive, the file on the hard drive is selected. Thanks if you have time to check this!

failed when using jgi-query to download files wanted

Hi Glarue,
I am trying to using the following command to download the files in the projects, but I get an error message and the file with the right name but wrong content:

command:

python jgi-query.py -x get-directory.xml # the xml file is downloaded from the project download page

#https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=TheHunmicrobiome#

by click 'Open Downloads as XML '

following the instructions:

user name and password #fine

file to download

for example

2:2216 # a protein seqs file I want download

I got this

Total download size of selected files: 693.23 KB
Continue? (y/n): y
Downloading '81031.assembled.faa' using command:
curl http://genome.jgi.doe.gov/EubpyrIsolGenome/download/_JAMO/56f1982d7ded5e7f7b938de5/81031.assembled.faa -b cookies -c cookies -L > 81031.assembled.faa
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 287 100 287 0 0 101 0 0:00:02 0:00:02 --:--:-- 101
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
100 9122 0 9122 0 0 2240 0 --:--:-- 0:00:04 --:--:-- 2240
Finished downloading all files.
ERROR: '81031.assembled.faa' appears to be malformed and will be left unmodified.
Keep temporary files ('/home/mpi/pengfei/Hungate1000p/get-directory.xml' and 'cookies')? (y/n): n
Removing temp files and exiting

and the file is wrong: not protein sequences, see attached

81031.assembled.faa.txt

would you please help checking which step I am doing wrong?
Thanks

Best,
Pengfei

Downloading the fungal database

Hello,
I wanted to download the entire fungal database, but the tool is not responding. Do you have any solution to recommend ?
It takes a lot of time and in the end it gives me an empty XML file.
Thanks in advance.

ERROR :
#-------------------------------------------------------------------------------------------

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    92    0    92    0     0      0      0 --:--:--  0:10:00 --:--:--    22
Retrieving information from JGI for query 'fungi' using command 'curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?organism=fungi' -L -b cookies > fungi_jgi_index.xml'


Traceback (most recent call last):
  File "/shared/ifbstor1/projects/HE/FungiDB/JGI-db/jgi-query-main/jgi-query.py", line 1151, in <module>
    if not any(v["results"] for v in list(file_list.values())):
AttributeError: 'NoneType' object has no attribute 'values'

#--------------------------------------------------------------------------------------------

FILE XML : fungi_jgi_index.xml

<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

Download error

Hello, when I use python3 ./jgi-query/jgi-query.py --xml get-directory.xml to download files from JGI genome portal,
I get the following download error. Do you know what the issue could be?

Thanks!

Downloading '7393.1.70539.TCGAAG.fastq.gz' using command:
curl -m 120 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/Poptrisequencing_78/download/_JAMO/5254b441067c0136350e4f73/7393.1.70539.TCGAAG.fastq.gz' -b cookies > 7393.1.70539.TCGAAG.fastq.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
Trying '7393.1.70539.TCGAAG.fastq.gz' again due to download error (1/4):

Downloading error: curl: (28)

Hi @glarue
I'm running the script in a remote server to download the *.tar.gz files from bacterial groups (as suggested here: #4).

Every time I run this command (starts in tar.gz not shown):

python3 jgi-query.py tenericutes -r '.tar.gz.' --retry_n 0 -c

I get something like this, on every single genome ID contained in tenericutes:

Downloading '2582580514.tar.gz' using command:
curl -m 120 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/Comgenmetab10417/download/_JAMO/53e5233f0d87856ba82b2ddc/2582580514.tar.gz' -b cookies > 2582580514.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:01:59 --:--:-- 0
curl: (28) Operation timed out after 120000 milliseconds with 0 bytes received

I also run it with default retry number (4) and it happens the same. I hope you can guide me through this.

Many thanks in advance

Error downloading the fungal database.

Hello @glarue

I am launching the script to retrieve all the fungal assembly sequences in fasta format, but it is showing me this error:

`python3 jgi-query.py fungi

Retrieving information from JGI for query 'fungi' using command 'curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?
organism=fungi' -L -b cookies > fungi_jgi_index.xml'

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 92 0 92 0 0 0 0 --:--:-- 0:10:00 --:--:-- 28

Traceback (most recent call last):
File /JGI-db/jgi-query-main/jgi-query.py", line 1151, in
if not any(v["results"] for v in list(file_list.values())):
AttributeError: 'NoneType' object has no attribute 'values'`

Do you have any idea how to download all the fungal genome sequences?

Incorporate into Biopython

Love the idea of this tool--what would you think of making it a package in Biopython so users could just do from Bio import jgi-query, or even making it a single function in Biopython? I don't know anything about how you would do this, but I'm sure the folks over there would be happy to help, and I think it would make it even easier to deploy and use.

Please change to reference current JGI signon server.

You have a reference to creating an account at signon.jgi-psf.org. Please have it use the current server address: signon.jgi.doe.gov

10 Minute time limit

So first off thanks for this tool, it's very useful.

My Problem:
I am trying to create an XML file from a very large database on jgi, and there seems to be a 10 minute runtime limit. Is this something built in or can it be changed?

Thanks