enasequence / enabrowsertools Goto Github PK

A collection of scripts to assist in the retrieval of data from the ENA Browser

License: Apache License 2.0

Shell 0.45% Python 99.55%

enabrowsertools's Introduction

enaBrowserTools

enaBrowserTools is a set of scripts that interface with the ENA web services to download data from ENA easily, without any knowledge of scripting required. For a Java based command line downloader tool, please see https://github.com/enasequence/ena-ftp-downloader/.

License

Copyright 2017 EMBL - European Bioinformatics Institute Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Installation and Setup

Python installation

Python 3 scripts are available. These can be found in the "python3" folder.

To run these scripts you will need to have Python installed. You can download Python 3 from here. If you already have Python installed, you can find out which version when you start the python interpreter.

Note that EBI now uses HTTPS servers. This can create a problem when using Python 3 on a Mac due to an oft-missed installation step. Please run the "Install Certificates.command" command to ensure your Python 3 installation on the Mac can correctly authenticate against the servers. To do this, run the following from a terminal window, updating the Python version with the correct version of Python 3 that you have installed: open "/Applications/Python 3.6/Install Certificates.command"

We have had a report from a user than when Python 3 was installed using homebrew, the following code needed to be run instead:

# install_certifi.py
#
# sample script to install or update a set of default Root Certificates
# for the ssl module.  Uses the certificates provided by the certifi package:
#       https://pypi.python.org/pypi/certifi

import os
import os.path
import ssl
import stat
import subprocess
import sys

STAT_0o775 = ( stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR
             | stat.S_IRGRP | stat.S_IWGRP | stat.S_IXGRP
             | stat.S_IROTH |                stat.S_IXOTH )

openssl_dir, openssl_cafile = os.path.split(
    ssl.get_default_verify_paths().openssl_cafile)

print(" -- pip install --upgrade certifi")
subprocess.check_call([sys.executable,
    "-E", "-s", "-m", "pip", "install", "--upgrade", "certifi"])

import certifi

# change working directory to the default SSL directory
os.chdir(openssl_dir)
relpath_to_certifi_cafile = os.path.relpath(certifi.where())
print(" -- removing any existing file or link")
try:
    os.remove(openssl_cafile)
except FileNotFoundError:
    pass
print(" -- creating symlink to certifi certificate bundle")
os.symlink(relpath_to_certifi_cafile, openssl_cafile)
print(" -- setting permissions")
os.chmod(openssl_cafile, STAT_0o775)
print(" -- update complete")

Installing and running the scripts

Download the latest release and extract it to the preferred location on your computer. You will now have the enaBrowserTools folder containing the python 3 scripts. If you are using a Unix/Linux or Mac computer, we suggest you add the following aliases to your .bashrc or .bash_profile file. Where INSTALLATION_DIR is the location where you have saved the enaBrowserTools.

alias enaDataGet=INSTALLATION_DIR/enaBrowserTools/python3/enaDataGet
alias enaGroupGet=INSTALLATION_DIR/enaBrowserTools/python3/enaGroupGet

This will allow you to run the tools from any location on your computer.

You can run install and run these scripts on Windows as you would in Unix/Linux using Cygwin. If you have python installed on your Windows machine, you can run the python scripts directly without Cygwin. However the call is a bit more complicated.

For example, instead of calling enaDataGet

you would need to call python INSTALLATION_DIR/enaBrowserTools/python3/enaDataGet.py

We will look more into the Windows equivalent of aliases to run batch files from the command line and hopefully be able to provide a better solution to Windows users shortly.

Setting up for Aspera

Important: there has been a change to using aspera in version 1.4.

If you wish to use Aspera to download read or analysis files, you will need to use the aspera_settings.ini file. Please save it to a static location on your local computer, and edit the file to include the location of your aspera binary (ASPERA_BIN) and the private key file (ASPERA_PRIVATE_KEY):

[aspera]
ASPERA_BIN = /path/to/ascp
ASPERA_PRIVATE_KEY = /path/to/aspera_dsa.openssh
ASPERA_OPTIONS =
ASPERA_SPEED = 100M

There are two command line flags/options available if you wish to use aspera for download. These are:

-a (or --aspera): use this flag if you'd like to download with aspera.
-as (or --aspera-settings): use this option if you'd like to specify the location of your aspera settings file. If you use this option, you don't need to use the --aspera flag.

If you use the --aspera-settings option, you don't need to also use the --aspera flag, e.g:

enaDataGet -f fastq -as /path/to/aspera_settings.ini ACCESSION

If you don't wish to specify the location of the aspera settings file each time you use the scripts, you have the option to either set an ENA_ASPERA_INIFILE environment variable to save the location:

export ENA_ASPERA_INIFILE="/path/to/aspera_settings.ini"

or you can use the default location for the file, this is the enaBrowserTools directory. Note that if you use this option, you will have to be careful about how you update your scripts so that you don't overwrite your aspera settings file.

Using just the --aspera flag will result in the scripts looking first for the ENA_ASPERA_INIFILE environment variable, and second for the default file location.

enaDataGet -f fastq -a ACCESSION

Regardless of which option you have selected, if the aspera settings file cannot be found or the licence key file declared within your settings file does not exist, the scripts will default to using FTP for the download.

Command line

There are two main tools for downloading data from ENA: enaDataGet and enaGroupGet.

enaDataGet

This tool will download all data for a given sequence, assembly, read or analysis accession or WGS set. Usage of this tool is described below. Note that unless a destination directory is provided, the data will be downloaded to the directory from which you run the command. When using an assembly, run, experiment or analysis accession, a subdirectory will be created using that accession as its name.

Accepted WGS set accession formats are:

AAAK03
AAAK03000000
AAAK
AAAK00000000

usage: enaDataGet [-h] [-f {embl,fasta,submitted,fastq,sra}] [-d DEST] [-w]
                  [-m] [-i] [-a] [-as ASPERA_SETTINGS] [-v]
                  accession

Download data for a given accession

positional arguments:
  accession             Sequence, coding, assembly, run, experiment or
                        analysis accession or WGS prefix (LLLLVV) to download

optional arguments:
  -h, --help            show this help message and exit
  -f {embl,fasta,submitted,fastq,sra}, --format {embl,fasta,submitted,fastq,sra}
                        File format required. Format requested must be
                        permitted for data type selected. sequence, assembly
                        and wgs accessions: embl(default) and fasta formats.
                        read group: submitted, fastq and sra formats. analysis
                        group: submitted only.
  -d DEST, --dest DEST  Destination directory (default is current running
                        directory)
  -w, --wgs             Download WGS set for each assembly if available
                        (default is false)
  -e, --extract-wgs     Extract WGS scaffolds for each assembly if available
                        (default is false)
  -exp, --expanded      Expand CON scaffolds when downloading embl format
                        (default is false)
  -m, --meta            Download read or analysis XML in addition to data
                        files (default is false)
  -i, --index           Download CRAM index files with submitted CRAM files,
                        if any (default is false). This flag is ignored for
                        fastq and sra format options.
  -a, --aspera          Use the aspera command line client to download,
                        instead of FTP.
  -as ASPERA_SETTINGS, --aspera-settings ASPERA_SETTINGS
                        Use the provided settings file, will otherwise check
                        for environment variable or default settings file
                        location.
  -v, --version         show program's version number and exit

enaGroupGet

This tool will allow you to download all data of a particular group (sequence, WGS, assembly, read or analysis) for a given sample or study accession. You can also download all sequence, WGS or assembly data for a given NCBI tax ID. When fetching data for a tax ID, the default is to only search for the specific tax ID, however you can use the subtree option to download the data associated with either the requested taxon or any of its subordinate taxa in the NCBI taxonomy tree.

Downloading read and analysis data by tax ID is currently disabled. We will be adding a data volume sanity check in place before we enable this functionality.

Usage of this tool is described below. A new directory will be created using the provided accession as the name, and all data will be downloaded here. There will also be a separate subdirectory created for each assembly, run and analysis being fetched. Note that unless a destination directory is provided, this group directory will be created in the directory from which you run the command.

usage: enaGroupGet [-h] [-g {sequence,wgs,assembly,read,analysis}]
                   [-f {embl,fasta,submitted,fastq,sra}] [-d DEST] [-w] [-m]
                   [-i] [-a] [-as ASPERA_SETTINGS] [-t] [-v]
                   accession

Download data for a given study or sample, or (for sequence and assembly) taxon

positional arguments:
  accession             Study or sample accession or NCBI tax ID to fetch data
                        for

optional arguments:
  -h, --help            show this help message and exit
  -g {sequence,wgs,assembly,read,analysis}, --group {sequence,wgs,assembly,read,analysis}
                        Data group to be downloaded for this
                        study/sample/taxon (default is read)
  -f {embl,fasta,submitted,fastq,sra}, --format {embl,fasta,submitted,fastq,sra}
                        File format required. Format requested must be
                        permitted for data group selected. sequence, assembly
                        and wgs groups: embl and fasta formats. read group:
                        submitted, fastq and sra formats. analysis group:
                        submitted only.
  -d DEST, --dest DEST  Destination directory (default is current running
                        directory)
  -w, --wgs             Download WGS set for each assembly if available
                        (default is false)
  -e, --extract-wgs     Extract WGS scaffolds for each assembly if available
                        (default is false)
  -exp, --expanded      Expand CON scaffolds when downloading embl format
                        (default is false)
  -m, --meta            Download read or analysis XML in addition to data
                        files (default is false)
  -i, --index           Download CRAM index files with submitted CRAM files,
                        if any (default is false). This flag is ignored for
                        fastq and sra format options.
  -a, --aspera          Use the aspera command line client to download,
                        instead of FTP.
  -as ASPERA_SETTINGS, --aspera-settings ASPERA_SETTINGS
                        Use the provided settings file, will otherwise check
                        for environment variable or default settings file
                        location.
  -t, --subtree         Include subordinate taxa (taxon subtree) when querying
                        with NCBI tax ID (default is false)
  -v, --version         show program's version number and exit

Tips

From version 1.4, when downloading read data if you use the default format (that is, don't use the format option), the scripts will look for available files in the following priority: submitted, sra, fastq.

A word of advice for read formats:

submitted: only read data submitted to ENA have files available as submitted by the user.
sra: this is the NCBI SRA format, and is the format in which all NCBI/DDBJ data is mirrored to ENA.
fastq: not all submitted format files can be converted to FASTQ

Problems

For any problems, please contact the ENA helpdesk (https://www.ebi.ac.uk/ena/browser/support) with 'enaBrowserTools' in your subject line.

We have had a couple of reports that the R2 FASTQ files are not always successfully downloading for paired runs. We have been unable to replicate this problem and have therefore exposed the error message to you should there be any download failure of read/analysis files via FTP or Aspera. If you get one of these errors, please copy the error into the support form.

enabrowsertools's People

Contributors

Stargazers

Watchers

Forkers

froggleston dorbarker jjkoehorst esteinig atdurian harryhaemin olivercardiff rpolicastro tauqeer9 hasanalanya gzentner biostars gene-loss murathangoktas vivianleung

enabrowsertools's Issues

Error messages to stderr instead of stdout

This is terrific you've made these available, so thank you for doing that!

It's more of a stylistic request to have error messages go to stderr rather than stdout.

In most scripts you already have the sys module imported, so you could just change the print statements to sys.stderr.write('ERROR: <msg> \n') but you would have to add that import sys line into the utils.py in order to change out the print_error() function: https://github.com/enasequence/enaBrowserTools/blob/master/python3/utils.py#L485-L487

Add --verbose option

I would like to see more output about what is happening
eg. connecting to https://xxxxx to get index etc

ASPERA_LICENSE hard-coded?

I notice also the code refers to ASPERA_LICENSE and os.path.exists(ASPERA_LICENSE).

I don't have this file, I use this ascp installation on the command line:
http://download.asperasoft.com/download/sw/connect/3.6.2/aspera-connect-3.6.2.117442-linux-64.tar.gz

I've upgraded my package to support 3.7.2 now using: https://download.asperasoft.com/download/sw/cli/3.7.2/aspera-cli-3.7.2.354.010c3b8-linux-64-release.sh

It seems to have /home/linuxbrew/.linuxbrew/etc/aspera-license file, so I assume adding ASPERA_LICENSE to the --aspera FILE.INI will work?

Download of paired-end reads fails in python 2.7

Running enaBrowserTools/python/enaDataGet -f fastq SRR2124780 fails with

Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR212/000/SRR2124780/SRR2124780_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR212/000/SRR2124780/SRR2124780_2.fastq.gz
Error with FTP transfer: [Errno ftp error] 200 Switching to Binary mode.
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR212/000/SRR2124780/SRR2124780_2.fastq.gz
Error with FTP transfer: [Errno ftp error] 200 Switching to Binary mode.
Failed to download file after two attempts
Completed

likely due to this bug in urllib. Forward reads are downloaded correctly but the reverse reads are not. Running the same command again downloads reverse reads as well.

The error occurs in python 2.7.12 which ships with Ubuntu 17.10, and python 2.7.14 which comes with Fedora 27. The equivalent python 3 command enaBrowserTools/python3/enaDataGet -f fastq SRR2124780 works correctly.

ssl Internal Error

Hello,

First of all thanks for this very useful tool.

I'm getting the following error when running enaDataGet:

Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1350, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/usr/lib/python3.8/http/client.py", line 1424, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/usr/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/usr/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL] internal error (_ssl.c:1123)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/enaGroupGet.py", line 190, in <module>
    download_group(accession, group, output_format, dest_dir, fetch_wgs, extract_wgs, fetch_meta, fetch_index, aspera, subtree, expanded)
  File "/usr/bin/enaGroupGet.py", line 140, in download_group
    download_data_group(group, accession, output_format, group_dir, fetch_wgs, extract_wgs, fetch_meta, fetch_index, aspera, subtree, expanded)
  File "/usr/bin/enaGroupGet.py", line 88, in download_data_group
    download_report(group, utils.get_group_result(group), accession, temp_file_path, subtree)
  File "/usr/bin/enaGroupGet.py", line 68, in download_report
    response = utils.get_report_from_portal(search_url)
  File "/usr/bin/utils.py", line 476, in get_report_from_portal
    response = urlrequest.urlopen(request, context=gcontext)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1393, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/lib/python3.8/urllib/request.py", line 1353, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL] internal error (_ssl.c:1123)>
ERROR: Something unexpected went wrong please try again.
If problem persists, please contact ENA (https://www.ebi.ac.uk/ena/browser/support) for assistance, with the above error details.

I've done a little bit of reading and found the ssl module error is raised by the underlying OpenSsl library (https://docs.python.org/3/library/ssl.html#ssl.SSLError).

Confirming this is an Openssl issue, the machine (ubuntu focal) where the error occurs has this version:

$ apt show openssl
Package: openssl
Version: 1.1.1f-1ubuntu2.1
Priority: important
Section: utils
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Debian OpenSSL Team <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 1287 kB
Depends: libc6 (>= 2.15), libssl1.1 (>= 1.1.1)
Suggests: ca-certificates
Homepage: https://www.openssl.org/
Task: minimal
Download-Size: 620 kB
APT-Manual-Installed: no
APT-Sources: http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
Description: Secure Sockets Layer toolkit - cryptographic utility

Whereas on a Fedora machine enaDataGet runs fine, where OpenSsl is following version:

Name        : openssl
Arch        : x86_64
Epoch       : 1
Version     : 1.0.2k
Release     : 8.el7
Size        : 814 k
Repo        : installed
From repo   : anaconda
Summary     : Utilities from the general purpose cryptography library with TLS implementation
URL         : http://www.openssl.org/
Licence     : OpenSSL
Description : The OpenSSL toolkit provides support for secure communications between
            : machines. OpenSSL includes a certificate management tool and shared
            : libraries which provide various cryptographic algorithms and
            : protocols.

Any idea what is up here?

ERROR: local variable 'ASPERA_PRIVATE_KEY' referenced before assignment

Not sure if i'm doing something silly?

% cat aspera.ini

[aspera]
ASPERA_BIN = ascp
ASPERA_PRIVATE_KEY = /home/linuxbrew/.linuxbrew/etc/asperaweb_id_dsa.openssh
ASPERA_OPTIONS = -QT
ASPERA_SPEED = 300M

% ./enaDataGet -a aspera.ini -f fastq SRR3309615

ERROR: cannot read aspera settings from aspera.ini
local variable 'ASPERA_PRIVATE_KEY' referenced before assignment

ASERA_BIN detection broken on mac/python3.6.5

I cannot get the enaData/GroupGet apps to recognise my local ascp client.

$ enaDataGet -as ./aspera_settings.ini SRR6682864
Aspera binary ("/Users/nathanjohnson/Applications/Aspera_CLI/bin/ascp") does not exist. Defaulting to FTP transfer

$ ls /Users/nathanjohnson/Applications/Aspera_CLI/bin/ascp
/Users/nathanjohnson/Applications/Aspera_CLI/bin/ascp

Note: I have replaced 'Aspera\ CLI' with 'Aspera_CLI' here.

Digging deeper I find some odd behaviour in the utils.py module:

$ python
Python 3.6.5 (default, Apr 25 2018, 14:23:58)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from python3.utils import set_aspera_variables
>>> set_aspera_variables('aspera_settings.ini')
Aspera binary ("/Users/nathanjohnson/Applications/Aspera_CLI/bin/ascp") does not exist. Defaulting to FTP transfer
False
>>> ASPERA_BIN="/Users/nathanjohnson/Applications/Aspera_CLI/bin/ascp"
>>> if not os.path.exists(ASPERA_BIN):
...     print('Aspera binary ({0}) does not exist. Defaulting to FTP transfer'.format(ASPERA_BIN))
...
>>>

Thanks

Call "python3" instead of python for the python3 scripts

Many of us have py2 and py3 co-installed. In these cases the py2 exe is python and the py3 exe is python3.

You use python for both, which clashes.

sequenceGet.py imports an undefined function

In sequenceGet.py, the function download_unversioned_wgs() calls utils.get_nonversioned_supp_wgs_ftp_url(). However, this function is undefined in utils.py.

Sample ID SAMN* regex bug

The regex on line 106 of utils.py in the python3 code:

sample_pattern_1 = re.compile('^SAM[ND][0-9]{7}$')

doesn't match everything, eg SAMN03649073 because it has 8 digits. It makes enaGroupGet throw an error:

enaGroupGet -f fastq SAMN03649073
ERROR: Invalid accession. Only sample and study/project accessions or NCBI tax ID supported

I don't know what the spec is for these accessions, otherwise I'd put in a pull request. Python2 code also has the same regex.

Make --aspera options an external file

When packaging the tools on a shared system we can't be expected to change the utils.py script every time we upgrade.

Can you please read the settings from a file specified in an environmental variable?
And optionally allow the file to be specified with the --aspera option?

export ENADATAGET_ASPERA_CONFIG=/my/system/path/to/aspera/settings.ini

# or

enaGetData --aspera /my/system/path/to/aspera/settings.ini

The settings.ini file could be YAML preferably, or JSON or "python" as you have below.

If you wish to use Aspera to download read or analysis files, you will need to edit the utils.py script. You will minimally need to specify the location of ascp and the licence file on your computer. All available aspera settings that can be modified are located at the top of utils.py:

ASPERA_BIN = 'ascp' # ascp binary
ASPERA_PRIVATE_KEY = '/path/to/aspera_dsa.openssh' # ascp private key file
ASPERA_LICENSE = 'aspera-license' # ascp license file
ASPERA_OPTIONS = '' # set any extra aspera options
ASPERA_SPEED = '100M' # set aspera download speed

Aspera option improvement

Thanks for adding the --aspera option, but is very inconvient for all the users on a cluster to have to specify it, as you say "You will need to provide a path to this file every time you chose to run aspera."

Can you make the default be a file specified in an ENV variable?

For example

export ENA_ASPERA_COMMANDLINE="-m 300m -i key.ssh -DT" 
#and/or
export ENA_ASPERIA_INIFILE="/bio/sw/enaBrowserTools/etc/aspera.ini"

If either of these are set, you use them. But let -a override it.

Some SRA accession numbers don't work.

Hi all,

enaDataGet fails to recognize certain SRA accession numbers. For example, 'SRR10704240' results in an invalid SRA accession number error after using the command enaDataGet -f fastq SRR10704240. However, the experiment accession number, 'SRX7388391', does work with that command.

This is with version 1.5.3.

Cheers!

Add filename for failed or incomplete downloads

Hi team,

I don't know why, but sometimes when I run enaDataGet not all of the files are downloaded. I can usually run the program again and it is fine, however I notice that the error message informs us how much was not downloaded but it doesn't specify which file is incomplete:

Error with FTP transfer: <urlopen error retrieval incomplete: got only 1083805828 out of 3179431281 bytes>

Could a change like this print an error of the exact file which wasn't downloaded properly?:

def get_ftp_file_with_md5_check(ftp_url, dest_dir, md5): try: filename = ftp_url.split('/')[-1] dest_file = os.path.join(dest_dir, filename) urlrequest.urlretrieve(ftp_url, dest_file) return check_md5(dest_file, md5) except Exception as e: sys.stderr.write("Error with FTP transfer: {0}\n".format(e)) sys.stderr.write("Error with FTP transfer occured for file: {}").format(filename)) return False

Thank you,
Tim

Download by GCA accession

Is it possible to use this software to download all sequences associated with a GCA accession, for example GCA_000009045.1? This would be very helpful for our work on Rfam.

enaGroupGet doesn't throw an exception when it fails to download one accession

I have been running into trouble in a NextFlow pipeline I wrote. After some investigation, i found out that for some of the samples only one of the fastq files was downloaded but the process still finished without an error. The relevant part of my bash script is below:

enaDataGet -f fastq $accession || enaGroupGet -f fastq $accession

I am using both because some accessions have only one sub-accession and some have plenty.

File not found errors with aspera option

I'm trying to use the enaBrowserTools scripts to download data via aspera, but I am getting file not found errors.

So, this call:

davey@n80295:~/git/enaBrowserTools/python$ ./enaGroupGet -f fastq -as ../aspera_settings.ini SRP006773
Fetching SRR203489
Downloading file with md5 check:fasp.sra.ebi.ac.uk:/vol1/fastq/SRR203/SRR203489/SRR203489.fastq.gz
Error with Aspera transfer: [Errno 2] No such file or directory

But the file exists on the FTP site:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR203/SRR203489/SRR203489.fastq.gz

Why are the aspera links non-functional?

Cheers

Rob

enaGroupGet <bioproject-id> 'Completed' but no files fetched or errors

I'm trying the py2 version.

$ ./enaGroupGet PRJNA358417 2> err.log
Fetching SRR5125719
No files of format submitted for SRR5125719
Fetching SRR5125720
No files of format submitted for SRR5125720
Fetching SRR5132331
No files of format submitted for SRR5132331
Fetching SRR5132332
No files of format submitted for SRR5132332
Fetching SRR5132333
No files of format submitted for SRR5132333
Completed
$ echo  $?
0

./PRJNA358417/{SRR5125719,SRR5125720,SRR5132331SRR5132332,SRR5132333}/ dirs are created, but all are empty.

The stderr was empty but should list errors encountered, and exit the status shouldn't be 0 if no files were fetched.

ensDataGet for entity with absent default files gives random binary output

Doing a search for a Run (I think the default download for Run is submitted files), which has no submitted files gives odd output:

$ enaDataGet -as ./aspera_settings.ini SRR6682864
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/srr/SRR668/004/SRR6682864
$ tree SRR6682864
SRR6682864
└── SRR6682864

0 directories, 1 file
$ more SRR6682864/SRR6682864
"SRR6682864/SRR6682864" may be a binary file.  See it anyway?
$ enaDataGet -as ./aspera_settings.ini -f fastq SRR6682864
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR668/004/SRR6682864/SRR6682864_1.fastq.gz
Downloading file with md5 check:ftp.sra.ebi.ac.uk/vol1/fastq/SRR668/004/SRR6682864/SRR6682864_2.fastq.gz

Thanks

Add tests

Would be good to have some basic tests we could run (ok if internet needed) to check it all installed ok.

This is also desirable for the test block in most packaging systems.

Consider a single py2/3 compatible package

To avoid having separate py2 and py3 version can you write in a way to make it work on both?

See http://python-future.org/compatible_idioms.html

Says "Completed" but folder is empty

It seems to indicate success, but there is no data:

% ls
<empty folder>

% ./enaDataGet  SRR3319297
Completed

% ls 
SRR3319297

% ls  SRR3319297
<empty folder>

Relationship to 'enasearch' python module

I recently came across this new python module and toolby @bebatut

https://github.com/bebatut/enasearch/

It might be useful to work together or use this?

Ability to only check if files are available for download

I have a workflow that uses this tool to fetch fastq files of a list of user provided accessions. Sometimes a certain accession doesn't have any files associated with them. This causes download to fail and workflow to be stuck. It would be great to be able to check if a certain accession has files of specified format available, with a clear output or exit code so we can take them out and proceed with workflow. This can be done via a flag (something like --check-files-only) which outputs True/False or exit with 0/1. I can try to make a PR if this is desirable.

The `dirname $0` trick fails when symlinks are used

#!/bin/bash
dir=`dirname "$0"`
python ${dir}/assemblyGet.py "$@"

If this shell script has been symlinked into /usr/local/bin say, it won't resolve to the proper place to find the python script.

Use realpath "$0" or readlink -e "$0" instead.

Or just make the python script directly runnable?

Possible bug in report download when selecting a group

Hi there,

When using enaGroupGet, the response that comes back selecting the default format vs fastq results in a different URL between the 'submitted_aspera' and 'fastq_aspera' fields, respectively. For example:

./enaGroupGet -f fastq -a PRJEB3008

calls:

utils.download_report_from_portal()

which results in the following correct response:

Fetching ERR125556 b'run_accession\tfastq_aspera\tfastq_md5\n' b'ERR125556\tfasp.sra.ebi.ac.uk:/vol1/fastq/ERR125/ERR125556/ERR125556.fastq.gz\t3568a68dc8e8cdc54c7afa9ae58053cf\n'

However, when calling:

./enaGroupGet -a PRJEB3008

The response, notably the missing colon between fastq_aspera and submitted_aspera, is different:

Fetching ERR125556 b'run_accession\tsubmitted_aspera\tsubmitted_md5\n' b'ERR125556\tfasp.sra.ebi.ac.uk/vol1/ERA128/ERA128463/sff/FZU8VVO01_MID9.sff\tbeb7f366441984e0f5f2ada5e57464a6\n'

Is this a bug? Aspera will fail if I try to use the submitted_aspera link with no colon between the host and the file path.

Cheers,

Rob

Add enaBrowserTools to Bioconda channel

Hi there,
would it be possible that you contribute a enaBrowserTools recipe to the conda package manager via the bioconda channel https://bioconda.github.io so it can be installed using conda install enaBrowserTools?
Many thanks in advance!

GCA (NCBI) vs EBI naming format

I am pretty used to use the GCA formatting in various projects as it was easier to sync via ftp at NCBI then via the EBI but with this application those problems are most likely solved.

However, is there an easy way to go from GCA > EBI/ENA formatting? I am currently playing with the taxon ids as this is the general id that is in common but not unique to single organisms... Also the identifiers (http://www.ebi.ac.uk/ena/data/view/GCA_000007565.1) do "exist" at EBI but not allowed via the retrieval application?

Failing to connect to sequence read database/server

I'm using enaDataGet (1.5) and have been for a while. As of yesterday, I cannot access the short read sets. A typical command is:

enaDataGet -f fastq -a ERR536833

I am able to download genome assemblies. I'm not sure why sequence reads is failing. Is there a server issue for these data or am I doing something wrong?

Thanks for your help

James

Error in using enaBrowserTools to download SRR

Hi,
I am missing a problem using enaBrowserTools to download SRR,
the Error code show blow:

No redownload of truncated files

Tool: enaGroupGet
Version: 1.6

Hi,

Recently I've been getting FTP errors (like "421 There are too many connected users, please try later."), which is fine except that when re-running the tool, there is no attempt to finish downloading (or redownload) incomplete files, as the (partial) files exist in the directory.
I'm not sure what the intended behavior is but it would be nice to have an option to redownload files whose MD5 sums don't match.

Many thanks!

HTTPError after enaGroupGet

Hi!
First of all thanks for writing this software. It's been a couple of weeks that enaGroupGet is throwing me this error no matter my python version (I've tried 2.7,3.6 and 3.7). It's related to the getRead.py script and I don't know how to solve it.
Thanks for having a look,
Lourdes

Traceback (most recent call last):
File "enaGroupGet.py", line 189, in
download_group(accession, group, output_format, dest_dir, fetch_wgs, extract_wgs, fetch_meta, fetch_index, aspera, subtree, expanded)
File "enaGroupGet.py", line 138, in download_group
download_data_group(group, accession, output_format, group_dir, fetch_wgs, extract_wgs, fetch_meta, fetch_index, aspera, subtree, expanded)
File "enaGroupGet.py", line 95, in download_data_group
download_data(group, data_accession, output_format, group_dir, fetch_wgs, extract_wgs, expanded, fetch_meta, fetch_index, aspera)
File "enaGroupGet.py", line 83, in download_data
readGet.download_files(data_accession, output_format, group_dir, fetch_index, fetch_meta, aspera)
File "/WORKING_DIRECTORY/lvelo/enaBrowserTools/enaBrowserTools-1.5.5/python/readGet.py", line 90, in download_files
utils.download_report_from_portal(search_url, temp_file)
File "/WORKING_DIRECTORY/lvelo/enaBrowserTools/enaBrowserTools-1.5.5/python/utils.py", line 452, in download_report_from_portal
response = get_report_from_portal(url)
File "/WORKING_DIRECTORY/lvelo/enaBrowserTools/enaBrowserTools-1.5.5/python/utils.py", line 449, in get_report_from_portal
return urllib2.urlopen(request)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 435, in open
response = meth(req, response)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 548, in http_response
'http', request, response, code, msg, hdrs)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 467, in error
result = self._call_chain(*args)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 654, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 435, in open
response = meth(req, response)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 548, in http_response
'http', request, response, code, msg, hdrs)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 473, in error
return self._call_chain(*args)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/home/lvelo/working_directory/my-envs/ena2.7/lib/python2.7/urllib2.py", line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 400: Bad Request
ERROR: Something unexpected went wrong please try again.
If problem persists, please contact [email protected] for assistance. with the above error details.

Aspera installation hints

Is it possible for you to give some hints on installing Aspera?

I tried install it on a linux server during the workshop today and I think they're no-longer bundling the SSH key we need to supply.

Suran said he'd have a look as part of a ticket but I think #38 refers to a slightly different issue.

If you do find a way to get this working, it would be really great if you could include some hints in the README or in the wiki. I'd like to start doing lots of downloading next week and it's going to be loads slower over FTP :(

Thank you for a good session today.

Consider merging all tools into one tool

Currently you have

analysisGet     assemblyGet     enaDataGet     enaGroupGet     readGet     sequenceGet

I think it would be much better to have a single executable tool ena-get (or similar) where the type of object could specified (or auto-predicted) ? Or use subcommands like samtools and other modern tools. (Note that argparse supports sub-commands.)

Is this possible?

Make pip installable

Ideally this would be possible:

pip install enaBrowserTools

It's not as hard as it seems: http://marthall.github.io/blog/how-to-package-a-python-app/

Fetching genomes sometimes get stuck

I have been trying the following command (on a Mac):

python3 enaBrowserTools/python3/enaDataGet.py -f embl GCA_000393655.1

and notices that it gets stuck on unplaced-scaffold and not sure how to continue.

no sequences: assembled-molecule
no sequences: unlocalised-scaffold
fetching sequences: unplaced-scaffold

The folder content:

-rw-r--r--  1 me  staff   4.5K May  9 16:51 GCA_000393655.1.xml
-rw-r--r--  1 me  staff    19M May  9 16:51 GCA_000393655.1_sequence_report.txt
-rw-r--r--  1 me  staff     0B May  9 16:51 unplaced-scaffold.dat

Update

On our ubuntu server it obtains the data but very slowly and shows a lot of:

Entry: ASAF01227205 display type is either not supported or entry is not found.

Update 2

As shown on the main page of this git project it indeed states to fix the certificates.

However as I used home-brew to install python3 I needed the following code to run:

# install_certifi.py
#
# sample script to install or update a set of default Root Certificates
# for the ssl module.  Uses the certificates provided by the certifi package:
#       https://pypi.python.org/pypi/certifi

import os
import os.path
import ssl
import stat
import subprocess
import sys

STAT_0o775 = ( stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR
             | stat.S_IRGRP | stat.S_IWGRP | stat.S_IXGRP
             | stat.S_IROTH |                stat.S_IXOTH )

openssl_dir, openssl_cafile = os.path.split(
    ssl.get_default_verify_paths().openssl_cafile)

print(" -- pip install --upgrade certifi")
subprocess.check_call([sys.executable,
    "-E", "-s", "-m", "pip", "install", "--upgrade", "certifi"])

import certifi

# change working directory to the default SSL directory
os.chdir(openssl_dir)
relpath_to_certifi_cafile = os.path.relpath(certifi.where())
print(" -- removing any existing file or link")
try:
    os.remove(openssl_cafile)
except FileNotFoundError:
    pass
print(" -- creating symlink to certifi certificate bundle")
os.symlink(relpath_to_certifi_cafile, openssl_cafile)
print(" -- setting permissions")
os.chmod(openssl_cafile, STAT_0o775)
print(" -- update complete")

but it is not yet 100% fixed. Not sure why some genomes do work while others don't.

Genome download which files to use

I am trying to figure out which settings and files to use to have the most complete and correct representation of a genome.

In the code I found the following type of output files:

REPLICON = 'assembled-molecule'
UNLOCALISED = 'unlocalised-scaffold'
UNPLACED = 'unplaced-scaffold'
PATCH = 'patch'

When downloading a genome, for example GCA_000003215.1

enaBrowserTools/python3/enaDataGet -f embl --wgs --extract-wgs --expanded GCA_000003215.1

It generates the following files:

-rw-r--r-- 1 root root 1746946 Oct 20 06:45 ABFD02.dat.gz
-rw-r--r-- 1 root root 5168 Oct 20 06:45 GCA_000003215.1.xml
-rw-r--r-- 1 root root 1242 Oct 20 06:45 GCA_000003215.1_sequence_report.txt
-rw-r--r-- 1 root root 5533183 Oct 20 06:45 assembled-molecule.dat
-rw-r--r-- 1 root root 0 Oct 20 06:45 wgs_scaffolds.dat

In this case I assume the assembled-molecule.dat is the most complete genome file?
It contains 1 chromosome with unknown gap sizes while the gzip file contains the 31 contigs separately.

Or would it be wiser to always use the gzipped file?

Skip existing files

Have to do some additional testing but I forgot the -meta option and it seems to me that it is now downloading the already downloaded files again?

Add option to split projects by samples

Unless I missed a bit of functionality somewhere, when downloading a whole read set via a project/study identifier, can there be a --by-sample toggle to split the run files up by sample? e.g.

<study_accession>/<sample_accession>/<run_accession>.

Error with retrieving fastq files with enaGroupGet

Hello,

I am trying to retrieve the fastq files for Project SRP199470 but end up with error messages as follows

(ena) [dmelloc@arc PRJNA544731]$ enaGroupGet -f fastq SRP199470
Traceback (most recent call last):
File "/home/dmelloc/anaconda3/envs/ena/bin/enaGroupGet.py", line 164, in
if not utils.is_available(accession):
File "/home/dmelloc/anaconda3/envs/ena/bin/utils.py", line 242, in is_available
return (not 'entry is not found' in record.text) and (len(record.getchildren()) > 0)
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren'

Would really appreciate any assistance to resolve this.

Thank you!

Confusion in WGS vs Assembly

I am a bit confused by the usage of WGS vs Assembly.

My assumption was that all genomes that are assembled and "annotated" / available in the EMBL format could be retrieved through

./enaGroupGet -g wgs -f embl -d ../../ENA/ --wgs --meta --index --subtree 2

In this case this would be Whole Genome Sequencing under taxon group 2.

But then what does the following do?

./enaGroupGet --meta --index --subtree 2 -f embl -g assembly --wgs

It now downloads GCA_00000 genomes which are as I can see so far drafts and when diving into the file the header contains ABFD02000031 which corresponds to a genome ABFD02 that is available through the first command.

However it does download some extra files which contains the assembly information. Is that the major difference between -g wgs and -g assembly ? Or will some genomes be available through -g assembly and not by -g wgs?

Feature Request: Continue after failed download

The download of studies consisting of multiple fastq files using enaGroupGet has failed me for a number of times now. It always finishes a couple of files and then randomly crashes.

Then I have two choices.

Just restart enaGroupGet, redownload everything (even the files that were already finished) and hope that it works this time.
Manually find out which sample id were part of the study, check which ones I already finished, and then use enaDataGet to download them one by one.

It would be really great if you could make both enaGroupGet and enaDataGet get aware of successfully downloaded files (e.g. by checksum or filesize) and by default only redownload files that changed.

Citing

In material methods of a paper we are currently writing we have extensively used this method for downloading genomes. Is there a paper which could be cited beside mentioning it?

URL sanitizing breaks Aspera downloads

When I try to use ascp to download accessions, enaDataGet (version 1.5.3) returns an error:

Downloading file with md5 check:fasp.sra.ebi.ac.uk%3A/vol1/fastq/SRR838/003/SRR8387643/SRR8387643_1.fastq.gz
Creating (scrubbed)/test/SRR8387643/logs /pollard/home/pbradz/bin/ascp -QT -L (scrubbed)/test/SRR8387643/logs -l 100M -P33001 -i (scrubbed)/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]%3A/vol1/fastq/SRR838/003/SRR8387643/SRR8387643_1.fastq.gz ./SRR8387643
ascp: no remote host specified 
Startup failed, exit

On closer inspection it looks like the colon after "fasp.sra.ebi.ac.uk" is getting sanitized to a "%3A" by urlparse.quote(), which then confuses ascp.

If I patch download_file in readGet.py to the following, it works:

def download_file(file_url, dest_dir, md5, aspera):
    # split server off file_url
    sf = file_url.split(":")
    server = sf[0]
    filename = ":".join(sf[1:])
    filename = urlparse.quote(filename)
    file_url = f'{server}:{filename}'
    if utils.file_exists(file_url, dest_dir, md5):
        return
    success = attempt_file_download(file_url, dest_dir, md5, aspera)
    if not success:
        success = attempt_file_download(file_url, dest_dir, md5, aspera)
    if not success:
        print('Failed to download file after two attempts')

Edit: Should have also specified that I was using Python 3 here.

Some of the .py scripts are still 'executable'

Need to chmod -x *.py as they aren't directly runnable (no #! line).

FileNotFoundError for temp.txt

I am using Snakemake and enaDataGet to download several fastq files concurrently from a project (namely PRJEB9586) instead of using enaGroupGet. However I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'PRJEB9586/temp.txt'

I believe this is the block of code from python3/readGet.py that messes things up:

temp_file = os.path.join(dest_dir, 'temp.txt')
utils.download_report_from_portal(search_url, temp_file)
f = open(temp_file)
lines = f.readlines()
f.close()
os.remove(temp_file)

Because my Snakemake pipeline is running concurrently, there are several processes creating and deleting this temp.txt file, which is causing the error. Can this be fixed, possibly by using Python's tempfile library?

Download WGS set in absence of higher level identifiers in the GCA xml

I'm suggesting this as an enhancement of the enaDataGet tool.

For example GCA_001005085.2. In cases such as this, the tool will only download the assembly xml file. It would be very helpful if the tool downloaded the WGS set directly without having to use -w option as an additional step. A message for logging purposes would be very handy as well.

Downloading contigs?

Could you give an example command for downloading all wgs data (contigs) files for C. jejuni?

Thanks in advance.

Errors accessing fastq files through enaDataGet

Hi,

A number of accessions return errors when attempting to download fastq files with enaDataGet; I've included the issue, error and an example lane ID of these three cases. enaBrowserTools version 1.5.3

Lane suppression; when a lane is suppressed the enaDataGet tool cannot access it, however, it still attempts to retrieve the record and results in the error below.

/.singularity.d/actions/exec: 21: exec: enaDataGet: not found
Traceback (most recent call last):
  File "/opt/enaBrowserTools-1.5.5/python3/enaDataGet.py", line 99, in <module>
    readGet.download_files(accession, output_format, dest_dir, fetch_index, fetch_meta, aspera)
  File "/opt/enaBrowserTools-1.5.5/python3/readGet.py", line 121, in download_files
    if utils.is_empty_dir(target_dir):
UnboundLocalError: local variable 'target_dir' referenced before assignment
ERROR: Something unexpected went wrong please try again.
If problem persists, please contact [email protected] for assistance, with the above error details.

A lane with this error is SRR8901447.
2. Missing lane; if a lane accession exists but is missing downloadable files from ena it returns this error:
mv: cannot stat ‘/path_download/SRR9998315/*‘: No such file or directory
As this shows, an accession is SRR9998315.
3. Unrecognised accession; if the accession is not recognised by enaDataGet it returns a clear error message stating this, however the lanes are present on ena and can be downloaded with wget.

Checking availability of https://www.ebi.ac.uk/ena/browser/api/xml/SRR10049758
ERROR: Invalid accession provided

A lane with this issue is SRR10049758.

It would be appreciated if you could provide descriptive errors for the first and second issues and a solution for the third.

Thanks!

Handle 'withdrawn' items better

I wrote this tutorial last year:
https://github.com/ngs-docs/angus/blob/2017/kraken_species_identification.rst

So i tried to use your tool to download it.

enaDataGet -f fastq SRR2423672
ERROR: Something unexpected went wrong please try again.
If problem persists, please contact [email protected] for assistance.

Now it turns out the entry used to exist, but was withdrawn:

Run: SRR2423672 has been removed at the submitter's request.
See: http://www.ebi.ac.uk/ena/data/view/SRR2423672

Are you able to catch this situation better and inform the user?