Giter VIP home page Giter VIP logo

pysradb's Introduction

A Python package for retrieving metadata from SRA/ENA/GEO

image image image image image image image

Documentation

https://saketkc.github.io/pysradb

CLI Usage

pysradb supports command line usage. See CLI instructions or quickstart guide.

$ pysradb
 usage: pysradb [-h] [--version] [--citation]
                {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
                ...

 pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
 version: 2.0.1
 Citation: 10.12688/f1000research.18676.1

 optional arguments:
   -h, --help            show this help message and exit
   --version             show program's version number and exit
   --citation            how to cite

 subcommands:
   {metadata,download,search,gse-to-gsm,gse-to-srp,gsm-to-gse,gsm-to-srp,gsm-to-srr,gsm-to-srs,gsm-to-srx,srp-to-gse,srp-to-srr,srp-to-srs,srp-to-srx,srr-to-gsm,srr-to-srp,srr-to-srs,srr-to-srx,srs-to-gsm,srs-to-srx,srx-to-srp,srx-to-srr,srx-to-srs}
     metadata            Fetch metadata for SRA project (SRPnnnn)
     download            Download SRA project (SRPnnnn)
     search              Search SRA for matching text
     gse-to-gsm          Get GSM for a GSE
     gse-to-srp          Get SRP for a GSE
     gsm-to-gse          Get GSE for a GSM
     gsm-to-srp          Get SRP for a GSM
     gsm-to-srr          Get SRR for a GSM
     gsm-to-srs          Get SRS for a GSM
     gsm-to-srx          Get SRX for a GSM
     srp-to-gse          Get GSE for a SRP
     srp-to-srr          Get SRR for a SRP
     srp-to-srs          Get SRS for a SRP
     srp-to-srx          Get SRX for a SRP
     srr-to-gsm          Get GSM for a SRR
     srr-to-srp          Get SRP for a SRR
     srr-to-srs          Get SRS for a SRR
     srr-to-srx          Get SRX for a SRR
     srs-to-gsm          Get GSM for a SRS
     srs-to-srx          Get SRX for a SRS
     srx-to-srp          Get SRP for a SRX
     srx-to-srr          Get SRR for a SRX
     srx-to-srs          Get SRS for a SRX

Quickstart

A Google Colaboratory version of most used commands are available in this Colab Notebook . Note that this requires only an active internet connection (no additional downloads are made).

The following notebooks document all the possible features of `pysradb`:

  1. Python API
  2. Downloading datasets from SRA - command line
  3. Parallely download multiple datasets - Python API
  4. Converting SRA-to-fastq - command line (requires conda)
  5. Downloading subsets of a project - Python API
  6. Download BAMs
  7. Metadata for multiple SRPs
  8. Multithreaded fastq downloads using Aspera Client
  9. Searching SRA/GEO/ENA

Installation

To install stable version using `pip`:

pip install pysradb

Alternatively, if you use conda:

conda install -c bioconda pysradb

This step will install all the dependencies. If you have an existing environment with a lot of pre-installed packages, conda might be slow. Please consider creating a new enviroment for pysradb:

conda create -c bioconda -n pysradb PYTHON=3.10 pysradb

Dependencies

pandas
requests
tqdm
xmltodict

Installing pysradb in development mode

git clone https://github.com/saketkc/pysradb.git
cd pysradb && pip install -r requirements.txt
pip install -e .

Using pysradb

Obtaining SRA metadata

$ pysradb metadata SRP000941 | head

study_accession experiment_accession experiment_title                                                                                                                 experiment_desc                                                                                                                  organism_taxid  organism_name library_strategy library_source  library_selection sample_accession sample_title instrument                    total_spots total_size    run_accession run_total_spots run_total_bases
SRP000941       SRX056722                                                                         Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells                                                               Reference Epigenome: ChIP-Seq Analysis of H3K27ac in hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC    ChIP            SRS184466                              Illumina HiSeq 2000    26900401     531654480   SRR179707     26900401         807012030
SRP000941       SRX027889                                                                            Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells                                                                  Reference Epigenome: ChIP-Seq Analysis of H2AK5ac in hESC Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC    ChIP            SRS116481                      Illumina Genome Analyzer II    37528590     779578968   SRR067978     37528590        1351029240
SRP000941       SRX027888                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116483                      Illumina Genome Analyzer II    13603127    3232309537   SRR067977     13603127         489712572
SRP000941       SRX027887                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116562                      Illumina Genome Analyzer II    22430523     506327844   SRR067976     22430523         807498828
SRP000941       SRX027886                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116560                      Illumina Genome Analyzer II    15342951     301720436   SRR067975     15342951         552346236
SRP000941       SRX027885                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116482                      Illumina Genome Analyzer II    39725232     851429082   SRR067974     39725232        1430108352
SRP000941       SRX027884                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS116481                      Illumina Genome Analyzer II    32633277     544478483   SRR067973     32633277        1174797972
SRP000941       SRX027883                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS004118                      Illumina Genome Analyzer II    22150965    3262293717   SRR067972      9357767         336879612
SRP000941       SRX027883                                                                                     Reference Epigenome: ChIP-Seq Input from hESC H1 Cells                                                                           Reference Epigenome: ChIP-Seq Input from hESC H1 Cells  9606            Homo sapiens       ChIP-Seq           GENOMIC  RANDOM            SRS004118                      Illumina Genome Analyzer II    22150965    3262293717   SRR067971     12793198         460555128

Obtaining detailed SRA metadata

$ pysradb metadata SRP075720 --detailed | head

study_accession experiment_accession experiment_title                                  experiment_desc                                   organism_taxid  organism_name library_strategy library_source  library_selection sample_accession sample_title instrument           total_spots total_size run_accession run_total_spots run_total_bases
SRP075720       SRX1800476            GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq   GSM2177569: Kcng4_2la_H9; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467643                    Illumina HiSeq 2500  2547148      97658407  SRR3587912    2547148         127357400
SRP075720       SRX1800475            GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq   GSM2177568: Kcng4_2la_H8; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467642                    Illumina HiSeq 2500  2676053     101904264  SRR3587911    2676053         133802650
SRP075720       SRX1800474            GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq   GSM2177567: Kcng4_2la_H7; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467641                    Illumina HiSeq 2500  1603567      61729014  SRR3587910    1603567          80178350
SRP075720       SRX1800473            GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq   GSM2177566: Kcng4_2la_H6; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467640                    Illumina HiSeq 2500  2498920      94977329  SRR3587909    2498920         124946000
SRP075720       SRX1800472            GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq   GSM2177565: Kcng4_2la_H5; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467639                    Illumina HiSeq 2500  2226670      83473957  SRR3587908    2226670         111333500
SRP075720       SRX1800471            GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq   GSM2177564: Kcng4_2la_H4; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467638                    Illumina HiSeq 2500  2269546      87486278  SRR3587907    2269546         113477300
SRP075720       SRX1800470            GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq   GSM2177563: Kcng4_2la_H3; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467636                    Illumina HiSeq 2500  2333284      88669838  SRR3587906    2333284         116664200
SRP075720       SRX1800469            GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq   GSM2177562: Kcng4_2la_H2; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467637                    Illumina HiSeq 2500  2071159      79689296  SRR3587905    2071159         103557950
SRP075720       SRX1800468            GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq   GSM2177561: Kcng4_2la_H1; Mus musculus; RNA-Seq  10090           Mus musculus  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS1467635                    Illumina HiSeq 2500  2321657      89307894  SRR3587904    2321657         116082850

Converting SRP to GSE

$ pysradb srp-to-gse SRP075720

study_accession study_alias
SRP075720       GSE81903

Converting GSM to SRP

$ pysradb gsm-to-srp GSM2177186

experiment_alias study_accession
GSM2177186       SRP075720

Converting GSM to GSE

$ pysradb gsm-to-gse GSM2177186

experiment_alias study_alias
GSM2177186       GSE81903

Converting GSM to SRX

$ pysradb gsm-to-srx GSM2177186

experiment_alias experiment_accession
GSM2177186       SRX1800089

Converting GSM to SRR

$ pysradb gsm-to-srr GSM2177186

experiment_alias run_accession
GSM2177186       SRR3587529

Downloading supplementary files from GEO

$ pysradb download -g GSE161707

Downloading an entire SRA/ENA project (multithreaded)

pysradb makes it super easy to download datasets from SRA parallely: Using 8 threads to download:

$ pysradb download -y -t 8 --out-dir ./pysradb_downloads -p SRP063852

Downloads are organized by SRP/SRX/SRR mimicking the hierarchy of SRA projects.

Downloading only certain samples of interest

$ pysradb metadata SRP000941 --detailed | grep 'study\|RNA-Seq' | pysradb download

This will download all RNA-seq samples coming from this project.

Ultrafast fastq downloads

With aspera-client installed, [pysradb]{.title-ref} can perform ultra fast downloads:

To download all original fastqs with [aspera-client]{.title-ref} installed utilizing 8 threads:

$ pysradb download -t 8 --use_ascp -p SRP002605

Refer to the notebook for (shallow) time benchmarks.

Publication

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive

Presentation slides from BOSC (ISMB-ECCB) 2019: https://f1000research.com/slides/8-1183

Citation

Choudhary, Saket. "pysradb: A Python Package to Query next-Generation Sequencing Metadata and Data from NCBI Sequence Read Archive." F1000Research, vol. 8, F1000 (Faculty of 1000 Ltd), Apr. 2019, p. 532 (https://f1000research.com/articles/8-532/v1)

@article{Choudhary2019,
doi = {10.12688/f1000research.18676.1},
url = {https://doi.org/10.12688/f1000research.18676.1},
year = {2019},
month = apr,
publisher = {F1000 (Faculty of 1000 Ltd)},
volume = {8},
pages = {532},
author = {Saket Choudhary},
title = {pysradb: A {P}ython package to query next-generation sequencing metadata and data from {NCBI} {S}equence {R}ead {A}rchive},
journal = {F1000Research}
}

Zenodo archive: https://zenodo.org/badge/latestdoi/159590788

Zenodo DOI: 10.5281/zenodo.2306881

Questions?

Open an issue or join our Slack Channel.

pysradb's People

Contributors

andrewdavidsmith avatar bscrow avatar daasdaham avatar dependabot[bot] avatar devangthakkar avatar dibyaaaaax avatar jggatter avatar maarten-vd-sande avatar mphschmitt avatar mvdbeek avatar phlya avatar saketkc avatar tfehlmann avatar zhoulytwinyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pysradb's Issues

Sample exists on GEO, but not on SRA

Description

I have a strange issue, which might not be within the scope of pysradb.

Sample GSM3832552 (actually the whole series) has been submitted to GEO but is not on the SRA. The sequencing runs however are on the SRA. Is there some magic way I can work with those?

>>> import pysradb
>>> print(pysradb.__version__)
0.11.1-dev0
>>> print(pysradb.SRAweb().sra_metadata("GSM3832552", detailed=True))
No results found for GSM3832552

Is piping possible in pysradb

Can i give result of one query as input to other query and get the result using two queries in a single query?
or
Can i concatenate or merge two queries as a single query and get the output?

Error during batch downloading SRA files using SRAweb()

  • pysradb version: = 0.10.4
  • Python version: = 3.8.3
  • Operating System: = CentOS Linux

Description

This is a follow-up issue from #46 where i started downloading a batch of sra files for the fetched metadata in a pandas DataFrame. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with error

Error

self.retrieve() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'

Discussion from #46

The download method first downloads to a temporary location which in this case is pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part: notice the .part. Downloads are resumable by default. Once a download finishes, the .part extension is removed to mark it complete.

In this case the error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded (it thinks it hasn't been, but probably its download is already complete).

You should have SRR12100406.sra Please feel free to open a new issue otherwise.

As you mentioned

The error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded

I have checked that SRR12100406.sra wasn't created yet. I am not sure how to use parallel efficiently in this case. I have two questions

  1. If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from?
  2. If you have any opinion with using example mentioned here on SunGridEngine based job queue system?

Thanks,
Zohaib

organism name is not present in the metadata fetched through flags

Hi @saketkc,
could you please provide me information on how to fetch the name of the organism by giving project id and also how to get the count of runs and experiments of each project
I tried with this "pysradb metadata SRP063732 --desc --expand --detailed" but it fetched all the attributes information except name of the organism.

Thanks in advance

keyerror while checking for a key in kwargs

What I Did

sc=SRAweb()
sc.gse_to_gsm("GSE34438") #no keyword arguments
Traceback (most recent call last):
  File "test_sraweb.py", line 151, in <module>
    sc.gse_to_gsm("GSE34438")
  File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 536, in gse_to_gsm
    if kwargs["detailed"] == True:
KeyError: 'detailed'

SRAdb and SRAweb don't give the same results

Hi,

First I'd like to thank you for this very useful package. Unfortunely, I'd love to use SRAweb, unfortunately, there seems to be somthing wrong with it compared to SRAdb.

Here are my specs,

  • pysradb version: pysradb==0.9.6
  • Python version: 3
  • Operating System: Ubuntu 16.04 LTS

Description

I'm trying to get the metadata from a SRA project ID (e.g.: SRP125768).

What I Did

With local SQL db,

db = SRAdb('SRAmetadb.sqlite')
df1 = db.sra_metadata('SRP125768', detailed=True, expand_sample_attributes=True, sample_attribute=True)

image

W/o local SQL db,

db = SRAweb()
df2 = db.sra_metadata(srp="SRP125768", detailed=True, expand_sample_attributes=True, sample_attribute=True)

image

I haven't check all the entries but there is definitely something wrong with df2: duplicated rows / missing rows.

I'd be happy to get your feedback and your fix for this :)

More pipe friendly output

Hi Saket,

It appears that the current output of pysradb command, one given below for example from the README, is not very friendly to parsing by tools such as awk or cut. For instance, I'd only like to retain a few columns from the output, but usual attempts such as awk -F "\t" .. or cut -f1-5 fail for columns which contain description text. This is a problem only if I use the direct string output from the command and not through the --saveto option.

pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand

Cheers, Vivek

Access

  • pysradb version: 0.4
  • Python version: 3.7
  • Operating System: Ubuntu 16.04

Description

Is there a way to grab all of the sample ids for a given SRP id? I see that you can return all of the GSM values (which might work for me ultimately), but it's not formatted in a fantastic manner.

Db is overwritten if run without --db command

  • pysradb version: pysradb, version 0.6.0
  • Python version: 3.6.5
  • Operating System: Linux

Description

Running a command without specifying --db overwrote the already downloaded databases in the same directory.

What I Did

pysradb srp-to-srx SRP067701

Running this twice should not download the database again

Let me know if I missed something.

Thanks for the great package, Saket!

piping support

From: https://doi.org/10.5256/f1000research.20450.r47560

One frustrating limitation is that the piping support is not univeral throughout the tool. You can pipe into the download command, but not, for example, into the metadata command. Being able to chain operations such as:

pysradb gse-to-srp GSE24355 | pysradb metadata | pysra download

..or

pysradb search '"oocyte development"' | head | pysradb metadata

..would be really nice and presumably not too hard to support?

pysradb search error with specific search term

  • pysradb version: 0.9.0
  • Python version: Python 3.7.3
  • Operating System: Centos

Description

Searching for the term hypoxia worked fine but searching the term hypoxic produced the following error message:

This is most likely a bug, please report it upstream.
sample_attribute: investigation type: metagenome || project name: Landsort Depth 20090415 transect || sequencing method: 454 || collection date: 2009-04-15 || ammonium: 8.7: รƒโ€šร‚ยตM || chlorophyll: 0: รƒโ€šร‚ยตg/L || dissolved oxygen: -1.33: รƒโ€šร‚ยตmol/kg || nitrate: 0.02: รƒโ€šร‚ยตM || nitrogen: 0: รƒโ€šร‚ยตM || environmental package: water || geographic location (latitude): 58.6: DD || geographic location (longitude): 18.2: DD || geographic location (country and/or sea,region): Baltic Sea || environment (biome): 00002150 || environment (feature): 00002150 || environment (material): 00002150 || depth: 400: m || Phosphate:  || Total phosphorous:  || Silicon:
Traceback (most recent call last):
  File "/envs/pysradb/bin/pysradb", line 11, in <module>
    sys.exit(parse_args())
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/cli.py", line 944, in parse_args
    args.saveto,
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/cli.py", line 148, in search
    expand_sample_attributes=expand,
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/sradb.py", line 1044, in search_sra
    acc_is_searchstr=True,
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/sradb.py", line 316, in sra_metadata
    metadata_df = expand_sample_attribute_columns(metadata_df)
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/filter_attrs.py", line 75, in expand_sample_attribute_columns
    sample_attribute_keys, _ = _get_sample_attr_keys(sample_attribute)
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/filter_attrs.py", line 27, in _get_sample_attr_keys
    sample_attribute_dict = dict(split_by_colon)
ValueError: dictionary update sequence element #19 has length 1; 2 is required

What I Did

pysradb  search "hypoxic" --db SRAmetadb.sqlite   --assay --desc   --detailed --expand  --saveto hypoxic_search.txt

directing output in external hard drive

Hi,
I want to redirect the download to an external hard drive since I have little space left in my hard drive.
I am trying
(pysradb) usr@usr-X705UDR:~$ pysradb download --out-dir ./pysradb_downloads -p SRP165962 --db cd /media/usr/LaCie/db/ , but it is not working.
Kindly help.

Regarding Non-available dataset in the package

  • pysradb version: 0.9.0
  • Python version: 3.7
  • Operating System: Ubuntu Linux

Description

I found few datasets are not available in the provided SRAmetadb.sqlite file. For example: GSE80994, GSE80993, GSE80992 etc. How can I get all the datasets from this tool? If the dataset is not available, how can I add it into the package?

What I Did

pysradb gse-to-gsm --db SRAmetadb.sqlite --desc --expand GSE80990
/home/singh/miniconda3/envs/dev/lib/python3.7/site-packages/pysradb/basedb.py:108: RuntimeWarning: Found no matching results for query.
  warnings.warn("Found no matching results for query.", RuntimeWarning)

[ENH] Platform/base space in metadata

Is your feature request related to a problem? Please describe.
I run into the problem with seq2science that we assume that all samples are illumina sequenced / base space. Someone tried to run it with ABI solid and this resulted in some unexpected behaviour (empty alignment, but pipeline still runs fully). I want to add a check if the sample platform / fastq format is supported, but I am not sure if I can retrieve this data with pysradb:

import pysradb

db_sra = pysradb.SRAweb()
metadata = db_sra.sra_metadata(["SRR1649197"], detailed=True)
print(metadata.columns)
Index(['run_accession', 'study_accession', 'experiment_accession',
       'experiment_title', 'experiment_desc', 'organism_taxid ',
       'organism_name', 'library_strategy', 'library_source',
       'library_selection', 'library_layout', 'sample_accession',
       'sample_title', 'instrument', 'total_spots', 'total_size',
       'run_total_spots', 'run_total_bases', 'run_alias', 'sra_url',
       'experiment_alias', 'source_name', 'cell type', 'ena_fastq_http',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp',
       'ena_fastq_ftp_1', 'ena_fastq_ftp_2'],
      dtype='object')

See this sample for https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1649197

The instrument is AB 5500 Genetic Analyzer, which is something I could use, but I expect to run into the same problem again when people use a different machine but still ABI solid (assuming there are multiple machines).

Describe the solution you'd like
The platform reported, or even better, the base format reported (e.g. base space vs color space).

Pysradb metadata does not work for some project accession numbers

pysradb metadata --saveto test1.tsv --detailed SRP098789 and pysradb metadata ERP113893 works well.
However, pysradb metadata --detailed ERP113893 or pysradb metadata --saveto test1.tsv --detailed ERP113893 produces the following IndexError:

image

I am able to replicate this issue on pysradb-0.10.4 (from pypi) and pysradb-0.10.5-dev0 (from github pysradb master), on both windows 10 and Ubuntu 18.04 OS.

attributes Instrument,total size of experiment are not present in the metadata fetched through flags

Hi @saketkc,
could you please provide me information on how to fetch the attribute "Instrument" under "Library"
and also the total size of experiment by giving bio project id as input.
I tried with this "pysradb metadata SRP063732 --desc --expand --detailed" fetched layout,strategy,name,source,selection under Library but not "Instrument" and also,it fetched only bases and spots but not the total size of experiment

Thanks in advance.

Support Pubmed IDs for search and metadata

Is your feature request related to a problem? Please describe.
PMIDs are currently not returned for the metadata frame.

Describe the solution you'd like

  • The output dataframe should have additional column for PMID.
  • Search should also support PMIDs (#22)

[BUG] experiment_alias has _1 appended

Describe the bug```
Not sure if its a bug on the GEO/SRA side, but for GSM sample GSM1020644 the experiment_alias has `_1` appended in its experiment_alias. I have not seen `_1` anywhere on the website.

import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("GSM1020644", detailed=True).experiment_alias.values)
0.11.0
['GSM1020644_1']

download fails for SRP125265

Running the following example from readme (different SRP)

pysradb metadata SRP125265 --assay | grep 'study\|RNAseq' | pysradb download

produces this error

Traceback (most recent call last):
  File "/home/dfeldman/.conda/envs/df-pyr/bin/pysradb", line 10, in <module>
    sys.exit(parse_args())
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/cli.py", line 1044, in parse_args
    download(args.out_dir, args.db, args.srx, args.srp, args.skip_confirmation)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/cli.py", line 134, in download
    protocol=protocol,
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/sradb.py", line 1275, in download
    + ".sra"
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/strings.py", line 2039, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/strings.py", line 2096, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!

Possible this is failing because sample_title is blank and whitespace is not being handled correctly in cli.download. The downloaded metadata loads fine with pd.read_fwf.

Python API doc examples seem to use sra_metadata with detailed=True

  • pysradb version: master
  • Python version: 3.7.4
  • Operating System: macOS 10.13.6

Description

All the outputs and commands in the Python API documentation seem to assume sra_metadata() is called with detailed=True, though it is not specified and the default value is False.

Notably, this breaks example 4 where expand_sample_attribute_columns() assumes presence of the sample_attribute column. Running this snippet yields an ambiguous KeyError rather than the intended output.

from pysradb.filter_attrs import expand_sample_attribute_columns
df = db.sra_metadata('SRP017942')
expand_sample_attribute_columns(df).head()

[BUG] ValueError when using SraSearch to query

Describe the bug
Using SraSearch with verbosity>=2 and a large query raises a ValueError when setting self.df["pmid"] = list(uids) (

self.df["pmid"] = list(uids)
) because the size of the underlying dataframe seems to vary.

The following error is raised:

ValueError                                Traceback (most recent call last)
<ipython-input-1-d19c33e66199> in <module>
      1 instance = SraSearch(verbosity=3, return_max=max_query_num, query=query, platform='illumina')
----> 2 instance.search()
      3 df_search = instance.get_df()

/pysradb/search.py in search(self)
    774             self._format_result()
    775             if self.verbosity >= 2:
--> 776                 self.df["pmid"] = list(uids)
    777         except requests.exceptions.Timeout:
    778             sys.exit(f"Connection to the server has timed out. Please retry.")

/pandas/core/frame.py in __setitem__(self, key, value)
   3038         else:
   3039             # set column
-> 3040             self._set_item(key, value)
   3041
   3042     def _setitem_slice(self, key: slice, value):

/pandas/core/frame.py in _set_item(self, key, value)
   3114         """
   3115         self._ensure_valid_index(value)
-> 3116         value = self._sanitize_column(key, value)
   3117         NDFrame._set_item(self, key, value)
   3118

/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3762
   3763             # turn me into an ndarray
-> 3764             value = sanitize_index(value, self.index)
   3765             if not isinstance(value, (np.ndarray, Index)):
   3766                 if isinstance(value, list) and len(value) > 0:

/pandas/core/internals/construction.py in sanitize_index(data, index)
    745     """
    746     if len(data) != len(index):
--> 747         raise ValueError(
    748             "Length of values "
    749             f"({len(data)}) "

ValueError: Length of values (86768) does not match length of index (86721)

Multiple runs yield slightly different error messages:

ValueError: Length of values (86768) does not match length of index (86760)

It seems like the index length is varying for some reason.

To Reproduce
Execute the following code:

from pysradb.search import SraSearch

max_query_num = 1_000_000
query = 'txid2697049[Organism:noexp] AND ("filetype cram"[Properties] OR "filetype bam"[Properties] OR "filetype fastq"[Properties])'

instance = SraSearch(verbosity=2, return_max=max_query_num, query=query, platform='illumina')
instance.search()
df_search = instance.get_df()

Desktop:

  • OS: Linux
  • Python version: 3.8.5
  • pysradb version: 0.11.2-dev0

Error using python API for batch SRAweb search

  • pysradb version: 0.10.4
  • Python version: 3.8.3
  • Operating System: mac OS Catalina 10.15.5. But using anaconda environment and pip installation of pysradb

Description

Came across pysradb to extract the metadata for a batch of SRA runs (~9K). I tried two different approaches, however, both gave different error. Likely because of a missing value on SRAweb, but i am not sure how an error can either be ignored and moved forward.

1st Method

I tried to convert 9K SRA run accessions to SRA study IDs using srr_to_srp and then search approx. 500 accession ids against SRAweb

from pysradb.sraweb import SRAweb

db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
srp = db.srr_to_srp(lineList)
unique_srp = srp.study_accession.unique()
studies_list = unique_srp.tolist()
Metadata = db.sra_metadata(studies_list, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)

Error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-d1fb481fd5e3> in <module>
----> 1 Metadata=db.sra_metadata(studies_list, detailed= True)

~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    457                     # detailed_record[key] = value
    458 
--> 459                 pool_record = record["Pool"]["Member"]
    460                 detailed_record["run_accession"] = run_set["@accession"]
    461                 detailed_record["run_alias"] = run_set["@alias"]

KeyError: 'Pool'

2nd Method

In this case I tried to run all 9K SRA run accessions directly against SRAweb

from pysradb.sraweb import SRAweb

db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
Metadata = db.sra_metadata(lineList, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)

Error

Traceback (most recent call last):
  File "/Users/Zohaib/PycharmProjects/SRA-Metadata/fetchSRAmetadata.py", line 10, in <module>
    Metadata = db.sra_metadata(lineList, detailed=True)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 425, in sra_metadata
    efetch_result = self.get_efetch_response("sra", srp)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 250, in get_efetch_response
    esearch_response = request.json()
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thanks in advance, looking forward to hear from you.
Zohaib

Probleme with pip from metadata to download

  • pysradb version: 0.10.4
  • Python version:3.7.7
  • Operating System: linux

Description

I have a problem when using the pipe from metadata to download.
It seems that the input from metadata is not well parsed into download.
And the download command use the column run_total_spots thinking that it is
the column run_accession

What I Did

pysradb metadata SRP026303 --assay | grep 'study\|AUY077' | pysradb download

This error occurs

Traceback (most recent call last):
  File "/home/jeanmichel/miniconda3/envs/alignment/bin/pysradb", line 8, in <module>
    sys.exit(parse_args())
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/cli.py", line 1051, in parse_args
    download(args.out_dir, args.db, args.srx, args.srp, args.skip_confirmation)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/cli.py", line 141, in download
    protocol=protocol,
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/sradb.py", line 1299, in download
    + ".sra"
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/strings.py", line 2039, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/strings.py", line 2096, in _validate
    raise AttributeError("Can only use .str accessor with string values!")pysradb download 
AttributeError: Can only use .str accessor with string values!

I took a look at the parsing but it seems not straightforward to correct it

A work around this issue could be something like that:

pysradb metadata SRP026303 --assay --saveto tmp.txt && cat tmp.txt | grep 'study|AUY077' > tmp.txt && pysradb download --input_file tmp.txt

Re-initiate testing for individual sub-commands of the SRAweb module

The initial versions of pysradb relied on SRAmetadb.sqlite for doing the operations. The test suite currently is catered to that mode and has not been updated to reflect SRAweb.

SRAweb is going to be the only supported mode in the future as we want to do away with the dependence on SRAmetadb. This will require a new suite of tests to be written based on the existing tests (since the sub-commands are exactly the same).

Error not thrown when entering wrong db path

  • pysradb version:0.10.2
  • Python version:3.6.9
  • Operating System:Ubuntu

Description

If SRAmetadab.sqlite is not already downloaded on local machine using $ pysradb metadb, then when using db = SRAdb('xyz.sqlite') throws an exception metaInfo not found which is hard-coded on line 158 in sradb.py. Ideally, it should check if the file with the name entered already exists or not and throw the error on line 153 in sradp.py not a valid SRAmetadb.sqlite file. It, however, creates a new .sqlite file with the given address ('xyz.sqlite' in this case) and tries SELECT * from metaInfo on it. Therefore, it would never enter the except block on line 153.

What I Did

from pysradb import SRAdb
db = SRAdb('this_shouldnt_exist')
df = db.sra_metadata('SRP098789')
df.head()

It returns

Traceback (most recent call last):
  File "sample.py", line 2, in <module>
    db = SRAdb('this_shouldnt_exist')
  File "/usr/local/lib/python3.6/dist-packages/pysradb/sradb.py", line 178, in __init__
    _verify_srametadb(sqlite_file)
  File "/usr/local/lib/python3.6/dist-packages/pysradb/sradb.py", line 155, in _verify_srametadb
    metadata = db.query("SELECT * FROM metaInfo")
  File "/usr/local/lib/python3.6/dist-packages/pysradb/basedb.py", line 103, in query
    results = self.cursor.execute(sql_query).fetchall()
sqlite3.OperationalError: no such table: metaInfo

and doing an ls on current dir shows the file with the name entered has been created

saad@DaasDaham:~$ ls
Desktop    examples.desktop  Public             sample.py  this_shouldnt_exist
Documents  Music             pysradb_downloads  snap       Videos
Downloads  Pictures          R                  Templates

[BUG]

Describe the bug
ModuleNotFoundError when importing SRAweb

To Reproduce
conda create -c bioconda -n pysradb PYTHON=3 pysradb
conda activate pysradb
python

from pysradb import SRAweb
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'pysradb.sraweb'

Desktop (please complete the following information):

  • OS: macOS Mojave version 10.14.6
  • Python version 3.7.9
  • pysradb 0.9.0

[BUG] JsonDecodeError

Describe the bug
My colleague @Rebecza is trying to download a single-cell ATAC-seq dataset and uses pysradb to get some metadata (seq2science), and managed to find a JsonDecodeError ๐Ÿ› . It's a list of approx 750 ENA samples, the strange this is the JsonDecodeError appears with the full list, but when split up in smaller lists it seems to work...

To Reproduce
I put it on colab, not sure if the link is working
https://colab.research.google.com/drive/1bC2WiA63JJnWYZew0pk6iovk537vQzaU?usp=sharing

01.SRAdb-demo.ipynb's output is different from yours.

  • pysradb version:0.10.2
  • Python version:3.6.5
  • Operating System: macOS Mojave 10.14.5

Description

I want to reproduct your .ipynb file, pysradb/notebooks/01.SRAdb-demo.
And I run that file in my local environment.
But the output is different.

I cannot understand this problem.
If you know have some solution or idea, please tell me.

Thank you for your attension.

What I Did

df = db.sra_metadata('SRP017942')
df

Screen Shot 2020-02-14 at 14 51 15

KeyError when converting taxid to organism name

  • pysradb version: 0.10.1
  • Python version: 3.7.6
  • Operating System: Debian

Description

I am trying to make a general search with the "--detailed" argument, but I get a KeyError when converting the taxid for certain SRP's. An example to reproduce this is: pysradb metadata SRP184142 --detailed.

The error arises here:

lambda taxid: TAXID_TO_NAME[taxid]

When the taxid does not exist in TAXID_TO_NAME, this line fails. I have fixed it in my local version by using lambda taxid: TAXID_TO_NAME.get(taxid, "NA"), but I do not know whether that is the best solution long-term. I do not personally need any data from those "special" organisms, but they break pysradb search when a broad keyword is used. Ideally, I guess they would all be included in TAXID_TO_NAME?

Thank you for providing this package!

Keyerror in srr_to_gsm function in sraweb.py

  • pysradb version:
  • Python version: 3
  • Operating System: Linux

Description

calling srr_to_gsm function in sraweb throws KeyError.

What I Did

Command run:

sc = SRAweb()
sc.srr_to_gsm("SRR057515")

Traceback:

Traceback (most recent call last):
  File "test_sraweb.py", line 113, in <module>
    test_srr_to_gsm(sc)
  File "test_sraweb.py", line 109, in test_srr_to_gsm
    df = sraweb_connection.srr_to_gsm("SRR057513")
  File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 640, in srr_to_gsm
    return _order_first(joined_df, ["run_accession", "experiment_alias"])
  File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 22, in _order_first
    return df[columns].drop_duplicates()
  File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1645, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['experiment_alias'] not in index"

pysradb download not working

  • pysradb version: 0.9.6
  • Python version: 3.8.0
  • Operating System: CentOS 7.5.1804

Description

I tried to download an SRP project.

What I Did

pysradb download SRP083135

It returns the help message

When I try:

pysradb metadata SRP083135

I get a table with the study_accession, experiment_accession, sample_accession and run_accession.

Issue with piping from pysradb search to pysradb download

  • pysradb version: pysradb 0.10.5-dev0
  • Python version: Python 3.7
  • Operating System: Windows 10 Pro 1909

Description

When I was trying to pipe the output from pysradb search to pysradb download, a FileNotFoundError was raised.

What I Did

pysradb search -m 100 -q ribosome profiling --db ena -v 1 | pysradb download

image

[ENH] library_layout

First off, thanks for the awesome tool!

Is your feature request related to a problem? Please describe.
In the examples on the docs and notebooks the returned dataframe has a library_layout column, however I do not get this column newest version + SRAweb.

library_layout

import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("SRP016501", detailed=True).columns)

> 0.11.0
>Index(['study_accession', 'experiment_accession', 'experiment_title',
       'experiment_desc', 'organism_taxid ', 'organism_name',
       'library_strategy', 'library_source', 'library_selection',
       'sample_accession', 'sample_title', 'instrument', 'total_spots',
       'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
       'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
       'experiment_alias', 'source_name', 'tissue', 'sra_url_alt3', 'strain',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp_1',
       'ena_fastq_ftp_2'],
      dtype='object')

I scanned the source very briefly but it seems like this info is never used:
https://github.com/saketkc/pysradb/blob/master/pysradb/sraweb.py#L416

Is there a reason its not there?

The metadata file downloaded directly from the ncbi website is different from the one downloaded using pysradb.

Describe the bug
The metadata file downloaded directly from the ncbi website is different from the one downloaded using pysradb.

To Reproduce
Steps to reproduce the behavior:
pysradb download:
pysradb metadata SRP181607 --detailed --saveto file.csv

manual download:
When I go on the ncbi website https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125497 I usually follow the link 'SRA Run Selector' a the bottom of the page which brings me here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA516634&o=acc_s%3Aa. I then click on the link 'Metatadata' which downloads a file with 24 rows, instead of the 12 I get from the pysradb package. It seems that each sample has two runs, of which only one is included in the pysradb metadata.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
  • Python version 3.8.5

Additional context
It's not really context but thanks for the amazing package!
(this issue makes it less reliable for full automation tho)

Empty results error while searching

Mac OS Cataline 10.15.5
python 2.7
pysradb 0.7.0

SRAmetadb.sqlite downloaded and extracted

Sebastians-Air:pysradb sebastian$ pysradb search '"single-cell rna-seq" retina'
/Users/sebastian/Library/Python/2.7/lib/python/site-packages/pysradb/basedb.py:108: RuntimeWarning: Found no matching results for query.
warnings.warn("Found no matching results for query.", RuntimeWarning)
/Users/sebastian/Library/Python/2.7/lib/python/site-packages/pysradb/sradb.py:243: UserWarning: Empty results
warnings.warn("Empty results", UserWarning)

GSM with mutiple SRR points to single SRR by pysradb

  • pysradb version: pysradb 0.9.6
  • Python version: Python 3.7.4
  • Operating System: Centos

Description

Hi,
My aim is to query GSM and get SRR for download. ( i was not aware that i can also download with your tool but my problem still persists.)

When I query the following GSM id GSM947526, i get the following output. but when i check GEO with this link, i see that there are actually 4 SRR for this sample.

Its not a big problem for me but it would be easier if you guys could fix it. I think the problem is at the database part.

Thank you for the tool, it saves a ton of time.

Best regards,

Tunc.

(genomics) [tmorova@linuxsrv006 raw-data]$ pysradb metadata GSM947526
experiment_accession experiment_title                                                  experiment_desc                                                   organism_taxid  organism_name library_strategy library_source library_selection sample_accession sample_title study_accession run_accession run_total_spots run_total_bases
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700

Python Api

What I Did

from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP098789')
df.head()

The link to Python API is broken. Could you please fix it or list how I can use this library from inside my code. I tried the above code and got this output.

Traceback (most recent call last):
  File "c:\Users\Bhavay\Desktop\nvbi.py", line 47, in <module>
    db = SRAdb('SRAmetadb.sqlite')
  File "C:\Python\lib\site-packages\pysradb\sradb.py", line 178, in __init__
    _verify_srametadb(sqlite_file)
  File "C:\Python\lib\site-packages\pysradb\sradb.py", line 155, in _verify_srametadb
    metadata = db.query("SELECT * FROM metaInfo")
  File "C:\Python\lib\site-packages\pysradb\basedb.py", line 103, in query
    results = self.cursor.execute(sql_query).fetchall()
sqlite3.OperationalError: no such table: metaInfo

[ENH] Accept API keys

Is your feature request related to a problem? Please describe.
The current API allows 3 requests per second, but with an api_key this is increased to 10 per second.
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

This would mean a 3 times speedup for the lookup of samples when you add an api key to sraweb.

Describe the solution you'd like
It looks like adding the api key to these lists of arguments would already work: https://github.com/saketkc/pysradb/blob/master/pysradb/sraweb.py#L74-L89

In that case all that needs to be changed is accepting an optional api_key argument to __init__, and changing the sleeps. I would start a PR, but I am not entirely sure which sleep does what, and thus which one needs to be changed.

SRS to GSM support

  • pysradb version: 0.9.0
  • Python version: 3.6
  • Operating System: Linux

Description

Can you include support for SRS to GSM conversions? I'm currently relying on the SRA_Accessions table for such conversions, and its not extremely well populated.

[BUG] KeyError: 'EXPERIMENT_PACKAGE_SET'

Describe the bug
version 0.11.1-dev0

import pysradb

db = pysradb.SRAweb()
db.sra_metadata("SRP098789", detailed=True)

Traceback (most recent call last):
  File "script", line 4, in <module>
    db.sra_metadata("SRP098789", detailed=True)
  File "/home/maarten/anaconda3/envs/seq2science/lib/python3.7/site-packages/pysradb/sraweb.py", line 456, in sra_metadata
    efetch_result = self.get_efetch_response("sra", srp)
  File "/home/maarten/anaconda3/envs/seq2science/lib/python3.7/site-packages/pysradb/sraweb.py", line 323, in get_efetch_response
    response = xmltodict.parse(request_text)["EXPERIMENT_PACKAGE_SET"][
KeyError: 'EXPERIMENT_PACKAGE_SET'

[BUG] library name not reported

Describe the bug
The entry library_name is not reported in the sra_metadata output.

To Reproduce

db.sra_metadata('SRX2792593', detailed=True).columns
Index(['run_accession', 'study_accession', 'experiment_accession',
       'experiment_title', 'experiment_desc', 'organism_taxid ',
       'organism_name', 'library_strategy', 'library_source',
       'library_selection', 'library_layout', 'sample_accession',
       'sample_title', 'instrument', 'total_spots', 'total_size',
       'run_total_spots', 'run_total_bases', 'run_alias', 'sra_url',
       'experiment_alias', 'isolate', 'age', 'biomaterial_provider', 'sex',
       'tissue', 'cell_line', 'treatment', 'BioSampleModel', 'ena_fastq_http',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp',
       'ena_fastq_ftp_1', 'ena_fastq_ftp_2'],
      dtype='object')

While in the online view of this experiment it is available: https://www.ncbi.nlm.nih.gov/sra/SRX2792593

Desktop (please complete the following information):

  • OS: Scientific Linux (I think - doing it on our cluster)
  • Python 3.7

Python traceback when sending output to a pipe

  • pysradb version: 0.9.0
  • Python version: 3.5.1
  • Operating System: Linux (CentOS6)

Description

$ pysradb search '"ribosome profiling"' | head

Generated:

study_accession experiment_accession sample_accession run_accession
 DRP003075       DRX019536            DRS026974        DRR021383
 DRP003075       DRX019537            DRS026982        DRR021384
 DRP003075       DRX019538            DRS026979        DRR021385
 DRP003075       DRX019540            DRS026984        DRR021387
 DRP003075       DRX019541            DRS026978        DRR021388
 DRP003075       DRX019543            DRS026980        DRR021390
 DRP003075       DRX019544            DRS026981        DRR021391
 ERP013565       ERX1264364           ERS1016056       ERR1190989
 ERP013565       ERX1264365           ERS1016057       ERR1190990
Traceback (most recent call last):
  File "/bi/apps/python/3.5.1/bin/pysradb", line 10, in <module>
    sys.exit(parse_args())
  File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 944, in parse_args
    args.saveto,
  File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 150, in search
    _print_save_df(df, saveto)
  File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 38, in _print_save_df
    print(df.to_string(index=False, justify="left", col_space=0))
BrokenPipeError: [Errno 32] Broken pipe

It doesn't happen with all searches:

$ pysradb search '"oocyte development"' | head
study_accession experiment_accession sample_accession run_accession
 SRP011546       SRX129998            SRS300732        SRR445719
 SRP011546       SRX129999            SRS300733        SRR445720
 SRP064741       SRX1617410           SRS1326799       SRR3208744
 SRP064741       SRX1617411           SRS1326798       SRR3208745
 SRP064741       SRX1617412           SRS1326797       SRR3208746
 SRP064741       SRX1617413           SRS1326796       SRR3208747
 SRP064741       SRX1617414           SRS1326795       SRR3208748
 SRP064741       SRX1617415           SRS1326794       SRR3208749
 SRP064741       SRX1617416           SRS1326793       SRR3208750

I suspect the program doesn't cope with having to wait to generate output if the pipe buffer is full, or closed?

[Feature request] Provide checksums for downloads

ISMB/ECCB 2019 feedback: Provide checksums for verifying SRA downloads. This should also be checked at the download time to verify the download (currently we rely on partial downloads, but the final download is not verified)

Python API fails to retrieve metadata in some cases

  • pysradb version: pysradb 0.10.4
  • Python version: Python 3.8.2
  • Operating System: Ubuntu 18.04.4 LTS

Description

I am trying to extract the metadata using Python API for a number of BioProjects and it works fine for most BioProject accessions except in some cases --detailed=True results in ValueError

What I Did

db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)

This results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-a36d224bed27> in <module>
----> 1 db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)

~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    506         metadata_df = metadata_df.drop_duplicates()
    507         metadata_df = metadata_df.replace(r"^\s*$", np.nan, regex=True)
--> 508         ena_results = self.fetch_ena_fastq(srp)
    509         if ena_results.shape[0]:
    510             metadata_df = metadata_df.merge(

~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in fetch_ena_fastq(self, srp)
    149                 srr = srr.split("_")[0]
    150                 if ";" in url1:
--> 151                     url1_1, url1_2 = url1.split(";")
    152                     url1_2 = "http://{}".format(url1_2)
    153                     url2_1, url2_2 = url2.split(";")

ValueError: too many values to unpack (expected 2)

Question: When is ena_url reported?

Description

import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("SRP016501", detailed=True).columns)
print(SRAweb().sra_metadata("GSM3141725", detailed=True).columns)

0.11.0
Index(['study_accession', 'experiment_accession', 'experiment_title',
       'experiment_desc', 'organism_taxid ', 'organism_name',
       'library_strategy', 'library_source', 'library_selection',
       'sample_accession', 'sample_title', 'instrument', 'total_spots',
       'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
       'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
       'experiment_alias', 'source_name', 'tissue', 'sra_url_alt3', 'strain',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp_1',
       'ena_fastq_ftp_2'],
      dtype='object')
Index(['study_accession', 'experiment_accession', 'experiment_title',
       'experiment_desc', 'organism_taxid ', 'organism_name',
       'library_strategy', 'library_source', 'library_selection',
       'sample_accession', 'sample_title', 'instrument', 'total_spots',
       'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
       'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
       'experiment_alias', 'source_name', 'age', 'strain'],
      dtype='object')

When can I expect an ena_url column and when not? I understand that not everything is hosted on ena.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.