saketkc / pysradb Goto Github PK

View Code? Open in Web Editor NEW

298.0 13.0 49.0 7.18 MB

Package for fetching metadata and downloading data from SRA/ENA/GEO

Home Page: https://saketkc.github.io/pysradb

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.18% Python 99.82%

bioinformatics ncbi-sra ncbi-sra-archive sratoolkit bioinformatics-pipeline sra ena

pysradb's Issues

Is piping possible in pysradb

Can i give result of one query as input to other query and get the result using two queries in a single query?
or
Can i concatenate or merge two queries as a single query and get the output?

[BUG] experiment_alias has _1 appended

Describe the bug```
Not sure if its a bug on the GEO/SRA side, but for GSM sample GSM1020644 the experiment_alias has `_1` appended in its experiment_alias. I have not seen `_1` anywhere on the website.

import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("GSM1020644", detailed=True).experiment_alias.values)

0.11.0
['GSM1020644_1']

Error during batch downloading SRA files using SRAweb()

pysradb version: = 0.10.4
Python version: = 3.8.3
Operating System: = CentOS Linux

Description

This is a follow-up issue from #46 where i started downloading a batch of sra files for the fetched metadata in a pandas DataFrame. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with error

Error

self.retrieve() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'

Discussion from #46

The download method first downloads to a temporary location which in this case is pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part: notice the .part. Downloads are resumable by default. Once a download finishes, the .part extension is removed to mark it complete.

In this case the error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded (it thinks it hasn't been, but probably its download is already complete).

You should have SRR12100406.sra Please feel free to open a new issue otherwise.

As you mentioned

The error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded

I have checked that SRR12100406.sra wasn't created yet. I am not sure how to use parallel efficiently in this case. I have two questions

If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from?
If you have any opinion with using example mentioned here on SunGridEngine based job queue system?

Thanks,
Zohaib

Empty results error while searching

Mac OS Cataline 10.15.5
python 2.7
pysradb 0.7.0

SRAmetadb.sqlite downloaded and extracted

Sebastians-Air:pysradb sebastian$ pysradb search '"single-cell rna-seq" retina'
/Users/sebastian/Library/Python/2.7/lib/python/site-packages/pysradb/basedb.py:108: RuntimeWarning: Found no matching results for query.
warnings.warn("Found no matching results for query.", RuntimeWarning)
/Users/sebastian/Library/Python/2.7/lib/python/site-packages/pysradb/sradb.py:243: UserWarning: Empty results
warnings.warn("Empty results", UserWarning)

Support search on SRAweb mode

Floated as a GSoC 2020 project: https://www.open-bio.org/events/gsoc/gsoc-project-ideas/#pysradb

Python traceback when sending output to a pipe

pysradb version: 0.9.0
Python version: 3.5.1
Operating System: Linux (CentOS6)

Description

$ pysradb search '"ribosome profiling"' | head

Generated:

study_accession experiment_accession sample_accession run_accession
 DRP003075       DRX019536            DRS026974        DRR021383
 DRP003075       DRX019537            DRS026982        DRR021384
 DRP003075       DRX019538            DRS026979        DRR021385
 DRP003075       DRX019540            DRS026984        DRR021387
 DRP003075       DRX019541            DRS026978        DRR021388
 DRP003075       DRX019543            DRS026980        DRR021390
 DRP003075       DRX019544            DRS026981        DRR021391
 ERP013565       ERX1264364           ERS1016056       ERR1190989
 ERP013565       ERX1264365           ERS1016057       ERR1190990
Traceback (most recent call last):
  File "/bi/apps/python/3.5.1/bin/pysradb", line 10, in <module>
    sys.exit(parse_args())
  File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 944, in parse_args
    args.saveto,
  File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 150, in search
    _print_save_df(df, saveto)
  File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 38, in _print_save_df
    print(df.to_string(index=False, justify="left", col_space=0))
BrokenPipeError: [Errno 32] Broken pipe

It doesn't happen with all searches:

$ pysradb search '"oocyte development"' | head
study_accession experiment_accession sample_accession run_accession
 SRP011546       SRX129998            SRS300732        SRR445719
 SRP011546       SRX129999            SRS300733        SRR445720
 SRP064741       SRX1617410           SRS1326799       SRR3208744
 SRP064741       SRX1617411           SRS1326798       SRR3208745
 SRP064741       SRX1617412           SRS1326797       SRR3208746
 SRP064741       SRX1617413           SRS1326796       SRR3208747
 SRP064741       SRX1617414           SRS1326795       SRR3208748
 SRP064741       SRX1617415           SRS1326794       SRR3208749
 SRP064741       SRX1617416           SRS1326793       SRR3208750

I suspect the program doesn't cope with having to wait to generate output if the pipe buffer is full, or closed?

[BUG] KeyError: 'EXPERIMENT_PACKAGE_SET'

Describe the bug
version 0.11.1-dev0

import pysradb

db = pysradb.SRAweb()
db.sra_metadata("SRP098789", detailed=True)

Traceback (most recent call last):
  File "script", line 4, in <module>
    db.sra_metadata("SRP098789", detailed=True)
  File "/home/maarten/anaconda3/envs/seq2science/lib/python3.7/site-packages/pysradb/sraweb.py", line 456, in sra_metadata
    efetch_result = self.get_efetch_response("sra", srp)
  File "/home/maarten/anaconda3/envs/seq2science/lib/python3.7/site-packages/pysradb/sraweb.py", line 323, in get_efetch_response
    response = xmltodict.parse(request_text)["EXPERIMENT_PACKAGE_SET"][
KeyError: 'EXPERIMENT_PACKAGE_SET'

[ENH] Accept API keys

Is your feature request related to a problem? Please describe.
The current API allows 3 requests per second, but with an api_key this is increased to 10 per second.
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

This would mean a 3 times speedup for the lookup of samples when you add an api key to sraweb.

Describe the solution you'd like
It looks like adding the api key to these lists of arguments would already work: https://github.com/saketkc/pysradb/blob/master/pysradb/sraweb.py#L74-L89

In that case all that needs to be changed is accepting an optional api_key argument to __init__, and changing the sleeps. I would start a PR, but I am not entirely sure which sleep does what, and thus which one needs to be changed.

[BUG] layout reported as @xmlns

Describe the bug
Looking up the layout of sample GSM1013144 reports a layout of @xmlns.

My guess would be that the xml is parsed incorrectly:
https://www.w3schools.com/tags/att_html_xmlns.asp

The SRA entry namely is single-ended:
https://www.ncbi.nlm.nih.gov/sra?term=GSM1013144

To Reproduce

import pysradb
db = pysradb.SRAweb()
db.sra_metadata("GSM1013144")

SRS to GSM support

pysradb version: 0.9.0
Python version: 3.6
Operating System: Linux

Description

Can you include support for SRS to GSM conversions? I'm currently relying on the SRA_Accessions table for such conversions, and its not extremely well populated.

[ENH] Is the md5 test always on?

I've seen you have a function but is it used always?

Cool tool

Error using python API for batch SRAweb search

pysradb version: 0.10.4
Python version: 3.8.3
Operating System: mac OS Catalina 10.15.5. But using anaconda environment and pip installation of pysradb

Description

Came across pysradb to extract the metadata for a batch of SRA runs (~9K). I tried two different approaches, however, both gave different error. Likely because of a missing value on SRAweb, but i am not sure how an error can either be ignored and moved forward.

1st Method

I tried to convert 9K SRA run accessions to SRA study IDs using srr_to_srp and then search approx. 500 accession ids against SRAweb

from pysradb.sraweb import SRAweb

db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
srp = db.srr_to_srp(lineList)
unique_srp = srp.study_accession.unique()
studies_list = unique_srp.tolist()
Metadata = db.sra_metadata(studies_list, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)

Error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-d1fb481fd5e3> in <module>
----> 1 Metadata=db.sra_metadata(studies_list, detailed= True)

~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    457                     # detailed_record[key] = value
    458 
--> 459                 pool_record = record["Pool"]["Member"]
    460                 detailed_record["run_accession"] = run_set["@accession"]
    461                 detailed_record["run_alias"] = run_set["@alias"]

KeyError: 'Pool'

2nd Method

In this case I tried to run all 9K SRA run accessions directly against SRAweb

from pysradb.sraweb import SRAweb

db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
Metadata = db.sra_metadata(lineList, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)

Error

Traceback (most recent call last):
  File "/Users/Zohaib/PycharmProjects/SRA-Metadata/fetchSRAmetadata.py", line 10, in <module>
    Metadata = db.sra_metadata(lineList, detailed=True)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 425, in sra_metadata
    efetch_result = self.get_efetch_response("sra", srp)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 250, in get_efetch_response
    esearch_response = request.json()
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thanks in advance, looking forward to hear from you.
Zohaib

Access

pysradb version: 0.4
Python version: 3.7
Operating System: Ubuntu 16.04

Description

Is there a way to grab all of the sample ids for a given SRP id? I see that you can return all of the GSM values (which might work for me ultimately), but it's not formatted in a fantastic manner.

download fails for SRP125265

Running the following example from readme (different SRP)

pysradb metadata SRP125265 --assay | grep 'study\|RNAseq' | pysradb download

produces this error

Traceback (most recent call last):
  File "/home/dfeldman/.conda/envs/df-pyr/bin/pysradb", line 10, in <module>
    sys.exit(parse_args())
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/cli.py", line 1044, in parse_args
    download(args.out_dir, args.db, args.srx, args.srp, args.skip_confirmation)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/cli.py", line 134, in download
    protocol=protocol,
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/sradb.py", line 1275, in download
    + ".sra"
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/strings.py", line 2039, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/strings.py", line 2096, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!

Possible this is failing because sample_title is blank and whitespace is not being handled correctly in cli.download. The downloaded metadata loads fine with pd.read_fwf.

srapath for resolving downloads

SRA seems to have undergone changes for keeping the reads with its new sos model.
Newer URLs are retrievable by srapath.

Example: SRP098789/SRR5227308: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR5227308

[BUG] JsonDecodeError

Describe the bug
My colleague @Rebecza is trying to download a single-cell ATAC-seq dataset and uses pysradb to get some metadata (seq2science), and managed to find a JsonDecodeError 🐛 . It's a list of approx 750 ENA samples, the strange this is the JsonDecodeError appears with the full list, but when split up in smaller lists it seems to work...

To Reproduce
I put it on colab, not sure if the link is working
https://colab.research.google.com/drive/1bC2WiA63JJnWYZew0pk6iovk537vQzaU?usp=sharing

Pysradb metadata does not work for some project accession numbers

pysradb metadata --saveto test1.tsv --detailed SRP098789 and pysradb metadata ERP113893 works well.
However, pysradb metadata --detailed ERP113893 or pysradb metadata --saveto test1.tsv --detailed ERP113893 produces the following IndexError:

I am able to replicate this issue on pysradb-0.10.4 (from pypi) and pysradb-0.10.5-dev0 (from github pysradb master), on both windows 10 and Ubuntu 18.04 OS.

piping support

From: https://doi.org/10.5256/f1000research.20450.r47560

One frustrating limitation is that the piping support is not univeral throughout the tool. You can pipe into the download command, but not, for example, into the metadata command. Being able to chain operations such as:

pysradb gse-to-srp GSE24355 | pysradb metadata | pysra download

..or

pysradb search '"oocyte development"' | head | pysradb metadata

..would be really nice and presumably not too hard to support?

Regarding Non-available dataset in the package

pysradb version: 0.9.0
Python version: 3.7
Operating System: Ubuntu Linux

Description

I found few datasets are not available in the provided SRAmetadb.sqlite file. For example: GSE80994, GSE80993, GSE80992 etc. How can I get all the datasets from this tool? If the dataset is not available, how can I add it into the package?

What I Did

pysradb gse-to-gsm --db SRAmetadb.sqlite --desc --expand GSE80990
/home/singh/miniconda3/envs/dev/lib/python3.7/site-packages/pysradb/basedb.py:108: RuntimeWarning: Found no matching results for query.
  warnings.warn("Found no matching results for query.", RuntimeWarning)

Sample exists on GEO, but not on SRA

Description

I have a strange issue, which might not be within the scope of pysradb.

Sample GSM3832552 (actually the whole series) has been submitted to GEO but is not on the SRA. The sequencing runs however are on the SRA. Is there some magic way I can work with those?

>>> import pysradb
>>> print(pysradb.__version__)
0.11.1-dev0
>>> print(pysradb.SRAweb().sra_metadata("GSM3832552", detailed=True))
No results found for GSM3832552

Db is overwritten if run without --db command

pysradb version: pysradb, version 0.6.0
Python version: 3.6.5
Operating System: Linux

Description

Running a command without specifying --db overwrote the already downloaded databases in the same directory.

What I Did

pysradb srp-to-srx SRP067701

Running this twice should not download the database again

Let me know if I missed something.

Thanks for the great package, Saket!

[BUG] library name not reported

Describe the bug
The entry library_name is not reported in the sra_metadata output.

To Reproduce

db.sra_metadata('SRX2792593', detailed=True).columns
Index(['run_accession', 'study_accession', 'experiment_accession',
       'experiment_title', 'experiment_desc', 'organism_taxid ',
       'organism_name', 'library_strategy', 'library_source',
       'library_selection', 'library_layout', 'sample_accession',
       'sample_title', 'instrument', 'total_spots', 'total_size',
       'run_total_spots', 'run_total_bases', 'run_alias', 'sra_url',
       'experiment_alias', 'isolate', 'age', 'biomaterial_provider', 'sex',
       'tissue', 'cell_line', 'treatment', 'BioSampleModel', 'ena_fastq_http',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp',
       'ena_fastq_ftp_1', 'ena_fastq_ftp_2'],
      dtype='object')

While in the online view of this experiment it is available: https://www.ncbi.nlm.nih.gov/sra/SRX2792593

Desktop (please complete the following information):

OS: Scientific Linux (I think - doing it on our cluster)
Python 3.7

[BUG] ValueError when using SraSearch to query

Describe the bug
Using SraSearch with verbosity>=2 and a large query raises a ValueError when setting self.df["pmid"] = list(uids) (

pysradb/pysradb/search.py

Line 776 in c23d4a7

self.df["pmid"] = list(uids)

) because the size of the underlying dataframe seems to vary.

The following error is raised:

ValueError                                Traceback (most recent call last)
<ipython-input-1-d19c33e66199> in <module>
      1 instance = SraSearch(verbosity=3, return_max=max_query_num, query=query, platform='illumina')
----> 2 instance.search()
      3 df_search = instance.get_df()

/pysradb/search.py in search(self)
    774             self._format_result()
    775             if self.verbosity >= 2:
--> 776                 self.df["pmid"] = list(uids)
    777         except requests.exceptions.Timeout:
    778             sys.exit(f"Connection to the server has timed out. Please retry.")

/pandas/core/frame.py in __setitem__(self, key, value)
   3038         else:
   3039             # set column
-> 3040             self._set_item(key, value)
   3041
   3042     def _setitem_slice(self, key: slice, value):

/pandas/core/frame.py in _set_item(self, key, value)
   3114         """
   3115         self._ensure_valid_index(value)
-> 3116         value = self._sanitize_column(key, value)
   3117         NDFrame._set_item(self, key, value)
   3118

/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3762
   3763             # turn me into an ndarray
-> 3764             value = sanitize_index(value, self.index)
   3765             if not isinstance(value, (np.ndarray, Index)):
   3766                 if isinstance(value, list) and len(value) > 0:

/pandas/core/internals/construction.py in sanitize_index(data, index)
    745     """
    746     if len(data) != len(index):
--> 747         raise ValueError(
    748             "Length of values "
    749             f"({len(data)}) "

ValueError: Length of values (86768) does not match length of index (86721)

Multiple runs yield slightly different error messages:

ValueError: Length of values (86768) does not match length of index (86760)

It seems like the index length is varying for some reason.

To Reproduce
Execute the following code:

from pysradb.search import SraSearch

max_query_num = 1_000_000
query = 'txid2697049[Organism:noexp] AND ("filetype cram"[Properties] OR "filetype bam"[Properties] OR "filetype fastq"[Properties])'

instance = SraSearch(verbosity=2, return_max=max_query_num, query=query, platform='illumina')
instance.search()
df_search = instance.get_df()

Desktop:

OS: Linux
Python version: 3.8.5
pysradb version: 0.11.2-dev0

pysradb search error with specific search term

pysradb version: 0.9.0
Python version: Python 3.7.3
Operating System: Centos

Description

Searching for the term hypoxia worked fine but searching the term hypoxic produced the following error message:

This is most likely a bug, please report it upstream.
sample_attribute: investigation type: metagenome || project name: Landsort Depth 20090415 transect || sequencing method: 454 || collection date: 2009-04-15 || ammonium: 8.7: Ã‚ÂµM || chlorophyll: 0: Ã‚Âµg/L || dissolved oxygen: -1.33: Ã‚Âµmol/kg || nitrate: 0.02: Ã‚ÂµM || nitrogen: 0: Ã‚ÂµM || environmental package: water || geographic location (latitude): 58.6: DD || geographic location (longitude): 18.2: DD || geographic location (country and/or sea,region): Baltic Sea || environment (biome): 00002150 || environment (feature): 00002150 || environment (material): 00002150 || depth: 400: m || Phosphate:  || Total phosphorous:  || Silicon:
Traceback (most recent call last):
  File "/envs/pysradb/bin/pysradb", line 11, in <module>
    sys.exit(parse_args())
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/cli.py", line 944, in parse_args
    args.saveto,
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/cli.py", line 148, in search
    expand_sample_attributes=expand,
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/sradb.py", line 1044, in search_sra
    acc_is_searchstr=True,
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/sradb.py", line 316, in sra_metadata
    metadata_df = expand_sample_attribute_columns(metadata_df)
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/filter_attrs.py", line 75, in expand_sample_attribute_columns
    sample_attribute_keys, _ = _get_sample_attr_keys(sample_attribute)
  File "/envs/pysradb/lib/python3.7/site-packages/pysradb/filter_attrs.py", line 27, in _get_sample_attr_keys
    sample_attribute_dict = dict(split_by_colon)
ValueError: dictionary update sequence element #19 has length 1; 2 is required

What I Did

pysradb  search "hypoxic" --db SRAmetadb.sqlite   --assay --desc   --detailed --expand  --saveto hypoxic_search.txt

Python API doc examples seem to use sra_metadata with detailed=True

pysradb version: master
Python version: 3.7.4
Operating System: macOS 10.13.6

Description

All the outputs and commands in the Python API documentation seem to assume sra_metadata() is called with detailed=True, though it is not specified and the default value is False.

Notably, this breaks example 4 where expand_sample_attribute_columns() assumes presence of the sample_attribute column. Running this snippet yields an ambiguous KeyError rather than the intended output.

from pysradb.filter_attrs import expand_sample_attribute_columns
df = db.sra_metadata('SRP017942')
expand_sample_attribute_columns(df).head()

[ENH] Platform/base space in metadata

Is your feature request related to a problem? Please describe.
I run into the problem with seq2science that we assume that all samples are illumina sequenced / base space. Someone tried to run it with ABI solid and this resulted in some unexpected behaviour (empty alignment, but pipeline still runs fully). I want to add a check if the sample platform / fastq format is supported, but I am not sure if I can retrieve this data with pysradb:

import pysradb

db_sra = pysradb.SRAweb()
metadata = db_sra.sra_metadata(["SRR1649197"], detailed=True)
print(metadata.columns)

Index(['run_accession', 'study_accession', 'experiment_accession',
       'experiment_title', 'experiment_desc', 'organism_taxid ',
       'organism_name', 'library_strategy', 'library_source',
       'library_selection', 'library_layout', 'sample_accession',
       'sample_title', 'instrument', 'total_spots', 'total_size',
       'run_total_spots', 'run_total_bases', 'run_alias', 'sra_url',
       'experiment_alias', 'source_name', 'cell type', 'ena_fastq_http',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp',
       'ena_fastq_ftp_1', 'ena_fastq_ftp_2'],
      dtype='object')

See this sample for https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1649197

The instrument is AB 5500 Genetic Analyzer, which is something I could use, but I expect to run into the same problem again when people use a different machine but still ABI solid (assuming there are multiple machines).

Describe the solution you'd like
The platform reported, or even better, the base format reported (e.g. base space vs color space).

Question: When is ena_url reported?

Description

import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("SRP016501", detailed=True).columns)
print(SRAweb().sra_metadata("GSM3141725", detailed=True).columns)

0.11.0
Index(['study_accession', 'experiment_accession', 'experiment_title',
       'experiment_desc', 'organism_taxid ', 'organism_name',
       'library_strategy', 'library_source', 'library_selection',
       'sample_accession', 'sample_title', 'instrument', 'total_spots',
       'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
       'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
       'experiment_alias', 'source_name', 'tissue', 'sra_url_alt3', 'strain',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp_1',
       'ena_fastq_ftp_2'],
      dtype='object')
Index(['study_accession', 'experiment_accession', 'experiment_title',
       'experiment_desc', 'organism_taxid ', 'organism_name',
       'library_strategy', 'library_source', 'library_selection',
       'sample_accession', 'sample_title', 'instrument', 'total_spots',
       'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
       'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
       'experiment_alias', 'source_name', 'age', 'strain'],
      dtype='object')

When can I expect an ena_url column and when not? I understand that not everything is hosted on ena.

GSM with mutiple SRR points to single SRR by pysradb

pysradb version: pysradb 0.9.6
Python version: Python 3.7.4
Operating System: Centos

Description

Hi,
My aim is to query GSM and get SRR for download. ( i was not aware that i can also download with your tool but my problem still persists.)

When I query the following GSM id GSM947526, i get the following output. but when i check GEO with this link, i see that there are actually 4 SRR for this sample.

Its not a big problem for me but it would be easier if you guys could fix it. I think the problem is at the database part.

Thank you for the tool, it saves a ton of time.

Best regards,

Tunc.

(genomics) [tmorova@linuxsrv006 raw-data]$ pysradb metadata GSM947526
experiment_accession experiment_title                                                  experiment_desc                                                   organism_taxid  organism_name library_strategy library_source library_selection sample_accession sample_title study_accession run_accession run_total_spots run_total_bases
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700
 SRX154497            GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq  9606            Homo sapiens  ChIP-Seq         GENOMIC        ChIP              SRS345682                     SRP013728       SRR513121     13493834        674691700

attributes Instrument,total size of experiment are not present in the metadata fetched through flags

Hi @saketkc,
could you please provide me information on how to fetch the attribute "Instrument" under "Library"
and also the total size of experiment by giving bio project id as input.
I tried with this "pysradb metadata SRP063732 --desc --expand --detailed" fetched layout,strategy,name,source,selection under Library but not "Instrument" and also,it fetched only bases and spots but not the total size of experiment

Thanks in advance.

The metadata file downloaded directly from the ncbi website is different from the one downloaded using pysradb.

Describe the bug
The metadata file downloaded directly from the ncbi website is different from the one downloaded using pysradb.

To Reproduce
Steps to reproduce the behavior:
pysradb download:
pysradb metadata SRP181607 --detailed --saveto file.csv

manual download:
When I go on the ncbi website https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125497 I usually follow the link 'SRA Run Selector' a the bottom of the page which brings me here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA516634&o=acc_s%3Aa. I then click on the link 'Metatadata' which downloads a file with 24 rows, instead of the 12 I get from the pysradb package. It seems that each sample has two runs, of which only one is included in the pysradb metadata.

Desktop (please complete the following information):

OS: Ubuntu 18.04
Python version 3.8.5

Additional context
It's not really context but thanks for the amazing package!
(this issue makes it less reliable for full automation tho)

Python Api

What I Did

from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP098789')
df.head()

The link to Python API is broken. Could you please fix it or list how I can use this library from inside my code. I tried the above code and got this output.

Traceback (most recent call last):
  File "c:\Users\Bhavay\Desktop\nvbi.py", line 47, in <module>
    db = SRAdb('SRAmetadb.sqlite')
  File "C:\Python\lib\site-packages\pysradb\sradb.py", line 178, in __init__
    _verify_srametadb(sqlite_file)
  File "C:\Python\lib\site-packages\pysradb\sradb.py", line 155, in _verify_srametadb
    metadata = db.query("SELECT * FROM metaInfo")
  File "C:\Python\lib\site-packages\pysradb\basedb.py", line 103, in query
    results = self.cursor.execute(sql_query).fetchall()
sqlite3.OperationalError: no such table: metaInfo

More pipe friendly output

Hi Saket,

It appears that the current output of pysradb command, one given below for example from the README, is not very friendly to parsing by tools such as awk or cut. For instance, I'd only like to retain a few columns from the output, but usual attempts such as awk -F "\t" .. or cut -f1-5 fail for columns which contain description text. This is a problem only if I use the direct string output from the command and not through the --saveto option.

pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand

Cheers, Vivek

directing output in external hard drive

Hi,
I want to redirect the download to an external hard drive since I have little space left in my hard drive.
I am trying
(pysradb) usr@usr-X705UDR:~$ pysradb download --out-dir ./pysradb_downloads -p SRP165962 --db cd /media/usr/LaCie/db/ , but it is not working.
Kindly help.

Issue with piping from pysradb search to pysradb download

pysradb version: pysradb 0.10.5-dev0
Python version: Python 3.7
Operating System: Windows 10 Pro 1909

Description

When I was trying to pipe the output from pysradb search to pysradb download, a FileNotFoundError was raised.

What I Did

pysradb search -m 100 -q ribosome profiling --db ena -v 1 | pysradb download

Error not thrown when entering wrong db path

pysradb version:0.10.2
Python version:3.6.9
Operating System:Ubuntu

Description

If SRAmetadab.sqlite is not already downloaded on local machine using $ pysradb metadb, then when using db = SRAdb('xyz.sqlite') throws an exception metaInfo not found which is hard-coded on line 158 in sradb.py. Ideally, it should check if the file with the name entered already exists or not and throw the error on line 153 in sradp.py not a valid SRAmetadb.sqlite file. It, however, creates a new .sqlite file with the given address ('xyz.sqlite' in this case) and tries SELECT * from metaInfo on it. Therefore, it would never enter the except block on line 153.

What I Did

from pysradb import SRAdb
db = SRAdb('this_shouldnt_exist')
df = db.sra_metadata('SRP098789')
df.head()

It returns

Traceback (most recent call last):
  File "sample.py", line 2, in <module>
    db = SRAdb('this_shouldnt_exist')
  File "/usr/local/lib/python3.6/dist-packages/pysradb/sradb.py", line 178, in __init__
    _verify_srametadb(sqlite_file)
  File "/usr/local/lib/python3.6/dist-packages/pysradb/sradb.py", line 155, in _verify_srametadb
    metadata = db.query("SELECT * FROM metaInfo")
  File "/usr/local/lib/python3.6/dist-packages/pysradb/basedb.py", line 103, in query
    results = self.cursor.execute(sql_query).fetchall()
sqlite3.OperationalError: no such table: metaInfo

and doing an ls on current dir shows the file with the name entered has been created

saad@DaasDaham:~$ ls
Desktop    examples.desktop  Public             sample.py  this_shouldnt_exist
Documents  Music             pysradb_downloads  snap       Videos
Downloads  Pictures          R                  Templates

keyerror while checking for a key in kwargs

What I Did

sc=SRAweb()
sc.gse_to_gsm("GSE34438") #no keyword arguments

Traceback (most recent call last):
  File "test_sraweb.py", line 151, in <module>
    sc.gse_to_gsm("GSE34438")
  File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 536, in gse_to_gsm
    if kwargs["detailed"] == True:
KeyError: 'detailed'

SRAdb and SRAweb don't give the same results

Hi,

First I'd like to thank you for this very useful package. Unfortunely, I'd love to use SRAweb, unfortunately, there seems to be somthing wrong with it compared to SRAdb.

Here are my specs,

pysradb version: pysradb==0.9.6
Python version: 3
Operating System: Ubuntu 16.04 LTS

Description

I'm trying to get the metadata from a SRA project ID (e.g.: SRP125768).

What I Did

With local SQL db,

db = SRAdb('SRAmetadb.sqlite')
df1 = db.sra_metadata('SRP125768', detailed=True, expand_sample_attributes=True, sample_attribute=True)

W/o local SQL db,

db = SRAweb()
df2 = db.sra_metadata(srp="SRP125768", detailed=True, expand_sample_attributes=True, sample_attribute=True)

I haven't check all the entries but there is definitely something wrong with df2: duplicated rows / missing rows.

I'd be happy to get your feedback and your fix for this :)

Keyerror in srr_to_gsm function in sraweb.py

pysradb version:
Python version: 3
Operating System: Linux

Description

calling srr_to_gsm function in sraweb throws KeyError.

What I Did

Command run:

sc = SRAweb()
sc.srr_to_gsm("SRR057515")

Traceback:

Traceback (most recent call last):
  File "test_sraweb.py", line 113, in <module>
    test_srr_to_gsm(sc)
  File "test_sraweb.py", line 109, in test_srr_to_gsm
    df = sraweb_connection.srr_to_gsm("SRR057513")
  File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 640, in srr_to_gsm
    return _order_first(joined_df, ["run_accession", "experiment_alias"])
  File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 22, in _order_first
    return df[columns].drop_duplicates()
  File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1645, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['experiment_alias'] not in index"

Re-initiate testing for individual sub-commands of the SRAweb module

The initial versions of pysradb relied on SRAmetadb.sqlite for doing the operations. The test suite currently is catered to that mode and has not been updated to reflect SRAweb.

SRAweb is going to be the only supported mode in the future as we want to do away with the dependence on SRAmetadb. This will require a new suite of tests to be written based on the existing tests (since the sub-commands are exactly the same).

discrepancies between websearch and pysradb

Mac OS Cataline 10.15.5
python 3.8
pysradb 0.10.5-dev0

https://www.ncbi.nlm.nih.gov/sra/?term=(rheumatoid+arthritis)+AND+%22Homo+sapiens%22%5Borgn%3A__txid9606%5D yileds 6514 entries

pysradb search -q "rheumatoid arthritis" -m 9000 results in only 287 entries.

What did i wrong?

KeyError when converting taxid to organism name

pysradb version: 0.10.1
Python version: 3.7.6
Operating System: Debian

Description

I am trying to make a general search with the "--detailed" argument, but I get a KeyError when converting the taxid for certain SRP's. An example to reproduce this is: pysradb metadata SRP184142 --detailed.

The error arises here:

pysradb/pysradb/sradb.py

Line 312 in 3b87e9d

lambda taxid: TAXID_TO_NAME[taxid]

When the taxid does not exist in TAXID_TO_NAME, this line fails. I have fixed it in my local version by using lambda taxid: TAXID_TO_NAME.get(taxid, "NA"), but I do not know whether that is the best solution long-term. I do not personally need any data from those "special" organisms, but they break pysradb search when a broad keyword is used. Ideally, I guess they would all be included in TAXID_TO_NAME?

Thank you for providing this package!

Python API fails to retrieve metadata in some cases

pysradb version: pysradb 0.10.4
Python version: Python 3.8.2
Operating System: Ubuntu 18.04.4 LTS

Description

I am trying to extract the metadata using Python API for a number of BioProjects and it works fine for most BioProject accessions except in some cases --detailed=True results in ValueError

What I Did

db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)

This results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-a36d224bed27> in <module>
----> 1 db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)

~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    506         metadata_df = metadata_df.drop_duplicates()
    507         metadata_df = metadata_df.replace(r"^\s*$", np.nan, regex=True)
--> 508         ena_results = self.fetch_ena_fastq(srp)
    509         if ena_results.shape[0]:
    510             metadata_df = metadata_df.merge(

~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in fetch_ena_fastq(self, srp)
    149                 srr = srr.split("_")[0]
    150                 if ";" in url1:
--> 151                     url1_1, url1_2 = url1.split(";")
    152                     url1_2 = "http://{}".format(url1_2)
    153                     url2_1, url2_2 = url2.split(";")

ValueError: too many values to unpack (expected 2)

[ENH] library_layout

First off, thanks for the awesome tool!

Is your feature request related to a problem? Please describe.
In the examples on the docs and notebooks the returned dataframe has a library_layout column, however I do not get this column newest version + SRAweb.

library_layout

import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("SRP016501", detailed=True).columns)

> 0.11.0
>Index(['study_accession', 'experiment_accession', 'experiment_title',
       'experiment_desc', 'organism_taxid ', 'organism_name',
       'library_strategy', 'library_source', 'library_selection',
       'sample_accession', 'sample_title', 'instrument', 'total_spots',
       'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
       'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
       'experiment_alias', 'source_name', 'tissue', 'sra_url_alt3', 'strain',
       'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp_1',
       'ena_fastq_ftp_2'],
      dtype='object')

I scanned the source very briefly but it seems like this info is never used:
https://github.com/saketkc/pysradb/blob/master/pysradb/sraweb.py#L416

Is there a reason its not there?

01.SRAdb-demo.ipynb's output is different from yours.

pysradb version:0.10.2
Python version:3.6.5
Operating System: macOS Mojave 10.14.5

Description

I want to reproduct your .ipynb file, pysradb/notebooks/01.SRAdb-demo.
And I run that file in my local environment.
But the output is different.

I cannot understand this problem.
If you know have some solution or idea, please tell me.

Thank you for your attension.

What I Did

df = db.sra_metadata('SRP017942')
df

organism name is not present in the metadata fetched through flags

Hi @saketkc,
could you please provide me information on how to fetch the name of the organism by giving project id and also how to get the count of runs and experiments of each project
I tried with this "pysradb metadata SRP063732 --desc --expand --detailed" but it fetched all the attributes information except name of the organism.

Thanks in advance

[Feature request] Provide checksums for downloads

ISMB/ECCB 2019 feedback: Provide checksums for verifying SRA downloads. This should also be checked at the download time to verify the download (currently we rely on partial downloads, but the final download is not verified)

pysradb download not working

pysradb version: 0.9.6
Python version: 3.8.0
Operating System: CentOS 7.5.1804

Description

I tried to download an SRP project.

What I Did

pysradb download SRP083135

It returns the help message

When I try:

pysradb metadata SRP083135

I get a table with the study_accession, experiment_accession, sample_accession and run_accession.

[BUG]

Describe the bug
ModuleNotFoundError when importing SRAweb

To Reproduce
conda create -c bioconda -n pysradb PYTHON=3 pysradb
conda activate pysradb
python

from pysradb import SRAweb
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'pysradb.sraweb'

Desktop (please complete the following information):

OS: macOS Mojave version 10.14.6
Python version 3.7.9
pysradb 0.9.0

Probleme with pip from metadata to download

pysradb version: 0.10.4
Python version:3.7.7
Operating System: linux

Description

I have a problem when using the pipe from metadata to download.
It seems that the input from metadata is not well parsed into download.
And the download command use the column run_total_spots thinking that it is
the column run_accession

What I Did

pysradb metadata SRP026303 --assay | grep 'study\|AUY077' | pysradb download

This error occurs

Traceback (most recent call last):
  File "/home/jeanmichel/miniconda3/envs/alignment/bin/pysradb", line 8, in <module>
    sys.exit(parse_args())
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/cli.py", line 1051, in parse_args
    download(args.out_dir, args.db, args.srx, args.srp, args.skip_confirmation)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/cli.py", line 141, in download
    protocol=protocol,
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/sradb.py", line 1299, in download
    + ".sra"
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/strings.py", line 2039, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/strings.py", line 2096, in _validate
    raise AttributeError("Can only use .str accessor with string values!")pysradb download 
AttributeError: Can only use .str accessor with string values!

I took a look at the parsing but it seems not straightforward to correct it

A work around this issue could be something like that:

pysradb metadata SRP026303 --assay --saveto tmp.txt && cat tmp.txt | grep 'study|AUY077' > tmp.txt && pysradb download --input_file tmp.txt

Support Pubmed IDs for search and metadata

Is your feature request related to a problem? Please describe.
PMIDs are currently not returned for the metadata frame.

Describe the solution you'd like

The output dataframe should have additional column for PMID.
Search should also support PMIDs (#22)

saketkc / pysradb Goto Github PK

pysradb's Issues

Description

Error

Description

Description

Description

1st Method

Error

2nd Method

Error

Description

Description

What I Did

Description

Description

What I Did

Description

What I Did

Description

Description

Description

What I Did

Description

What I Did

Description

What I Did

What I Did

Description

What I Did

Description

What I Did

Description

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org