saketkc / pysradb Goto Github PK
View Code? Open in Web Editor NEWPackage for fetching metadata and downloading data from SRA/ENA/GEO
Home Page: https://saketkc.github.io/pysradb
License: BSD 3-Clause "New" or "Revised" License
Package for fetching metadata and downloading data from SRA/ENA/GEO
Home Page: https://saketkc.github.io/pysradb
License: BSD 3-Clause "New" or "Revised" License
Can i give result of one query as input to other query and get the result using two queries in a single query?
or
Can i concatenate or merge two queries as a single query and get the output?
Describe the bug```
Not sure if its a bug on the GEO/SRA side, but for GSM sample GSM1020644
the experiment_alias has `_1` appended in its experiment_alias. I have not seen `_1` anywhere on the website.
import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("GSM1020644", detailed=True).experiment_alias.values)
0.11.0
['GSM1020644_1']
0.10.4
3.8.3
CentOS Linux
This is a follow-up issue from #46 where i started downloading a batch of sra files for the fetched metadata in a pandas
DataFrame. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with error
self.retrieve() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'
Discussion from #46
The download method first downloads to a temporary location which in this case is
pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part
: notice the.part
. Downloads are resumable by default. Once a download finishes, the.part
extension is removed to mark it complete.In this case the error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded (it thinks it hasn't been, but probably its download is already complete).
You should have
SRR12100406.sra
Please feel free to open a new issue otherwise.
As you mentioned
The error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded
I have checked that SRR12100406.sra
wasn't created yet. I am not sure how to use parallel efficiently in this case. I have two questions
Thanks,
Zohaib
Mac OS Cataline 10.15.5
python 2.7
pysradb 0.7.0
SRAmetadb.sqlite downloaded and extracted
Sebastians-Air:pysradb sebastian$ pysradb search '"single-cell rna-seq" retina'
/Users/sebastian/Library/Python/2.7/lib/python/site-packages/pysradb/basedb.py:108: RuntimeWarning: Found no matching results for query.
warnings.warn("Found no matching results for query.", RuntimeWarning)
/Users/sebastian/Library/Python/2.7/lib/python/site-packages/pysradb/sradb.py:243: UserWarning: Empty results
warnings.warn("Empty results", UserWarning)
Floated as a GSoC 2020 project: https://www.open-bio.org/events/gsoc/gsoc-project-ideas/#pysradb
$ pysradb search '"ribosome profiling"' | head
Generated:
study_accession experiment_accession sample_accession run_accession
DRP003075 DRX019536 DRS026974 DRR021383
DRP003075 DRX019537 DRS026982 DRR021384
DRP003075 DRX019538 DRS026979 DRR021385
DRP003075 DRX019540 DRS026984 DRR021387
DRP003075 DRX019541 DRS026978 DRR021388
DRP003075 DRX019543 DRS026980 DRR021390
DRP003075 DRX019544 DRS026981 DRR021391
ERP013565 ERX1264364 ERS1016056 ERR1190989
ERP013565 ERX1264365 ERS1016057 ERR1190990
Traceback (most recent call last):
File "/bi/apps/python/3.5.1/bin/pysradb", line 10, in <module>
sys.exit(parse_args())
File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 944, in parse_args
args.saveto,
File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 150, in search
_print_save_df(df, saveto)
File "/bi/apps/python/3.5.1/lib/python3.5/site-packages/pysradb/cli.py", line 38, in _print_save_df
print(df.to_string(index=False, justify="left", col_space=0))
BrokenPipeError: [Errno 32] Broken pipe
It doesn't happen with all searches:
$ pysradb search '"oocyte development"' | head
study_accession experiment_accession sample_accession run_accession
SRP011546 SRX129998 SRS300732 SRR445719
SRP011546 SRX129999 SRS300733 SRR445720
SRP064741 SRX1617410 SRS1326799 SRR3208744
SRP064741 SRX1617411 SRS1326798 SRR3208745
SRP064741 SRX1617412 SRS1326797 SRR3208746
SRP064741 SRX1617413 SRS1326796 SRR3208747
SRP064741 SRX1617414 SRS1326795 SRR3208748
SRP064741 SRX1617415 SRS1326794 SRR3208749
SRP064741 SRX1617416 SRS1326793 SRR3208750
I suspect the program doesn't cope with having to wait to generate output if the pipe buffer is full, or closed?
Describe the bug
version 0.11.1-dev0
import pysradb
db = pysradb.SRAweb()
db.sra_metadata("SRP098789", detailed=True)
Traceback (most recent call last):
File "script", line 4, in <module>
db.sra_metadata("SRP098789", detailed=True)
File "/home/maarten/anaconda3/envs/seq2science/lib/python3.7/site-packages/pysradb/sraweb.py", line 456, in sra_metadata
efetch_result = self.get_efetch_response("sra", srp)
File "/home/maarten/anaconda3/envs/seq2science/lib/python3.7/site-packages/pysradb/sraweb.py", line 323, in get_efetch_response
response = xmltodict.parse(request_text)["EXPERIMENT_PACKAGE_SET"][
KeyError: 'EXPERIMENT_PACKAGE_SET'
Is your feature request related to a problem? Please describe.
The current API allows 3 requests per second, but with an api_key this is increased to 10 per second.
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
This would mean a 3 times speedup for the lookup of samples when you add an api key to sraweb.
Describe the solution you'd like
It looks like adding the api key to these lists of arguments would already work: https://github.com/saketkc/pysradb/blob/master/pysradb/sraweb.py#L74-L89
In that case all that needs to be changed is accepting an optional api_key argument to __init__, and changing the sleeps. I would start a PR, but I am not entirely sure which sleep does what, and thus which one needs to be changed.
Describe the bug
Looking up the layout of sample GSM1013144 reports a layout of @xmlns
.
My guess would be that the xml is parsed incorrectly:
https://www.w3schools.com/tags/att_html_xmlns.asp
The SRA entry namely is single-ended:
https://www.ncbi.nlm.nih.gov/sra?term=GSM1013144
To Reproduce
import pysradb
db = pysradb.SRAweb()
db.sra_metadata("GSM1013144")
Can you include support for SRS to GSM conversions? I'm currently relying on the SRA_Accessions table for such conversions, and its not extremely well populated.
I've seen you have a function but is it used always?
Cool tool
0.10.4
3.8.3
10.15.5
. But using anaconda environment and pip installation of pysradb
Came across pysradb
to extract the metadata for a batch of SRA runs (~9K). I tried two different approaches, however, both gave different error. Likely because of a missing value on SRAweb
, but i am not sure how an error can either be ignored and moved forward.
I tried to convert 9K SRA run accessions to SRA study IDs using srr_to_srp
and then search approx. 500 accession ids against SRAweb
from pysradb.sraweb import SRAweb
db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
srp = db.srr_to_srp(lineList)
unique_srp = srp.study_accession.unique()
studies_list = unique_srp.tolist()
Metadata = db.sra_metadata(studies_list, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-d1fb481fd5e3> in <module>
----> 1 Metadata=db.sra_metadata(studies_list, detailed= True)
~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
457 # detailed_record[key] = value
458
--> 459 pool_record = record["Pool"]["Member"]
460 detailed_record["run_accession"] = run_set["@accession"]
461 detailed_record["run_alias"] = run_set["@alias"]
KeyError: 'Pool'
In this case I tried to run all 9K SRA run accessions directly against SRAweb
from pysradb.sraweb import SRAweb
db = SRAweb()
# file.txt has SRA run accession ids. With each ID in new line.
lineList = [line.rstrip('\n') for line in open("file.txt")]
Metadata = db.sra_metadata(lineList, detailed=True,)
Metadata.to_csv('Metadata.tsv', sep='\t', index=False)
Traceback (most recent call last):
File "/Users/Zohaib/PycharmProjects/SRA-Metadata/fetchSRAmetadata.py", line 10, in <module>
Metadata = db.sra_metadata(lineList, detailed=True)
File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 425, in sra_metadata
efetch_result = self.get_efetch_response("sra", srp)
File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/pysradb/sraweb.py", line 250, in get_efetch_response
esearch_response = request.json()
File "~/opt/anaconda3/envs/pysradb/lib/python3.8/site-packages/requests/models.py", line 898, in json
return complexjson.loads(self.text, **kwargs)
File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "~/opt/anaconda3/envs/pysradb/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thanks in advance, looking forward to hear from you.
Zohaib
Is there a way to grab all of the sample ids for a given SRP id? I see that you can return all of the GSM values (which might work for me ultimately), but it's not formatted in a fantastic manner.
Running the following example from readme (different SRP)
pysradb metadata SRP125265 --assay | grep 'study\|RNAseq' | pysradb download
produces this error
Traceback (most recent call last):
File "/home/dfeldman/.conda/envs/df-pyr/bin/pysradb", line 10, in <module>
sys.exit(parse_args())
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/cli.py", line 1044, in parse_args
download(args.out_dir, args.db, args.srx, args.srp, args.skip_confirmation)
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/cli.py", line 134, in download
protocol=protocol,
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pysradb/sradb.py", line 1275, in download
+ ".sra"
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/accessor.py", line 187, in __get__
accessor_obj = self._accessor(obj)
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/strings.py", line 2039, in __init__
self._inferred_dtype = self._validate(data)
File "/home/dfeldman/.conda/envs/df-pyr/lib/python3.7/site-packages/pandas/core/strings.py", line 2096, in _validate
raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
Possible this is failing because sample_title
is blank and whitespace is not being handled correctly in cli.download
. The downloaded metadata loads fine with pd.read_fwf
.
SRA seems to have undergone changes for keeping the reads with its new sos model.
Newer URLs are retrievable by srapath.
Example: SRP098789/SRR5227308: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR5227308
Describe the bug
My colleague @Rebecza is trying to download a single-cell ATAC-seq dataset and uses pysradb to get some metadata (seq2science), and managed to find a JsonDecodeError 🐛 . It's a list of approx 750 ENA samples, the strange this is the JsonDecodeError appears with the full list, but when split up in smaller lists it seems to work...
To Reproduce
I put it on colab, not sure if the link is working
https://colab.research.google.com/drive/1bC2WiA63JJnWYZew0pk6iovk537vQzaU?usp=sharing
pysradb metadata --saveto test1.tsv --detailed SRP098789
and pysradb metadata ERP113893
works well.
However, pysradb metadata --detailed ERP113893
or pysradb metadata --saveto test1.tsv --detailed ERP113893
produces the following IndexError:
I am able to replicate this issue on pysradb-0.10.4 (from pypi) and pysradb-0.10.5-dev0 (from github pysradb master), on both windows 10 and Ubuntu 18.04 OS.
From: https://doi.org/10.5256/f1000research.20450.r47560
One frustrating limitation is that the piping support is not univeral throughout the tool. You can pipe into the download command, but not, for example, into the metadata command. Being able to chain operations such as:
pysradb gse-to-srp GSE24355 | pysradb metadata | pysra download
..or
pysradb search '"oocyte development"' | head | pysradb metadata
..would be really nice and presumably not too hard to support?
I found few datasets are not available in the provided SRAmetadb.sqlite file. For example: GSE80994, GSE80993, GSE80992 etc. How can I get all the datasets from this tool? If the dataset is not available, how can I add it into the package?
pysradb gse-to-gsm --db SRAmetadb.sqlite --desc --expand GSE80990
/home/singh/miniconda3/envs/dev/lib/python3.7/site-packages/pysradb/basedb.py:108: RuntimeWarning: Found no matching results for query.
warnings.warn("Found no matching results for query.", RuntimeWarning)
I have a strange issue, which might not be within the scope of pysradb.
Sample GSM3832552 (actually the whole series) has been submitted to GEO but is not on the SRA. The sequencing runs however are on the SRA. Is there some magic way I can work with those?
>>> import pysradb
>>> print(pysradb.__version__)
0.11.1-dev0
>>> print(pysradb.SRAweb().sra_metadata("GSM3832552", detailed=True))
No results found for GSM3832552
Running a command without specifying --db
overwrote the already downloaded databases in the same directory.
pysradb srp-to-srx SRP067701
Running this twice should not download the database again
Let me know if I missed something.
Thanks for the great package, Saket!
Describe the bug
The entry library_name
is not reported in the sra_metadata output.
To Reproduce
db.sra_metadata('SRX2792593', detailed=True).columns
Index(['run_accession', 'study_accession', 'experiment_accession',
'experiment_title', 'experiment_desc', 'organism_taxid ',
'organism_name', 'library_strategy', 'library_source',
'library_selection', 'library_layout', 'sample_accession',
'sample_title', 'instrument', 'total_spots', 'total_size',
'run_total_spots', 'run_total_bases', 'run_alias', 'sra_url',
'experiment_alias', 'isolate', 'age', 'biomaterial_provider', 'sex',
'tissue', 'cell_line', 'treatment', 'BioSampleModel', 'ena_fastq_http',
'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp',
'ena_fastq_ftp_1', 'ena_fastq_ftp_2'],
dtype='object')
While in the online view of this experiment it is available: https://www.ncbi.nlm.nih.gov/sra/SRX2792593
Desktop (please complete the following information):
Describe the bug
Using SraSearch
with verbosity>=2
and a large query raises a ValueError when setting self.df["pmid"] = list(uids)
(
Line 776 in c23d4a7
The following error is raised:
ValueError Traceback (most recent call last)
<ipython-input-1-d19c33e66199> in <module>
1 instance = SraSearch(verbosity=3, return_max=max_query_num, query=query, platform='illumina')
----> 2 instance.search()
3 df_search = instance.get_df()
/pysradb/search.py in search(self)
774 self._format_result()
775 if self.verbosity >= 2:
--> 776 self.df["pmid"] = list(uids)
777 except requests.exceptions.Timeout:
778 sys.exit(f"Connection to the server has timed out. Please retry.")
/pandas/core/frame.py in __setitem__(self, key, value)
3038 else:
3039 # set column
-> 3040 self._set_item(key, value)
3041
3042 def _setitem_slice(self, key: slice, value):
/pandas/core/frame.py in _set_item(self, key, value)
3114 """
3115 self._ensure_valid_index(value)
-> 3116 value = self._sanitize_column(key, value)
3117 NDFrame._set_item(self, key, value)
3118
/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3762
3763 # turn me into an ndarray
-> 3764 value = sanitize_index(value, self.index)
3765 if not isinstance(value, (np.ndarray, Index)):
3766 if isinstance(value, list) and len(value) > 0:
/pandas/core/internals/construction.py in sanitize_index(data, index)
745 """
746 if len(data) != len(index):
--> 747 raise ValueError(
748 "Length of values "
749 f"({len(data)}) "
ValueError: Length of values (86768) does not match length of index (86721)
Multiple runs yield slightly different error messages:
ValueError: Length of values (86768) does not match length of index (86760)
It seems like the index length is varying for some reason.
To Reproduce
Execute the following code:
from pysradb.search import SraSearch
max_query_num = 1_000_000
query = 'txid2697049[Organism:noexp] AND ("filetype cram"[Properties] OR "filetype bam"[Properties] OR "filetype fastq"[Properties])'
instance = SraSearch(verbosity=2, return_max=max_query_num, query=query, platform='illumina')
instance.search()
df_search = instance.get_df()
Desktop:
Linux
3.8.5
0.11.2-dev0
Searching for the term hypoxia
worked fine but searching the term hypoxic
produced the following error message:
This is most likely a bug, please report it upstream.
sample_attribute: investigation type: metagenome || project name: Landsort Depth 20090415 transect || sequencing method: 454 || collection date: 2009-04-15 || ammonium: 8.7: µM || chlorophyll: 0: µg/L || dissolved oxygen: -1.33: µmol/kg || nitrate: 0.02: µM || nitrogen: 0: µM || environmental package: water || geographic location (latitude): 58.6: DD || geographic location (longitude): 18.2: DD || geographic location (country and/or sea,region): Baltic Sea || environment (biome): 00002150 || environment (feature): 00002150 || environment (material): 00002150 || depth: 400: m || Phosphate: || Total phosphorous: || Silicon:
Traceback (most recent call last):
File "/envs/pysradb/bin/pysradb", line 11, in <module>
sys.exit(parse_args())
File "/envs/pysradb/lib/python3.7/site-packages/pysradb/cli.py", line 944, in parse_args
args.saveto,
File "/envs/pysradb/lib/python3.7/site-packages/pysradb/cli.py", line 148, in search
expand_sample_attributes=expand,
File "/envs/pysradb/lib/python3.7/site-packages/pysradb/sradb.py", line 1044, in search_sra
acc_is_searchstr=True,
File "/envs/pysradb/lib/python3.7/site-packages/pysradb/sradb.py", line 316, in sra_metadata
metadata_df = expand_sample_attribute_columns(metadata_df)
File "/envs/pysradb/lib/python3.7/site-packages/pysradb/filter_attrs.py", line 75, in expand_sample_attribute_columns
sample_attribute_keys, _ = _get_sample_attr_keys(sample_attribute)
File "/envs/pysradb/lib/python3.7/site-packages/pysradb/filter_attrs.py", line 27, in _get_sample_attr_keys
sample_attribute_dict = dict(split_by_colon)
ValueError: dictionary update sequence element #19 has length 1; 2 is required
pysradb search "hypoxic" --db SRAmetadb.sqlite --assay --desc --detailed --expand --saveto hypoxic_search.txt
All the outputs and commands in the Python API documentation seem to assume sra_metadata()
is called with detailed=True
, though it is not specified and the default value is False.
Notably, this breaks example 4 where expand_sample_attribute_columns()
assumes presence of the sample_attribute
column. Running this snippet yields an ambiguous KeyError
rather than the intended output.
from pysradb.filter_attrs import expand_sample_attribute_columns
df = db.sra_metadata('SRP017942')
expand_sample_attribute_columns(df).head()
Is your feature request related to a problem? Please describe.
I run into the problem with seq2science that we assume that all samples are illumina sequenced / base space. Someone tried to run it with ABI solid and this resulted in some unexpected behaviour (empty alignment, but pipeline still runs fully). I want to add a check if the sample platform / fastq format is supported, but I am not sure if I can retrieve this data with pysradb:
import pysradb
db_sra = pysradb.SRAweb()
metadata = db_sra.sra_metadata(["SRR1649197"], detailed=True)
print(metadata.columns)
Index(['run_accession', 'study_accession', 'experiment_accession',
'experiment_title', 'experiment_desc', 'organism_taxid ',
'organism_name', 'library_strategy', 'library_source',
'library_selection', 'library_layout', 'sample_accession',
'sample_title', 'instrument', 'total_spots', 'total_size',
'run_total_spots', 'run_total_bases', 'run_alias', 'sra_url',
'experiment_alias', 'source_name', 'cell type', 'ena_fastq_http',
'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp',
'ena_fastq_ftp_1', 'ena_fastq_ftp_2'],
dtype='object')
See this sample for https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1649197
The instrument is AB 5500 Genetic Analyzer
, which is something I could use, but I expect to run into the same problem again when people use a different machine but still ABI solid (assuming there are multiple machines).
Describe the solution you'd like
The platform reported, or even better, the base format reported (e.g. base space vs color space).
import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("SRP016501", detailed=True).columns)
print(SRAweb().sra_metadata("GSM3141725", detailed=True).columns)
0.11.0
Index(['study_accession', 'experiment_accession', 'experiment_title',
'experiment_desc', 'organism_taxid ', 'organism_name',
'library_strategy', 'library_source', 'library_selection',
'sample_accession', 'sample_title', 'instrument', 'total_spots',
'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
'experiment_alias', 'source_name', 'tissue', 'sra_url_alt3', 'strain',
'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp_1',
'ena_fastq_ftp_2'],
dtype='object')
Index(['study_accession', 'experiment_accession', 'experiment_title',
'experiment_desc', 'organism_taxid ', 'organism_name',
'library_strategy', 'library_source', 'library_selection',
'sample_accession', 'sample_title', 'instrument', 'total_spots',
'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
'experiment_alias', 'source_name', 'age', 'strain'],
dtype='object')
When can I expect an ena_url column and when not? I understand that not everything is hosted on ena.
Hi,
My aim is to query GSM and get SRR for download. ( i was not aware that i can also download with your tool but my problem still persists.)
When I query the following GSM id GSM947526, i get the following output. but when i check GEO with this link, i see that there are actually 4 SRR for this sample.
Its not a big problem for me but it would be easier if you guys could fix it. I think the problem is at the database part.
Thank you for the tool, it saves a ton of time.
Best regards,
Tunc.
(genomics) [tmorova@linuxsrv006 raw-data]$ pysradb metadata GSM947526
experiment_accession experiment_title experiment_desc organism_taxid organism_name library_strategy library_source library_selection sample_accession sample_title study_accession run_accession run_total_spots run_total_bases
SRX154497 GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS345682 SRP013728 SRR513121 13493834 674691700
SRX154497 GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS345682 SRP013728 SRR513121 13493834 674691700
SRX154497 GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS345682 SRP013728 SRR513121 13493834 674691700
SRX154497 GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq GSM947526: LNCaP Cells H3K27me3 ChIP-seq; Homo sapiens; ChIP-Seq 9606 Homo sapiens ChIP-Seq GENOMIC ChIP SRS345682 SRP013728 SRR513121 13493834 674691700
Hi @saketkc,
could you please provide me information on how to fetch the attribute "Instrument" under "Library"
and also the total size of experiment by giving bio project id as input.
I tried with this "pysradb metadata SRP063732 --desc --expand --detailed" fetched layout,strategy,name,source,selection under Library but not "Instrument" and also,it fetched only bases and spots but not the total size of experiment
Thanks in advance.
Describe the bug
The metadata file downloaded directly from the ncbi website is different from the one downloaded using pysradb.
To Reproduce
Steps to reproduce the behavior:
pysradb download:
pysradb metadata SRP181607 --detailed --saveto file.csv
manual download:
When I go on the ncbi website https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125497 I usually follow the link 'SRA Run Selector' a the bottom of the page which brings me here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA516634&o=acc_s%3Aa. I then click on the link 'Metatadata' which downloads a file with 24 rows, instead of the 12 I get from the pysradb package. It seems that each sample has two runs, of which only one is included in the pysradb metadata.
Desktop (please complete the following information):
Additional context
It's not really context but thanks for the amazing package!
(this issue makes it less reliable for full automation tho)
from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP098789')
df.head()
The link to Python API is broken. Could you please fix it or list how I can use this library from inside my code. I tried the above code and got this output.
Traceback (most recent call last):
File "c:\Users\Bhavay\Desktop\nvbi.py", line 47, in <module>
db = SRAdb('SRAmetadb.sqlite')
File "C:\Python\lib\site-packages\pysradb\sradb.py", line 178, in __init__
_verify_srametadb(sqlite_file)
File "C:\Python\lib\site-packages\pysradb\sradb.py", line 155, in _verify_srametadb
metadata = db.query("SELECT * FROM metaInfo")
File "C:\Python\lib\site-packages\pysradb\basedb.py", line 103, in query
results = self.cursor.execute(sql_query).fetchall()
sqlite3.OperationalError: no such table: metaInfo
Hi Saket,
It appears that the current output of pysradb command, one given below for example from the README, is not very friendly to parsing by tools such as awk
or cut
. For instance, I'd only like to retain a few columns from the output, but usual attempts such as awk -F "\t" ..
or cut -f1-5
fail for columns which contain description text. This is a problem only if I use the direct string output from the command and not through the --saveto
option.
pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand
Cheers, Vivek
Hi,
I want to redirect the download to an external hard drive since I have little space left in my hard drive.
I am trying
(pysradb) usr@usr-X705UDR:~$ pysradb download --out-dir ./pysradb_downloads -p SRP165962 --db cd /media/usr/LaCie/db/ , but it is not working.
Kindly help.
When I was trying to pipe the output from pysradb search to pysradb download, a FileNotFoundError was raised.
pysradb search -m 100 -q ribosome profiling --db ena -v 1 | pysradb download
If SRAmetadab.sqlite is not already downloaded on local machine using $ pysradb metadb
, then when using db = SRAdb('xyz.sqlite')
throws an exception metaInfo not found
which is hard-coded on line 158 in sradb.py. Ideally, it should check if the file with the name entered already exists or not and throw the error on line 153 in sradp.py not a valid SRAmetadb.sqlite file
. It, however, creates a new .sqlite file with the given address ('xyz.sqlite' in this case) and tries SELECT * from metaInfo
on it. Therefore, it would never enter the except block on line 153.
from pysradb import SRAdb
db = SRAdb('this_shouldnt_exist')
df = db.sra_metadata('SRP098789')
df.head()
It returns
Traceback (most recent call last):
File "sample.py", line 2, in <module>
db = SRAdb('this_shouldnt_exist')
File "/usr/local/lib/python3.6/dist-packages/pysradb/sradb.py", line 178, in __init__
_verify_srametadb(sqlite_file)
File "/usr/local/lib/python3.6/dist-packages/pysradb/sradb.py", line 155, in _verify_srametadb
metadata = db.query("SELECT * FROM metaInfo")
File "/usr/local/lib/python3.6/dist-packages/pysradb/basedb.py", line 103, in query
results = self.cursor.execute(sql_query).fetchall()
sqlite3.OperationalError: no such table: metaInfo
and doing an ls on current dir shows the file with the name entered has been created
saad@DaasDaham:~$ ls
Desktop examples.desktop Public sample.py this_shouldnt_exist
Documents Music pysradb_downloads snap Videos
Downloads Pictures R Templates
sc=SRAweb()
sc.gse_to_gsm("GSE34438") #no keyword arguments
Traceback (most recent call last):
File "test_sraweb.py", line 151, in <module>
sc.gse_to_gsm("GSE34438")
File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 536, in gse_to_gsm
if kwargs["detailed"] == True:
KeyError: 'detailed'
Hi,
First I'd like to thank you for this very useful package. Unfortunely, I'd love to use SRAweb
, unfortunately, there seems to be somthing wrong with it compared to SRAdb
.
Here are my specs,
I'm trying to get the metadata from a SRA project ID (e.g.: SRP125768).
With local SQL db,
db = SRAdb('SRAmetadb.sqlite')
df1 = db.sra_metadata('SRP125768', detailed=True, expand_sample_attributes=True, sample_attribute=True)
W/o local SQL db,
db = SRAweb()
df2 = db.sra_metadata(srp="SRP125768", detailed=True, expand_sample_attributes=True, sample_attribute=True)
I haven't check all the entries but there is definitely something wrong with df2: duplicated rows / missing rows.
I'd be happy to get your feedback and your fix for this :)
calling srr_to_gsm function in sraweb throws KeyError.
Command run:
sc = SRAweb()
sc.srr_to_gsm("SRR057515")
Traceback:
Traceback (most recent call last):
File "test_sraweb.py", line 113, in <module>
test_srr_to_gsm(sc)
File "test_sraweb.py", line 109, in test_srr_to_gsm
df = sraweb_connection.srr_to_gsm("SRR057513")
File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 640, in srr_to_gsm
return _order_first(joined_df, ["run_accession", "experiment_alias"])
File "/home/dibya/Documents/new_pysradb/pysradb/pysradb/sraweb.py", line 22, in _order_first
return df[columns].drop_duplicates()
File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2806, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
File "/home/dibya/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1645, in _validate_read_indexer
raise KeyError(f"{not_found} not in index")
KeyError: "['experiment_alias'] not in index"
The initial versions of pysradb
relied on SRAmetadb.sqlite
for doing the operations. The test suite currently is catered to that mode and has not been updated to reflect SRAweb.
SRAweb is going to be the only supported mode in the future as we want to do away with the dependence on SRAmetadb. This will require a new suite of tests to be written based on the existing tests (since the sub-commands are exactly the same).
Mac OS Cataline 10.15.5
python 3.8
pysradb 0.10.5-dev0
https://www.ncbi.nlm.nih.gov/sra/?term=(rheumatoid+arthritis)+AND+%22Homo+sapiens%22%5Borgn%3A__txid9606%5D yileds 6514 entries
pysradb search -q "rheumatoid arthritis" -m 9000 results in only 287 entries.
What did i wrong?
I am trying to make a general search with the "--detailed" argument, but I get a KeyError when converting the taxid for certain SRP's. An example to reproduce this is: pysradb metadata SRP184142 --detailed
.
The error arises here:
Line 312 in 3b87e9d
lambda taxid: TAXID_TO_NAME.get(taxid, "NA")
, but I do not know whether that is the best solution long-term. I do not personally need any data from those "special" organisms, but they break pysradb search
when a broad keyword is used. Ideally, I guess they would all be included in TAXID_TO_NAME?
Thank you for providing this package!
I am trying to extract the metadata using Python API for a number of BioProjects and it works fine for most BioProject accessions except in some cases --detailed=True
results in ValueError
db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)
This results in:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-a36d224bed27> in <module>
----> 1 db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)
~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
506 metadata_df = metadata_df.drop_duplicates()
507 metadata_df = metadata_df.replace(r"^\s*$", np.nan, regex=True)
--> 508 ena_results = self.fetch_ena_fastq(srp)
509 if ena_results.shape[0]:
510 metadata_df = metadata_df.merge(
~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in fetch_ena_fastq(self, srp)
149 srr = srr.split("_")[0]
150 if ";" in url1:
--> 151 url1_1, url1_2 = url1.split(";")
152 url1_2 = "http://{}".format(url1_2)
153 url2_1, url2_2 = url2.split(";")
ValueError: too many values to unpack (expected 2)
First off, thanks for the awesome tool!
Is your feature request related to a problem? Please describe.
In the examples on the docs and notebooks the returned dataframe has a library_layout column, however I do not get this column newest version + SRAweb.
import pysradb
print(pysradb.__version__)
print(SRAweb().sra_metadata("SRP016501", detailed=True).columns)
> 0.11.0
>Index(['study_accession', 'experiment_accession', 'experiment_title',
'experiment_desc', 'organism_taxid ', 'organism_name',
'library_strategy', 'library_source', 'library_selection',
'sample_accession', 'sample_title', 'instrument', 'total_spots',
'total_size', 'run_accession', 'run_total_spots', 'run_total_bases',
'run_alias', 'sra_url_alt1', 'sra_url_alt2', 'sra_url',
'experiment_alias', 'source_name', 'tissue', 'sra_url_alt3', 'strain',
'ena_fastq_http_1', 'ena_fastq_http_2', 'ena_fastq_ftp_1',
'ena_fastq_ftp_2'],
dtype='object')
I scanned the source very briefly but it seems like this info is never used:
https://github.com/saketkc/pysradb/blob/master/pysradb/sraweb.py#L416
Is there a reason its not there?
I want to reproduct your .ipynb file, pysradb/notebooks/01.SRAdb-demo.
And I run that file in my local environment.
But the output is different.
I cannot understand this problem.
If you know have some solution or idea, please tell me.
Thank you for your attension.
df = db.sra_metadata('SRP017942')
df
Hi @saketkc,
could you please provide me information on how to fetch the name of the organism by giving project id and also how to get the count of runs and experiments of each project
I tried with this "pysradb metadata SRP063732 --desc --expand --detailed" but it fetched all the attributes information except name of the organism.
Thanks in advance
ISMB/ECCB 2019 feedback: Provide checksums for verifying SRA downloads. This should also be checked at the download time to verify the download (currently we rely on partial downloads, but the final download is not verified)
I tried to download an SRP project.
pysradb download SRP083135
It returns the help message
When I try:
pysradb metadata SRP083135
I get a table with the study_accession, experiment_accession, sample_accession and run_accession.
Describe the bug
ModuleNotFoundError when importing SRAweb
To Reproduce
conda create -c bioconda -n pysradb PYTHON=3 pysradb
conda activate pysradb
python
from pysradb import SRAweb
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'pysradb.sraweb'
Desktop (please complete the following information):
I have a problem when using the pipe from metadata to download.
It seems that the input from metadata is not well parsed into download.
And the download command use the column run_total_spots thinking that it is
the column run_accession
pysradb metadata SRP026303 --assay | grep 'study\|AUY077' | pysradb download
This error occurs
Traceback (most recent call last):
File "/home/jeanmichel/miniconda3/envs/alignment/bin/pysradb", line 8, in <module>
sys.exit(parse_args())
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/cli.py", line 1051, in parse_args
download(args.out_dir, args.db, args.srx, args.srp, args.skip_confirmation)
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/cli.py", line 141, in download
protocol=protocol,
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pysradb/sradb.py", line 1299, in download
+ ".sra"
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/accessor.py", line 187, in __get__
accessor_obj = self._accessor(obj)
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/strings.py", line 2039, in __init__
self._inferred_dtype = self._validate(data)
File "/home/jeanmichel/miniconda3/envs/alignment/lib/python3.7/site-packages/pandas/core/strings.py", line 2096, in _validate
raise AttributeError("Can only use .str accessor with string values!")pysradb download
AttributeError: Can only use .str accessor with string values!
I took a look at the parsing but it seems not straightforward to correct it
A work around this issue could be something like that:
pysradb metadata SRP026303 --assay --saveto tmp.txt && cat tmp.txt | grep 'study|AUY077' > tmp.txt && pysradb download --input_file tmp.txt
Is your feature request related to a problem? Please describe.
PMIDs are currently not returned for the metadata frame.
Describe the solution you'd like
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.