jannisborn / paperscraper Goto Github PK

View Code? Open in Web Editor NEW

165.0 9.0 27.0 860 KB

Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.

License: MIT License

Python 100.00%

medrxiv arxiv pubmed biorxiv chemrxiv paperscraper

paperscraper's Introduction

paperscraper

Overview

paperscraper is a python package that ships via pypi and facilitates scraping publication metadata as well as full PDF files from PubMed or from preprint servers such as arXiv, medRxiv, bioRxiv and chemRxiv. It provides a streamlined interface to scrape metadata and comes with simple postprocessing functions and plotting routines for meta-analysis.

Since v0.2.4 paperscraper also supports scraping PDF files directly! Thanks to @daenuprobst for suggestions!

Getting started

pip install paperscraper

This is enough to query PubMed, arXiv or Google Scholar.

Download X-rxiv Dumps

However, to scrape publication data from the preprint servers biorxiv, medrxiv and chemrxiv, the setup is different. The entire dump is downloaded and stored in the server_dumps folder in a .jsonl format (one paper per line).

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv()  #  Takes ~30min and should result in ~35 MB file
biorxiv()  # Takes ~1h and should result in ~350 MB file
chemrxiv()  #  Takes ~45min and should result in ~20 MB file

NOTE: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.

Since v0.2.5 paperscraper also allows to scrape {med/bio/chem}rxiv for specific dates! Thanks to @achouhan93 for contributions!

medrxiv(begin_date="2023-04-01", end_date="2023-04-08")

But watch out. The resulting .jsonl file will be labelled according to the current date and all your subsequent searches will be based on this file only. If you use this option you might want to keep an eye on the source files (paperscraper/server_dumps/*jsonl) to ensure they contain the paper metadata for all papers you're interested in.

Examples

paperscraper is build on top of the packages pymed, arxiv and scholarly.

Publication keyword search

Consider you want to perform a publication keyword search with the query: COVID-19 AND Artificial Intelligence AND Medical Imaging.

Scrape papers from PubMed:

from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]

get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')

Scrape papers from arXiv:

from paperscraper.arxiv import get_and_dump_arxiv_papers

get_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')

Scrape papers from bioRiv, medRxiv or chemRxiv:

from paperscraper.xrxiv.xrxiv_query import XRXivQuery

querier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')

You can also use dump_queries to iterate over a bunch of queries for all available databases.

from paperscraper import dump_queries

queries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')

Or use the harmonized interface of QUERY_FN_DICT to query multiple databases of your choice:

from paperscraper.load_dumps import QUERY_FN_DICT
print(QUERY_FN_DICT.keys())

QUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')
QUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')

Scrape papers from Google Scholar:

Thanks to scholarly, there is an endpoint for Google Scholar too. It does not understand Boolean expressions like the others, but should be used just like the Google Scholar search fields.

from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)

Scrape PDFs

paperscraper also allows you to download the PDF files.

from paperscraper.pdf import save_pdf
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
save_pdf(paper_data, filepath='gt4sd_paper.pdf')

If you want to batch download all PDFs for your previous metadata search, use the wrapper. Here we scrape the PDFs for the metadata obtained in the previous example.

from paperscraper.pdf import save_pdf_from_dump

# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')

NOTE: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs. Many publishers detect and block scraping and many publications are simply behind paywalls.

Citation search

A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:

from paperscraper.scholar import get_citations_from_title
title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
get_citations_from_title(title)

NOTE: The scholar endpoint does not require authentification but since it regularly prompts with captchas, it's difficult to apply large scale.

Journal impact factor

You can also retrieve the impact factor for all journals:

>>>from paperscraper.impact import Impactor
>>>i = Impactor()
>>>i.search("Nat Comms", threshold=85, sort_by='impact') 
[
    {'journal': 'Nature Communications', 'factor': 17.694, 'score': 94}, 
    {'journal': 'Natural Computing', 'factor': 1.504, 'score': 88}
]

This performs a fuzzy search with a threshold of 85. threshold defaults to 100 in which case an exact search is performed. You can also search by journal abbreviation, E-ISSN or NLM ID.

i.search("Nat Rev Earth Environ") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search("101771060") # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]
i.search('2662-138X') # [{'journal': 'Nature Reviews Earth & Environment', 'factor': 37.214, 'score': 100}]

# Filter results by impact factor
i.search("Neural network", threshold=85, min_impact=1.5, max_impact=20)
# [
#   {'journal': 'IEEE Transactions on Neural Networks and Learning Systems', 'factor': 14.255, 'score': 93}, 
#   {'journal': 'NEURAL NETWORKS', 'factor': 9.657, 'score': 91},
#   {'journal': 'WORK-A Journal of Prevention Assessment & Rehabilitation', 'factor': 1.803, 'score': 86}, 
#   {'journal': 'NETWORK-COMPUTATION IN NEURAL SYSTEMS', 'factor': 1.5, 'score': 92}
# ]

# Show all fields
i.search("quantum information", threshold=90, return_all=True)
# [
#   {'factor': 10.758, 'jcr': 'Q1', 'journal_abbr': 'npj Quantum Inf', 'eissn': '2056-6387', 'journal': 'npj Quantum Information', 'nlm_id': '101722857', 'issn': '', 'score': 92},
#   {'factor': 1.577, 'jcr': 'Q3', 'journal_abbr': 'Nation', 'eissn': '0027-8378', 'journal': 'NATION', 'nlm_id': '9877123', 'issn': '0027-8378', 'score': 91}
# ]

Plotting

When multiple query searches are performed, two types of plots can be generated automatically: Venn diagrams and bar plots.

Barplots

Compare the temporal evolution of different queries across different servers.

from paperscraper import QUERY_FN_DICT
from paperscraper.postprocessing import aggregate_paper
from paperscraper.utils import get_filename_from_query, load_jsonl

# Define search terms and their synonyms
ml = ['Deep learning', 'Neural Network', 'Machine learning']
mol = ['molecule', 'molecular', 'drug', 'ligand', 'compound']
gnn = ['gcn', 'gnn', 'graph neural', 'graph convolutional', 'molecular graph']
smiles = ['SMILES', 'Simplified molecular']
fp = ['fingerprint', 'molecular fingerprint', 'fingerprints']

# Define queries
queries = [[ml, mol, smiles], [ml, mol, fp], [ml, mol, gnn]]

root = '../keyword_dumps'

data_dict = dict()
for query in queries:
    filename = get_filename_from_query(query)
    data_dict[filename] = dict()
    for db,_ in QUERY_FN_DICT.items():
        # Assuming the keyword search has been performed already
        data = load_jsonl(os.path.join(root, db, filename))

        # Unstructured matches are aggregated into 6 bins, 1 per year
        # from 2015 to 2020. Sanity check is performed by having 
        # `filtering=True`, removing papers that don't contain all of
        # the keywords in query.
        data_dict[filename][db], filtered = aggregate_paper(
            data, 2015, bins_per_year=1, filtering=True,
            filter_keys=query, return_filtered=True
        )

# Plotting is now very simple
from paperscraper.plotting import plot_comparison

data_keys = [
    'deeplearning_molecule_fingerprint.jsonl',
    'deeplearning_molecule_smiles.jsonl', 
    'deeplearning_molecule_gcn.jsonl'
]
plot_comparison(
    data_dict,
    data_keys,
    title_text="'Deep Learning' AND 'Molecule' AND X",
    keyword_text=['Fingerprint', 'SMILES', 'Graph'],
    figpath='mol_representation'
)

Venn Diagrams

from paperscraper.plotting import (
    plot_venn_two, plot_venn_three, plot_multiple_venn
)

sizes_2020 = (30842, 14474, 2292, 35476, 1904, 1408, 376)
sizes_2019 = (55402, 11899, 2563)
labels_2020 = ('Medical\nImaging', 'Artificial\nIntelligence', 'COVID-19')
labels_2019 = ['Medical Imaging', 'Artificial\nIntelligence']

plot_venn_two(sizes_2019, labels_2019, title='2019', figname='ai_imaging')

plot_venn_three(
    sizes_2020, labels_2020, title='2020', figname='ai_imaging_covid'
)

)

Or plot both together:

plot_multiple_venn(
    [sizes_2019, sizes_2020], [labels_2019, labels_2020], 
    titles=['2019', '2020'], suptitle='Keyword search comparison', 
    gridspec_kw={'width_ratios': [1, 2]}, figsize=(10, 6),
    figname='both'
)

Citation

If you use paperscraper, please cite the papers that motivated our development of this tool.

@article{born2021trends,
  title={Trends in Deep Learning for Property-driven Drug Design},
  author={Born, Jannis and Manica, Matteo},
  journal={Current Medicinal Chemistry},
  volume={28},
  number={38},
  pages={7862--7886},
  year={2021},
  publisher={Bentham Science Publishers}
}

@article{born2021on,
	title = {On the role of artificial intelligence in medical imaging of COVID-19},
	journal = {Patterns},
	volume = {2},
	number = {6},
	pages = {100269},
	year = {2021},
	issn = {2666-3899},
	url = {https://doi.org/10.1016/j.patter.2021.100269},
	author = {Jannis Born and David Beymer and Deepta Rajan and Adam Coy and Vandana V. Mukherjee and Matteo Manica and Prasanth Prasanna and Deddeh Ballah and Michal Guindy and Dorith Shaham and Pallav L. Shah and Emmanouil Karteris and Jan L. Robertus and Maria Gabrani and Michal Rosen-Zvi}
}

paperscraper's People

Contributors

Stargazers

Watchers

paperscraper's Issues

scrapper Killed

If a killed error is displayed and paper scraping stops, can it be considered that the IP is blocked?

How can I solve this issue?

ChemRxiv Engage API integration

Turns out, OpenEngage does provide an API :)

Base url: https://chemrxiv.org/engage/chemrxiv/public-api/v1/items

See code here: chemrxiv-dashboard/chemrxiv-dashboard.github.io@d3816f6#diff-d34f2e1442f7c9783f9229f7808dd7cbd276b7229ddea80b65146e4bed283ef7

Will try to integrate this as soon as possible

get_dumps.chemrxiv does nothing

I got chem_token from figshare.com.

from paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv
...
chemrxiv(save_path=chem_save_path, token=chem_token)

Running:

WARNING:paperscraper.load_dumps: No dump found for chemrxiv. Skipping entry.
WARNING:paperscraper.load_dumps: No dump found for medrxiv. Skipping entry.
0it [00:00, ?it/s]
INFO:paperscraper.get_dumps.utils.chemrxiv.utils:Done, shutting down

File chemrxiv_2021-10-07.jsonl is created but empty.

Meanwhile med and bio seem to work fine!

Searching impact factor of journal

Useful to postprocess/filter scraped results.
Achieve with fuzzysearch combined with impact_factor: https://github.com/suqingdong/impact_factor?tab=readme-ov-file#use-in-python

Scrape X-rxiv via API

Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.

Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too

No DOI given in saved dumps of recent arxiv papers

Ran

from paperscraper.arxiv import get_and_dump_arxiv_papers

prompt = ['prompt engineering llm', 'prompt injection llm']
ai = ['Artificial intelligence', 'Large Language Models', 'OpenAI','LLM']
mi = ['ChatGPT']
query = [prompt, ai, mi]

get_and_dump_arxiv_papers(query, output_filepath='pro_inject.jsonl')

Example - jsonl contains title,authors,abstract for the page https://arxiv.org/abs/2302.11382 but journal is always blank and doi is null. This pattern repeats itself for all results.

Error when importing any of chemrxiv, biorxiv, medrxiv from paperscraper.get_dumps

I just installed the paperscraper from pip today. However, I got the error when doing paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv.

WARNING:paperscraper.load_dumps: No dump found for biorxiv. Skipping entry.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()

TypeError: must be real number, not str

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'date'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_529/3049928879.py in <module>
----> 1 from paperscraper.get_dumps import medrxiv
      2 # chemrxiv(token=api_token)
      3 # medrxiv()
      4 # biorxiv()

~/.local/lib/python3.7/site-packages/paperscraper/__init__.py in <module>
      8 from typing import List, Union
      9 
---> 10 from .load_dumps import QUERY_FN_DICT
     11 from .utils import get_filename_from_query
     12 

~/.local/lib/python3.7/site-packages/paperscraper/load_dumps.py in <module>
     29         logger.info(f' Multiple dumps found for {db}, taking most recent one')
     30     path = sorted(dump_paths, reverse=True)[0]
---> 31     querier = XRXivQuery(path)
     32     QUERY_FN_DICT.update({db: querier.search_keywords})
     33 

~/.local/lib/python3.7/site-packages/paperscraper/xrxiv/xrxiv_query.py in __init__(self, dump_filepath, fields)
     23         self.fields = fields
     24         self.df = pd.read_json(self.dump_filepath, lines=True)
---> 25         self.df['date'] = [date.strftime('%Y-%m-%d') for date in self.df['date']]
     26 
     27     def search_keywords(

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'date'

edit 1: The same error also occur when doing from paperscraper.arxiv import get_and_dump_arxiv_papers. However, if I did the procedure paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv first (even this one has error), the second error is not occurred.
edit2: The version is 0.1.0

Error when downloading papers from Pubmed.

when I tried to download papers from pubmed, I got this error:

JSONDecodeError: Invalid control character at: line 1 column 105 (char 104)

import error

Hi!

Sorry to take up your time with this, but I have a small issue when trying to use the package on chemrxiv/biorxiv and medrxiv.
The import fails, and I have a no module found error (I have attached a snapshot).

I was wondering if I had missed something?

Thank you very much for making this package open source, I look forward to using it!
Best regards,

Claire

ImportError: attempted relative import beyond top-level package

Probably my fault, but I pasted this code:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv() # Takes ~30min and should result in ~35 MB file
biorxiv() # Takes ~1h and should result in ~350 MB file
chemrxiv() # Takes ~45min and should result in ~20 MB file

into a get_dumps.py in paperscraper/paperscraper and tried running it using python3 paperscraper, and I got this error:

Traceback (most recent call last):
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/get_dumps.py", line 1, in
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/init.py", line 10, in
from .load_dumps import QUERY_FN_DICT
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/load_dumps.py", line 8, in
from .arxiv import get_and_dump_arxiv_papers
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/arxiv/init.py", line 1, in
from .arxiv import * # noqa
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/arxiv/arxiv.py", line 5, in
import arxiv
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/arxiv/init.py", line 1, in
from .arxiv import * # noqa
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/arxiv/arxiv.py", line 7, in
from ..utils import dump_papers
ImportError: attempted relative import beyond top-level package

I'm likely doing something wrong, and I was hoping you could help me figure out what it is in particular.

Thanks!

-Morgan

Edit: I am using a conda environment to run this, if that makes a difference.

UnexpectedEmptyPageError and associated errorscre

Please excuse me if I do this incorrectly. I a noob. I am using python 3.11 on Windows 11 and Ubuntu 22.04.2. on I have run into an error like this on arxiv as well as medarxiv:

arxiv.arxiv.UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=%28all%3Apyschological+flow+state%29&id_list=&sortBy=relevance&sortOrder=descending&start=29500&max_results=100)

this seems to be an issue in the original code and was patched here lukasschwab/arxiv.py#43

I did not see that and I took a similar path. My code can checks to see if a URL is malformed or is empty. It handles it and logs it. If it runs into a URL that is not responding or hangs it waits some user-defined amount of time and moves on. You can also make it create smaller jsonl for various reasons. I was also going to implement querying by date. Right now it's all hardcoded variables but I was thinking I should make it so that you can call the options from the command line or a config file. I am also thinking about multi-threaded and being able to throttle your calls to service and or a back-off algorithm. I don't know what I am supposed to do. Do I provide my fixes, if needed, and how or do I go to the arxiv team? I also think these issues lurk in other libraries but I have not made anything like extensive testing. Thank you I appreciate your time and paper scraper.

How to turn off the DEBUG log information?

There're too much log print outs, how to turn off the DEBUG/INFO?

Thanks!

Randomness in arxiv API requests

The underlying arxiv package had an issue with unreliable results (#43).

Fortunately, this was fixed in the recent 1.0.0 release but here we still depend on the old 0.5.3.

Task: Bump dependency to 1.0.1 and refactor arxiv related code.

HTTPError for paperscraper.get_dumps.chemrxiv()

Hi, I was trying this library out and it worked for biorxiv() and medxriv(). However, for chemrxiv() I kept getting this error:

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    xxv_dumps(3)
  File "C:\Users\User\Documents\98_Notes\Data Analytics\use_paperscraper.py", line 37, in xxv_dumps
    chemrxiv()  #  Takes ~45min and should result in ~20 MB file
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\chemrxiv.py", line 42, in chemrxiv
    download_full(save_folder, api)
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\utils\chemrxiv\utils.py", line 132, in download_full
    for preprint in tqdm(api.all_preprints()):
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\utils\chemrxiv\chemrxiv_api.py", line 103, in query_generator
    r.raise_for_status()
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\requests\models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: 	https://chemrxiv.org/engage/chemrxiv/public-api/v1%5Citems?limit=50&skip=0&searchDateFrom=2017-01-01&searchDateTo=2023-05-08

After reading through past issues and PRs, I noticed that https://chemrxiv.org/engage/chemrxiv/public-api/v1 was a valid address, and found that '%5C' is the hex code for '\'. So I manually stitched together a valid URL in get_dumps\utils\chemrxiv\chemrxiv_api.query_generator by replacing

r = self.request(os.path.join(self.base, query), method, params=params)
with
r = self.request(self.base+"/"+query, method, params=params)

Result: A bunch of JSONs were dumped in server_dumps, each corresponding to a single paper.

Remote diconnected and didnt download files

Hi,
Very cool project! It looks like I installed it correctly and I ran this code on a jupyter notebook:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
    medrxiv()  #  Takes ~30min and should result in ~35 MB file
    biorxiv()  # Takes ~1h and should result in ~350 MB file
    chemrxiv()  #  Takes ~45min and should result in ~20 MB file

I get this response:

61032it [20:29, 49.63it/s]
106700it [1:45:02, 16.93it/s]

And then I get the mess below. Any ideas on what I can do? Thankyou!!

Sincerely,

tom

RemoteDisconnected                        Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    783     e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
    786     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    787 )
    788 retries.sleep()

File ~\anaconda3\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    549 if read is False or not self._is_method_retryable(method):
--> 550     raise six.reraise(type(error), error, _stacktrace)
    551 elif read is not None:

File ~\anaconda3\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
    768 if value.__traceback__ is not tb:
--> 769     raise value.with_traceback(tb)
    770 raise value

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:71, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     70 while do_loop:
---> 71     json_response = requests.get(
     72         self.get_papers_url.format(
     73             begin_date=begin_date, end_date=end_date, cursor=cursor
     74         )
     75     ).json()
     76     do_loop = json_response["messages"][0]["status"] == "ok"

File ~\anaconda3\lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
     63 r"""Sends a GET request.
     64 
     65 :param url: URL for the new :class:`Request` object.
   (...)
     70 :rtype: requests.Response
     71 """
---> 73 return request("get", url, params=params, **kwargs)

File ~\anaconda3\lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~\anaconda3\lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~\anaconda3\lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)

File ~\anaconda3\lib\site-packages\requests\adapters.py:501, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    500 except (ProtocolError, OSError) as err:
--> 501     raise ConnectionError(err, request=request)
    503 except MaxRetryError as e:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
      2 medrxiv()  #  Takes ~30min and should result in ~35 MB file
----> 3 biorxiv()  # Takes ~1h and should result in ~350 MB file
      4 chemrxiv()

File ~\anaconda3\lib\site-packages\paperscraper\get_dumps\biorxiv.py:42, in biorxiv(begin_date, end_date, save_path)
     40 # dump all papers
     41 with open(save_path, "w") as fp:
---> 42     for index, paper in enumerate(
     43         tqdm(api.get_papers(begin_date=begin_date, end_date=end_date))
     44     ):
     45         if index > 0:
     46             fp.write(os.linesep)

File ~\anaconda3\lib\site-packages\tqdm\std.py:1195, in tqdm.__iter__(self)
   1192 time = self._time
   1194 try:
-> 1195     for obj in iterable:
   1196         yield obj
   1197         # Update and possibly print the progressbar.
   1198         # Note: does not call self.update(1) for speed optimisation.

File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:85, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     83                 yield processed_paper
     84 except Exception as exc:
---> 85     raise RuntimeError(
     86         "Failed getting papers: {} - {}".format(exc.__class__.__name__, exc)
     87     )

RuntimeError: Failed getting papers: ConnectionError - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))