fabiobatalha / crossrefapi Goto Github PK
View Code? Open in Web Editor NEWA python library that implements the Crossref API.
License: BSD 2-Clause "Simplified" License
A python library that implements the Crossref API.
License: BSD 2-Clause "Simplified" License
Hi @fabiobatalha
I am using this library now and i got too many request error.
Because i called the multiple request at a time.
How to handle this error
Please reply me.
Description:
The code currently lacks proper documentation, making it difficult for users to understand the classes and methods and their intended usage. In order to improve the code's usability and maintainability, we should add comprehensive documentation.
Documentation Status:
Action Required:
Expected Documentation Style:
We can use PEP 257 style docstrings for documenting classes, methods, and functions. Refer to the PEP 257 documentation for guidelines.
Specific Examples:
Endpoint
and Works
have no documentation.do_http_request
in the HTTPRequest
class has no documentationApologies if this isn't the right channel to ask.
I'm trying to match titles to their DOI with a simple loop
for article in articles:
work = works.query(bibliographic=article.title)
for w in work:
if hasattr(article, 'title') and hasattr(w,'title') and w['title'][0] == article.title:
article.doi = w['DOI']
print(article.doi)
article.save()
else:
print('not found', article.title)
But since work
contains over 80k results, the method is too slow to be valuable. I have also tried with .sample(20)
hoping it would narrow the search, but it didn't match any titles. Is it because the sample is random?
Is there any way I can just fetch the first items from the work
class? It seems they always contain the match I need.
I was trying to get my own works using the from_accepted_date filter and it returned this error.
from crossref.restful import Works
works = Works()
pub_date = '2001'
author='aguiam'
pub = works.query(author=author,).filter(from_accepted_date=pub_date
).sort('published')
UrlSyntaxError: Filter from-accepted-date specified but there is no such filter for this route. Valid filters for this route are: from-event-start-date, has-update, has-abstract, article_number, until-update-date, from-posted-date, license.delay, has-update-policy, prefix, has-content-domain, has-authenticated-orcid, type, relation.type, from-event-end-date, has-orcid, archive, full-text.version, until-event-end-date, from-pub-date, until-index-date, has-full-text, has-assertion, until-posted-date, until-print-pub-date, has-affiliation, funder-doi-asserted-by, license.version, assertion, has-funder, member, from-created-date, has-domain-restriction, from-index-date, full-text.application, has-event, until-pub-date, until-event-start-date, from-deposit-date, relation.object-type, has-award, clinical-trial-number, assertion-group, until-deposit-date, award.funder, until-accepted-date, from-online-pub-date, until-online-pub-date, has-archive, license.url, orcid, type-name, isbn, full-text.type, has-relation, from-print-pub-date, until-created-date, from-update-date, has-clinical-trial-number, has-references, content-domain, doi, award.number, until-issued-date, has-license, issn, alternative_id, group-title, relation.object, is-update, container-title, directory, category-name, funder, from-accepted_date, has-funder-doi, update-type, updates, from-issued-date
When trying to retrieve information via simple queries, I consistently got outputs that I did not expect. Specifically, the publications which are referred to by the keywords are not returned in the result of the query. I do however get a return with the right publication data via a manual HTTP GET request.
Example code:
from crossref.restful import Works
keyword = 'Albert Einstein Elektrodynamik bewegter Körper'
works = Works()
result = works.query(keyword)
for entry in result:
print(entry)
break
>> {'indexed': {'date-parts': [[2019, 11, 19]], 'date-time': '2019-11-19T19:11:52Z', 'timestamp': 1574190712445}, 'reference-count': 0, 'publisher': 'Maney Publishing', 'issue': '1', 'content-domain': {'domain': [], 'crossmark-restriction': False}, 'short-container-title': ['Journal of the American Institute for Conservation'], 'published-print': {'date-parts': [[1980]]}, 'DOI': '10.2307/3179679', 'type': 'journal-article', 'created': {'date-parts': [[2006, 4, 18]], 'date-time': '2006-04-18T05:15:34Z', 'timestamp': 1145337334000}, 'page': '21', 'source': 'Crossref', 'is-referenced-by-count': 0, 'title': ['A Semi-Rigid Transparent Support for Paintings Which Have Both Inscriptions on Their Fabric Reverse and Acute Planar Distortions'], 'prefix': '10.1179', 'volume': '20', 'author': [{'given': 'Albert', 'family': 'Albano', 'sequence': 'first', 'affiliation': []}], 'member': '138', 'container-title': ['Journal of the American Institute for Conservation'], 'deposited': {'date-parts': [[2015, 6, 26]], 'date-time': '2015-06-26T01:05:23Z', 'timestamp': 1435280723000}, 'score': 4.5581737, 'issued': {'date-parts': [[1980]]}, 'references-count': 0, 'journal-issue': {'published-print': {'date-parts': [[1980]]}, 'issue': '1'}, 'URL': 'http://dx.doi.org/10.2307/3179679', 'ISSN': ['0197-1360'], 'issn-type': [{'value': '0197-1360', 'type': 'print'}]}
I get this kind of output which has nothing to do with my input keyword with different keywords, too. I have tried modifying the order of the result [result.order('desc')] but that does not seem to change anything.
When I then do the same request via HTTP GET and the normal API URL, I get the expected output as the first result:
import requests
keyword = 'Albert Einstein Elektrodynamik bewegter Körper'
keyword = '+'.join(keyword.split())
url = 'https://api.crossref.org/works?query=' + keyword
result = requests.get(url = url)
# Take first result
result = result.json()['message']['items'][0]
print(result)
>> {'indexed': {'date-parts': [[2020, 5, 25]], 'date-time': '2020-05-25T14:23:45Z', 'timestamp': 1590416625775}, 'publisher-location': 'Wiesbaden', 'reference-count': 0, 'publisher': 'Vieweg+Teubner Verlag', 'isbn-type': [{'value': '9783663193722', 'type': 'print'}, {'value': '9783663195108', 'type': 'electronic'}], 'content-domain': {'domain': [], 'crossmark-restriction': False}, 'published-print': {'date-parts': [[1923]]}, 'DOI': '10.1007/978-3-663-19510-8_3', 'type': 'book-chapter', 'created': {'date-parts': [[2013, 12, 6]], 'date-time': '2013-12-06T02:08:43Z', 'timestamp': 1386295723000}, 'page': '26-50', 'source': 'Crossref', 'is-referenced-by-count': 5, 'title': ['Zur Elektrodynamik bewegter Körper'], 'prefix': '10.1007', 'author': [{'given': 'A.', 'family': 'Einstein', 'sequence': 'first', 'affiliation': []}], 'member': '297', 'container-title': ['Das Relativitätsprinzip'], 'link': [{'URL': 'http://link.springer.com/content/pdf/10.1007/978-3-663-19510-8_3', 'content-type': 'unspecified', 'content-version': 'vor', 'intended-application': 'similarity-checking'}], 'deposited': {'date-parts': [[2013, 12, 6]], 'date-time': '2013-12-06T02:08:45Z', 'timestamp': 1386295725000}, 'score': 53.638336, 'issued': {'date-parts': [[1923]]}, 'ISBN': ['9783663193722', '9783663195108'], 'references-count': 0, 'URL': 'http://dx.doi.org/10.1007/978-3-663-19510-8_3'}
The output that I have retrieved with the tool in this repository has nothing to do with my query keyword. Do you have an idea about how I can fix this? I would be very grateful for every kind of help.
I'm interested in using this project to do deposits into crossref. Could you add an example of how to use the Depositor class?
Also, does the Depositor support resource-only deposits?
Thanks!
the members route now supports a few filters. So this should work:
Members().filter(has_public_references="True")
Does not seem to like filter before sample, but does not complain:
>>>works.filter(type='book').sample(10).url
https://api.crossref.org/works?sample=10
Sample before filter works as expected:
>>>works.sample(10).filter(type='book').url
https://api.crossref.org/works?sample=10&filter=type%3Abook
We have added some new header and parameter support for providing contact information. This is designed to help us troubleshoot problems with the API. See the section on etiquette
at api.crossref.org. Would love to see support for this.
Great package !
It would be nice to add support for proxies
the request.get method accept proxies (in a form of a dictionary)
dict_proxies = {'https': 'https://username:password@HOST:PORT',
'http': 'http://username:password@HOST:PORT',
}
requests.get(url , proxies = dict_proxies)
I'm using the filter ('for i in works.filter()') and selecting journal (using ISSN) and one month date period to gather articles (using from-pub-date and until-pub-date) from same issue/volume of journal. This seems to work, but when there are more than 1000 records I receive error "Expecting value: line 1 column 1 (char 0)" when using 'json.dumps(i)' and it kills my code.
I can't figure out why this is happening? Any ideas?
Cool library. I however find that only supporting sample(n) for limiting the results is a pity.
It would be cool if you could also support the rows query parameter to control number of results returned by a query. This would also help limiting the load for common use cases like "find the best (or best n) match based on title and author"
Could one use the crossrefapi to construct an author search with both first and last name?
I have tried that a bit but the results are very imprecise.
The funders route also now supports a location
filter. Would be great to see support for:
Funders().filter(location="Japan")
It would be useful if you could set the API url for crossref API requests.
For example, as we are testing, it would be good to make requests to test.crossref.org
instead of api.crossref.org
so that we are not testing using the production site.
Thanks!
If I run below code it gives me 0 results, which is expected
journals.works('1946-3944').filter(type='journal-article').filter(from_created_date='2021-11-05').filter(until_created_date='2021-11-05').count()
But when I run the same code with all(), it gives me some random results.
journals.works('1946-3944').filter(type='journal-article').filter(from_created_date='2021-11-05').filter(until_created_date='2021-11-05').all()
If I iterate over the results, it gives me some random results. Even this code should return an empty array
I am a fresh user, when applying syntax
w1 = works.query(title='zika')
return
TypeError: list indices must be integers or slices, not str
But it is OK when query author and others, have any ideas?
I have got 10000+ Wiley dois by crossref API( works.query() without "has_abstract=true", cuz it returns 0 with "has_abstract=true" ) and tried two different ways to fetch abstracts, but they don't work.
If I tried to fetch abstracts by Wiley API, it would download full texts(PDF), which wastes a lot of time in parsing.
So how can I get abstracts by crossref or does crossref not have abstracts of these dois?
Thanks for your help!
There are my approaches to fetch abstracts:
1.from crossref.restful import Works API:
2."requests" tool to link crossref url:
I installed crossrefapi using pip, it seemed to install fine. When I attempted to use it, got the following error: ModuleNotFoundError: No module named 'crossref.restful'; 'crossref' is not a package
Here was my code
from crossref.restful import Journals
journals = Journals()
print(journals.journal('1759-3441'))
Hi @fabiobatalha,
It seems like crossref api has made some modifications in header format.
{'date': 'Mon, 26 Jul 2021 11:41:59 GMT', 'content-type': 'application/json', 'transfer-encoding': 'chunked', 'access-control-allow-origin': '*', 'access-control-allow-headers': 'X-Requested-With', 'vary': 'Accept-Encoding', 'content-encoding': 'gzip', 'server': 'Jetty(9.4.40.v20210413)', 'x-ratelimit-limit': '50', 'x-ratelimit-interval': '1s', 'x-rate-limit-limit': '50, 50', 'x-rate-limit-interval': '1s, 1s', 'permissions-policy': 'interest-cohort=()', 'connection': 'close'}
'x-rate-limit-limit': '50, 50', 'x-rate-limit-interval': '1s, 1s',
Running below code
from crossref.restful import Works
works = Works()
w1 = works.query('zika').sample(20)
for item in w1:
print(item["title"])
is giving following error:
Traceback (most recent call last):
File "/home/ankush/.config/JetBrains/PyCharm2021.1/scratches/crossref_scratch.py", line 6, in <module>
for item in w1:
File "/media/ankush/ContinentalGroun/workplace/open_source/crossrefapi/crossref/restful.py", line 264, in __iter__
result = self.do_http_request(
File "/media/ankush/ContinentalGroun/workplace/open_source/crossrefapi/crossref/restful.py", line 80, in do_http_request
self._update_rate_limits(result.headers)
File "/media/ankush/ContinentalGroun/workplace/open_source/crossrefapi/crossref/restful.py", line 43, in _update_rate_limits
self.rate_limits['X-Rate-Limit-Limit'] = int(headers.get('X-Rate-Limit-Limit', 50))
ValueError: invalid literal for int() with base 10: '50, 50'
We have added a select
parameter that allows one finer control of response sizes. The following, for example, will only return the DOI and title for each matching record.
http://api.crossref.org/works?sample=10&select=DOI,title
I sometimes get timeout errors when searching for dois:
Traceback (most recent call last):
File "C:\Users\delap\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\models.py", line 910, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Projets\h-transport-materials-dashboard\test.py", line 6, in <module>
works.doi("10.1103/PhysRevB.4.330")
File "C:\Users\delap\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\crossref\restful.py", line 957, in doi
result = result.json()
File "C:\Users\delap\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\models.py", line 917, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: [Errno Expecting value] <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>
: 0
This is rather new and I haven't experience that before.
Here's the code to reproduce:
from crossref.restful import Works
works = Works()
works.doi("10.1103/PhysRevB.4.330")
I realise this is more of a general question, but I hope I can still get some help.
I would like to get the DOIs of a list of unstructured citations (somehow, similar to this issue).
However, if I run:
unstructured_citation = "Jan Hansen, Jochen Hung, Jaroslav Ira, Judit " \
"Klement, Sylvain Lesage, Juan Luis Simal and " \
"Andrew Tompkins (eds), The European Experience: " \
"A Multi-Perspective History of Modern Europe. " \
"Cambridge, UK: Open Book Publishers, 2023."
work = Works.query(bibliographic="unstructured_citation").sort("relevance")
I get a huge numbers of results in the variable work
(some of which are not even related).
What am I am missing? Is the bibliographic
argument meant to be used for work titles only? Should I try to extract the work titles from the raw citations and then use them as part of the query?
Thank you!
crossrefapi/crossref/restful.py
Line 29 in 2660c25
This value doesn't appear to be used anywhere.
Also did you mean "tuning" ? ("tunning" is not a recognised English word).
I have known the exact title of a paper, how can I print metadata, like author name?
(My plan is getting information automatically of more than 80 papers which I only know titles)
Anyone have some ideas?
Great library. But just discovered that sample doesn't work when combined with a filter.
w = Works().filter(type='journal-article').sample(5).url
w
returns
'https://api.crossref.org/works?sample=5'
Would expect something like this:
http://api.crossref.org/works?filter=type:journal-article&sample=5
Hi,
I checked the package from different computers and consistently get:
ReadTimeout: HTTPSConnectionPool(host='api.crossref.org', port=443): Read timed out. (read timeout=10).
I was wondering if the library allows to fetch only works that have an abstract and if there way to fetch abstract.
I want to read 200 results in 50 result chunks. The reading is done not concurrently but with sometimes other requests in between. How to tell the api to give the next 50 results (51 to 100)?
Are results ranked in a row? If I search the same key words in different number in 'sample()', how can I get different results in the second time?
For example, works.query(bibliographic=key_words).sample(10), get 10 results.
Then works.query(bibliographic=key_words).sample(20), how to get new 20 results instead 10 old and 10 new.
Thanks!
Hi,
Thank you for maintaining one of the documented libraries for using the Crossref REST API.
We’ve been working on a new version of the REST API, replacing the Solr backend with Elasticsearch and moving from our own hardware in a datacenter to a cloud platform.
We plan to cutover to the new version shortly (expect an official announcement on our blog in the next few days with more details), and wanted to invite you to test it out before the official cutover.
Please check it out at https://api.production.crossref.org/
During the cutover phase (expected to last a few weeks), traffic will be redirected to the above domain on a pool by pool basis. Once all traffic is using the new service, we will continue to use the api.crossref.org
domain, so please do not update anything to use the temporary domain.
Let me know if you have any questions. Issues can be filed into our GitLab issue repository, or I’ll keep an eye on this thread.
Thanks again,
Patrick
When I search for a DOI, the affiliations for the authors are missing. Example
from crossref.restful import Works
import json
doi = "10.1016/j.jbusvent.2019.105970"
works = Works()
res = works.doi(doi)
with open("doi.json","w", encoding="utf8") as fileh:
json.dump(res, fileh, ensure_ascii=False, indent=4, sort_keys=True)
The relevant rows in the output file doi.json are:
"author": [
{
"affiliation": [],
"family": "Douglas",
"given": "Evan J.",
"sequence": "first"
},
{
"affiliation": [],
"family": "Shepherd",
"given": "Dean A.",
"sequence": "additional"
},
{
"affiliation": [],
"family": "Prentice",
"given": "Catherine",
"sequence": "additional"
}
],
with the affiliation of the authors missing
When i am importing crossref.restful, getting error
`from crossref.restful import Works
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "crossref.py", line 1, in <module>
from crossref.restful import Works
ImportError: No module named restful
`
Does this api support passing of metadata plus api token?
How to do query with multiple words? I use "+" but the results that I got from Crossref website is different than using crossrefapi?
Hi,
https://github.com/CrossRef/rest-api-doc#etiquette says that in order to get in the "polite pool" i have to use https and include a mailto parameter in the query or in the user agent.
How do i do this with this lib?
Thank you
Does this library automatically apply throttling to comply with Crossref rate limits? I.e., as an extreme example, if non-Plus caller invokes doi() 100 times a second, would crossrefapi throttle outgoing requests and make the caller wait so as not to exceed Crossref API limits?
If not, is there a way for the caller to programmatically find out rate limit currently in effect and throttle its doi() invocations accordingly?
See also: https://api.crossref.org/swagger-ui/index.html. It seems that Crossref signals current rate limits using HTTP headers.
Presumably, complying with rate limits is preferable and guarantees not running into any further limiting.
Hi there, great work!!!
I am a first-time user, trying to wrap my head around dowloading a PDF based on its DOI, something like
works.doi.download('10.1590/0102-311x00133115', '~/Downloads/')
that would result in the PDF landing in my Downloads folder.
I guess this is something very simple, however I could not find any example. Would you please provide one, perhaps even as a Wiki entry?
Many thanks & Merry Christmas,
Stav
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.