kousu / isi Goto Github PK
View Code? Open in Web Editor NEWTools for scraping the Thomson Reuters (aka ISI) Web of Science
License: MIT License
Tools for scraping the Thomson Reuters (aka ISI) Web of Science
License: MIT License
ProQuest has a different login flow than most academic publishers. The ezproxy config for it is a bit complicated. It has "patron login" and "SSO login". I was able to try out the SSO mode and discovered that after a couple rounds of redirects between it and Ebook Central, it constructs this URL:
f"https://ebookcentral.proquest.com/lib/{partner}/SignInPartnerUser?ebrary_username={partner}_{username}&partner_key={key}'"
partner
and key
seem to be an institutional login to ProQuest's database, partner
being a codeword for the relaying site and key
being the corresponding password; username
seems to be arbitrary -- it's there to pass to Ebook Central the human username for logging, but is otherwise ignored. If partner
and key
are good, ebookcentral generates guest session cookies (JSESSIONID
, EBSESSIONID
and EBUQUSER
; and the latter two are always equal) and gives them back to the calling user. After that point, the calling user talks to ebookcentral.proquest.com directly, without going through the proxy.
For bibliotecavirtual.uis.edu.co the codeword ispartner="bibliouissp"
and for librarylogin-um.suagm.edu partner="ebooksumet-ebooks"
, for example.
Unlike OAuth or Kerberos, there is no public key scheme or backchannel communication from the auth server ezproxy.whatever.net to the target server ebookcentral.proquest.com. The authorization step happens entirely by ezproxy passing a key in a HTTP Location redirect.
Since this situation is special-cased in ezproxy, we need to special-case it in ezproxy.py.
Further work would be:
Similar to ProQuest (!8), http://app.knovel.com/web/ uses some sort of SSO process which leaves the user connecting to that site directly but with a login cookie authorized by the relaying ezproxy.
Reverse engineer enough of this to support it in ezproxy.py.
ezproxy can run in two modes: by port or by hostname. In port mode, https://ezproxy.example.com:$port proxies to https://paywalled-site.net. In hostname mode, https://paywalled-site.ezproxy.example.com proxies to paywalled-site.net. See https://help.oclc.org/Library_Management/EZproxy/Get_started/Evaluate_proxy_by_port_versus_proxy_by_hostname.
ezproxy.py is currently only compatible with by hostname mode, and I'm not 100% convinced it even does that right, because it has the assumption that the login page is going to be at https://login.ezproxy.example.com/login.
The relevant lines are around
Lines 151 to 160 in 73a3577
and
Lines 102 to 105 in 73a3577
ISI does have an internal accounts system--they don't just sell themselves through library proxies.
I have no way to test this. If anyone does actually have an ISI account not via a research institution, please get in touch.
I've falled out of love with inheritence trees and mixins. I would rather make the ezproxy instances I create return a fresh object, with no parent class, that fits the requests.Session API but is not a Session.
I think the easiest way to do this is to override getattribute to proxy calls to requests.Session, except for those we explicitly define.
... but maybe that's just reimplementing python's inheritence lookup rules all over again? I'm not sure. Anyway, it'll be a good excercise to try.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.