Giter VIP home page Giter VIP logo

sidewall's Introduction

Sidewall

Sidewall is a package for interacting with the Dimensions search API. It provides object classes for Dimensions entities, fetches data incrementally, caches results, copes with rate limits, and more, to make working with Dimensions in Python more natural. "Sidewall" is a loose acronym for Simple Dimensions wrapper client library.

Authors: Michael Hucka
Repository: https://github.com/caltechlibrary/sidewall
License: BSD/MIT derivative โ€“ see the LICENSE file for more information

License Python Latest version DOI

๐Ÿ Log of recent changes

Version 1.0.1: This is a significant bug-fix release.

  • Fixed serious bugs in creating Researcher objects from Author objects.
  • Fixed bugs in setting current_organization on Person, Author and Researcher objects
  • Fixed bugs setting affiliations on Researcher when derived from Author objects
  • Updated examples in the top-level README file
  • Started a CHANGES file

Table of Contents

โ˜€ Introduction

Dimensions offers a networked API and search language (the DSL). However, interacting with the DSL currently requires sending a search string to the Dimensions server, then interpreting the JSON results and handling various issues such as iterating to obtain more than 1000 values (which requires the use of multiple queries), staying within API rate limits, and more. Sidewall ("Simple Dimensions wrapper client library") provides a higher-level interface for working more conveniently with the Dimensions DSL and network API. Features of Sidewall include:

  • object classes for different Dimensions data entities
  • lazy object values filled in automatically behind the scenes
  • results iterator fetches data over the net as needed
  • automatic caching of search results for speed and efficiency
  • automatic throttling to keep within API rate limits

โœบ Installation instructions

The following is probably the simplest and most direct way to install this software on your computer:

sudo python3 -m pip install git+https://github.com/caltechlibrary/sidewall.git --upgrade

Alternatively, you can clone this GitHub repository and then run setup.py:

git clone https://github.com/caltechlibrary/sidewall.git
cd sidewall
sudo python3 -m pip install . --upgrade

โ–ถ๏ธŽ Using Sidewall

Sidewall is meant to be used from other programs; it does not provide a standalone command-line interface or graphical user interface. At this time, Sidewall only supports certain kinds of Dimensions queries as discussed below.

Basic setup and use

To use Sidewall, import the package and the symbol dimensions in your Python code:

import sidewall
from sidewall import dimensions

In case of problems, it may be useful to turn on debugging in Sidewall to see everything that is happening behind the scenes. You can do that by using set_debug() after importing Sidewall:

sidewall.set_debug(True)

To run queries, you will need first to have an account with Dimensions. There are multiple ways of supplying user credentials to Sidewall. The most secure and more convenient way is to invoke the login() method without any arguments:

dimensions.login()

When done this way, Sidewall will use the operating system's keyring/keychain functionality (via keyring) to get the user name and password. If the information does not exist from a previous call to dimensions.login(), Sidewall will ask you for the user name and password interactively, and then store it in the keyring/keychain for next time.

If asking the user for credentials interactively on the command line is unsuitable for the application you are writing, you can also supply a user name and password to the login() method as keyword arguments:

dimensions.login(username = 'somelogin', password = 'somepassword')

Basic principles of running queries

Sidewall defines a method, query(), which you can use to run a search in Dimensions and get back results. The method takes a single argument, a string. Here is an example:

results = dimensions.query('search publications for "SBML" return publications')

The form of the search query string that Sidewall can use is limited in ways described shortly. The query() method returns an object that acts as a Python iteratorโ€”you can iterate over the results, use len(), and do other operations.

The items returned by the iterator will be Sidewall objects of the kind discussed in the section below on Data mappings. The specific classes of objects returned will correspond to the type of record expressed in the tail end of the query handed to query(). For example, a query that ends in return publications will produce Sidewall Publications objects; a query that ends in return researchers will produce Sidewall Researcher objects; and so on.

Sidewall currently puts the following limitations on the form of the query search string:

  • it must begin with search
  • it must end with return publications, return researchers, or return grants
  • it must only return a single type of thing (i.e., researchers or publications or grants)
  • it must not put facet specifiers or limits on the returned results
  • it must not use aggregation or other advanced DSL features

The following is a complete example of using Sidewall to search for publications containing thes string "SBML", and then printing the year and DOI for each such publication found:

import sidewall
from sidewall import dimensions

dimensions.login()
results = dimensions.query('search publications for "SBML" return publications')

print('Total found: {}'.format(len(results)))
for pub in results:
    print('{}: {}'.format(pub.year, pub.doi))

Data mappings

Sidewall defines object classes such as Researcher, Publication, and a few others to represent the different types of entities returned as the results of a Dimensions search query. Sidewall's objects attempt to smooth over some of the confusing aspects of the data representations in Dimensions by providing single objects that consolidate different fields and facets of the same underlying "thing". Further, the fields of an object sometimes are not available from a given query Dimensions performed by the user but may be available if a different kind of query is performed; Sidewall uses this knowledge in some cases to expand object field values automatically and behind the scenes as needed.

The following data classes are defined by Sidewall at this time; note that this is not all the types of data that Dimensions provides today, but future work may improve Sidewall's coverage.

  • Person, with subclasses Authors and Researchers
  • Organization
  • Publication
  • Journal
  • Grant
  • several very simple objects: Category, City, Country, State

Person

Dimensions doesn't expose an underlying base class for people; instead, it returns unnamed data structures that basically refer to people in different contexts. Sidewall currently understands two such contexts: authors of publications (when a query uses return publications), and "researchers" (when a query uses return researchers or objects such as Grant contain "researchers" as a data field). Sidewall introduces a parent class called Person because the objects in these two contexts are so similar, and provides two derived classes: Author and Researcher. Both of the derived classes have the same fields. The distinction provided by the derived classes is necessary because the list of affiliations for an Author is relative to a particular publication and may not be all the affiliations that a person has. Thus, affiliations for authors must be understood in the context of a particular search for publications. The use of two classes indicates the context, so that callers can correctly interpret the list of affiliations.

           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚    Person    โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  ^
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Author    โ”‚      โ”‚  Researcher  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The following table describes the fields and how they relate to values returned from Dimensions:

Field Type In return researchers? In return publications? In return grants? Exp.?
affiliations [Organization,ย ...] via research_orgs โœ“ โœ“ โœ“
current_organization Organization n via current_organization_id n โœ“
first_name string โœ“ โœ“ โœ“ n
middle_name string n n โœ“ n
id string โœ“ as researcher_id โœ“ n
last_name string โœ“ โœ“ โœ“ n
orcid string as orcid_id โœ“ orcid_id โœ“
role string n n โœ“ n

("Exp." โ‡’ filled or expanded by Sidewall via search if needed.)

The affiliations field in Sidewall's Person (and consequently Author and Researcher) is a list of Organization class objects (see below). Although affiliations as returned by Dimensions are sparse when using a query that ends with return researchers (they consist only of organization identifiers), Sidewall hides this by providing complete Organization objects for the affiliations field of a Person, and using behind-the-scenes queries to Dimensions to fill out the organization info when the object field values are accessed. Thus, calling programs do not need to do anything to get organization details in a result regardless of whether they use return publications or return researchersโ€”Sidewall always provides Organization class objects and handles getting the field values automatically.

To make data access more uniform, Sidewall also replaces the field current_organization_id (which in Dimensions is a string, the identifier of an organization) with the field current_organization. Its value is an Organization object corresponding to the organization whose identifier is found in current_organization_id.

Author class objects are returned when returning publication results, and in those cases, the list of a person's affiliations will reflect their affiliations with respect to a particular publication. However, sometimes it's convenient to get more information about an author, such as the complete list of affiliations that Dimensions has for the person in question. Sidewall allows you to create a Researcher object out of an Author object for that reason. Here is an example to illustrate the differences between authors and researchers and how you can convert the former to the latter:

>>> import sidewall
>>> from sidewall import dimensions, Researcher
>>> dimensions.login()
>>> pubs = dimensions.query('search publications in title_only for "SBML" where year=2003 return publications')
>>> pub = next(pubs)
>>> author1 = pub.authors[0]
>>> author1
<Author ur.0665132124.52>
>>> author1.affiliations
[]
>>> researcher1 = Researcher(author1)
>>> researcher1.affiliations
[<Organization grid.20861.3d>, <Organization grid.10392.39>, <Organization grid.214458.e>]

Finally, note that the field role is present for Researcher objects listed only in the context of Grant results. Its value is not filled in other contexts.

Organization

Sidewall uses the object class Organization to represent an organization in results returned by Dimensions. In Sidewall, the set of fields possessed by an Organization is the union of all fields that Dimensions provides in different contexts for organizations. The following table describes the fields and how they relate to values returned from Dimensions:

Field Type In "return research_orgs"? In "return publications"? Sidewall filled?
acronym string โœ“ n โœ“
city string n โœ“ n
city_id string n โœ“ n
country string n โœ“ n
country_code string n โœ“ n
country_name string โœ“ n โœ“
id string โœ“ โœ“ n
name string โœ“ โœ“ n
state string n โœ“ n
state_code string n โœ“ n

Dimensions returns different field values in different contexts. For example, the information about organizations included in an author's affiliation list in a publication is somewhat different from what is provided if a search ending in return research_orgs is used. Sidewall makes the assumption that an organization with a given organization identifier ("grid id") is the same organization no matter the context in which it is mentioned in a search result, and so Sidewall smooths over the field differences and, as with Researcher and Author, queries Dimensions behind the scenes to get missing values when it can (and when they exist).

Publication

The Publication object class is mostly unchanged from the Dimensions publication entity, but in Dimensions, different fields are exposed depending on the type of publication and whether fieldset modifiers are being used. (The available fieldsets for publications are basics, extras, and book.) Sidewall's Publication object class contains all possible fields, but the values of some fields may not be filled in depending on the type of publication in question. For example, journals will not have a value for book_doi. The following table describes the fields in Publication objects:

Field Type In return publications?
altmetric string โœ“
authors [Author, ...] via author_affiliations
author_affiliations [Author, ...] via author_affiliations
book_doi string โœ“
book_series_title string โœ“
book_title string โœ“
date string โœ“
date_inserted string โœ“
doi string โœ“
field_citation_ratio string โœ“
id string โœ“
issn string โœ“
issue string โœ“
journal Journal โœ“
linkout string โœ“
mesh_terms string โœ“
open_access string โœ“
pages string โœ“
pmcid string โœ“
pmid string โœ“
proceedings_title string โœ“
publisher string โœ“
references string โœ“
relative_citation_ratio string โœ“
research_org_country_names string โœ“
research_org_state_names string โœ“
supporting_grant_ids string โœ“
times_cited string โœ“
title string โœ“
type string โœ“
volume string โœ“
year string โœ“

Sidewall's Publication objects use a list of Author objects to represent authors, and introduce an alias called authors for the field author_affiliations. The latter alias is for convenience and an attempt to bring more intuitiveness to the structure of publications records. (The name author_affiliations in the Dimensions data is potentially confusing because the name suggests it may be a list of organizations rather than a list of authors. Providing a field named authors removes this ambiguity.)

Grant

The Grant object in Sidewall maps directly to the entity representing grants in Dimensions. The fields in Grants are all identical to the Dimensions results, and use lists of other objects where appropriate. For example, the funders field is created as a list of Organization objects.

Field Type
FOR [Category, ...]
FOR_first [Category, ...]
HRCS_HC [Category, ...]
HRCS_RAC [Category, ...]
RCDC [Category, ...]
abstract string
active_year [int, ...]
date_inserted string
end_date string
funder_countries [Country, ...]
funders [Organization, ...]
funding_aud float
funding_cad float
funding_chf float
funding_eur float
funding_gbp float
funding_jpy float
funding_usd float
funding_org_acronym string
funding_org_city string
funding_org_name string
id string
language string
linkout string
original_title string
project_num string
research_org_cities [City, ...]
research_org_countries [Country, ...]
research_org_name string
research_org_state_codes [State, ...]
research_orgs [Organization, ...]
researchers [Researcher, ...]
start_date string
start_year int
title string
title_language string

The Dimensions data fields in grant entities have an anomaly in that funding_org_city is a string, but cities in another field (research_org_cities) are represented as structured objects. The Grant object in Sidewall does not smooth over this inconsistency in its current version, although perhaps it should in a future release.

Journal, Category, City, Country, State

Rounding out the classes implemented in Sidewall are a small number of very simple classes used to store data that Dimensions returns in structured form: Journal, Category, City, Country, State. They are all basically identical, each containing only two static fields having string values. In the case of Journal one of the fields is named differently (title versus name for the others). More specifically, Journal has the following form:

Field Type In return publications?
id string โœ“
title string โœ“

All of the other classes (Category, City, Country, State) have the following form:

Field Type
id string
name string

Currently unsupported Dimensions data types

As of this version, Sidewall does not offer support for representing Dimensions policy and patent entities. This is purely due to resource constraints and not due to an inherent limitation in the Sidewall design. Future development could easily add new object classes to support these other data entities.

โ‡ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

โ˜บ๏ธŽ Acknowledgments

The vector artwork of a car tire used as a logo for this repository was created by Flanker. It is licensed under the Creative Commons Attribution 3.0 Unported license.

Sidewall makes use of numerous open-source packages, without which it would have been effectively impossible to develop Sidewall with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

  • humanize โ€“ print numbers in a human-friendly format
  • keyring โ€“ access the system keyring service from Python
  • requests โ€“ an HTTP library for Python
  • setuptools โ€“ library for setup.py
  • urllib3 โ€“ HTTP client library for Python
  • validators โ€“ data validation package for Python

โ˜ฎ๏ธŽ Copyright and license

Copyright (C) 2019, Caltech. This software is freely distributed under a BSD/MIT type license. Please see the LICENSE file for more information.

sidewall's People

Contributors

mhucka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

u15443385

sidewall's Issues

Use grants to fill out organization details

It looks like grant objects in Dimensions may provide the most detailed information about organizations. The Organization object should try to search for organizations in grants like this,

search grants where research_orgs.id="grid.20861.3d" return grants

and use the content of the research_orgs field to extract data about an organization to fill out the fields in the Organization object. (Currently, Organization uses a search against publications to get the data. )

The complication is that not all organizations may have received grants, so Organization will have to try it 2 ways (using grants, and falling back to publications if no grant is found). Unfortunately, the current architecture of Sidewall makes the assumption that there is only one fill-search option available. It will require changing the underlying architecture of the objects and the __getattribute__(...) method on DimensionCore objects. Preliminarily, I think the approach would be to change the _search_tmpl property on objects like Organization to be a list of searches to try in order of preference.

Save API calls by using lists

Right now, filling in attribute values uses a one-call-one-item approach. However, I just realized today that it's possible to search for such things as researchers using a list syntax like this:

search publications where researchers.id in ["ur.015050223327.40", "ur.07402772253.38"] return researchers

This means that it would be possible to reduce the number of API calls in some situations by getting the results for multiple people at once. This would help avoid hitting the API rate limit as much.

The tricky part is figuring out in what situation you can group a call for data like this. In the current scheme of things, a fill operation is initiated when something accesses an attribute that has not yet been set, like the list of affiliations on a researcher. To do that for multiple researchers in one shot, you would need to know ahead of time that the user's code is going to access the fields on multiple objects, and know which specific objects they will be. That means some kind of predictive heuristic approach. This needs more thought.

Add functionality to iterator returned for results

Currently, the queryresults object returned by dimensions.query(...) implements the iterator protocol and the additional ability to take the len(...) only. It would be convenient ifย the results object could supported slicing, and possibly other operations such as list(results). This wouldn't be very hard to implement, and the Python itertools package provides some features that could be useful in the implementation. The main design decisions revolve around how to store the results internally (when they could be very many, as many as 50,000) and what to do if the user tries to use list(results) on a long list of results (which will hang during the time that Sidewall is repeatedly querying the Dimensions server to fill out the list).

Move parsing of id to core.py

Currently, all the subclasses of DimensionsCore handle pulling out the Dimensions id themselves. I did it this way because in the beginning, I wasn't sure if the identifier of every Dimensions object was in an attribute named id or whether it was sometimes named id and sometimes something else (like the case for orcid and orcid_id). It now looks like everything really does have an id, so the best thing to do now would be to pull out id from the list of _new_attributes on the subclasses and put it in _attributes on DimensionCore, then add the necessary code to parse it in _set_attributes() on DimensionCore, and finally remove it from the places where it appears in the subclasses' _set_attributes().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.