Giter VIP home page Giter VIP logo

amba-analysis-worker-pubfinder's Introduction

amba-analysis-worker-pubfinder

The Pubfinder component retrieves publication metadata and, if possible, abstracts for publications that are linked to an event. The Pubfinder component is implemented in python by using the amba-event-streams package.

The process of retrieving publication data can be seen in Figure 1. The Pubfinder is built around its main worker utilizing individual sources which are integrated similar. This architecture allows adding, remove or modify sources since their only difference is in the source retrieval and transformation. This means the Kafka and database connections are maintained by the main worker, while the sources are separated to only retrieve information for a given DOI.

pubfinder Process Overview of Retrieving Publication Data

All sources have their own retrieval mechanisms and data transformations, as well as API limits that must be regarded. The following sources are used: ambalytics , CrossRef , OpenAIRE , Semantic Scholar and Meta Tags. CrossRef, OpenAIRE and Semantic Scholar are service that offer APIs, while Meta Tags are Dublin Core HTML header meta tags that may be embedded in the respective publication web page of the publisher. Table 1 shows the individual fields that can be retrieved by a source, as well as their rate limit and URL. Ambalytics is the internal database (see Figure 1), is technically not a real source and therefore not listed in the table. Even though CrossRef and Meta Tags are not rate limited, the number of requests per time unit is limited to comply with the recommended usage of the respective APIs.

Since all sources are needed for a full check, the maximum throughput is limited by the slowest source or in this case, the source with the lowest rate limit. However, using this rate limit as base for the worst-case calculation 100 publications per 5 minutes will amount to 28800 per 24h period. Each source is integrated similarly to the main python program, but runs it own threads independently. In order to allow fast and parallel workloads, the process can either be implemented asynchronously allowing to use the await syntax to avoid blocking the thread or using queues and multi-threading. Unfortunately, the support of asynchronous python libraries is still limited and running blocking functions in asynchronous environment would defeat its workings. Following, the Pubfinder is developed based on the queue architecture pattern.

Source CrossRef OpenAIRE Semantic Scholar Meta Tags
URL api.crossref.org develop.openaire.eu api.semanticscholar.org -
Rate Limit - 3600/h 100/5m -
Title x x x x
Authors x x x x
Year x x x x
Type x - - -
Citation Count x - x x
Pub Date x x - x
Abstract x x x x
References x - - -
Citations x - x x
Publisher x x x x
Field Of Study x x x x
License x - - -

Publication Meta Data Sources

Queues allow each source to run indecently and multi-threaded while reading and writing events into and from queues. A worker queue for each source contains all events, and all sources share a result queue. The queue is implemented as deque allowing for thread-safe operation and adding references and citations. Using a deque allows to add new DOI’s to the front while adding references and citations to the end, resulting in earlier retrieval for the more important event publications. The management of the queues is done in the Pubfinder, checking for finished publications and adding an element to the correct source if more data is required.

Since a publication resolution runs through each source until it is complete and implements different limits and speeds, the order is critical for overall speed. The API with the lowest rate limit should be last. This order of sources reduces the total request count affecting the limit since publications are more likely to be complete before and therefore reducing the count.

Further, it will not slow down the processing. Each source uses the same base way of finding data. The given DOI will request the data in a defined way while ensuring API limits are kept. Once data is retrieved, the needed data is extracted and mapped into the internal publication structure. If a source adds data, it will also add its source. Finally, the publication data is added to the event, and the event is added to the result queue.

Once a publication is complete or run through all sources and contains enough data, it is stored in PostgreSQL. Then publication data is added to the event, the event state is changed to linked, and finally, the event is sent to Kafka.

amba-analysis-worker-pubfinder's People

Contributors

e3dmc2 avatar lukasjesche avatar xlrx avatar

Watchers

 avatar

amba-analysis-worker-pubfinder's Issues

Error Crossref Source

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "./src/pubfinder_worker.py", line 145, in worker_results
    self.finish_work(result['item'], result['tag'])
  File "./src/pubfinder_worker.py", line 182, in finish_work
    self.dao.save_publication(publication)
  File "/usr/local/lib/python3.6/site-packages/event_stream/dao.py", line 150, in save_publication
    fos = FieldOfStudy(name=fos_data['name'], normalizedName=fos_data['normalizedName'])
TypeError: string indices must be integers

updated for correct error

Known Bugs

  • empty author
  • connection error domain to many retries -> limit per domain? (see sentry)
  • key error (see sentry)
  • type error (see sentry)
  • meta (?) clean abstract better

Error on run

worker-pubfinder      | Exception in thread Thread-18:
worker-pubfinder      | Traceback (most recent call last):
worker-pubfinder      |   File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
worker-pubfinder      |     self.run()
worker-pubfinder      |   File "/usr/local/lib/python3.6/threading.py", line 864, in run
worker-pubfinder      |     self._target(*self._args, **self._kwargs)
worker-pubfinder      |   File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 103, in worker
worker-pubfinder      |     initializer(*initargs)
worker-pubfinder      |   File "/src/src/meta_source.py", line 118, in worker
worker-pubfinder      |     self.pubfinder.finish_work(item, self.tag)
worker-pubfinder      |   File "./src/pubfinder_worker.py", line 208, in finish_work
worker-pubfinder      |     self.save_to_mongo(publication)
worker-pubfinder      |   File "./src/pubfinder_worker.py", line 268, in save_to_mongo
worker-pubfinder      |     pub = self.dao.save_publication(publication)
worker-pubfinder      |   File "/usr/local/lib/python3.6/site-packages/event_stream/dao.py", line 111, in save_publication
worker-pubfinder      |     fos = FieldOfStudy(name=fos_data['name'], normalizedName=fos_data['normalizedName'])
worker-pubfinder      | TypeError: string indices must be integers

Crossref parse error

worker-pubfinder      | 2021-09-03 15:06:22,846.846 Thread-4:MongoDB, not found event
worker-pubfinder      | 2021-09-03 15:06:22,854.854 Thread-9:SourceCrossref work on item 10.15366/tp2021.38
worker-pubfinder      | Exception in thread Thread-9:
worker-pubfinder      | Traceback (most recent call last):
worker-pubfinder      |   File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
worker-pubfinder      |     self.run()
worker-pubfinder      |   File "/usr/local/lib/python3.6/threading.py", line 864, in run
worker-pubfinder      |     self._target(*self._args, **self._kwargs)
worker-pubfinder      |   File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 103, in worker
worker-pubfinder      |     initializer(*initargs)
worker-pubfinder      |   File "/src/src/crossref_source.py", line 110, in worker
worker-pubfinder      |     publication_temp = self.add_data_to_publication(publication)
worker-pubfinder      |   File "/src/src/crossref_source.py", line 129, in add_data_to_publication
worker-pubfinder      |     return self.map(response, publication)
worker-pubfinder      |   File "/src/src/crossref_source.py", line 156, in map
worker-pubfinder      |     publication['title'] = response_data['title'][0]
worker-pubfinder      | IndexError: list index out of range
worker-pubfinder      | 
worker-pubfinder      | 2021-09-03 15:06:24,934.934 Thr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.