Giter VIP home page Giter VIP logo

backend's People

Contributors

afrogenosse avatar druckdev avatar marcplustwo avatar philosapiens avatar rbalink avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

dankoan

backend's Issues

More efficient data base scheme

status quo

At the moment the mapping from results to queries/reviews is done as follows:

class Result(MongoModel):
    review = fields.ReferenceField('Review')
    queries = fields.ListField()

We save what review and queries a result belongs to with the result.
This might lead to some inefficiencies and problems with scaling.

idea

We should store the mapping inside the queries, which are subdocuments of a review.

This would look something like this:

class Review(MongoModel):
    queries = fields.EmbeddedDocumentListField('Query')


class Query(EmbeddedMongoModel):
    _id = fields.ObjectIdField(primary_key=True)
    time = fields.DateTimeField()
    results = fields.ReferenceField('Result')

To get all results for a review, aggregate the results for all queries of a review.

Meta info for persisted results (total number etc.)

We need to offer methods to access certain meta information for Results. This information ideally be added to each endpoint that requests results from the data base.

Ideally also in the list of queries and the review endpoint.

How many results do we have persisted for:

  • a query
  • a review
    so that the frontend can determine how many pages to offer.

duplicate filtering on review-side

Duplicates are filtered out in the results collection, but the results' ids are also saved under the review collection. There, duplicates remain and might lead to weird behavior when querying.

Error management / info for handler

Put function body in try, except to gracefully handle failures to at least inform the frontend what broke.

Something like this:

try:
    # handler code
    pass
except Exception as e:
    return {
        "statusCode": 500,
        "body": json.dumps(e)
    }

Timing of different search parameters

At this point the wrappers look through all search fields.
Searching only through Title, Abstract and Keywords for example could lead to faster results.

Wrapper: Get all wrappers

Implement function get_all_wrappers() to get a list of all available wrappers.

A new data base wrapper must only be registered there.

Different output format of Springer and Elsevier wrappers

Springer returns the result as such:

"result": [{
		"total": "Total amount of hits in the DB",
		"start": "Index at which the returned results start",
		"pageLength": "Numer of results per page requested",
		"recordsDisplayed": "Number of records this exact query returned"
	}]

whereas Elsevier returns the following

"result": {
		"total": "Total amount of hits in the DB",
		"start": "Index at which the returned results start",
		"pageLength": "Numer of results per page requested",
		"recordsDisplayed": "Number of records this exact query returned"
	}

Endless loop when no keys are set

persistent_query goes into an endless loop if no API keys are set.

...
No wrappers existing.
No API key specified for ElsevierWrapper.
No API key specified for SpringerWrapper.
No wrappers existing.
No API key specified for ElsevierWrapper.
No API key specified for SpringerWrapper.
No wrappers existing.
...

Do not search in all fields

Not because of #13 but because searching through all fields returns just too many results. It does not make sense to use the tool for an SLR if it returns multiple thousand results.

AWS Timeout after 6s

When searching for search terms that are too broad, the wrappers take too long and the lambda function times out after 6s.

This happens with the following request to the dry_query function:

{
    "search": {
	    "search_groups": [
	        {
	            "search_terms": ["testing"],
	            "match": "OR"
	        }
	    ],
	    "match": "AND"
	},
	"page": 1,
	"page_length": 100
}

but doesn't happen with

{
    "search": {
	    "search_groups": [
	        {
	            "search_terms": ["blockchain"],
	            "match": "OR"
	        }
	    ],
	    "match": "AND"
	},
	"page": 1,
	"page_length": 100
}

We should probably figure out where exactly the query takes that long.

Wrapper: Error, when no results are returned by database

Ideally, the wrapper would just return an empty list of results.

Exception has occurred: KeyError
'results'
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/wrapper/elsevierWrapper.py", line 182, in formatResponse
    "recordsDisplayed": len(response["results"])
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/wrapper/elsevierWrapper.py", line 227, in callAPI
    return self.formatResponse(response, query, body)
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 48, in call_api
    return db_wrapper.callAPI(search)
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 74, in do_search
    results.append(call_api(db_wrapper, search, page, page_length))
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 95, in <module>
    results = do_search(search, 1, 25)

Search terms in Query

Initially, we planned to come up with the search terms for a query in advance and save it.

It probably makes more sense to store the search terms in each query as that will be one 'session'

Elsevier Article Metadata API

Elsevier supports two different ScienceDirect APIs: The "ScienceDirect Search V2" and the "Article Metadata".
Although both are listed as available collections in the Elsevier wraper, the truth is that only the V2 search is working.
If one were to try the Metadata collection one would get a HTTP error: 405 Client Error: Method Not Allowed for url: https://api.elsevier.com/content/metadata/article because the wrapper tries a PUT request instead of a GET.

There are two options:

  1. Use the GET API of V2 so that the both collections can use the same callAPI
  2. Implement some case distinction so that V2 keeps PUT and for the metadata collection the wrapper uses GETs

Since the amount of changes needed are probably very similar and because Elsevier recommends the PUT API instead of the GET for V2 I would tend to the second option.

Consistent date format

We need to agree on a standard date format for communication with the front end. MongoDBs json util spits out epoch time, whereas the data base wrappers use some kind of string based format IIRC.

Let's talk about this.

Some ideas.

Check if result is already persisted in db

what

When we conduct a dry query, add a boolean field persisted to the output format.

	"records": [{
		"title": "The title of the record",
		"authors": ["Full name of one creator"],
                [...]
                "persisted": "Result present in data base"
	}]

how

  1. Get a list of all dois of the results associated with the current review.
  2. Iterate over all results and check if they're present in the list of dois.

Mind the API key limits

Since we are using the users API keys we should keep in mind that some of the databases have restrictions on how many queries we can send per second/day/week.
To not lock the keys we should implement those limits somewhere.

That the wrapper classes will contain the information about the limits is pretty clear but I am not too sure on which layer those limits are observed and followed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.