dawesearch / backend Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 1.0 468 KB

The backend of the DaWeSearch project - a tool to conduct structured literature reviews (SLR).

License: MIT License

Python 99.97% Shell 0.03%

literature-review literature-review-tool literature-search serverless python backend api-wrapper

backend's People

Contributors

Stargazers

Watchers

Forkers

dankoan

backend's Issues

More efficient data base scheme

status quo

At the moment the mapping from results to queries/reviews is done as follows:

class Result(MongoModel):
    review = fields.ReferenceField('Review')
    queries = fields.ListField()

We save what review and queries a result belongs to with the result.
This might lead to some inefficiencies and problems with scaling.

idea

We should store the mapping inside the queries, which are subdocuments of a review.

This would look something like this:

class Review(MongoModel):
    queries = fields.EmbeddedDocumentListField('Query')


class Query(EmbeddedMongoModel):
    _id = fields.ObjectIdField(primary_key=True)
    time = fields.DateTimeField()
    results = fields.ReferenceField('Result')

To get all results for a review, aggregate the results for all queries of a review.

Implement abstract pagination for mongodb collections

The two code fragments look almost exactly the same. Be DRY don't be WET.

https://github.com/DaWeSys/backend/blob/0e848ae6ba781ecc14ea5a23d62f724743376223/functions/db/connector.py#L163-L182

https://github.com/DaWeSys/backend/blob/0e848ae6ba781ecc14ea5a23d62f724743376223/functions/db/connector.py#L185-L202

We can probably implement the pagination part of this query abstractly as it should work on any QuerySet.

Add proper docstrings

Most of the functions are still missing docstrings.

See https://www.python.org/dev/peps/pep-0257/ for this.

Get all reviews a user is part of (owner or collaborator)

Filter mongodb results

We want to be able to filter persisted results e.g. by year, author, title etc.

Meta info for persisted results (total number etc.)

We need to offer methods to access certain meta information for Results. This information ideally be added to each endpoint that requests results from the data base.

Ideally also in the list of queries and the review endpoint.

How many results do we have persisted for:

a query
a review
so that the frontend can determine how many pages to offer.

duplicate filtering on review-side

Duplicates are filtered out in the results collection, but the results' ids are also saved under the review collection. There, duplicates remain and might lead to weird behavior when querying.

Error management / info for handler

Put function body in try, except to gracefully handle failures to at least inform the frontend what broke.

Something like this:

try:
    # handler code
    pass
except Exception as e:
    return {
        "statusCode": 500,
        "body": json.dumps(e)
    }

Timeout loop index is compared with maxRecords

In wrapper/springerWrapper:350 the loop index that handles timeouts in requests is compared to self.maxRecords instead of self.maxRetries. This is a nasty little typo!

Redundancies in the wrapper classes

The wrapper classes have some redundancies. For example in translateQuery. Those should be removed by creating appropriate functions in utils.py.

Timing of different search parameters

At this point the wrappers look through all search fields.
Searching only through Title, Abstract and Keywords for example could lead to faster results.

Wrapper: Get all wrappers

Implement function get_all_wrappers() to get a list of all available wrappers.

A new data base wrapper must only be registered there.

Wrapper: Multiple execution is broken

The wrappers seem to create new parentheses for the search groups (as in buildGroup()) in every new call to the API

Select what wrappers to search in

We don't necessarily always want to search in all available data base wrappers.

Different output format of Springer and Elsevier wrappers

Springer returns the result as such:

"result": [{
		"total": "Total amount of hits in the DB",
		"start": "Index at which the returned results start",
		"pageLength": "Numer of results per page requested",
		"recordsDisplayed": "Number of records this exact query returned"
	}]

whereas Elsevier returns the following

"result": {
		"total": "Total amount of hits in the DB",
		"start": "Index at which the returned results start",
		"pageLength": "Numer of results per page requested",
		"recordsDisplayed": "Number of records this exact query returned"
	}

Save what database each result is from

Probably need to adapt both the db model in models.py and save it in save_results() in connector.py

Endless loop when no keys are set

persistent_query goes into an endless loop if no API keys are set.

...
No wrappers existing.
No API key specified for ElsevierWrapper.
No API key specified for SpringerWrapper.
No wrappers existing.
No API key specified for ElsevierWrapper.
No API key specified for SpringerWrapper.
No wrappers existing.
...

Update result format when setting collection

Currently the result format is not updated when changing collections. Because of that unsupported formats can be set.

Do not search in all fields

Not because of #13 but because searching through all fields returns just too many results. It does not make sense to use the tool for an SLR if it returns multiple thousand results.

AWS Timeout after 6s

When searching for search terms that are too broad, the wrappers take too long and the lambda function times out after 6s.

This happens with the following request to the dry_query function:

{
    "search": {
	    "search_groups": [
	        {
	            "search_terms": ["testing"],
	            "match": "OR"
	        }
	    ],
	    "match": "AND"
	},
	"page": 1,
	"page_length": 100
}

but doesn't happen with

{
    "search": {
	    "search_groups": [
	        {
	            "search_terms": ["blockchain"],
	            "match": "OR"
	        }
	    ],
	    "match": "AND"
	},
	"page": 1,
	"page_length": 100
}

We should probably figure out where exactly the query takes that long.

Wrapper: Error, when no results are returned by database

Ideally, the wrapper would just return an empty list of results.

Exception has occurred: KeyError
'results'
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/wrapper/elsevierWrapper.py", line 182, in formatResponse
    "recordsDisplayed": len(response["results"])
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/wrapper/elsevierWrapper.py", line 227, in callAPI
    return self.formatResponse(response, query, body)
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 48, in call_api
    return db_wrapper.callAPI(search)
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 74, in do_search
    results.append(call_api(db_wrapper, search, page, page_length))
  File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 95, in <module>
    results = do_search(search, 1, 25)

Search terms in Query

Initially, we planned to come up with the search terms for a query in advance and save it.

It probably makes more sense to store the search terms in each query as that will be one 'session'

Remove API keys from commit history

This shouldn't have happened..

Wrapper: Offer method to set max_records for API call

In order to offer a useful pagination for searches directly in the literature databases, we need to be able to set the maximum number of results for an API call.

Elsevier supports two different ScienceDirect APIs: The "ScienceDirect Search V2" and the "Article Metadata".
Although both are listed as available collections in the Elsevier wraper, the truth is that only the V2 search is working.
If one were to try the Metadata collection one would get a HTTP error: 405 Client Error: Method Not Allowed for url: https://api.elsevier.com/content/metadata/article because the wrapper tries a PUT request instead of a GET.

There are two options:

Use the GET API of V2 so that the both collections can use the same callAPI
Implement some case distinction so that V2 keeps PUT and for the metadata collection the wrapper uses GETs

Since the amount of changes needed are probably very similar and because Elsevier recommends the PUT API instead of the GET for V2 I would tend to the second option.

Consistent date format

We need to agree on a standard date format for communication with the front end. MongoDBs json util spits out epoch time, whereas the data base wrappers use some kind of string based format IIRC.

Let's talk about this.

Some ideas.

Check if result is already persisted in db

what

When we conduct a dry query, add a boolean field persisted to the output format.

	"records": [{
		"title": "The title of the record",
		"authors": ["Full name of one creator"],
                [...]
                "persisted": "Result present in data base"
	}]

how

Get a list of all dois of the results associated with the current review.
Iterate over all results and check if they're present in the list of dois.