dawesearch / backend Goto Github PK
View Code? Open in Web Editor NEWThe backend of the DaWeSearch project - a tool to conduct structured literature reviews (SLR).
License: MIT License
The backend of the DaWeSearch project - a tool to conduct structured literature reviews (SLR).
License: MIT License
At the moment the mapping from results to queries/reviews is done as follows:
class Result(MongoModel):
review = fields.ReferenceField('Review')
queries = fields.ListField()
We save what review and queries a result belongs to with the result.
This might lead to some inefficiencies and problems with scaling.
We should store the mapping inside the queries, which are subdocuments of a review.
This would look something like this:
class Review(MongoModel):
queries = fields.EmbeddedDocumentListField('Query')
class Query(EmbeddedMongoModel):
_id = fields.ObjectIdField(primary_key=True)
time = fields.DateTimeField()
results = fields.ReferenceField('Result')
To get all results for a review, aggregate the results for all queries of a review.
The two code fragments look almost exactly the same. Be DRY don't be WET.
We can probably implement the pagination part of this query abstractly as it should work on any QuerySet.
Most of the functions are still missing docstrings.
See https://www.python.org/dev/peps/pep-0257/ for this.
We want to be able to filter persisted results e.g. by year, author, title etc.
We need to offer methods to access certain meta information for Results. This information ideally be added to each endpoint that requests results from the data base.
Ideally also in the list of queries and the review endpoint.
How many results do we have persisted for:
To discard results.
Duplicates are filtered out in the results collection, but the results' ids are also saved under the review collection. There, duplicates remain and might lead to weird behavior when querying.
Put function body in try
, except
to gracefully handle failures to at least inform the frontend what broke.
Something like this:
try:
# handler code
pass
except Exception as e:
return {
"statusCode": 500,
"body": json.dumps(e)
}
In wrapper/springerWrapper:350 the loop index that handles timeouts in requests is compared to self.maxRecords
instead of self.maxRetries
. This is a nasty little typo!
The wrapper classes have some redundancies. For example in translateQuery
. Those should be removed by creating appropriate functions in utils.py.
At this point the wrappers look through all search fields.
Searching only through Title, Abstract and Keywords for example could lead to faster results.
Implement function get_all_wrappers()
to get a list of all available wrappers.
A new data base wrapper must only be registered there.
The wrappers seem to create new parentheses for the search groups (as in buildGroup()) in every new call to the API
We don't necessarily always want to search in all available data base wrappers.
Springer returns the result as such:
"result": [{
"total": "Total amount of hits in the DB",
"start": "Index at which the returned results start",
"pageLength": "Numer of results per page requested",
"recordsDisplayed": "Number of records this exact query returned"
}]
whereas Elsevier returns the following
"result": {
"total": "Total amount of hits in the DB",
"start": "Index at which the returned results start",
"pageLength": "Numer of results per page requested",
"recordsDisplayed": "Number of records this exact query returned"
}
Probably need to adapt both the db model in models.py
and save it in save_results()
in connector.py
persistent_query
goes into an endless loop if no API keys are set.
...
No wrappers existing.
No API key specified for ElsevierWrapper.
No API key specified for SpringerWrapper.
No wrappers existing.
No API key specified for ElsevierWrapper.
No API key specified for SpringerWrapper.
No wrappers existing.
...
Currently the result format is not updated when changing collections. Because of that unsupported formats can be set.
Not because of #13 but because searching through all fields returns just too many results. It does not make sense to use the tool for an SLR if it returns multiple thousand results.
When searching for search terms that are too broad, the wrappers take too long and the lambda function times out after 6s.
This happens with the following request to the dry_query
function:
{
"search": {
"search_groups": [
{
"search_terms": ["testing"],
"match": "OR"
}
],
"match": "AND"
},
"page": 1,
"page_length": 100
}
but doesn't happen with
{
"search": {
"search_groups": [
{
"search_terms": ["blockchain"],
"match": "OR"
}
],
"match": "AND"
},
"page": 1,
"page_length": 100
}
We should probably figure out where exactly the query takes that long.
Ideally, the wrapper would just return an empty list of results.
Exception has occurred: KeyError
'results'
File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/wrapper/elsevierWrapper.py", line 182, in formatResponse
"recordsDisplayed": len(response["results"])
File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/wrapper/elsevierWrapper.py", line 227, in callAPI
return self.formatResponse(response, query, body)
File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 48, in call_api
return db_wrapper.callAPI(search)
File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 74, in do_search
results.append(call_api(db_wrapper, search, page, page_length))
File "/home/marc/projects/progpra/aws-python-rest-api-with-pymongo/functions/slr.py", line 95, in <module>
results = do_search(search, 1, 25)
Initially, we planned to come up with the search terms for a query in advance and save it.
It probably makes more sense to store the search terms in each query as that will be one 'session'
This shouldn't have happened..
In order to offer a useful pagination for searches directly in the literature databases, we need to be able to set the maximum number of results for an API call.
Elsevier supports two different ScienceDirect APIs: The "ScienceDirect Search V2" and the "Article Metadata".
Although both are listed as available collections in the Elsevier wraper, the truth is that only the V2 search is working.
If one were to try the Metadata collection one would get a HTTP error: 405 Client Error: Method Not Allowed for url: https://api.elsevier.com/content/metadata/article
because the wrapper tries a PUT request instead of a GET.
There are two options:
callAPI
Since the amount of changes needed are probably very similar and because Elsevier recommends the PUT API instead of the GET for V2 I would tend to the second option.
When we conduct a dry query, add a boolean field persisted
to the output format.
"records": [{
"title": "The title of the record",
"authors": ["Full name of one creator"],
[...]
"persisted": "Result present in data base"
}]
doi
s of the results associated with the current review.doi
s.Currently, we're using ObjectId
s, but it would be easier to use the unique doi
field.
Since we are using the users API keys we should keep in mind that some of the databases have restrictions on how many queries we can send per second/day/week.
To not lock the keys we should implement those limits somewhere.
That the wrapper classes will contain the information about the limits is pretty clear but I am not too sure on which layer those limits are observed and followed.
When a key is missing for an API the query is still sent with None as key
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.