This issue has been migrated from the CC Search Catalog repository
Author: annatuma
Date: Thu Jun 04 2020
Labels: Hacktoberfest,help wanted,providers,β¨ goal: improvement,π
status: discontinued
Problem Description
We're currently integrated with Digitalt Museum, via CommonCrawl, which is a Nordic data portal aggregating/supporting the digital collections of dozens of Nordic museums.
The full count of image records under a CC license is over 2.7 million.
Our current integration with Digitalt Museum is about 289 thousand.
Solution Description
We need to redo our integration with Digitalt Museum to use their API, which would explain why we're missing 80%+ of the records we could be ingesting.
We also need to review how this may be a fit for our provider-within-provider scheme, given that Digitalt Museum is an aggregator.
Finally, the API documentation is in Norwegian: http://api.dimu.org/doc/public_api.html
Additional context
None at this time, but may be updated as research is done on the issue.
No development should be done on a Provider API Script until the following info is gathered:
General Recommendations for implementation
- The script should be in the
src/cc_catalog_airflow/dags/provider_api_scripts/
directory.
- The script should have a test suite in the same directory.
- The script must use the
ImageStore
class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py
).
- The script should use the
DelayedRequester
class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py
).
- The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py
, since
that module is deprecated.
- If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date
parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD
(so,
the script can be run via python my_favorite_provider.py --date 2018-01-01
).
- The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date)
. The main should do the same thing calling
from the CLI would do.
- The script must conform to PEP8. Please use
pycodestyle
(available via
pip install pycodestyle
) to check for compliance.
- The script should use small, testable functions.
- The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).
--------
Original Comments:
annatuma commented on Thu Jun 04 2020:
@mathemancer flagging this for you to look at, potentially a fit for @ChariniNana to dig into.
source
ChariniNana commented on Fri Jun 05 2020:
Will look into this Anna
source
Issue author amartya-dev commented on Sat Oct 03 2020:
@ChariniNana are you still working on this?
source
ChariniNana commented on Mon Oct 05 2020:
@amartya-dev No. Feel free to start working on this ticket
source
Issue author amartya-dev commented on Mon Oct 05 2020:
@annatuma @mathemancer
I looked into the documentation provided, I have a couple of questions before I can start the implementation
- Since this issue does not have the checklist for verifying the API providers before starting development, should I include that here?
- The systematic way of retrieving info from the API in my opinion is to query using fromYear and toYear, there is a published date I thought of using but that is listed under unprocessed and non-indexed fields, should I still try for accessing info using publishedDate?
I am adding the sample response for reference:
{
"artifact.defaultPictureIndex":37285,
"artifact.ingress.title":"Utsikt fra Odd Fellow-bygningen, Stortingsgt. 28.",
"artifact.uniqueId":"021015452963",
"artifact.ingress.producer":"Wilse, Anders Beer",
"artifact.ingress.production.place":"Norge, Oslo, Oslo, Sentrum, St. Olavs gate 33 Norge, Oslo, Oslo, Sentrum, Frederiks gate 2 Historisk Museum Norge, Oslo, Oslo, Slottsparken, Nisseberget Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 19 Norge, Oslo, Oslo, Sentrum, TullinlΓΈkka Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 23 Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 15 Norge, Oslo, Oslo, Sentrum, Frederiks gate 3 Norge, Oslo, Oslo, Sentrum, St. Olavs gate 27 Norge, Oslo, Oslo, Sentrum, St. Olavs gate 35 Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 13 Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 11 Norge, Oslo, Oslo, Sentrum, St. Olavs gate 29 Norge, Oslo, Oslo, Sentrum, Stortingsgata 28 Odd Fellow-gΓ₯rden Norge, Oslo, Oslo, Sentrum, Karl Johans gate 47 Universitetet Norge, Oslo, Oslo, Sentrum, Frederiks gate",
"artifact.publishedDate":"2014-10-25T22:33:06.868Z",
"artifact.ingress.producerRole":"Fotograf",
"identifier.id":"OB.Z04254",
"identifier.owner":"OMU",
"artifact.updatedDate":"2019-12-18T05:02:24.883Z",
"artifact.coordinate":"59.91448119999999, 10.732526900000039",
"artifact.uuid":"20499DF3-C60E-4C81-B0B9-17EAF90B6E80",
"artifact.pictureCount":1,
"artifact.ingress.classification":["219",
"344",
"346",
"361",
"367"],
"artifact.hasPictures":true,
"artifact.ingress.production.toYear":1933,
"artifact.ingress.production.fromYear":1933,
"artifact.ingress.license":["CC CC0 1.0"],
"artifact.hasChildren":false,
"artifact.defaultPictureDimension":"3600x2372",
"artifact.defaultMediaIdentifier":"012uMXAsuvLF",
"artifact.type":"Photograph",
"artifact.ingress.subjects":["parker",
"universiteter"]},
Please let me know your views on this, and I will take this up.
source
mathemancer commented on Mon Oct 05 2020:
@amartya-dev From that sample, it looks like the publishedDate
is more appropriate, since it seems to be the date of actual upload. In the end, it doesn't matter much as long as the whole collection gets into our DB. Is there any less-than/more-than functionality for any of the date fields in the API ?
As for the checklist, I've edited the original issue comment to add it.
source
Issue author amartya-dev commented on Mon Oct 05 2020:
I do not think they provide a less than or more than search functionality, actually, they have two different lists of fields as follows:
This list of the fields that are "fields in index, unprocessed and for display:"
- 'identifier.id'
- 'identifier.owner'
- 'artifact.uniqueId'
- 'artifact.type'
- 'artifact.pictureCount'
- 'artifact.hasPictures'
- 'artifact.defaultMediaIdentifier'
- 'artifact.defaultPictureIndex'
- 'artifact.publishedDate'
- 'artifact.updatedDate'
- 'artifact.ingress.title'
- 'artifact.ingress.producer'
- 'artifact.ingress.producerRole'
- 'artifact.ingress.additionalProducers'
- 'artifact.ingress.production.fromYear'
- 'artifact.ingress.production.toYear'
- 'artifact.ingress.production.place'
- 'artifact.ingress.classification'
- 'artifact.ingress.subjects'
- 'artifact.ingress.license'
- 'artifact.coordinate'
whereas, this list is of the fields that are "Indexed fields, processed and search only":
- 'artifact.name'
- 'artifact.title'
- 'artifact.classification'
- 'artifact.producer'
- 'artifact.depictedPerson'
- 'artifact.depictedPlace'
- 'artifact.material'
- 'artifact.technique'
- 'artifact.license'
- 'artifact.eventDescription'
- 'artifact.event.fromYear'
- 'artifact.event.toYear'
- 'artifact.event.place'
- 'artifact.folderUids'
- 'artifact.exhibitionUids'
- 'allContent'
I have tried querying the API with different date strings, all that it returns is Invalid Date String
in the response, I will still look into ways in which this can be implemented with publishedDate
else we would need to go with the fromYear
and toYear
.
I will add more details later.
source