wordpress / openverse Goto Github PK

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

License: MIT License

Python 60.50% JavaScript 10.28% Shell 0.47% Dockerfile 0.34% Jinja 0.02% HTML 0.16% TypeScript 14.50% Vue 12.44% CSS 0.24% PLpgSQL 0.13% Just 0.90% MDX 0.02%

openverse creative-commons search-engine python javascript hacktoberfest typescript

openverse's Introduction

Openverse is a search engine for openly-licensed media.

Openverse

Openverse is a powerful search engine for GPL-compatible images, audio, and more. Openverse is live at openverse.org.

Catalog | The Apache Airflow-powered system for downloading and storing Openverse's metadata
Ingestion server | The mechanism for refreshing the data from the catalog to the API
API | The Django REST API for querying the database, used by the frontend
Frontend | The public search engine at openverse.org, built with Vue and Nuxt
Automations | Scripts used for various workflows around Openverse repositories and processes
Utilities | Scripts or utilities which are useful across multiple projects or don't necessarily fit into a specific project.

This repository also contains the following directories.

Brand | Brand assets for Openverse such as logo and icon and guidelines for using these assets
Templates | Jinja templates that can be rendered into common scaffolding code for the project

Keep in touch

You can keep in touch with the project via the following channels:

GitHub
- Issues
- PRs
- Discussions
- Project Board
Community Site
#openverse channel in the Making WordPress Chat
- Weekly Development Chat ( Mondays @ 15:00 UTC)
- Monthly Prioritisation Meeting (first Wednesday of every month @ 15:00 UTC)

Documentation

To use the Openverse API, please refer to the API consumer documentation.

Contributing

Pull requests are welcome! Feel free to join us on Slack and discuss the project with the engineers and community members on #openverse.

You are welcome to take any open issue in the tracker labelled help wanted or good first issue; there's no need to ask for permission in advance. Other issues are open for contribution as well, but may be less accessible or well-defined in comparison to those that are explicitly labelled.

See the contribution guide for details.

Acknowledgments

Openverse, previously known as CC Search, was conceived and built at Creative Commons. We thank them for their commitment to open source and openly licensed content, with particular thanks to previous team members @ryanmerkley, @janetpkr, @lizadaly, @sebworks, @pa-w, @kgodey, @annatuma, @mathemancer, @aldenstpage, @brenoferreira, and @sclachar, along with their community of volunteers.

openverse's People

Contributors

Stargazers

Watchers

Forkers

wenxuefeng3930 taka1226 mukeshpanchal27 urakymzhan saurabhan jeherve phattymcfee jessedoka isabella232 sneznaj mustkimkhatik alrz1999 danpoynor bailey-coding satya-vinay serpentbytes fairhopeweb tahmid-ul anton202 rohitm02 greencreep miikkuu imhamad malanb wbrown633 gelbelle kk311y raiyaj glowatsk tammytdo milktea02 rsubra13 popoimm tomvth sepehrrezaei samkenxstream anksh1997 masif2002 sumit-158 msolorio lucasgois1 rahulbollisetty couldbefree capitan-beto sjdex akentominas dewy800 arun-chib wasimtq veronewra ashwanthramkl shivamganwani aditya062003 ritesh-pandey yiyinyiyang sajeremy kobe-curry kaitomizukami k1ngalpha yashgaur000 ashiramin metabiswadeep devanshi-crypto duwarakan lyfproticol110 kerfred zeroplayerrodent rahulsindhu01 majdbjk dolphinbeans allbit360 sehgxl chnikhilreddy alimurtuzacodes heyligerjon chenikabukes nevvada sruthiv98 rousseam thedevhaider anubhav1206 homgorn hatesune akhilsrivatsa chirag57 ngken0995 say-what-site udithishara 23198 jamestiotio paras-2407 carlosm22700 chinmay-bakhale rajnykdhulapkar sudhanshujoshi09 mysterymanav birdboybolu ujwalkumar1995 codeank2829 ciftum

openverse's Issues

Digitalt Museum (Norwegian documentation)

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Thu Jun 04 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Problem Description

We're currently integrated with Digitalt Museum, via CommonCrawl, which is a Nordic data portal aggregating/supporting the digital collections of dozens of Nordic museums.

The full count of image records under a CC license is over 2.7 million.

Our current integration with Digitalt Museum is about 289 thousand.

Solution Description

We need to redo our integration with Digitalt Museum to use their API, which would explain why we're missing 80%+ of the records we could be ingesting.

We also need to review how this may be a fit for our provider-within-provider scheme, given that Digitalt Museum is an aggregator.

Finally, the API documentation is in Norwegian: http://api.dimu.org/doc/public_api.html

Additional context

None at this time, but may be updated as research is done on the issue.

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

--------

Original Comments:

annatuma commented on Thu Jun 04 2020:

@mathemancer flagging this for you to look at, potentially a fit for @ChariniNana to dig into.
source

ChariniNana commented on Fri Jun 05 2020:

Will look into this Anna
source

Issue author amartya-dev commented on Sat Oct 03 2020:

@ChariniNana are you still working on this?
source

ChariniNana commented on Mon Oct 05 2020:

@amartya-dev No. Feel free to start working on this ticket
source

Issue author amartya-dev commented on Mon Oct 05 2020:

@annatuma @mathemancer
I looked into the documentation provided, I have a couple of questions before I can start the implementation

Since this issue does not have the checklist for verifying the API providers before starting development, should I include that here?
The systematic way of retrieving info from the API in my opinion is to query using fromYear and toYear, there is a published date I thought of using but that is listed under unprocessed and non-indexed fields, should I still try for accessing info using publishedDate?

I am adding the sample response for reference:

{
        "artifact.defaultPictureIndex":37285,
        "artifact.ingress.title":"Utsikt fra Odd Fellow-bygningen, Stortingsgt. 28.",
        "artifact.uniqueId":"021015452963",
        "artifact.ingress.producer":"Wilse, Anders Beer",
        "artifact.ingress.production.place":"Norge, Oslo, Oslo, Sentrum, St. Olavs gate 33 Norge, Oslo, Oslo, Sentrum, Frederiks gate 2 Historisk Museum Norge, Oslo, Oslo, Slottsparken, Nisseberget Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 19 Norge, Oslo, Oslo, Sentrum, Tullinløkka Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 23 Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 15 Norge, Oslo, Oslo, Sentrum, Frederiks gate 3 Norge, Oslo, Oslo, Sentrum, St. Olavs gate 27 Norge, Oslo, Oslo, Sentrum, St. Olavs gate 35 Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 13 Norge, Oslo, Oslo, Sentrum, Kristian Augusts gate 11 Norge, Oslo, Oslo, Sentrum, St. Olavs gate 29 Norge, Oslo, Oslo, Sentrum, Stortingsgata 28 Odd Fellow-gården Norge, Oslo, Oslo, Sentrum, Karl Johans gate 47 Universitetet Norge, Oslo, Oslo, Sentrum, Frederiks gate",
        "artifact.publishedDate":"2014-10-25T22:33:06.868Z",
        "artifact.ingress.producerRole":"Fotograf",
        "identifier.id":"OB.Z04254",
        "identifier.owner":"OMU",
        "artifact.updatedDate":"2019-12-18T05:02:24.883Z",
        "artifact.coordinate":"59.91448119999999, 10.732526900000039",
        "artifact.uuid":"20499DF3-C60E-4C81-B0B9-17EAF90B6E80",
        "artifact.pictureCount":1,
        "artifact.ingress.classification":["219",
          "344",
          "346",
          "361",
          "367"],
        "artifact.hasPictures":true,
        "artifact.ingress.production.toYear":1933,
        "artifact.ingress.production.fromYear":1933,
        "artifact.ingress.license":["CC CC0 1.0"],
        "artifact.hasChildren":false,
        "artifact.defaultPictureDimension":"3600x2372",
        "artifact.defaultMediaIdentifier":"012uMXAsuvLF",
        "artifact.type":"Photograph",
        "artifact.ingress.subjects":["parker",
          "universiteter"]},

Please let me know your views on this, and I will take this up.
source

mathemancer commented on Mon Oct 05 2020:

@amartya-dev From that sample, it looks like the publishedDate is more appropriate, since it seems to be the date of actual upload. In the end, it doesn't matter much as long as the whole collection gets into our DB. Is there any less-than/more-than functionality for any of the date fields in the API ?

As for the checklist, I've edited the original issue comment to add it.

source

Issue author amartya-dev commented on Mon Oct 05 2020:

I do not think they provide a less than or more than search functionality, actually, they have two different lists of fields as follows:

This list of the fields that are "fields in index, unprocessed and for display:"

- 'identifier.id'                          
- 'identifier.owner'                    
- 'artifact.uniqueId'                   
- 'artifact.type'                          
- 'artifact.pictureCount'                 
- 'artifact.hasPictures'                   
- 'artifact.defaultMediaIdentifier'      
- 'artifact.defaultPictureIndex'          
- 'artifact.publishedDate'                 
- 'artifact.updatedDate'                   
- 'artifact.ingress.title'                 
- 'artifact.ingress.producer'              
- 'artifact.ingress.producerRole'          
- 'artifact.ingress.additionalProducers'   
- 'artifact.ingress.production.fromYear'   
- 'artifact.ingress.production.toYear'     
- 'artifact.ingress.production.place'      
- 'artifact.ingress.classification'        
- 'artifact.ingress.subjects'             
- 'artifact.ingress.license'               
- 'artifact.coordinate'

whereas, this list is of the fields that are "Indexed fields, processed and search only":

- 'artifact.name'                          
- 'artifact.title'                         
- 'artifact.classification'                
- 'artifact.producer'                      
- 'artifact.depictedPerson'                
- 'artifact.depictedPlace'                 
- 'artifact.material'                      
- 'artifact.technique'                     
- 'artifact.license'                       
- 'artifact.eventDescription'              
- 'artifact.event.fromYear'               
- 'artifact.event.toYear'                  
- 'artifact.event.place'                  
- 'artifact.folderUids'                   
- 'artifact.exhibitionUids'              
- 'allContent'

I have tried querying the API with different date strings, all that it returns is Invalid Date String in the response, I will still look into ways in which this can be implemented with publishedDate else we would need to go with the fromYear and toYear.

I will add more details later.
source

Wellcome Collection

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Tue Jan 14 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

https://developers.wellcomecollection.org

Provider description

Most of the metadata we need is readily available (the license, attribution info, a link, a thumbnail, etc.). They have something we could use for the description meta_data field (which we like for search indexing).

Two considerations to look into further prior to integration:

Not much by way of tags.
Unclear if/how we can get only the newest data (vs having to pull the entire DB for every sync, which would mean less frequent syncs).

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Licenses Provided

Provider API Technical info

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

ChariniNana commented on Tue Mar 17 2020:

@mathemancer Can I work on this as part of my GSoC project?
source

mathemancer commented on Thu Mar 19 2020:

@ChariniNana Yes, but before beginning development, it's important to answer the following:

What licenses do they use, and how can we find that info?
What metadata do they provide, and how well does it match up with the interface of the ImageStore class?
@annatuma mentioned that it's not clear how to pull just the info updated on a given day from their API. Can you confirm this? What is the total volume of the collection? Is it feasible to pull the whole thing, e.g., monthly?
source

ChariniNana commented on Tue Mar 24 2020:

They seem to provide images of different license where the details of it is provided in the json response as follows (please try the GET request https://api.wellcomecollection.org/catalogue/v2/works/mtkdctvn)

"license": {
"id": "cc-by",
"label": "Attribution 4.0 International (CC BY 4.0)",
"type": "License",
"url": "http://creativecommons.org/licenses/by/4.0/"
}

Apart from the image description, I cannot see anything contained in the response that could go into the meta data field (it does not return information such as date created , views, etc.)

Need to explore a bit more to find if it's possible to pull just the info updated on a given day (no obvious method)
source

mathemancer commented on Wed Mar 25 2020:

@ChariniNana It's okay if that's the case, as long as the overall volume isn't to large, and the speed is fast enough. The main thing is to have some strategy to get all the data over time.
source

mathemancer commented on Wed Mar 25 2020:

@annatuma I'm a bit concerned about this line in their developer docs:

There are some licensing restrictions, as different parts of the data may have different licenses. If it’s data that has been created by us, it’s CC0; if it’s not created by us, then it isn’t. We are working to make data licensing clear on a per work basis; in the meantime, if this is a concern, please get in touch.
(emphasis mine)

Have we been in touch with them regarding that?
source

allen505 commented on Mon Mar 30 2020:

In this section, it says that Europeana has used their APIs to include the images into Europeana. If this is the case, when Europeana is integrated into cc-search there could be data duplication.
source

mathemancer commented on Wed Apr 01 2020:

@allen505 It's totally possible to have data duplication with our current set up. We'd need to choose on a case-by-case basis whether to keep the data from Europeana (if it's usefully enriched, for example), or from the upstream provider. Does the Europeana API give an ID of the upstream provider? That would make future deduplication comparatively easy.
source

allen505 commented on Thu Apr 02 2020:

@mathemancer The provider field in the response gives the value of the Provider which should be
"provider": [ "Wellcome Collection" ] in this case.

The following query gives all the items which belong to Wellcome Collection:
https://www.europeana.eu/api/v2/search.json?wskey=API_KEY&query=*:*&qf=PROVIDER:%22Wellcome+Collection%22

source

mathemancer commented on Fri Apr 03 2020:

That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are?

We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators.

@allen505 This is great info, thanks!
source

annatuma commented on Thu Jun 04 2020:

That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are?

We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators.

@allen505 This is great info, thanks!

@mathemancer sorry I missed this - we don't have information on that, but we'll check the record count for Wellcome once our Europeana integration is live, and ensure we're getting a full collection.
source

[Feature] Metadata to classify providers (original #571)

This issue has been migrated from the CC Search Catalog repository

Author: zackkrida
Date: Tue Jul 28 2020
Labels: ✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued

Problem

While we don't want the current frontend design of Openverse to lead our api design, we currently hardcode classifications of the providers in the Openverse frontend, and would like to consider adding a mechanism to do this via the api.

Description

Add a new field named classification to the ContentProvider model. The field should be a CharField with choices set to a list of classifications that includes the following:

Cultural institution
Curated stock photography
Social media
Scientific observational data

Actually adding classifications for each content provider will be done in a separate issue (#661). In order to accommodate this, the field must be nullable with a default of None. Yet another future issue will remove the nullability (#660).

Original Comments:

aldenstpage commented on Tue Jul 28 2020:

Who classifies new content sources when they are added to our database, and how? Do we want this to be configurable through the content provider interface? @annatuma @kgodey
source

kgodey commented on Tue Jul 28 2020:

we don't have an answer yet, this will depend on the design work that @annatuma and @panchovm are doing on sources/collections and may take a while. we'll update the issue once there are more details.
source

Internet Archive Audio

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Wed Mar 11 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

At https://archive.org/details/audio users can set CC licenses when uploading their work. I've found examples of CC0 and PDM while browsing the collection at random. Unclear how many objects are here, and whether there's an API with the necessary data.

** More ticket work is required to see if there's a path forward here **

Provider API Endpoint / Documentation

Provider description

https://archive.org/details/audio

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

Issue author amartya-dev commented on Thu Mar 19 2020:

Information gathered about the API

The following is the information about the API:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Yes the files can be fetched systematically after deciding the number of entries that should be included in a page. The API also provides a way for pagination as we can provide the parameter page in the API request.
The endpoint is: https://archive.org/advancedsearch.php.
Documentation for the API: https://blog.archive.org/developers/
The other official documentation: https://archive.org/services/docs/api/
The other documentation provides for a command line script and a python wrapper which can be used after obtaining the API credentials from Internet Archive.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
The API provides the license URL with the licenseurl key in the response JSON.
Verify the API provides stable direct links to individual works.
The API does not provide the links directly but they can be easily formed by the identifier and metadata provided by querying a separate endpoint.
Verify the API provides a stable landing page URL to individual works.
The API provides a stable landing page URL for the works: https://archive.org/details/
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.
Example response:
{'responseHeader': {'status': 0,
'QTime': 788,
'params': {'query': 'mediatype:audio',
'qin': 'mediatype:audio',
'fields': 'identifier,title,mediatype,collection,licenseurl,date',
'wt': 'json',
'rows': '2',
'start': 0}},
'response': {'numFound': 9441652,
'start': 0,
'docs': [{'collection': ['audio_sermons', 'audio_religion'],
'identifier': 'JesusTheRescuer',
'licenseurl': 'http://creativecommons.org/licenses/by-nc-nd/3.0/',
'mediatype': 'audio',
'title': 'Jesus, The Rescuer'},
{'collection': ['audio_sermons', 'audio_religion'],
'date': '2015-07-05T00:00:00Z',
'identifier': 'July52015EveningSermon',
'mediatype': 'audio',
'title': 'What to Do When the Foundations Are Destroyed'}]}}

source

mathemancer commented on Tue Mar 24 2020:

Thanks for putting this here.
source

[Source Suggestion] Unsplash.com (original #549)

This issue has been migrated from the CC Search Catalog repository

Author: dgbdgb
Date: Wed Feb 17 2021
Labels: providers,🏷 status: label work required,🙅 status: discontinued

Source Site

unsplash.com

Value Provided

really good quality photographs for many topics

Licenses Provided

CC0

Original Comments:

TimidRobot commented on Fri Feb 19 2021:

🙅🏻 status: discontinued: Project is in maintenance mode (Upcoming Changes to the CC Open Source Community — Creative Commons Open Source).
source

Sörmlands Museum

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Mon Aug 19 2019
Labels: 🙅 status: discontinued

https://sokisamlingar.sormlandsmuseum.se

Original Comments:

annatuma commented on Mon Feb 24 2020:

I was unable to find licensing information or API documentation. I've sent them an email to inquire on both counts. Moving this to Blocked for now.
source

The Noun Project

This issue has been migrated from the CC Search Catalog repository

Author: dravadhis
Date: Fri Oct 23 2020
Labels: providers,🙅 status: discontinued

Provider API Endpoint / Documentation

http://api.thenounproject.com/

Provider description

The Noun Project API provides a collection of icons and photos.

Licenses Provided

CC BY 3.0
Public Domain Mark 1.0

Provider API Technical info

Rate Limits: 5000 requests/month
Overall volume: 3 million (Rough Indication)

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

Issue author dravadhis commented on Fri Oct 23 2020:

@mathemancer As mentioned in #1560 I will begin work on this.
Thank You!
source

Issue author dravadhis commented on Wed Oct 28 2020:

@mathemancer I am working on issue on issue #1559. The API requires OAuth 1.0 authorisation via an api_key and consumer_secret. To send an authorised request I need to use a library called requests-oauthlib (See here for reference). Should I add this library in the requirements file? Or is there any other way to go about this?
source

mathemancer commented on Thu Oct 29 2020:

@dravadhis I think that should be fine. Just add it to both requirements files (make sure to freeze the version in prod requirements, but not dev).
source

Collect data on API usage

This issue has been migrated from the CC Search API repository

Author: kgodey
Date: Fri Apr 10 2020
Labels: ✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued

Problem Description

We should collect data on how our API is being used. We should be able to create reports of non-CC Search API usage

by API key
from unauthenticated users (per user, anonymized)

Solution Description

Append API keys (when available) to uwsgi logs.

Identify fields from which the license can be obtained for certain Smithsonian museums (original #472)

This issue has been migrated from the CC Search Frontend repository

Author: ChariniNana
Date: Mon Jul 27 2020
Labels: ✨ goal: improvement,🙅 status: discontinued

Current Situation

In the existing implementation of the Smithsonian provider API script, the license value is obtained from the content.descriptiveNonRepeating.usage.access field. Only the images with a license value CC0 are stored in the image DB. However, for two of the Smithsonian museums (FBR - smithsonian_field_book_project, and NAA - smithsonian_anthropological_archives) this field is unavailable and thus we lose all their images due the inability of identifying their public domain status.

Suggested Improvement

Identify an alternative field from which to obtain the license

Benefit

Considerable improvement to the completeness of Smithsonian data

Additional context

Servier Medical Art Images

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Tue Aug 20 2019
Labels: 🙅 status: discontinued

https://smart.servier.com/, user suggestion via email.

Original Comments:

annatuma commented on Mon Feb 24 2020:

3000 objects, no indication of API on website. I've contacted them for more information. Moving this to blocked until we hear back.
source

Idea: Generate social image previews for single images

This issue has been migrated from the CC Search Frontend repository

Author: zackkrida
Date: Thu Jul 23 2020
Labels: ✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued,🚧 status: blocked

It would be really lovely to generate image previews for social media that showed the attribution as a watermark, and perhaps the cc search logo, instead of just showing a plain image:

Original Comments:

zackkrida commented on Thu Jul 23 2020:

Relates to the 'open graph image generator' idea expressed in #1077
source

Paris Musées (French documentation)

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Thu Jun 04 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Please note that all documentation is in French. French language competency is required to work on this integration.

Provider API Endpoint / Documentation

https://apicollections.parismusees.paris.fr/

For build/test purposes, sign up for an API key here:
https://apicollections.parismusees.paris.fr/user/register

CC will swap this API key for our organizational key before the integration goes live.

Provider description

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Process Objects from Queue of User Reported Content (original #351)

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Fri Apr 03 2020
Labels: 🙅 status: discontinued

Problem Description

We need to develop a process by which we pull objects from a flagging queue, and add metadata in the catalog.

Further ticket work required

Relates to: cc-archive/cccatalog-frontend#848

Biodiversity Heritage Library

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Wed Feb 26 2020
Labels: 🔒 staff only,🙅 status: discontinued

The Biodiversity Heritage Library licenses all *metadata as CC0, as stated in their developer documentation:
https://about.biodiversitylibrary.org/tools-and-services/developer-and-data-tools/

However, individual sources do not appear to have consistent rights statements, and are not using CC legal tools.

E.g. unknown rights: https://www.biodiversitylibrary.org/item/237667#page/2/mode/1up

E.g. public domain (but not PDM): https://www.biodiversitylibrary.org/page/12866543#page/69/mode/1up

This collection needs to be reviewed by Product Counsel before proceeding.

Original Comments:

annatuma commented on Wed Feb 26 2020:

I question if we can ingest this collection as it stands, since only metadata is CC0 licensed, and not the actual objects. @sarahpearson assigning this to you, as next step is legal review.
source

sarahpearson commented on Fri Feb 28 2020:

Thanks for flagging. It looks to me like at least some of the objects in the collection are CC-licensed, according to this: https://about.biodiversitylibrary.org/help/copyright-and-reuse/ (See the "In Copyright" section)

In order to include this repository in the catalog, we would have to be able to isolate and include only those CC-licensed objects. Given that the public domain material in the repository is not marked with either CC0 or the Public Domain Mark, we should not include them at this time.
source

Automatically clean up after failed indexing runs (original #402)

This issue has been migrated from the CC Search API repository

Author: aldenstpage
Date: Tue Jan 14 2020
Labels: Hacktoberfest,help wanted,✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued

When an indexing job fails (such as if a node in our Elasticsearch cluster has a full disk, or a bug in indexer-worker halts the process), the incomplete index is left inside of the Elasticsearch cluster, requiring someone to manually delete it. The indexer should detect this condition when the job starts and handle it.

The production index is determined by the image alias. The indexer should delete any index NOT pointed to by this alias following the naming scheme image-<uuid>.

Original Comments:

hedonhermdev commented on Sat Feb 22 2020:

Can I work on this issue?
source

CodeMonk263 commented on Sun Feb 23 2020:

Can i work on this issue?

source

kgodey commented on Tue Feb 25 2020:

@hedonhermdev go ahead. @CodeMonk263 please find another issue to work on since @hedonhermdev commented first.

DantrazTrev commented on Sat Feb 29 2020:

@hedonhermdev are still working on this issue?

hedonhermdev commented on Sat Feb 29 2020:

No.

On Sat, 29 Feb 2020 at 8:07 PM, Dantraz [email protected] wrote:

@hedonhermdev https://github.com/hedonhermdev are still working on this issue?

source

DantrazTrev commented on Sat Feb 29 2020:

Can i take it over?
@aldenstpage

kgodey commented on Tue Mar 03 2020:

Go ahead @DantrazTrev

tushar912 commented on Fri Oct 02 2020:

@DantrazTrev are u still working on this?

kgodey commented on Fri Oct 02 2020:

@tushar912 it's been a few months since @DantrazTrev's post, I think you can go ahead and work on this.

tushar912 commented on Fri Oct 02 2020:

Ok

tushar912 commented on Tue Oct 06 2020:

The way i understood this issue is as follows. The main indexing job is done by indexer.py in ingestion_server . The TableIndexer class contains a method _index_table which checks if the database is in sync with index and replicates if not.There are two methods of indexing. reindex which creates a new index and makes it live alias and update which updates the index. Currently during reindex if the index is not created successfully it still persists in the cluster so the job is to delete the index if indexing fails . @kgodey or @aldenstpage please tell if i have understood correctly.
source

tushar912 commented on Tue Oct 06 2020:

Also i am thinking of modifying the already existing consistency_check method and add it to the reindex to delete the index if it is not indexed properly. Am i on the right track?
source

AudioStore development (original #322)

This issue has been migrated from the CC Search Catalog repository

Author: mariuszskon
Date: Fri Mar 13 2020
Labels: 🙅 status: discontinued

As per WordPress/openverse-catalog#311, integrating audio into CC Catalog will require a few different things. This issue hopes to keep track of everything relevant to the development of AudioStore.

(NEW) means a field that was not in ImageStore that was added to AudioStore
Current schema (in audio.py):

Foreign ID
Landing URL
Audio URL
File format (NEW)
Thumbnail URL
Filesize
Duration (NEW)
Samplerate (NEW)
Bitdepth (NEW)
Channels (NEW)
License (like ImageStore, license and license version)
Creator
Creator URL
Title
Collection/Album (NEW)
Type (music, podcast, lecture) (NEW)
Genre (NEW)
Language (NEW)
Metadata
Tags
Provider
Source

Other data to keep, but inside metadata rather than dedicated columns:

Date of publishing
Instruments
Mood
Description
Bitrate (as provided by source, rather than calculating ourselves from samplerate and bitdepth)
Views
Number of ratings
Numbers of comments
Related (list of foreign IDs of sounds which source considers to be similar)

Inappropriate columns for AudioStore that are in ImageStore:

Width
Height
Watermarked

Original Comments:

Issue author mariuszskon commented on Fri Mar 13 2020:

I am uncertain about the "type" field (as in music, podcast, lecture) and if it could be combined with genre in some way, or if we should keep it separate. What do you think @mathemancer ?
source

Issue author mariuszskon commented on Fri Mar 13 2020:

Date field blocked by WordPress/openverse-catalog#324
source

mathemancer commented on Mon Mar 16 2020:

@mariuszskon I think Genre and Type are different. We could probably come up with a list of 'allowed' types, e.g., music, podcast, lecture. Then, within music, we'd have the genres: rock, pop, classical, jazz, etc. On the other hand, in the 'podcast' type, for genre, we'd have: 'science', 'comedy', 'news', etc.
source

mathemancer commented on Mon Mar 16 2020:

@brenoferreira Do you have any guidance or ideas based on what you'd like to have available to the front end here?
source

brenoferreira commented on Mon Mar 16 2020:

Based on some of the designs here, it seems very comprehensive to me.
The frontend probably won't look exactly like this, but it'll probably be inspired by it. The only thing missing from the metadata list here are things like music instruments and mood. But I think that's ok
source

Issue author mariuszskon commented on Tue Mar 17 2020:

@mathemancer Indeed I see the distinction much more clearly, and it is certainly worth having them separate.
source

Issue author mariuszskon commented on Tue Mar 17 2020:

@brenoferreira instruments and mood can be added no problem, but I am concerned that audio sources might not have this information available for ingestion.
source

brenoferreira commented on Tue Mar 17 2020:

yes. that's what I meant. If the source has the info, great. But if not, it
should be fine for now

On Mon, Mar 16, 2020 at 11:33 PM Mariusz Skoneczko [email protected]
wrote:

@brenoferreira https://github.com/brenoferreira instruments and mood
can be added no problem, but I am concerned that audio sources might not
have this information available for ingestion.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
cc-archive/cccatalog#322 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAFMTSYST7AKUKEJPCGGO4LRH3ON7ANCNFSM4LG32CIQ
.

--
Thank you

Breno Ferreira
Front-end Engineer
Creative Commons

source

Internet Archive

This issue has been migrated from the CC Search Catalog repository

Author: stragu
Date: Thu Jan 09 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

Internet Archive has CC-licensed images: https://archive.org/

This integration could be limited to images only using mediatype:"image"
https://archive.org/details/image?and[]=mediatype%3A%22image%22

There are close to 3.5 million images available.

To filter by license, the CC abbreviations can be used. For example:
https://archive.org/details/image?and[]=mediatype%3A%22image%22&and[]=licenseurl:http*by-nc*
...gives more than 110,000 results.

There is more information about license search here:
https://help.archive.org/hc/en-us/articles/360018359991-Search-A-Basic-Guide

API docs
Example query

Ticket work required beyond this point

Provider description

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

mathemancer commented on Thu Jan 09 2020:

API docs
Example query
source

annatuma commented on Fri Jan 10 2020:

Thank you @stragu for requesting this. We will do further research and determine prioritization.
source

annatuma commented on Thu Jun 04 2020:

I've updated the original ticket from @stragu to include our standard format for new providers, and the links you included as well @mathemancer

This is now ready for community contribution.
source

stragu commented on Thu Dec 17 2020:

Hi @kgodey ! What does "not suitable for work as repo is in maintenance" mean?
Did the feature make it into the addon at all?
source

kgodey commented on Thu Dec 17 2020:

@stragu Please see this blog post for more details: https://opensource.creativecommons.org/blog/entries/2020-12-07-upcoming-changes-to-community/
source

stragu commented on Fri Dec 18 2020:

Oh that's a real shame about the reduced capacity.
So if we would like the Internet Archive to be added to the CC Search Browser Extension, I should reopen this issue there? cc-archive/ccsearch-browser-extension#64

Or does it simply mean that no new sources will be added to the extension?
source

kgodey commented on Fri Dec 18 2020:

Unfortunately, no new sources will be added at this time. Hopefully that will change in the future.
source

stragu commented on Fri Dec 18 2020:

@kgodey that's unfortunate, but thank you for your work and your quick replies. All the best!
source

Transfer issues from CC projects

We should migrate issues from the Creative Commons project, specifically those issues that were open at the time of shutdown. These issues are labeled with 🙅 status: discontinued.

This should be a manual, one-off script. It can be run on a user account with adequate permissions. Because all new issues will be created by the GitHub user executing the script, we should add a block of metadata to the source of each imported issue, something like the following:

This issue has been migrated from {original-repo-name-linking-to-the-original-issue}.
Author: {author-name}
Date: {post-date}
Labels {list-of-original-labels}

Other metadata and suggestions are welcome.

Further Questions

Any thoughts on how we should handle comments?
Any critical metadata I missed?
Should we import/sync labels, or merely reference the originals?

Feature: Progressive Web App

This issue has been migrated from the CC Search Frontend repository

Author: abhisheknaiidu
Date: Wed Jun 17 2020
Labels: ✨ goal: improvement,❓ talk: question,🏷 status: label work required,🙅 status: discontinued,🚧 status: blocked

Implement Progressive Web App for CC Search

We can make both mobile and web as a progressive app! (Add App to Home Screen)

Mobile Compatible
Desktop Compatible(If Needed)

Most of the apps nowadays are sunsetting their mobile web apps and going full PWA like:
DRIBBLE:

TWITTER:

Original Comments:

kgodey commented on Wed Jun 17 2020:

@abhisheknaiidu new features have to go through @annatuma for triage so I added "awaiting triage" back. Please talk with her directly and get her approval before removing the label.
source

annatuma commented on Thu Jun 18 2020:

Thanks for the suggestion @abhisheknaiidu

While I agree that PWAs are nifty, and may be appropriate for some apps, I'm not convinced that this would be the right move for CC Search, and definitely not in the short term.

We'd need to start by doing user research to understand if there was even a need for this. Assuming there was, we'd need to look into what is feasible for us to load offline for users, while being cognizant of storage space, amongst other things. Further, it would likely require a different definition of the product entirely, since the goal of PWAs is typically to build engagement with the product, while we explicitly steer users through CC Search, treating it as a portal, and send them on to their destinations of sites containing CC licensed content.

That said, the scenario may change in time. I'm going to put this in the parking lot for now and mark it as blocked. That way, anyone with a similar idea can contribute to the discussion here.

source

KarenEfereyan commented on Fri Sep 11 2020:

PWAs? Seems interesting
source

EOL

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Fri Feb 28 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Note we previously integrated via CommonCrawl, and would like to integrate via API now.

Provider API Endpoint / Documentation

https://eol.org/docs/what-is-eol/data-services

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Provider description

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

hack3r-0m commented on Mon Mar 02 2020:

-EOL provides 1)Structured Data 2) Classic API (https://eol.org/docs/what-is-eol/data-services)
-EOL is under MIT-LICENSE (https://github.com/EOL/eol/blob/master/MIT-LICENSE.txt)
-they have stable API currently(in context of "classic API" however not sure about structured Data)
-they have stable landing page url
-example https://eol.org/api/search/1.0.json?q=Pinus%2Bcontorta&page=1&key=

However, i am not sure about license and if all required endpoints exist. If someone can review i would like to work on it.
source

annatuma commented on Thu Mar 05 2020:

Flagging here that the first two items on the checklist are crucial to moving forward, and this provider cannot be ingested until those have been verified.

@mathemancer will take a look when time allows.
source

Configure airflow to use remote logging capability (original #215)

This issue has been migrated from the CC Search Catalog repository

Author: mathemancer
Date: Fri Dec 06 2019
Labels: 🔒 staff only,🙅 status: discontinued

Currently, we store airflow logs in the running Docker Container. The logs are ~650-900MB per day, and storing them in the container actually requires approximately double the space on the host machine. This equates to more than 1GB of logs being stored on the ec2 instance disk per day, quickly filling the disk.

Airflow includes out-of-the-box logging to s3 functionality, and we should reconfigure the airflow daemon to use that ability.

Original Comments:

mathemancer commented on Tue Feb 25 2020:

Marking this for CC staff, since it requires access to AWS to implement/test
source

Free SVG

This issue has been migrated from the CC Search Catalog repository

Author: avats-dev
Date: Mon Aug 31 2020
Labels: Hacktoberfest,help wanted,providers,🌟 goal: addition,💻 aspect: code,🙅 status: discontinued,🟨 priority: medium

Provider API Endpoint / Documentation

https://freesvg.org/pages/api-and-usage

Provider description

Collection of public domain clipart content in SVG vector format. (Current count :120130)

Licenses Provided

CC0 public domain license

https://creativecommons.org/publicdomain/zero/1.0/

Provider API Technical info

endpoint : https://freesvg.org/api/v1/

Use case

req :
curl -H 'Accept: application/json' -H "Authorization: Bearer *-----token---------*" https://freesvg.org/api/v1/svg/83156

response :

{
    "data": {
        "id": 83156,
        "thumb": "1569933538scorpion-clipart-freesvg.org.png",
        "svg": "1569933538scorpion-clipart-freesvg.org.svg",
        "publish_datetime": "01/10/2019 12:37 PM",
        "status": "Published",
        "created_at": "2019-10-01",
        "created_by": "Publicdomainvectors"
    }
}

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

annatuma commented on Tue Sep 08 2020:

Great suggestion, thanks! We're happy for this to be worked on if you'd like to tackle it @avats-dev or anyone else interested.
source

Issue author avats-dev commented on Wed Sep 09 2020:

Hey @annatuma I would like to tackle it and I'm studying other scripts currently used, so it might take some time. I will update you when I will start working here or on slack. Anyone else interested can go on and tackle this 👍 .
source

Issue author dravadhis commented on Wed Sep 23 2020:

@avats-dev I think it is not possible to query by upload date. A good option would be to retrieve all available SVGs because this is the only possible request that can be made. Search term would not cover the entire collection, while querying by ID is not possible. The script can then be run every month or 15 days depending on how quickly the count increases.
source

Issue author tushar912 commented on Mon Oct 12 2020:

Is anyone working on this. If not can i take it up?
source

Issue author dravadhis commented on Wed Oct 14 2020:

@tushar912 I have been working on this.
source

Issue author dravadhis commented on Wed Oct 14 2020:

@tushar912 @avats-dev
False - Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
False - Verify the API provides license info (license type and version; license URL provides both, and is preferred)
False - Verify the API provides stable direct links to individual works.
False - Verify the API provides a stable landing page URL to individual works.

This is what I concluded. Please feel free to take this up if you find a way to retrieve the relevant info.
source

kshitij86 commented on Sun Dec 06 2020:

@kgodey @dravadhis Does this issue still need to be worked on? I'd be happy to take it up.
source

Issue author dravadhis commented on Sun Dec 06 2020:

@tushar912 @avats-dev
False - Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
False - Verify the API provides license info (license type and version; license URL provides both, and is preferred)
False - Verify the API provides stable direct links to individual works.
False - Verify the API provides a stable landing page URL to individual works.

This is what I concluded. Please feel free to take this up if you find a way to retrieve the relevant info.

@kshitij86 The issue still needs to be worked on although there are some problems with the source API (see tagged comment).
Also I am not the right person to be asked whether you can take this or not. Please ask @kgodey @mathemancer @annatuma.

Thank You!

source

mathemancer commented on Mon Dec 07 2020:

@kshitij86: @dravadhis is correct, this issue is still in the research phase. The problems noted in that comment need to be solved before proceeding with coding.
source

Investigate use of the BM25 algorithm to search image titles (original #288)

This issue has been migrated from the CC Search API repository

Author: kgodey
Date: Sat Apr 27 2019
Labels: ✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued

The similarity algorithm used to search titles was switched from BM25 to boolean in cc-archive/cccatalog-api#281 to avoid ranking repeated words in titles higher.

We should investigate switching back to BM25 and set the k1 tuning value to a low value just for the title field.

See cc-archive/cccatalog-api#281 (review) and BM25 algorithm docs for more info.

Original Comments:

annatuma commented on Thu Jan 23 2020:

@aldenstpage I'm putting this in Q2 of the backlog, given that there are other search algorithm improvements scheduled for then. Please evaluate if this is a fit for community contributions and if so, label it accordingly.
source

Behance

This issue has been migrated from the CC Search Catalog repository

Author: janetpkr
Date: Fri Jun 14 2019
Labels: providers,✨ goal: improvement,🔒 staff only,🙅 status: discontinued

see email

Original Comments:

Issue author sclachar commented on Tue Jun 18 2019:

@janeatcc 👆
source

kgodey commented on Thu Jul 11 2019:

Unassigning myself since we have an API key now.
source

annatuma commented on Mon Dec 02 2019:

Moved to CC Catalog. @kgodey please email/share the API Key with @mathemancer so he has it for his records when it's time to start work on this.
source

kgodey commented on Tue Dec 03 2019:

The key is already shared in LastPass.
source

annatuma commented on Tue Feb 04 2020:

Current integration is through Common Crawl.

API documentation: https://www.behance.net/dev
API key: stored in LastPass, internally

Considerations prior to integration:

Can we make use of e.g. the stats in /project_id to bring in popularity data?
Is there an equivalent of NSFW anywhere in the API that we can include in metadata?
source

kgodey commented on Tue Feb 04 2020:

@annatuma yes there is an NSFW flag in the API
source

gauravahlawat81 commented on Thu Feb 20 2020:

Hey, what's the status of this issue ? Is it ready to be worked on?
Currently I am working on cccatalog repository, I think I can work on this issue .
@mathemancer What do you think ?
source

mathemancer commented on Mon Feb 24 2020:

Adding the default boiler plate from our fancy new Provider API Integration template:

Provider API Endpoint / Documentation

https://www.behance.net/dev

Provider description

Licenses Provided

Provider API Technical info

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.
source

mathemancer commented on Mon Feb 24 2020:

@gauravahlawat81
The issue is almost ready. See my previous comment for the details so far. In order to work on this issue, one needs to obtain an API key from Behance (it's not possible to use the CC org's key as a community contributor). The preferred way to work with it would be to use an environment variable (perhaps BEHANCE_API_KEY) for the key, and read it in the script.

The main info we need before this is ready for work is whether all the relevant pieces (creator, license info, foreign landing url, and image url at a minimum) can be obtained from the API.
source

mathemancer commented on Mon Feb 24 2020:

@gauravahlawat81
I'd prefer you finish up on WordPress/openverse-catalog#255 before claiming this issue to work on. This is quite a large/major issue.

Thank you very much for your initiative!
source

akshgpt7 commented on Sat Mar 14 2020:

@mathemancer Behance has stopped accepting new clients for their API and are not giving new API keys anymore. Is this integration still possible under the SoC project, maybe using the org's key later during the project?
Here's the source
source

mathemancer commented on Mon Mar 16 2020:

@akshgpt7 Oh no! I see that we actually have a Behance API key stored, but we can't give it out to the community for development. Perhaps we'll have to implement this one internally. I'll mark the issue as such for the moment, and update once we have an idea of how others can get keys.
source

[API Integration - AUDIO] Wikimedia Commons (original #316)

This issue has been migrated from the CC Search CCCatalog repository

Author: annatuma
Date: Wed Mar 11 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

Note: we already have an API integration with Wikimedia Commons for images. They also have CC licensed sounds.

Overview page in UI:
https://commons.wikimedia.org/wiki/Category:Sound

Example file: https://commons.wikimedia.org/wiki/File:Christoph_Nolte_-_The_Rocky_Road_-_The_Rocky_Road_To_Dublin.ogg

Details below this point needed

Provider API Endpoint / Documentation

Provider description

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Come up with a solution for consuming crawler events (original #457)

This issue has been migrated from the CC Search Catalog repository

Author: aldenstpage
Date: Wed Jul 08 2020
Labels: ✨ goal: improvement,🙅 status: discontinued

We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:

We know the dimensions, filesize, and compression rate of images in the image_metadata_updates topic
In some cases we are able to extract exif metadata, which also goes into the image_metadata_updates topic.
We record 404s in the link_rot topic

This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.

We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.

Implement Lighthouse CI (original #1130)

This issue has been migrated from the CC Search Frontend repository

Author: zackkrida
Date: Mon Aug 10 2020
Labels: ✨ goal: improvement,🏷 status: label work required,💻 aspect: code,🔒 staff only,🙅 status: discontinued,🤖 aspect: dx

We will implement Lighthouse CI to get some CI on front-end performance and a11y.

Implementation

I would be interested in implementing this feature.

Original Comments:

Issue author neeraj-2 commented on Thu Sep 03 2020:

Can ,i work on this?But ,please guide me what exactly we need to do?

source

zackkrida commented on Tue Sep 08 2020:

Hi @neeraj-2, this actually needs to be done by CC staff, since there's some edits to the repo configuration needed.
source

Bump pyyaml from 5.3.1 to 5.4 in /src/cc_catalog_airflow (original #555)

This issue has been migrated from the CC Search Catalog repository

Author: dependabot[bot]
Date: Fri Mar 26 2021
Labels: dependencies,🙅 status: discontinued

Bumps pyyaml from 5.3.1 to 5.4.

Changelog

Sourced from pyyaml's changelog.

5.4 (2021-01-19)

yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA

yaml/pyyaml#472 -- Fix for CVE-2020-14343, moves arbitrary python tags to UnsafeLoader

yaml/pyyaml#441 -- Fix memory leak in implicit resolver setup

yaml/pyyaml#392 -- Fix py2 copy support for timezone objects

yaml/pyyaml#378 -- Fix compatibility with Jython

Commits

58d0cb7 5.4 release
a60f7a1 Fix compatibility with Jython
ee98abd Run CI on PR base branch changes
ddf2033 constructor.timezone: _copy & deepcopy
fc914d5 Avoid repeatedly appending to yaml_implicit_resolvers
a001f27 Fix for CVE-2020-14343
fe15062 Add 3.9 to appveyor file for completeness sake
1e1c7fb Add a newline character to end of pyproject.toml
0b6b7d6 Start sentences and phrases for capital letters
c976915 Shell code improvements
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@dependabot use these labels will set the current labels as the default for future PRs for this repo and language
@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

-------- ### Original Comments:

TimidRobot commented on Mon Mar 29 2021:

🙅🏻 status: discontinued: Project is in maintenance mode (Upcoming Changes to the CC Open Source Community — Creative Commons Open Source).
source

Issue author dependabot[bot] commented on Mon Mar 29 2021:

OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting @dependabot ignore this major version or @dependabot ignore this minor version.

If you change your mind, just re-open this PR and I'll resolve any conflicts on it.
source

Auckland Museum

Source Site

https://www.aucklandmuseum.com/discover/collections/our-data

Provider API Endpoint / Documentation

Tutorial: https://github.com/AucklandMuseum/API/wiki/Tutorial#fields-available-for-query-string-searches
Endpoint: https://api.aucklandmuseum.com/

Provider description

The Auckland Museum displays wide varieties of historical artifacts and collections in addition to providing education and research resources.

Provider API Technical info

Rate Limit: 10 requests per second/ 1000 requests per day.

This issue has been migrated from the CC Search Catalog repository

Author: akshgpt7
Date: Mon Feb 24 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Original Comments:

akshgpt7 commented on Tue Feb 25 2020:

@annatuma, @mathemancer I found out that the Auckland Museum's online collection has CC licensed images, which can be identified by the copyright field in the response JSON.

mathemancer commented on Tue Feb 25 2020:

@akshgpt7 That's great, thank you! Further info to gather would be:

What is the overall volume of objects to be queried?

Is it possible to catalog all CC-licensed items via API calls in some reasonable way? (I.e., there must be some way to systematically loop through all objects, gathering their metadata)

akshgpt7 commented on Fri Feb 28 2020:

@mathemancer

The current volume of objects with CC in the copyright field is a bit over 100k.
To catalog the CC-licensed items, the following endpoint can be used https://api.aucklandmuseum.com/search/collectionsonline/_search?q=copyright:CC&has_image=true. For each object, there's a primaryRepresentation field that contains the image url. I believe it's very much possible to extract the items through the same.
Most of the images are licensed under this license: http://creativecommons.org/licenses/by/4.0/ (Needs confirmation).

mathemancer commented on Mon Mar 02 2020:

This is great info @akshgpt7 , thank you!

annatuma commented on Thu Mar 05 2020:

@akshgpt7 you are welcome to tackle this integration if you're interested in doing so. Let us know, for now this ticket is assigned to you to work on.

akshgpt7 commented on Fri May 29 2020:

@mathemancer I have been working on this script.
I have one question moving forward. There is a rate limit of 1000 requests per day. How do I go about handling checks to not exceed that in a day?

I was thinking something like getting the time of the day at the start of the script, and maintaining a request_count. Then getting the time of the day on each request and stop the script if request_count hits 1000 (on the same day), or refresh it if the day passes before completing 1000 requests.

However, I'm not sure if this is the right way to go about handling the 1000 requests/day limit. Moreover, how do we make sure to start off the next day from the same page where we left on the previous day?

mathemancer commented on Mon Jun 15 2020:

@akshgpt7 I suggest using the DelayedRequester class with a delay of 87 seconds. This will keep the overall number of requests under the limit.

Images of Empowerment

This issue has been migrated from the CC Search Catalog repository

Author: ircpresident
Date: Thu Apr 16 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

https://www.imagesofempowerment.org/about/

Provider description

The William and Flora Hewlett Foundation worked with Getty Images in 2015 and 2017 to create a library of powerful, positive and high-quality images showing women’s work and family life around the world. The David and Lucile Packard Foundation added hundreds of photographs to the collection beginning in 2018.

All photographs are available—free of charge—to non-commercial users, thanks to Creative Commons licensing (CC-BY-NC 4.0). The photos funded by the Hewlett Foundation are also available for licensing to Getty Images’ global customer base of creative agencies, businesses, news organizations, and other editorial clients. Photographs were taken by Jonathan Torgovnik, Paula Bronstein, Juan Arredondo, Nina Robinson, and Yagazie Emezi for Getty Images.

The images show the connection between women’s work, their health, and the ability to care for themselves and their families in 11 countries around the world: Colombia, Ghana, India, Kenya, Peru, Rwanda, Senegal, South Africa, Thailand, Uganda, and the United States. They also show women as active participants in their communities, accessing and providing quality reproductive health information and services, and advocating for better working conditions.

The William and Flora Hewlett Foundation has long supported organizations that work to ensure women have full and fair opportunities to earn a living and can choose whether and when to have a family. Similarly, the David and Lucile Packard Foundation supports reproductive health advocates, researchers, and providers to advance quality sexual and reproductive health and rights.

The Images of Empowerment show several non-profit organizations the foundations support, including the African Population and Health Research Center, Amref Health Africa, Marie Stopes International, TOSTAN, the Women in Informal Employment: Globalizing and Organizing (WIEGO), the Centre for Catalyzing Change, Ipas Development Foundation, the Institute of Women and Ethnic Studies, Teen Health Mississippi, the Imbuto Foundation, Medical Students for Choice, and more.

Since the collection launched in 2015, we’ve seen the images used by dozens of nonprofits, at major international conferences like Women Deliver, and by media including the New York Times, Vox, and the Guardian.

Like the Getty Images Lean In Collection that provides powerful depictions of women and girls especially in the United States, these 2,000 images show women living, working and organizing around the world.

Licenses Provided

(CC-BY-NC 4.0)

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

kss682 commented on Mon Apr 20 2020:

So I was just going through the collections , and the images have been grouped with the location , but I couldn't find any api reference , @ircpresident did you find any api's .
source

ircpresident commented on Tue Apr 21 2020:

I don't think so. I have no relationship to the site but I will ask them if they have an API in the contact us and update here.
source

OpenClipArt.org

This issue has been migrated from the CC Search Catalog repository

Author: janetpkr
Date: Thu Jul 11 2019
Labels: providers,🙅 status: discontinued

per many twitter user requests! https://twitter.com/rejon/status/1141844914168377346

Original Comments:

annatuma commented on Mon Feb 24 2020:

Provider API Endpoint / Documentation

No documentation found, emailed OpenClipArt.org to find out if this integration is possible.

Provider description

This is a clip art provider, with almost 160,000 objects in the collection.

Licenses Provided

All objects are CC licensed.

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Provider API Technical info

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

source

kgodey commented on Mon Feb 24 2020:

We may need to do this via Common Crawl if there's no API.
source

mathemancer commented on Tue Feb 25 2020:

They have a 'developers' site listing API endpoints, but none seem to work:

https://openclipart.org/developers
source

mathemancer commented on Tue Feb 25 2020:

@annatuma @kgodey Have we had any direct contact with them? They have an email listed for contact regarding their API:

[email protected]
source

annatuma commented on Wed Apr 22 2020:

@annatuma @kgodey Have we had any direct contact with them? They have an email listed for contact regarding their API:

[email protected]

No response on the Feb 24th email, pinged them again.
source

annatuma commented on Mon Apr 27 2020:

Update: they'll be in touch when the API goes live.
source

README broken links (original #542)

This issue has been migrated from the CC Search Catalog repository

Author: aashrafh
Date: Sun Dec 27 2020
Labels: 🏷 status: label work required,🙅 status: discontinued

The following links are broken in the README.md file:

monthlyWorkflow.py .
RawPixel. I found the right link here
Cleveland Museum of Art. I found the right link here
Brooklyn Museum. I found the right link here
NYPL. I found the right link here

Original Comments:

mathemancer commented on Wed Jan 06 2021:

Hey, @aashrafh. Unfortunately, this project is paused for the moment. Please see https://opensource.creativecommons.org/blog/entries/2020-12-07-upcoming-changes-to-community/

source

Fix-It Ticket for Smithsonian Institution Integration (original #397)

This issue has been migrated from the CC Search Frontend repository

Author: annatuma
Date: Mon May 18 2020
Labels: 🙅 status: discontinued

We know that some museums in SI have discrepancies between field names, e.g. in some museums they use "summary" and in others "description" to describe the result.

To improve the results of SI objects, we need to go through each museum within SI to look for any missing metadata mapping/potential improvements.

Original Comments:

annatuma commented on Mon May 18 2020:

Blocked by cc-archive/cccatalog#318
source

ChariniNana commented on Fri Jul 24 2020:

@annatuma @mathemancer

An initial analysis on missing metadata mapping is as follows:

The numbers and percentages of missing creators:-

                         Sub provider | No Creator | Total Images | Missing Percentage
si_national_museum_of_natural_history |    3019894 |      3019894 |              100.0
                         si_libraries |          1 |           55 | 1.8181818181818181
                           si_gardens |        669 |          689 |  97.09724238026125
                  si_portrait_gallery |       7080 |        11661 | 60.715204527913556
           si_american_history_museum |        196 |         2167 |   9.04476234425473
              si_cooper_hewitt_museum |      36105 |        65632 |  55.01127498781082
   si_african_american_history_museum |       3252 |         7519 | 43.250432238329566
               si_american_art_museum |         22 |        11561 | 0.19029495718363462
                  si_anacostia_museum |        322 |          571 |   56.3922942206655
                     si_postal_museum |       2900 |         2951 |  98.27177228058285
              si_freer_gallery_of_art |       2929 |         3875 |  75.58709677419355
              si_air_and_space_museum |        238 |         2516 |   9.45945945945946
                  si_hirshhorn_museum |          2 |          477 | 0.4192872117400419
                si_african_art_museum |          3 |          136 | 2.2058823529411766
            si_american_indian_museum |        187 |          248 |  75.40322580645162

The numbers and percentages of missing descriptions in the meta data field:-

                         Sub provider | No Description | Total Images | Missing Percentage
si_national_museum_of_natural_history |        3007887 |      3019894 |   99.6024032631609
                         si_libraries |             54 |           55 |  98.18181818181819
                           si_gardens |              0 |          689 |                0.0
                  si_portrait_gallery |          11661 |        11661 |              100.0
           si_american_history_museum |            963 |         2167 |  44.43931702814952
              si_cooper_hewitt_museum |           4186 |        65632 | 6.3779863481228665
   si_african_american_history_museum |              0 |         7519 |                0.0
               si_american_art_museum |          11561 |        11561 |              100.0
                  si_anacostia_museum |            501 |          571 |  87.74080560420315
                     si_postal_museum |              2 |         2951 | 0.06777363605557438
              si_freer_gallery_of_art |           3875 |         3875 |              100.0
              si_air_and_space_museum |            319 |         2516 |  12.67885532591415
                  si_hirshhorn_museum |            477 |          477 |              100.0
                si_african_art_museum |              1 |          136 | 0.7352941176470589
            si_american_indian_museum |            248 |          248 |              100.0

The reason for missing the creator value is because the field from which to get it is not yet included in the CREATOR_TYPES dictionary and the description is missing since it's not yet covered in DESCRIPTION_TYPES as defined in the Smithsonian script.

Other findings:-
We entirely lose the following museums due to unavailability of the mandatory value foreign_landing_url and/or due to not knowing whether they have the CC0 license

SIA (smithsonian_institution_archives) - Both the record_link and guid fields from which we get the foreign_landing_url are missing.
NZP (smithsonian_zoo_and_conservation) - Both the record_link and guid fields from which we get the foreign_landing_url are missing.
FBR (smithsonian_field_book_project) - Both the record_link and guid fields from which we get the foreign_landing_url are missing. The usage -> access fields from which we determine whether images are CC0 licensed are also missing.
NAA (smithsonian_anthropological_archives) - Both the record_link and guid fields from which we get the foreign_landing_url are missing. The usage -> access fields from which we determine whether images are CC0 licensed are also missing.

source

ChariniNana commented on Mon Jul 27 2020:

As per the initial research conducted on the NMNH data, it was realised that the creator field may be retrieved from the freetext -> name -> Collector value which appears for some of the images in NMNH. Further discussion is necessary to determine whether this is an appropriate field from which to obtain the creator.

For populating the description information, it was noted that the freetext -> notes -> Notes field would be appropriate for NMNH.
source

ChariniNana commented on Fri Jul 31 2020:

For the four museums with missing foreign_identifier_url (SIA, NZP, FBR, NAA), no alternative field could be identified in the JSON responses to retrieve the url from.
For FBR we actually do have the content.descriptiveNonRepeating.online_media.media.usage.access path available. So obtaining the license type is possible. But for most objects we don't find an image list which we get from the path content.descriptiveNonRepeating.online_media.media in the Smithsonian script. For NAA we don't find the image list for any of the objects.
source

ChariniNana commented on Fri Jul 31 2020:

For si_postal_museum (NPM) with 98% of the creators missing, we might be able to use the freetext -> name -> Presentor value as the creator and some have freetext -> name -> Associated Organization. Both fields seem to contain names of places which could be the place where the image is presented or has some association with. For certain images freetext -> name -> Associated Person is available.
source

Add Digital Commons @ ACU (original #185)

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Mon Aug 19 2019
Labels: 🙅 status: discontinued

https://digitalcommons.acu.edu/

Original Comments:

annatuma commented on Mon Feb 24 2020:

Turns out this is part of http://network.bepress.com/ which has almost 3.5M OA text objects from participating institutions. I've sent them an email to inquire about a partnership. The preference would be to partner at the master level. If that's not an option, we can look at institution by institution. For now, putting this ticket as "Blocked" until we hear back from bepress.
source

Systematically update CC catalog records (original #164)

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Fri Jun 14 2019
Labels: 🙅 status: discontinued

Currently, when we pull data into the Catalog, it is stored, but never refreshed on future pulls from those sources.

We need to discuss how we want to go about maintaining/updating data in the Catalog.

For a description of the strategy for reingestion we're using (but not the implementation), see:
https://opensource.creativecommons.org/blog/entries/date-partitioned-data-reingestion/

Original Comments:

annatuma commented on Fri Dec 06 2019:

Loosely aiming for Q2 2020 to tackle this.
source

kgodey commented on Fri Mar 20 2020:

@mathemancer has a plan for this, starting with cc-archive/cccatalog#298
source

mathemancer commented on Thu May 14 2020:

#298 has been implemented (see the PR cc-archive/cccatalog#394 ).
source

mathemancer commented on Thu May 14 2020:

There are three fundamentally different types of Provider APIs with regards to
this issue:

APIs that let us ingest metadata related to objects uploaded on a given date.
- For these providers, a strategy like the one outlined in WordPress/openverse-catalog#298 is necessary.
- Of the scripts currently implmented, these are:
  - Flickr
  - Met Museum
  - PhyloPic
  - Wikimedia Commons
- From that list, only Flickr and Wikimedia Commons are at a scale that
  implies difficulty. If we need to reingest the entire Met Museum
  collection, it's possible to do that at any time.
- For The 'smaller' providers, the easiest solution would be to continue
  doing the daily ingestions, and combine them with a monthly 'complete'
  ingestion of their entire catalog.
APIs for which we ingest the entire collection every time we ingest from it.
- For these providers, the problem is already solved.
APIs that offer a 'delta' endpoint that we use (currently none).
- This would be tricky, since if we miss ingesting for some time, it may be
  that we'd have to reingest the entire collection to make sure we're in a
  consistent state.

source

mathemancer commented on Thu May 14 2020:

I think we should focus on Wikimedia Commons next, both because it's the only one which isn't solved for which this might be challenging, and because we need more data from WMC for Regoknition analysis.
source

mathemancer commented on Fri May 15 2020:

See issue cc-archive/cccatalog#395 for the Wikimedia Commons plan.
source

mathemancer commented on Fri May 29 2020:

For the following API providers, reingestion is not necessary, because we ingest all their data every time we run:

Cleveland Museum
RawPixel
Smithsonian Institution

For the following API providers, reingestion is not necessary, because we pull from a delta endpoint (i.e., we're separating on date modified, rather than uploaded):

PhyloPic

For the following API Providers, we need to create Apache Airflow DAGs implementing the Date-partitioned reingestion strategy:

Flickr -- see cc-archive/cccatalog#298 and cc-archive/cccatalog#394
Wikimedia Commons -- see cc-archive/cccatalog#395 and cc-archive/cccatalog#402
Europeana -- see cc-archive/cccatalog#412
Met Museum -- see cc-archive/cccatalog#413

The only current API provider not mentioned above is Thingiverse which is BLOCKED by cc-archive/cccatalog#391. Whenever that is implemented, the implementation may or may not need to use a reingestion strategy.
source

mathemancer commented on Fri May 29 2020:

Moving forward, we should use the reingestion strategy out of the gate for any new Provider API Script, whenever it uses the date-partitioning strategy to ingest in the first place.
source

[Feature] Use metadata keywords to help detect if something is NSFW (original #482)

This issue has been migrated from the CC Search API repository

Author: aldenstpage
Date: Tue Apr 21 2020
Labels: 🏷 status: label work required,🙅 status: discontinued

Problem Description

We are trying to make NSFW content in CC Search "opt-in". We can catch a lot of NSFW content by using API specific filters and relying on moderation "upstream" at the source, but sometimes things slip through.

Solution Description

One way we can help prevent this is scanning for NSFW profanity and slurs in the title/tags/artist name and settings nsfw = True in the metadata field if it fails the check. There are 3rd party lists of dirty words that can help us achieve this. In my experience moderating content on CC Search, this will help prevent a lot of embarrassment and indignant emails from teachers.

We can do a one-time scan-and-filter relatively easily, but we will also need a way to filter new content as it is ingested.

Additional Context

The Scunthorpe Problem

Original Comments:

aldenstpage commented on Tue Apr 21 2020:

Also: we're going to need to review the list of words carefully, because the lists that I linked to are too broad in what they consider NSFW and could have some unwanted inadvertent censorship effects.
source

Issue author brenoferreira commented on Tue Apr 21 2020:

One thing to watch out for in this word list is the potential for false positives that can end up filtering out a lot of content with words that aren't necessarily NSFW.
Edit: when I commented I had the tab open for a while so @aldenstpage comment hadn't loaded yet :D
source

kss682 commented on Wed Apr 22 2020:

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv , so that the NSFW contents could be flaged and segregated at an early stage. @aldenstpage
source

[Infrastructure] Create meta-DAG to run Loader Workflow DAG in parallel (original #377)

This issue has been migrated from the CC Search Catalog repository

Author: mathemancer
Date: Tue Apr 28 2020
Labels: ✨ goal: improvement,🙅 status: discontinued

Current Situation

The fastest we can run the loader_workflow Apache Airflow DAG (Directed Acyclic Graph) is once per minute. This is the finest resolution at which the Airflow Scheduler will kick off DAG runs. However, we sometimes want to load more than one file per minute.

Suggested Improvement

We should create a factory method to create a meta-DAG that takes a parameter split (or the like) giving how many times per minute we'd like to kick off a run of loader_workflow. The generated meta-DAG then has split number of SubdagOperator nodes that run in parallel, offset by an appropriate number of seconds. If the meta-DAG is run once per minute, we then get loader_workflow run split times per minute.

Benefit

More throughput of TSVs from the EC2 instance to S3, and eventually PostgreSQL. We are currently not overloaded at the one-per-minute rate, but as we increase the volume of data we collect, this will become necessary.

Additional context

The loader_workflow DAG is defined at src/cc_catalog_airflow/dags/loader_workflow.py

Original Comments:

Issue author amartya-dev commented on Thu Apr 30 2020:

I want to work on this, according to my understanding after talking with @mathemancer I understand that we need to create a meta dag which will run the loader workflow.

The process is this:

The meta-DAG runs every minute
It has n branches in parallel, each with a SubDAG node (and a delaying node to offset by 60/n seconds).
When it runs, it runs the n branches, thereby running the SubDAGs (loader_workflow) in parallel.
source

Issue author tushar912 commented on Mon Nov 30 2020:

@mathemancer I am interested to work on this.
source

Issue author tushar912 commented on Mon Nov 30 2020:

@mathemancer Just a doubt the loader_workflow contains two dags tsv_to_postgres_loader and tsv_to_postgres_loader_overwrite .Does the meta dag need to contain both subdags.
source

mathemancer commented on Wed Dec 02 2020:

@tushar912 It should be both.
source

Project board automations

We should re-implement project board automations similar to those we had with the CC repos. The new project board is available here. Off the top of my head, prior automations included:

Automatically adding new issues + prs in the relevant repos to the project board
- The "backlog" column is probably the best fit
Moving issues with opened, linked PRs to the 'in progress' column

Other suggestions and automations are welcome.

Build UI for API consumers to get their key and check their usage (original #335)

This issue has been migrated from the CC Search API repository

Author: kgodey
Date: Fri Jul 19 2019
Labels: 🏷 status: label work required,🙅 status: discontinued

Original Comments:

annatuma commented on Thu Jan 23 2020:

Tentatively slotted for Q3 in Roadmap. Design mockups from UCB student collaboration are a starting point here, when the time comes.
source

Sketchfab

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Fri Jan 10 2020
Labels: 🔒 staff only,🙅 status: discontinued

Sketchfab has over 250,000 CC-licensed images, many from GLAM institutions. We'd like to prioritize integration with them at this time.

API info:
https://docs.sketchfab.com/data-api/v3/index.html#/

Per our contact, once you have created a user account you're able to apply for an API key. If that isn't the case, we have a contact to quickly connect with for access.

Original Comments:

akshgpt7 commented on Tue Feb 18 2020:

Are only CC-licensed images to be fetched through this API integration?
source

mathemancer commented on Tue Feb 18 2020:

@akshgpt7 That is correct. I should work on this ticket to reflect that. I'll also mark this as in progress, since it's on my plate for this sprint (i.e., I'll be the one working on it).
source

mathemancer commented on Tue Feb 25 2020:

Marking this as 'blocked' until we figure out how to proceed regarding the fact that it doesn't seem possible to systematically pull their entire collection via some sequence of API calls.
source

akshgpt7 commented on Fri Mar 06 2020:

@mathemancer I think we can pull the required data using this way:

The https://api.sketchfab.com/v3/search?type=models&license=cc0 endpoint returns the various models. For each model, on calling the endpoint in the uri field, we get the metadata for that particular model, which we can easily extract.
(In the initial endpoint, we can replace cc0 by by, by-sa, by-nd, by-nc, by-nc-sa, by-nc-nd to get different licensed models)
So, we can either take the license URL from the license field in the response, otherwise, we already know it by the endpoint we're calling. For example https://api.sketchfab.com/v3/search?type=models&license=by-sa responses will only contain BY-SA licenses.
source

mathemancer commented on Mon Mar 09 2020:

@akshgpt7 Some context: There's a technical limitation in their API that means it's impossible to obtain more than 10,000 images per license that way. (That's actually the approach we tried initially). To see the script, check out the sketchfab_integration branch in this repo.
source

[Infrastructure] Replace "Common Crawl Provider Images ETL" (AWS Data Pipeline) with Apache Airflow DAG (original #458)

This issue has been migrated from the CC Search CCCatalog repository

Author: mathemancer
Date: Fri Jul 10 2020
Labels: ✨ goal: improvement,🔒 staff only,🙅 status: discontinued

Current Situation

"Common Crawl Provider Images ETL" is an AWS Data Pipeline. It's difficult to work on, and when it fails, the first notification is usually when data isn't where we expected it to be.

Suggested Improvement

We should define an Apache Airflow DAG that runs the PySpark job run by the aforementioned pipeline instead.

Benefit

After this change, the success or failure of that pipeline will be logged and visible in the same interface as most of our other data jobs.
Also, this will mean that community members will have more access to be able to understand and help work on the pipeline, which is currently hidden inside AWS.
Finally, this is a step to cutting one technology (AWS Data Pipeline) out of our stack, which always feels like the right direction to head (i.e., the direction of a minimal tech stack).

Additional context

This is part of cc-archive/cccatalog#445

Original Comments:

mathemancer commented on Fri Jul 10 2020:

This will be staff only, since as of yet, the Data Pipeline is completely hidden within AWS, and since testing this will involve incurring costs (a relatively huge spark cluster gets created by the pipeline)
source

mathemancer commented on Mon Nov 02 2020:

It's likely that we'll actually extend the DAG defined by the implementation of WordPress/openverse-catalog#526 instead of making a separate DAG. This will help manage dependencies between the data pipelines involved.
source

IMSLP

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Thu Apr 09 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

https://imslp.org/api.php

Provider description

This is a provider of sheet music and music recordings. There is CC-licensed content in both categories. For this ticket, we'd like to ingest the CC-licensed audio. In the future we may also want to ingest sheet music, but that is out of scope here.

Example file:
https://imslp.org/wiki/Dilatate_sunt_tribulationes_(Abbatini%2C_Antonio_Maria)

Ticket work required beyond this point

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

mathemancer commented on Wed Apr 15 2020:

I love IMSLP! However, note that almost all of the audio you'll find there will be MIDI 'performances'. (just play the mp3 for the Example file). Given that, I'm not sure this should be highly-prioritized.
source

Cinnamon

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Fri Jun 12 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

Prerequisites to this:

Support for video in CC Search
Integration of CC license selection on upload for Cinnamon users

We do not expect to be ready to further investigate this integration until some time in 2021

Provider API Endpoint / Documentation

Provider description

Licenses Provided

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Unglue.it

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Fri Nov 15 2019
Labels: providers,✨ goal: improvement,🙅 status: discontinued

This is a provider of texts and is therefore blocked by the Catalog not being ready to ingest that content type at this time

Provider API Endpoint / Documentation

https://unglue.it/api/help

Internal users only: CC has an API key for this service, please check CC's password manager.

Provider description

A provider of openly licensed ebooks, some of which are available from Project Gutenberg.

Licenses Provided

They indicate that the works on their site as CC licensed or have another open license. We'd need to restrict ingestion to CC licenses.

Provider API Technical info

There isn't a clear way for a frontend user to filter books on the site by license type.

The basic API documentation doesn't include license info at the high level:
https://unglue.it/api/v1/?format=json

However, they reference an ONIX structure, where rights information is returned in the Epub License field:

CC BY-NC-ND

01
https://creativecommons.org/licenses/by-nc-nd/3.0/

For example:
https://unglue.it/api/onix/by-nc-nd/epub/?max=20

More work is needed to determine if we can get all the information we need for ingestion

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

USC Digital Library

This issue has been migrated from the CC Search Catalog repository

Author: kgodey
Date: Mon Aug 19 2019
Labels: providers,🙅 status: discontinued

http://digitallibrary.usc.edu/

Original Comments:

annatuma commented on Mon Feb 24 2020:

Integration is currently blocked due to dual rights of certain objects, e.g. http://digitallibrary.usc.edu/cdm/singleitem/collection/p15799coll95/id/1/rec/1 where the Rights field indicates both Public Domain and CC-BY. We're in communication with the library regarding this.
source

annatuma commented on Wed Feb 26 2020:

The library confirms that the work to review and correct licensing information will not be done in the short term. We can keep this in Blocked and review at a later time.
source

ChariniNana commented on Fri Mar 13 2020:

Would it be possible to work on this for GSoC?
source

mathemancer commented on Mon Mar 16 2020:

@ChariniNana See the comments from @annatuma . It's not likely to be ready for development any time soon.
source

ccMixter

This issue has been migrated from the CC Search Catalog repository

Author: annatuma
Date: Wed Mar 11 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

http://ccmixter.org/query-api

Provider description

http://dig.ccmixter.org/free

Licenses Provided

They have over 29,000 objects, but it is unclear from the UI how many are CC licensed. They appear to also have their own license on some objects. We'd need to restrict to CC BY and CC BY-NC, which are the two licenses they appear to support for objects upon upload.

Edit: CC Mixter now supports the CC 4.0 licenses and seems to have around 65k audio files now. Details here and here.

further information required below this point

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Iconfinder

This issue has been migrated from the CC Search Catalog repository

Author: akshgpt7
Date: Mon May 18 2020
Labels: Hacktoberfest,help wanted,providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

https://developer.iconfinder.com/

Provider description

Iconfinder is a provider of icons, available in both vector and raster. It has Premium and Free options. Majority of the Free icons have a CC license.

Licenses Provided

Available through this endpoint: https://developer.iconfinder.com/reference/get-license-details

They appear to have a number of CC licenses in use, various versions, each identified by a number, licenses 1-11 are all CC, but we should go through the full list of available licenses to ensure there aren't additional CC licenses later in the license table.

Provider API Technical info

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
Verify the API provides stable direct links to individual works.
Verify the API provides a stable landing page URL to individual works.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
Attach example responses to API queries that have the relevant info.

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

annatuma commented on Thu Jun 04 2020:

I've updated the description here with a little more information on the licenses. This is now ready for work. Full ticket work/research must be completed before integration starts, as usual.

Thanks for the suggestion @akshgpt7
source

kss682 commented on Thu Jul 09 2020:

@annatuma @mathemancer
The CC licenses available are

source

kss682 commented on Fri Jul 10 2020:

Icons can be collected using the https://api.iconfinder.com/v4/icons/search endpoint .
Sample query to search for free icons :

url = "https://api.iconfinder.com/v4/icons/search"

querystring = {"query":":","count":"10","offset":"0","premium":"0"}

Image url
The response is a list of icons, from each icon dictionary we can get the image_url and thumbnail_url from rastor sizes field and selecting choosing the preview url ( other is download link ) . Icons of different sizes are availables in rastor sizes

{
"size_height":512
"size":512
"size_width":512
"formats":[
0:{
"download_url":"https://api.iconfinder.com/v4/icons/4359331/formats/png/512/download"
"preview_url":"https://cdn3.iconfinder.com/data/icons/sale-13/32/tag-512.png"
"format":"png"
}
]
}

License
The icon object does not show license in the response when queried using the search endpoint with premium:0 . But when queried with the icon detail we get the license. Even though both are returning the same icon object . This means we might have to query icon details endpoint for each icon .

Search endpoint :

"published_at":"2019-03-12T11:08:34.980"
"tags":[...
]
"is_premium":false
"vector_sizes":[...
]
"containers":[]
"is_icon_glyph":false
"icon_id":4359331
"styles":[...
]
"categories":[...
]
"type":"vector"
"raster_sizes":[...
]

Foreign identifier
We can use the icon id as the foreign identifier.
Foreign landing page
Couldn't find a landing url in the response. ( I doubt whether icons have such concept of landing url, going through their site on clicking icons only a modal pops up showing the icon details .)

source

[API Integration - AUDIO] Jamendo (original #345)

This issue has been migrated from the CC Search Catalog repository

Author: amartya-dev
Date: Fri Mar 27 2020
Labels: providers,✨ goal: improvement,🙅 status: discontinued

Provider API Endpoint / Documentation

Documentation: https://developer.jamendo.com/v3.0/docs
The generic GET url form is the following: http[s]://api.jamendo.com/version/entity/subentity/?api_parameter=value

Provider description

On Jamendo Music, you can enjoy a wide catalog of more than 500,000 tracks shared by 40,000 artists from over 150 countries all over the world. You can stream all the music for free, download it and support the artist: become a music explorer and be a part of a great discovery experience!

Licenses Provided

Jamendo uses Creative Commons licenses to enable the free distribution of otherwise copyrighted work. CC licenses all grant 'baseline rights', such as the right to distribute the copyrighted work worldwide for non-commercial purposes, and without modification. Artists choose a license according to the conditions they want to be applied to the song.
As per https://www.jamendo.com/legal/creative-commons

Provider API Technical info

The Terms of use does not as such mention a limit, there are different plans though which will be available only after signup. The plans are read-only and write-only.
The API uses OAuth2 based authentication.
Rate Limit: 35,000 requests per month for non-commercial apps

Checklist to complete before beginning development

No development should be done on a Provider API Script until the following info is gathered:

Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
There is a read API that can be queried with: https://developer.jamendo.com/v3.0 as the base URL and other information can be sent along with the GET request to retrieve specific info. The API responds with 200 rows in one response and accepts a parameter offset that can be used to specify the position to start returning results from.
Verify the API provides license info (license type and version; license URL provides both, and is preferred)
It provides the license URL as the parameter licensecurl in the API response.
Verify the API provides stable direct links to individual works.
It provides the same under the key "audio" in the response. The download link is also provided under "audiodownload"
Verify the API provides a stable landing page URL to individual works.
The API provides a short and a share version for the landing page of the URL under the keys "shorturl" and "shareurl" respectively.
Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
The API provides duration, album name, album image, artist name, release date, image for the track, etc.
Attach example responses to API queries that have the relevant info.

{
  "headers":{
    "status":"success",
    "code":0,
    "error_message":"",
    "warnings":"",
    "results_count":2
  },
  "results":[
    {
      "id":"1630628",
      "name":"Rock_guard",
      "duration":202,
      "artist_id":"493111",
      "artist_name":"Christian Petermann",
      "artist_idstr":"Christian_Petermann",
      "album_name":"Rock Vision",
      "album_id":"183496",
      "license_ccurl":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/3.0\/",
      "position":10,
      "releasedate":"2019-03-14",
      "album_image":"https:\/\/imgjam2.jamendo.com\/albums\/s183\/183496\/covers\/1.200.jpg",
      "audio":"https:\/\/mp3l.jamendo.com\/?trackid=1630628&format=mp31&from=app-devsite",
      "audiodownload":"https:\/\/mp3d.jamendo.com\/download\/track\/1630628\/mp32\/",
      "prourl":"https:\/\/licensing.jamendo.com\/track\/1630628",
      "shorturl":"https:\/\/jamen.do\/t\/1630628",
      "shareurl":"https:\/\/www.jamendo.com\/track\/1630628",
      "image":"https:\/\/imgjam2.jamendo.com\/albums\/s183\/183496\/covers\/1.200.jpg",
      "musicinfo":{
        "vocalinstrumental":"instrumental",
        "lang":"",
        "gender":"",
        "acousticelectric":"",
        "speed":"high",
        "tags":{
          "genres":[
            "rock"
          ],
          "instruments":[
            "electricguitar",
            "strings"
          ],
          "vartags":[
            "groovy",
            "energetic"
          ]
        }
      }
    },
    {
      "id":"1425156",
      "name":"Skitz",
      "duration":102,
      "artist_id":"497621",
      "artist_name":"Chris Bleau",
      "artist_idstr":"Chris_Bleau",
      "album_name":"San Diego State of Mind",
      "album_id":"166193",
      "license_ccurl":"http:\/\/creativecommons.org\/licenses\/by-nd\/3.0\/",
      "position":3,
      "releasedate":"2017-02-22",
      "album_image":"https:\/\/imgjam1.jamendo.com\/albums\/s166\/166193\/covers\/1.200.jpg",
      "audio":"https:\/\/mp3l.jamendo.com\/?trackid=1425156&format=mp31&from=app-devsite",
      "audiodownload":"https:\/\/mp3d.jamendo.com\/download\/track\/1425156\/mp32\/",
      "prourl":"https:\/\/licensing.jamendo.com\/track\/1425156",
      "shorturl":"https:\/\/jamen.do\/t\/1425156",
      "shareurl":"https:\/\/www.jamendo.com\/track\/1425156",
      "image":"https:\/\/imgjam1.jamendo.com\/albums\/s166\/166193\/covers\/1.200.jpg",
      "musicinfo":{
        "vocalinstrumental":"instrumental",
        "lang":"",
        "gender":"",
        "acousticelectric":"",
        "speed":"high",
        "tags":{
          "genres":[
            "rock"
          ],
          "instruments":[

          ],
          "vartags":[
            "fun",
            "groovy",
            "party"
          ]
        }
      }
    }
  ]
}

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

Original Comments:

mathemancer commented on Wed Apr 01 2020:

Hey, for any boxes checked on the required info above, please provide references or evidence to support checking those boxes. This will help us when this issue is ready for development.
source

Issue author amartya-dev commented on Fri Apr 03 2020:

@mathemancer I have updated the comment in order to include the required info
source

mathemancer commented on Tue Apr 07 2020:

Okay, thanks!
source

Collect sensitive content flag data from providers

This issue has been migrated from the CC Search Catalog repository

Providers with sensitive content information

Freesound is_explicit MTG/freesound@0ddf231

Author: kgodey
Date: Fri Jul 19 2019
Labels: 🙅 status: discontinued

I think this is only for Behance but if other providers offer it, please collect it from them as well.

Original Comments:

annatuma commented on Fri Dec 06 2019:

Also related to 425
source

annatuma commented on Sat Feb 29 2020:

Blocked by cc-archive/cccatalog#163
source

Use size and compression as metrics (original #575)

This issue has been migrated from the CC Search API repository

Author: aldenstpage
Date: Wed Jul 29 2020
Labels: ✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued

Problem

Images with low resolution or high compression sometimes show up in the first page of results, even with popularity boosting.

This issue blocks on consuming outbound data from the web crawler.

Description

We should heavily weigh down results with low resolution and high compression. Both of these metrics can be distilled into a single "quality_penalty" value (high compression OR low resolution will result in higher quality penalties). The thinking here is that small resolution or high compression are strong indicators that an image is not worth showing, but high resolution and low compressibility do not necessarily correlate with relevance.