Comments (11)
@WordPress/openverse-catalog This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion. Getting it for historical records could follow the same pattern as the proposed data normalisation RFC (#345). Should we move this issue? Side note that this is one of the projects mentioned as part of the content safety lighthouse goal in the project planning spreadsheet (#343)
from openverse.
Just +1'ing here because I don't want to let my kids use OpenVerse for their school projects yet when results include NSFW images (even when I went to go try to search for something for my own demo site).
from openverse.
This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion.
This seems reasonable to me, although as you mention would require some effort to establish for historical data. I would support this. It seems like one of the original comments here supported this approach as well:
For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv
from openverse.
I came across a potentially interesting approach for doing this, with it's own tradeoffs, naturally. ES Index Aliases support filtering: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-aliases.html#filtered
What if we put sensitive media into their own indexes (sensitive-images sensitive-audio and so on), filtered by the mature
flag but also by filtering against the sensitive term list at index creation time? Then, at search time we route to the sensitive/non-sensitive alias based on the sensitive filter.
This would mean that updates to our term list require reindexing, which we currently do weekly as part of the data refresh anyway. I think the idea there is that the sensitive term list can change at all, not that those changes need to be instantaneous.
from openverse.
I assume that approach would perform better overall relative to the number of terms. It sounds like a good approach to me!
from openverse.
BTW, the correct documentation for our current version of Elasticsearch for the index aliases is this: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/aliases.html#filter-alias
Upon further investigation of the feature, I don't think it is worth spending time trying it, at least not for performance reasons. The filtered alias does not "pre-filter" the documents, it just applies the filter to every query against that index (as far as I can tell based on filtered alias performance questions raised online).
However, it got me digging around because it seemed like something ES should be able to do (create a new index "type thing" from an existing index based on a filter). Indeed, it can: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html
Using the reindex API, we can tell ES to "reindex" an existing index into a differently named index and use a filter to select the documents from the origin index. So essentially we'd be copying X
index with the same MultiMatch filter I wrote in WordPress/openverse-api#1108. We'd move the logic to index creation site and call to reindex
to create the filtered index immediately after an index is created. We'd also need to do that when an index is updated via the update_index
action.
I'm going to move this issue back to the API repository as it now seems to be squarely in the domain of the API/ingestion server rather than the catalogue, at least under our currently agreed-upon approach for sensitive term filtering.
That said, while I think it is a good idea (for query performance) to create these secondary indexes that exclude documents matching sensitive terms, there are a couple of questions/complications I want to raise early so that we can consider them:
- How will this affect the size of our ES cluster, specifically with disk usage?
- Will it also effect memory, as now twice as many indexes will suddenly be being queried?
- Will this effectively double index creation time? Will it be more because we'll be filtering documents as well (sending a query) to derive the documents for the complementary query?
- What should we call this secondary index that excludes the sensitive terms?
"{model_name}-safe"
comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now)"{model_name}-unfiltered"
and the filtered index just"{model_name}"
, we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it). - Do we want to keep the ability to apply additional sensitive term filters at query time, following the pattern in WordPress/openverse-api#1108. If we did, I am assuming the list would only be used in an "emergency" type situation where we realised a critical term is left out of the list used to create the filtered index that we want to filter out of the default searches ASAP. The invention of a new slur or something is the only thing I can really think of where this would apply. I am mildly sceptical that it would be useful to keep in as it seems unlikely we'd need it and "no code is the best code"/YAGNI.
from openverse.
What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).
Thinking more about this, it's far more complicated to rename the origin index at this point, so we could just use the word "filtered" for the new one (which has the same benefits of unfiltered
I suggested before).
from openverse.
I've updated the PR linked to this issue to also enable the creation of filtered indexes. We can remove the Django API behaviour depending on if we decide whether it's worth keeping around or not.
The PR creates a new action for the reindexing rather than trying to add it to an existing step. This means we'll need to update the data refresh DAG to call the new action as well as the "POINT_ALIAS" action afterwards, mirroring the changes made to load_sample_data.sh
in the PR.
from openverse.
@sarayourfriend should we close this issue in favor of some of the other plans/RFCs currently ongoing? Or will we just end up using this issue for that work?
from openverse.
The project thread references it: #377
We could close this issue once the project thread is closed or close it now as a duplicate, I have no preference.
from openverse.
I'll go ahead and close the issue, it's linked for context as you mention anyway 🙂
from openverse.
Related Issues (20)
- Limit the number of concurrent dead link requests
- Skip-to-content button is broken
- The content settings button should be focused when the modal is hidden
- Test the copy data steps of Data Refresh
- Add favicon to Django API HOT 3
- Add `django-authlib` to enable SSO using GitHub into Django admin
- Prevent Django Admin default queries on primary media tables in production HOT 1
- Rename the `ContentProvider` model to `ContentSource` HOT 1
- `DJANGO_DB_LOGGING` setting breaks the build
- The `add_license_url` DAG keeps timing out
- Reenable Science Museum provider in Django admin HOT 1
- Remove popularity & matview timeouts from data refresh configurations
- Alert HOT 2
- Remove uses of `openverse-storage` bucket
- Create a document for how to start the catalog stack and run a DAG for testing HOT 3
- Browserlist (caniuse-lite) DB needs updating on the frontend
- Results considered dead if SSL fails during dead link check, even though they might not actually be dead HOT 1
- API docs logo missing
- Write a page describing the machine-generated tags for the frontend
- Drop `ORDER BY` clause from copy step of image data refresh when adding a limit
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openverse.