Giter VIP home page Giter VIP logo

Comments (11)

sarayourfriend avatar sarayourfriend commented on June 15, 2024 2

@WordPress/openverse-catalog This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion. Getting it for historical records could follow the same pattern as the proposed data normalisation RFC (#345). Should we move this issue? Side note that this is one of the projects mentioned as part of the content safety lighthouse goal in the project planning spreadsheet (#343)

from openverse.

sc0ttkclark avatar sc0ttkclark commented on June 15, 2024 2

Just +1'ing here because I don't want to let my kids use OpenVerse for their school projects yet when results include NSFW images (even when I went to go try to search for something for my own demo site).

from openverse.

stacimc avatar stacimc commented on June 15, 2024 1

This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion.

This seems reasonable to me, although as you mention would require some effort to establish for historical data. I would support this. It seems like one of the original comments here supported this approach as well:

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv

from openverse.

zackkrida avatar zackkrida commented on June 15, 2024 1

I came across a potentially interesting approach for doing this, with it's own tradeoffs, naturally. ES Index Aliases support filtering: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-aliases.html#filtered

What if we put sensitive media into their own indexes (sensitive-images sensitive-audio and so on), filtered by the mature flag but also by filtering against the sensitive term list at index creation time? Then, at search time we route to the sensitive/non-sensitive alias based on the sensitive filter.

This would mean that updates to our term list require reindexing, which we currently do weekly as part of the data refresh anyway. I think the idea there is that the sensitive term list can change at all, not that those changes need to be instantaneous.

from openverse.

sarayourfriend avatar sarayourfriend commented on June 15, 2024 1

I assume that approach would perform better overall relative to the number of terms. It sounds like a good approach to me!

from openverse.

sarayourfriend avatar sarayourfriend commented on June 15, 2024 1

BTW, the correct documentation for our current version of Elasticsearch for the index aliases is this: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/aliases.html#filter-alias

Upon further investigation of the feature, I don't think it is worth spending time trying it, at least not for performance reasons. The filtered alias does not "pre-filter" the documents, it just applies the filter to every query against that index (as far as I can tell based on filtered alias performance questions raised online).

However, it got me digging around because it seemed like something ES should be able to do (create a new index "type thing" from an existing index based on a filter). Indeed, it can: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html

Using the reindex API, we can tell ES to "reindex" an existing index into a differently named index and use a filter to select the documents from the origin index. So essentially we'd be copying X index with the same MultiMatch filter I wrote in WordPress/openverse-api#1108. We'd move the logic to index creation site and call to reindex to create the filtered index immediately after an index is created. We'd also need to do that when an index is updated via the update_index action.

I'm going to move this issue back to the API repository as it now seems to be squarely in the domain of the API/ingestion server rather than the catalogue, at least under our currently agreed-upon approach for sensitive term filtering.

That said, while I think it is a good idea (for query performance) to create these secondary indexes that exclude documents matching sensitive terms, there are a couple of questions/complications I want to raise early so that we can consider them:

  1. How will this affect the size of our ES cluster, specifically with disk usage?
  2. Will it also effect memory, as now twice as many indexes will suddenly be being queried?
  3. Will this effectively double index creation time? Will it be more because we'll be filtering documents as well (sending a query) to derive the documents for the complementary query?
  4. What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).
  5. Do we want to keep the ability to apply additional sensitive term filters at query time, following the pattern in WordPress/openverse-api#1108. If we did, I am assuming the list would only be used in an "emergency" type situation where we realised a critical term is left out of the list used to create the filtered index that we want to filter out of the default searches ASAP. The invention of a new slur or something is the only thing I can really think of where this would apply. I am mildly sceptical that it would be useful to keep in as it seems unlikely we'd need it and "no code is the best code"/YAGNI.

from openverse.

sarayourfriend avatar sarayourfriend commented on June 15, 2024

What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).

Thinking more about this, it's far more complicated to rename the origin index at this point, so we could just use the word "filtered" for the new one (which has the same benefits of unfiltered I suggested before).

from openverse.

sarayourfriend avatar sarayourfriend commented on June 15, 2024

I've updated the PR linked to this issue to also enable the creation of filtered indexes. We can remove the Django API behaviour depending on if we decide whether it's worth keeping around or not.

The PR creates a new action for the reindexing rather than trying to add it to an existing step. This means we'll need to update the data refresh DAG to call the new action as well as the "POINT_ALIAS" action afterwards, mirroring the changes made to load_sample_data.sh in the PR.

from openverse.

AetherUnbound avatar AetherUnbound commented on June 15, 2024

@sarayourfriend should we close this issue in favor of some of the other plans/RFCs currently ongoing? Or will we just end up using this issue for that work?

from openverse.

sarayourfriend avatar sarayourfriend commented on June 15, 2024

The project thread references it: #377

We could close this issue once the project thread is closed or close it now as a duplicate, I have no preference.

from openverse.

AetherUnbound avatar AetherUnbound commented on June 15, 2024

I'll go ahead and close the issue, it's linked for context as you mention anyway 🙂

from openverse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.