search-terms-sanitization
Code for evaluating and implementing search terms sanitization.
Working in this repo
Making commits
Open a PR and get one passing review before merging.
Directory structure
This repo's directory structure is minimal for now. We'll add more structure as we go.
.circleci | CircleCI |
nightly-job | code for the sanitization job that runs nightly |
assets | public data like US Census surnames |
non_sensitive | analyses that do not involve sensitive search data |
suggest_search_tools | reusable python code for the research team |
Set-up
- Request access to the
[email protected]
service account. This documentation describes how. - Create a GCP-hosted notebook environment and clone this repo into it. This video tutorial demonstrates how.
- Optional: If you want to use the code in the
suggest_search_tools/
directory as a python library, you can pip install it:This is needed to run the notebooks incd search-terms-sanitization/ # make sure you're in the search-terms-sanitization/ directory pip install -e . # -e installs in editable (develop) mode
non_sensitive/
.
Outputs
The nightly sanitization job writes data to
- sanitized search terms:
moz-fx-data-shared-prod.search_terms_derived.merino_log_sanitized_v3
- the job metadata table:
moz-fx-data-shared-prod.search_terms_derived.sanitization_job_metadata
- the job metadata languages table:
moz-fx-data-shared-prod.search_terms.sanitization_job_languages
Related artifacts
- Search Terms Sanitization Proposal: lists goals, requirements, high-level sanitization strategies
- Technical report: Evaluation of PII removal strategies.