search-terms-sanitization's Introduction

search-terms-sanitization

Code for evaluating and implementing search terms sanitization.

Working in this repo

Making commits

Open a PR and get one passing review before merging.

Directory structure

This repo's directory structure is minimal for now. We'll add more structure as we go.


.circleci	CircleCI
nightly-job	code for the sanitization job that runs nightly
assets	public data like US Census surnames
non_sensitive	analyses that do not involve sensitive search data
suggest_search_tools	reusable python code for the research team

Set-up

Request access to the [email protected] service account. This documentation describes how.
Create a GCP-hosted notebook environment and clone this repo into it. This video tutorial demonstrates how.

Optional: If you want to use the code in the suggest_search_tools/ directory as a python library, you can pip install it:

cd search-terms-sanitization/  # make sure you're in the search-terms-sanitization/ directory
pip install -e .               # -e installs in editable (develop) mode

This is needed to run the notebooks in non_sensitive/.

Outputs

The nightly sanitization job writes data to

sanitized search terms: moz-fx-data-shared-prod.search_terms_derived.merino_log_sanitized_v3
the job metadata table: moz-fx-data-shared-prod.search_terms_derived.sanitization_job_metadata
the job metadata languages table: moz-fx-data-shared-prod.search_terms.sanitization_job_languages

Related artifacts

Search Terms Sanitization Proposal: lists goals, requirements, high-level sanitization strategies
Technical report: Evaluation of PII removal strategies.

search-terms-sanitization's People

Contributors

Stargazers

Watchers

search-terms-sanitization's Issues

Add open source software license

This Mozilla repository has been identified as lacking a license. Consistent with Mozilla's Licensing Policy an open source license should be applied to the code in this repository.

Please add an appropriate LICENSE.md file to the root directory of the project. In general, Mozilla's licensing policies are as follows:

Client-side products created by Mozilla employees or contributors should use the Mozilla Public License, Version 2.0 (MPL).
Server-side products or utilities that support Mozilla products may use either the MPL or the Apache License 2.0 (Apache 2.0).

In special cases, another license might be appropriate. If the repository is a fork of another repository it must apply the license of the original. Similarly, another license might be appropriate to match that of a broader project (for example Rust crates that Firefox depends on are published under an Apache 2.0 / MIT dual license, as that is the dual license used by the Rust programming language and projects).

Please ensure that any license added to the LICENSE.md file matches other licensing information in the repository (for example, it should match any license indicated in a setup.py or package.json file).

Mozilla staff can access more information in our Software Licensing Runbook – search for “Licensing Runbook” in Confluence to find it.

If you have any questions you can contact Daniel Nazer who can be reached at dnazer on Mozilla email or Slack.

OPENLIC-2023-01

Recent CI runs are failing

Recently, CircleCI build-and-push-image jobs have been failing, eg. this one.

The "Initialize gcloud CLI" step fails with ERROR: gcloud crashed (ValueError): No key could be detected.

Recommend Projects

mozilla / search-terms-sanitization Goto Github PK

search-terms-sanitization's Introduction

search-terms-sanitization

Working in this repo

Making commits

Directory structure

Set-up

Outputs

Related artifacts

search-terms-sanitization's People

Contributors

Stargazers

Watchers

search-terms-sanitization's Issues

Add open source software license

Recent CI runs are failing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent