Giter VIP home page Giter VIP logo

googlecloudplatform / dlp-pdf-redaction Goto Github PK

View Code? Open in Web Editor NEW
37.0 83.0 23.0 277 KB

This solution provides an automated, serverless way to redact sensitive data from PDF files using Google Cloud Services like Data Loss Prevention (DLP), Cloud Workflows, and Cloud Run.

License: Apache License 2.0

Dockerfile 11.18% Python 39.87% HCL 48.95%
dlp workflows redaction gcp pdf documents serverless cloud cloudrun cloudfunctions cloudworkflows bigquery tesseract ocr terraform mask datalossprevention cloudstorage

dlp-pdf-redaction's Introduction

Solution Guide

This solution provides an automated, serverless way to redact sensitive data from PDF files using Google Cloud Services like Data Loss Prevention (DLP), Cloud Workflows, and Cloud Run.

Solution Architecture Diagram

The image below describes the solution architecture of the pdf redaction process.

Architecture Diagram

Workflow Steps

The workflow consists of the following steps:

  1. The user uploads a PDF file to a GCS bucket
  2. A Workflow is triggered by EventArc. This workflow orchestrates the PDF file redaction consisting of the following steps:
    • Split the PDF into single pages, convert pages into images, and store them in a working bucket
    • Redact each image using DLP Image Redact API
    • Assemble back the PDF file from the list of redacted images and store it on GCS (output bucket)
    • Write redacted quotes (findings) to BigQuery

Deploy PDF Redaction app

The terraform folder contains the code needed to deploy the PDF Redaction application.

What resources are created?

Main resources:

  • Workflow
  • CloudRun services for each component with its service accounts and permissions
    1. pdf-spliter - Split PDF into single-page image files
    2. dlp-runner - Runs each page file through DLP to redact sensitive information
    3. pdf-merger - Assembles back the pages into a single PDF
    4. findings-writer - Writes findings into BigQuery
  • Cloud Storage buckets
    • Input Bucket - bucket where the original file is stored
    • Working Bucket - a working bucket in which all temp files will be stored as throughout the different workflow stages
    • Output Bucket - bucket where the redacted file is stored
  • DLP template where InfoTypes and rules are specified. You can modify the dlp.tf file to specify your own INFO_TYPES and Rule Sets (refer to terraform documentation for dlp templates)
  • BigQuery dataset and table where findings will be written

How to deploy?

The following steps should be executed in Cloud Shell in the Google Cloud Console.

1. Create a project and enable billing

Follow the steps in this guide.

2. Get the code

Clone this github repository go to the root of the repository.

git clone https://github.com/GoogleCloudPlatform/dlp-pdf-redaction
cd dlp-pdf-redaction

3. Build images for Cloud Run

You will first need to build the docker images for each microservice.

PROJECT_ID=[YOUR_PROJECT_ID]
gcloud services enable cloudbuild.googleapis.com containerregistry.googleapis.com --project $PROJECT_ID
gcloud builds submit --config ./build-app-images.yaml --project $PROJECT_ID

Note: If you receive a pop-up for permissions, you can authorize gcloud to request your credentials an make a GCP API call.

The above command will build 4 docker images and push them into Google Container Registry (GCR). Run the following command and confirm that the images are present in GCR.

gcloud container images list --project $PROJECT_ID

4. Deploy the infrastructure using Terraform

This terraform deployment requires the following variables.

  • project_id = "YOUR_PROJECT_ID"
  • region = "YOUR_REGION_REGION"
  • wf_region = "YOUR_WORKFLOW_REGION"

From the root folder of this repo, run the following commands:

export TF_VAR_project_id=$PROJECT_ID
terraform -chdir=terraform init
terraform -chdir=terraform apply -auto-approve

Note: Region and Workflow region both default to us-central1. If you wish to deploy the resources in a different region, specify the region and the wf_region variables (ie. using TF_VAR_region and TF_VAR_wf_region). Cloud Workflows is only available in specific regions, for more information check the documentation.

5. Take note of Terraform Outputs

Once terraform finishes provisioning all resources, you will see its outputs. Please take note of input_bucket and output_bucket buckets. Files uploaded to the input_bucket bucket will be automatically processed and the redacted files will be written to the output_bucket bucket. If you missed the outputs from the firs run, you can list the outputs by running

terraform -chdir=terraform output

6. Test

Use the command below to upload the test file into the input_bucket. After a few seconds, you should see a redacted PDF file in the output_bucket.

gsutil cp ./test_file.pdf [INPUT_BUCKET_FROM_OUTPUT e.g. gs://pdf-input-bucket-xxxx]

If you are curious about the behind the scenes, try:

  • Checkout the Redacted file in the output_bucket.

    gsutil ls [OUTPUT_BUCKET_FROM_OUTPUT e.g. gs://pdf-output-bucket-xxxx]
    
  • Download the redacted pdf file, open it with your preferred pdf reader, and search for text in the PDF file.

  • Looking into Cloud Workflows in the GCP web console. You will see that a workflow execution was triggered when you uploaded the file to GCS.

  • Explore the pdf_redaction_xxxx dataset in BigQuery and check out the metadata that was inserted into the findings table.

dlp-pdf-redaction's People

Contributors

felimartina avatar gracehoogendoorn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlp-pdf-redaction's Issues

Seems to be failing on step 2 for me which is Split PDF into Pages

HTTP server responded with error code 503
in step "2. Split PDF into pages", routine "main", line: 48
{
"body": "Service Unavailable",
"code": 503,
"headers": {
"Alt-Svc": "h3=":443"; ma=2592000,h3-29=":443"; ma=2592000",
"Content-Length": "19",
"Content-Type": "text/plain",
"Date": "Fri, 03 Nov 2023 17:47:53 GMT",
"Server": "Google Frontend",
"X-Cloud-Trace-Context": "853a3ce82faa491cbd266fd3aa572001;o=1"
},
"message": "HTTP server responded with error code 503",
"tags": [
"HttpError"
]
}

Any ideas?

usage of archived TF provider

As per: https://registry.terraform.io/providers/hashicorp/template/latest

This provider has been archived. Please use the templatefile function or the Cloudinit provider instead. See documentation for more details.

Solution can't be deployed from m1 Mac as archived provider doesn't support new architecture.


Current state:

can't terraform apply fails with error:

│ Error: Incompatible provider version
│ 
│ Provider registry.terraform.io/hashicorp/template v2.2.0 does not have a package available for your current platform, darwin_arm64.
│ 
│ Provider releases are separate from Terraform CLI releases, so not all providers are available for all platforms. Other versions of this provider may have different platforms supported.
╵

500 server error

BUG: Single Page PDF's fail on Splitter phase.
Description: i get a 500 error anytime i upload a single page, multiple page works good.

Error 2022-07-07T21:54:40.463871ZPOST500653 B160 msGoogleCloudWorkflows; (+https://cloud.google.com/workflows/docs) https://pdf-splitter-98ea-buxissg3zq-uc.a.run.app/ Default 2022-07-07T21:54:46.046341Z[2022-07-07 21:54:46 +0000] [1] [INFO] Starting gunicorn 20.1.0 Default 2022-07-07T21:54:46.047609Z[2022-07-07 21:54:46 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1) Default 2022-07-07T21:54:46.047657Z[2022-07-07 21:54:46 +0000] [1] [INFO] Using worker: gthread Default 2022-07-07T21:54:46.079066Z[2022-07-07 21:54:46 +0000] [2] [INFO] Booting worker with pid: 2 Default 2022-07-07T21:57:12.792512ZDownloading file: gs://pdf-input-bucket-98ea/nospacename.pdf Default 2022-07-07T21:57:12.908398ZInput file downloaded from GCS to 970778b2-3ef6-4a0b-b7d3-b623f3c297fa Default 2022-07-07T21:57:12.935872Zerror: Unable to get page count. Default 2022-07-07T21:57:12.935893ZSyntax Error: Gen inside xref table too large (bigger than INT_MAX) Default 2022-07-07T21:57:12.935902ZSyntax Error: Couldn't find trailer dictionary Default 2022-07-07T21:57:12.935911ZSyntax Error: Gen inside xref table too large (bigger than INT_MAX) Default 2022-07-07T21:57:12.935919ZSyntax Error: Couldn't find trailer dictionary Default 2022-07-07T21:57:12.935927ZSyntax Error: Couldn't read xref table Default

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.