Giter VIP home page Giter VIP logo

integrity-preprocessor's People

Contributors

benhylau avatar makew0rld avatar walkerlj0 avatar yurkowashere avatar

Watchers

 avatar

Forkers

darkdrgn2k

integrity-preprocessor's Issues

Refactor folder

  • Update to use common functions instead of local ones
    • processWacz
    • parse_proofmode_data

Proofmode + Signal Content

Commit 3d6c50a added basic detection for content so all proofmode submits are not "Authenticated image"

  • Convert to arrays and in keyword instead of long list of if and or
  • Deal with scenario where different content gets sent (ie mix of audio and video)

Standardize delivery methods

Currently, the chat-bot delivery method (e.g. Signal) uses a folder sync that can recover from failure, but other ingest methods fail silently.

With proofmode-signal, when we get signature validation failures, input files get moved to a failed folder, indicating it needs manual attention. Other delivery methods do not support this.

We need to standardize on delivery methods and add fault tolerance to ensure reliability.

Preprocessor failes to validate wacz Files

Error occurs
AttributeError: module 'validate' has no attribute 'Wacz'

https://github.com/starlinglab/integrity-preprocessor/blob/dev/lib/common/__init__.py#L102

Note: When commented process continues

Oct 3 18:54:59 org2 python3[936029]: 2022-10-03 18:54:59 INFO Recorder Metadata Change Detected
Oct 3 18:54:59 org2 python3[936029]: Traceback (most recent call last):
Oct 3 18:54:59 org2 python3[936029]: File "/root/integrity-preprocessor/browsertrix/main.py", line 434, in
Oct 3 18:54:59 org2 python3[936029]: meta_extra = common.parse_wacz_data_extra(wacz_path)
Oct 3 18:54:59 org2 python3[936029]: File "/root/integrity-preprocessor/browsertrix/../lib/common/init.py", line 102, in parse_wacz_data_extra
Oct 3 18:54:59 org2 python3[936029]: if not validate.Wacz(wacz_path).validate():
Oct 3 18:54:59 org2 python3[936029]: AttributeError: module 'validate' has no attribute 'Wacz'
Oct 3 18:54:59 org2 systemd[1]: integrity-preprocessor-browsertrix.service: Main process exited, code=exited, status=1/FAILURE
Oct 3 18:54:59 org2 systemd[1]: integrity-preprocessor-browsertrix.service: Failed with result 'exit-code'

Redact latitude and longitude

redact lat/lon from manifests. Keep the reverse-geocoded results in manifest, but skip lat/lon. Not sure when we'll need it, but let's make the task first.

Document preprocessor

Document the preprocessor architecture.

There are a few concepts in the preprocessor:

  • asset types (e.g. wacz-local, wacz-browsertrix, jpg, proofmode-zip, jpg-sig66)
  • delivery methods (e.g. folder, signal, http, browsertrix docker)
  • root-of-trust signature validators (e.g. numbers signatures, sig66)

We need to document these as a list, and how these can be composed together for a given deployment.

The Starling Integrity Collections (ask for access in chat) and the integrity-schema are good places to understand this.

The output of any preprocessor is ultimately an inputBundle for, and as documented in, the integrity-backend with conformant file structure and metadata files sha256(content)-meta-content.json and sha256(content)-meta-recorder.json.

Add test coverage

We need test data and unit tests that validate each item described in #51

TODO

  • run pipeline to be able to locally drop items into receiving folders and generate inputBundle
  • check tests folder and gather any missing test data that should be included
  • write missing unit test
  • ensure all cases are documented in #51

Browsertrix - Cannot Recrawl

Browsertrix preprocessor checks for duplicates are done on the crawl template not the crawl. This mean re-crawls are not processed and skipped.

Repo naming

Rename to integrity-preprocessor or integrate into integrity-backend itself as a subfolder. We'll also have integrity-api eventually once we factor out /create.

Allow for review of WACZ before proceeding into the pipeline

"poor mans" QA

  • Have browsertrix preprocessor deliver WACZ into a review folder instead of input
  • Add support for "review" flag in collection config

How it works:

  • Preprocessor delivers files into /review instead of /input
  • Human can review WACZ file
  • Move file to /input to continue down the pipeline

Parse ProofMode JSON instead of CSV

Latest ProofMode release (0.0.16-ALPHA-3) supports JSON metadata output, along with the old CSV format. Switching to parsing the JSON will be simpler, less error-prone, and more future-proof if other keys are added to the metadata.

There is not urgency in this switch as CSV is not going away. Also, this change can only be deployed once we're sure that everyone in the field going forward will use ALPHA-3 or later, and not ALPHA-2.

Create a standard understanding of content recorded in metadata-content

The different processors and different projects have required different content in common metadata fields. This is specifically about computer-generated text inside some of the meta data field, not the schema itself.

Specifically

  • name
  • description
  • dateCreate

todo:

  • Record what the defaults for different scenarios
  • Under what conditions can the metadata be overridden
  • Look at all meta data collected
  • Decide if knobs/switches need to be implemented per collection
  • Consolidate as much as possible under common libraries

Change import order for in-repo modules

Many places in the code doing this to allow for local imports:

sys.path.append(os.path.dirname(os.path.realpath(__file__)) + "/../lib")

Instead they should be doing this:

sys.path.insert(0, os.path.dirname(os.path.realpath(__file__)) + "/../lib")

The second method puts the import path at the beginning, preventing global modules with the same name as local ones being imported instead. This caused issues in the past.

Folder preprocessor - injected validatedSignatures from another asset

Item that did not have any validatedSignatures receive a validatedSignatures from another asset. After restarting preprocessor item completed without any validatedSignatures entries as expected.

Possibly a variable remembering last validatedSignatures ?

{
    "contentMetadata": {
        "name": "Extracted Page",
        "mime": "image/png",
        "description": "Extracted page from web archives for Bijeljina Investigation",
        "author": {
            "@type": "Organization",
            "identifier": "https://sfi.usc.edu/",
            "name": "USC Shoah Foundation"
        },
        "dateCreated": "2022-12-12T21:26:24Z",
        "timestamp": "2022-12-13T21:48:18.428980Z",
        "sourceId": {
            "key": "data_id",
            "value": "C049-0001"
        },
        "extras": {
            "data_id": "C049-0001",
            "registration_filename": "MRA22591R0000401246_Page_0001.png",
            "relatedAssetCid": "bafybeiapv3jcyh74fwuttudyvg2amfrgnfzjq2fg32w5jghvjytxgvugim"
        },
        "private": {
            <REDACTED>
        },
        "validatedSignatures": [
            {
                "provider": "ProofMode v0.0.17-RC-1 autogenerated=false",
                "algorithm": "proofmode-pgp-rsa",
                "publicKey": "-----BEGIN PGP PUBLIC KEY BLOCK-----\nVersion: BCPG v1.71\n\nmQINBGOBXO4BEAC79RJqDkhlO9M68FLjQ/SYuYHhk6tkKggoWs2N9z8sspkwepsp\nkTQk/Xfu1X+Xs0o2gRJXEziQB37aSqJ2adAE4h7kJOSjUZ13v7TNiEaSD8qX4JQl\nMFstoi//JLxuT7UVBETHoivt3VG64U5Z8SmzFOZVbEK130SaQpK5rOAvrJrcCYQx\nh8t5r+E91hqDmJBHLZoeVKq8TV2jea52x9dHPNoc2SOTjIYS/2tLrlQou+y1NqhZ\nVER+57aKj0DVrHfkdQ5U4sVf9nYfZCOfdDPciGOXIzCaXW4dtxjWI41WqK6JrVQ+\nVkH410IcV44rQugXKL16CRqaAmrVfmd1c+UPer9EGKOaYZi5eNT8To5nKhuYSJDy\ncDZVtO3OCkYxUppKR5j0SoL4Qf1Lr/rfk3hQeRI9DdsekFVXz6HkPIqVJbn2RlU3\naOY403nzCSmGDkoDzGXZHS30gAdRDOjTjtWGuM+5rWYWB3xFWg0J03QdPYJYJ+qf\nWlM93xXkwScgD7RDKxFcWVrSD61TioaEJMa7YlZuYUfAjFiokCjiRSaTc3W0MrJG\nHrBKTu8oHBsVK687JkL5iPIea33qSMyQFj6FHB6xU2/2s0ePN1BUpFWEdv64aNVa\nzSxD+i2gs6Me1NwEZNDpcML76zOzJaAEFtdBXmspBhr6VXsuUDnjlIzUmQARAQAB\ntBtub29uZUBwcm9vZm1vZGUud2l0bmVzcy5vcmeJAi4EEwECABgFAmOBXO8CG4ME\nCwkIBwYVCAIJCgsCHgEACgkQ01jFsT0KxiLUSQ/8CO2vl3QTk5cqi5lLjmIucTYn\n1H9bTybMIxA4mx3OvzN9j36iNbqDCzUKHrKLycc3o3Rb8LVZVc6iNziZVbV2bKgV\n/1Di3IcB4drkIqPlkHf+IL4Qz05oRL1gfDuhKGtsLgGohGKCj2URgZsfotM6185k\ncdZdqmvs+uHGTzOIIyb0UyKBoeLbsN5sMKDFPMetkFF4CGhh+PH4BrA3+4KtywJ8\nN373A4j5tXrGr3rrAjqppZ4jTEbUDKI1/76EmxpWUcP+DUv9sJB1mlqS7npzOBvN\n8v6iSBQ64R0b4/pr6ipQVD78LD+LxZToQO7v8VOXa7ervOMT1EHheDSIeYC6MP3R\nSPgT2grgy04gjZjD3HYyJ+sUwW4slnHNm7XLwKz3GYEYDXSPwPWGrSh3WUstQ6DL\nXPPJvoi+S++yPV0hT/OilJPz/cJBWZ7wAazor0kJSb2vtU/8S+YfJHK4KCgTkyFx\nzeyZptmSJdWyV+/Jey42WtV1u96HYaIwg9TeE3MqFLd3GoM8OAucGUgZMK7k3Qve\nRV3Swz2mAULGUot1CFmAhN2Yk1e5kHyiLwdRege0NtM/Eu2pMSmo72pKWKb4O0Xa\nF5RGfiol1GIX+/O5tteUWSV75Lm4NQObFXAwvwgEvt5/BFZQ/fAAzqC9IQ9z4/sc\nOR4f9Gc/GYBExlCvllm5Ag0EY4Fc7wEQALaTij8Vw2PJC6a0XKGYxYmXRwcs2YBJ\nF3WTE24ArvdAjNnoWGx7zhPWV+10eiJZQRnq7yXZD+1wBA3Wnb38vnlmyuX5l1OA\n3I7hQC5v0qgsy2CcMI6MwlgypipF8JLAI9/yuTGWMuNx4GUGD2B4kkhV6SaIhd1q\nwJnZDsLCqnjsIwQezF2p924z5iOudzihmaZfFwzC5Os99AgWP3e0skR4EO77uVpt\nWe332cQmY54dgGCwzLX1ZkoFrsNNgBNvTsvsD+twrOUewK8Lja5oSd0QAJFAUKr0\nuaBxzVnWaSAwtK9Wgtdv9H3mEDfN+y2vDExdRx5CouZilPMahmq5N5WBuXRLaUez\n5atzRknGzhW/sTf5WzqMsQcBEtcq8GJwb8CF77IG3mkLaWwKx5Afx+wLfRsii8Ut\nCCsCiix/wC6rahFoK9pzyihxA+5tZNchUI/0c+caus7AzdVR2A0Ju7U4PqY+ELrh\nAys+1Bk74vSeaJMNe/JBlKjSVGEUEZxiNY2xqprbJbjI3kqt1fxIptJKQJPxvFJV\nlANitXf6WoCS8NzNuNZavY/b4MWYEAdvMRpu1ZjK/EWa2J1UeRpeyNeai8TtO5fc\nNr7ebCBsxm6HZVGAnVaP8M5/bh8eQjJqA7HbcuMHvnPvV9np+pQNl+ZmB/sqwWAj\nmRGb3qXdFCQNABEBAAGJAh8EGAECAAkFAmOBXPACGwwACgkQ01jFsT0KxiLUbg/+\nKplkMOoQEwUb+cLs7dU8SrvQHadshPlAJW4/4cgo3B8dkmD31i5NmurUTW0FW7yz\nonBrzYPiMze7xrUuAYpeBaCTmjYOvmzlPxb5bCLSzOaa4I7UpOn6YmmYHAvB0BXV\nXqZDhdnfCNe9zg9apzU9HgzuIE0BrfPLd6t53hflifBQQIeBOWilxRo27x/ZxGmm\noR7Cddz3aPs37gCY3iCQDNY0NB3wtM8kpMXz0OJR9NDyPcvxSk36cJEY/TMIHc1e\nzrVjH7t1shAxi7yx+swEoXjrKiTk9EJ0d+R2AoBuHf/woTPqWO2ltwGHxVs088Ew\nehHteNg9QDrpwcZ6Mdra3PdGoniuqofzkCXy+bSC4IGdzYV9Rsco1WHsUqB8Kmot\n2HC2nvkNCh12tC6IBsy+Z3PLS6tejnOEeH/bhfHzm8R4sF1UyHWMZ9fFPf4h+3yH\n3jjw/mcyvkHZZ01RmANiZWKEoNDduthUwRTMWN21D6SNLpLPVA2Qod5N0p0or8OS\nI84fzRwDsCO3+qUE5yP99ROO108NLnLFYzfR+4dRmVfmGLU6gJcRnJF0brtp0K9b\n7hmf9A89/8138F8ufKoC2iiG0RG1Nrh9y+IJT2BkZVKpnPZ90T/8kViulHLf/6X9\nc6cSi5Gz/bUhvO8PZ/M95sLHd2RR3KP7rw/rhjvTN3M=\n=vbrb\n-----END PGP PUBLIC KEY BLOCK-----\n",
                "custom": {
                    "c931ee18b3ee2d9fc2d215f03d3d568ed0aa6a7c5afdf78bf5c2ff187a918421.proof.csv": {
                        "signature": "-----BEGIN PGP SIGNATURE-----\nVersion: BCPG v1.71\n\niQIcBAABCAAGBQJjgrhxAAoJENNYxbE9CsYibbYQALKD7GvDOicyWviI35K/dxVn\nMOAHtAokjH9iXC0gIoLwFlOqoJ4MtRgazVZynTT06DdEkgB/hNAY8G7mOAuQOuXY\nc0l/fJ4pzwygWZMihLhgjIA9fc+tGjjg+jqd6hGoc3dchxEpi7z4Au4cn71PFdMk\nbVWIWGN9XYE5QNdFZ+Vxb+WDXYFE/fg62SgEYPGXXMLrQIgKhxMkwb7FpmUKq4MS\nQuPhYg7txSJnjSaMs1NcBr4Do6LBNzYHYWznvoskzz2saohF3hciIAXSCL8MGexV\nNFWuDasPBOvMFtMPQwPClSj9nRa4wNNQNEmtHMmT9fSb3PhWA4fDZerbIy/IrLHH\n6SocnhTwo8Zccdg7RbNEjyeWqrSkAP2FCMja8V3D1BfJ8y1DdwbvRQULHVokz6/3\n4Pr1yL3InsJurr05dGWQ/b8OkispH/751ebUcFurjKUsYNL9oK0gAxI9VZ2TLoDu\nJTBUhntHB3QkaMgcY0krx2tEJXilVA2n3s4M/rFxz7WhFwfTkT1mO/JxWE/aN0Ed\n+cP7ffOypuZuNXLE/AtDAqOawNKhUmMfVI4R9QR86z+67Y9DuCnxJo4ZQ6n6dweY\nNiN2B9L89Vg8Rzc4em9pQuCpPmQHb6CrF9eU42eDkKlLEpjcLBdLCDrJ19bB9cxi\nJHOkaRxYyXIT4VTGFs2t\n=KgHo\n-----END PGP SIGNATURE-----\n",
                        "authenticatedMessage": "eb526a0e5ec96ddcd23bb143e7ba4560f277a19fbe0d894b7ddd07f779a7bc4a",
                        "authenticatedMessageDescription": "SHA256 hash of the signed file"
                    },
                    "IMG_20221127_002811.jpg": {
                        "signature": "-----BEGIN PGP SIGNATURE-----\nVersion: BCPG v1.71\n\niQIcBAABCAAGBQJjgrhxAAoJENNYxbE9CsYi0RoP/2XtHQseXwXCcTgMo+ay4TqB\nsXdulF7YbUw74SJaVbzPtn/qsyRqCd32DupJZ+HUW2IEbLEWSEPGRxZl0sWVlO48\nbrtWjCKqk5A0hhDhGM8p/H6ktng4Dzl+u3ywyaNinbwPlcQ36So7lKcdjmOCdBfU\nGDJaLaKoM4TlCFufzZUVfATd4BTT5HKf/Z+RJLc9fmJZsdx4ULsR0bqBVvJMxdFl\nW6xjJhDRAlQC2DeJA1xkVz6Vjwp33qtAs5+YGIlwk+g88rrLvsqyGW1kO/dz6FKk\nyoFgeg61m9efR43V3x8dzJJb1gVp5fApaUvaesGQmieHu3Cv0lg3/zma+UnJM1SQ\nHo5wss9NEEw3Z+NSX2UNbTc+kg0+ZADmbqtrdsYfuCXQBqRnP6HiyYs5Mw1Aaz7K\nkkIVadGmOq9N+zlMQsi6UKnkCISD9TuST9p+vOXNQDT1Mg14IpRMdad7NhGDrwsT\nGRpauaA3fveR0YGKpQsYDDCdv66eixDu+4m4TYbcOcEfyErxJItMy0rH9Ozf3fWy\nqfdffWdQh3ehSPBUULLa8eF9jwuNxbDmXon3uku1sMTGfo7hvTqOAR/HoKaN8TSi\nMB96vq5qo+8vwdVa07m+iriHhvH1QkGC4Ia717UPc38M5rt7eidmMwkOkAS9TLw/\ndLxSZWn+XmLAu8oX07S7\n=Zp5+\n-----END PGP SIGNATURE-----\n",
                        "authenticatedMessage": "c931ee18b3ee2d9fc2d215f03d3d568ed0aa6a7c5afdf78bf5c2ff187a918421",
                        "authenticatedMessageDescription": "SHA256 hash of the signed file"
                    },
                    "c931ee18b3ee2d9fc2d215f03d3d568ed0aa6a7c5afdf78bf5c2ff187a918421.proof.json": {
                        "signature": "-----BEGIN PGP SIGNATURE-----\nVersion: BCPG v1.71\n\niQIcBAABCAAGBQJjgrhxAAoJENNYxbE9CsYi/2cP+QHqU7qTc+FxSwgg7NdAvGaU\nIMuQjq61KhWNf/CQRtwjXNcSP43ehUd4RrJbZc3p5sQrIf0BwIgOpAxffGE3+V91\nwZB/BzvDMceGwNw4vYDOe2aDdKiRcvwGxS+xa8JOlKtKwg9WeK3g/N2bVaDhP2QC\nhYisVy16d6a/0QYO8dsekf/uFI6EPhciHV45G+vholhWV3mjS531Ez2cIR62FCou\ntQNC1OzNwGgu3WZNJmQljqat8ZDIJG40tT1jcOsKo1LJLgD4nVN+YzzvdcgfRl3V\nCFfUFnayJB1O6E2kjAhz/mePvVtqDRiknsSsAqg2VbblD+J39QMZ4YI03QKr8FsW\ngtUAvxpGVHzyRG/MrNkgM3MBslNPAkS9XziEhlp1bzM4cc+Y9389v43aKaMnTAj4\n/K03aiu4OSYLdlQDtKOIz9QkAF1pF4T4/G73dTFWdZYZ15QPOOOElp0rxmAGISPR\nz/nFgCZRF/YlhUSIJ6dLemJDlWjxl4kZcyuVWh/KPSYrlFojNOQsjbgXVkVlR/Jo\nNioKrgTVh3xpvhr6wOzPKM88hAKx6C+v2VEGwS5iDGP6oqX0NAoHIs6/U5GFe6/f\nY7+8X3W/sZCyr1Chmj9ngT0SuImn8Xj9IY6BxzcQV/4JiDGRf7x1sRABLcCTHmIL\n/OWfKW/jwHqNYlbtlaPb\n=Mx2A\n-----END PGP SIGNATURE-----\n",
                        "authenticatedMessage": "93f1ab9cfefe8e82d4338348800634f9d8964315691d71ce5f2c1e1f62c2d25a",
                        "authenticatedMessageDescription": "SHA256 hash of the signed file"
                    }
                }
            }
        ],
        "relatedAssetCid": "bafybeiapv3jcyh74fwuttudyvg2amfrgnfzjq2fg32w5jghvjytxgvugim"
    }
}

Confirm all datetime are timezone adjusted

Go through code to make sure all the dates/times stored in meta data are adjusted for the correct timezone.

  • Timezone should be adjusted to "Z" time
  • "Z" remove if it cannot be determined

Identity authorization file

Copied from https://github.com/starlinglab/organizing-private/issues/111

Identity authorization

This allows us to authorize ingest only to known agents.

trusted_devices.txt:

<sig_algo> <sig_identity> <description>

Signers

This can be used to identify the signer and add a line to metadata.

sig66 pubkey "Camera provisoned to ..."
proofmode gpg_key "Phone provisioned to ..."
starling-capture-zion-classic eth_pubkey "Phone provisioned to ..."

Transport

This is already used to identify the collection, author, etc.

dropbox email_address "Person at USC"
signal phone_number "Phone provisioned to ..."http jwt "Person at partner"
http jwt "Person at partner"

Probably in JSON it's better.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.