starlinglab / integrity-preprocessor Goto Github PK
View Code? Open in Web Editor NEWPreprocessing server for ingesting and validating asset root-of-trust signatures in the Starling Integrity framework.
License: MIT License
Preprocessing server for ingesting and validating asset root-of-trust signatures in the Starling Integrity framework.
License: MIT License
common
functions instead of local ones
Commit 3d6c50a added basic detection for content so all proofmode submits are not "Authenticated image"
in
keyword instead of long list of if
and or
Currently, the chat-bot
delivery method (e.g. Signal) uses a folder sync that can recover from failure, but other ingest methods fail silently.
With proofmode-signal, when we get signature validation failures, input files get moved to a failed folder, indicating it needs manual attention. Other delivery methods do not support this.
We need to standardize on delivery methods and add fault tolerance to ensure reliability.
Error occurs
AttributeError: module 'validate' has no attribute 'Wacz'
https://github.com/starlinglab/integrity-preprocessor/blob/dev/lib/common/__init__.py#L102
Note: When commented process continues
Oct 3 18:54:59 org2 python3[936029]: 2022-10-03 18:54:59 INFO Recorder Metadata Change Detected
Oct 3 18:54:59 org2 python3[936029]: Traceback (most recent call last):
Oct 3 18:54:59 org2 python3[936029]: File "/root/integrity-preprocessor/browsertrix/main.py", line 434, in
Oct 3 18:54:59 org2 python3[936029]: meta_extra = common.parse_wacz_data_extra(wacz_path)
Oct 3 18:54:59 org2 python3[936029]: File "/root/integrity-preprocessor/browsertrix/../lib/common/init.py", line 102, in parse_wacz_data_extra
Oct 3 18:54:59 org2 python3[936029]: if not validate.Wacz(wacz_path).validate():
Oct 3 18:54:59 org2 python3[936029]: AttributeError: module 'validate' has no attribute 'Wacz'
Oct 3 18:54:59 org2 systemd[1]: integrity-preprocessor-browsertrix.service: Main process exited, code=exited, status=1/FAILURE
Oct 3 18:54:59 org2 systemd[1]: integrity-preprocessor-browsertrix.service: Failed with result 'exit-code'
redact lat/lon from manifests. Keep the reverse-geocoded results in manifest, but skip lat/lon. Not sure when we'll need it, but let's make the task first.
Document the preprocessor architecture.
There are a few concepts in the preprocessor:
We need to document these as a list, and how these can be composed together for a given deployment.
The Starling Integrity Collections
(ask for access in chat) and the integrity-schema are good places to understand this.
The output of any preprocessor is ultimately an inputBundle
for, and as documented in, the integrity-backend with conformant file structure and metadata files sha256(content)-meta-content.json
and sha256(content)-meta-recorder.json
.
We need test data and unit tests that validate each item described in #51
inputBundle
Browsertrix preprocessor checks for duplicates are done on the crawl template not the crawl. This mean re-crawls are not processed and skipped.
Rename to integrity-preprocessor
or integrate into integrity-backend
itself as a subfolder. We'll also have integrity-api
eventually once we factor out /create
.
Space missing in description:
"description": "Authenticated web archive of [ http://example.org/] captured on 2023-05-24",
"poor mans" QA
review
folder instead of input
review
supportHow it works:
Latest ProofMode release (0.0.16-ALPHA-3) supports JSON metadata output, along with the old CSV format. Switching to parsing the JSON will be simpler, less error-prone, and more future-proof if other keys are added to the metadata.
There is not urgency in this switch as CSV is not going away. Also, this change can only be deployed once we're sure that everyone in the field going forward will use ALPHA-3 or later, and not ALPHA-2.
The different processors and different projects have required different content in common metadata fields. This is specifically about computer-generated text inside some of the meta data field, not the schema itself.
Specifically
todo:
Many places in the code doing this to allow for local imports:
sys.path.append(os.path.dirname(os.path.realpath(__file__)) + "/../lib")
Instead they should be doing this:
sys.path.insert(0, os.path.dirname(os.path.realpath(__file__)) + "/../lib")
The second method puts the import path at the beginning, preventing global modules with the same name as local ones being imported instead. This caused issues in the past.
Item that did not have any validatedSignatures receive a validatedSignatures from another asset. After restarting preprocessor item completed without any validatedSignatures entries as expected.
Possibly a variable remembering last validatedSignatures ?
{
"contentMetadata": {
"name": "Extracted Page",
"mime": "image/png",
"description": "Extracted page from web archives for Bijeljina Investigation",
"author": {
"@type": "Organization",
"identifier": "https://sfi.usc.edu/",
"name": "USC Shoah Foundation"
},
"dateCreated": "2022-12-12T21:26:24Z",
"timestamp": "2022-12-13T21:48:18.428980Z",
"sourceId": {
"key": "data_id",
"value": "C049-0001"
},
"extras": {
"data_id": "C049-0001",
"registration_filename": "MRA22591R0000401246_Page_0001.png",
"relatedAssetCid": "bafybeiapv3jcyh74fwuttudyvg2amfrgnfzjq2fg32w5jghvjytxgvugim"
},
"private": {
<REDACTED>
},
"validatedSignatures": [
{
"provider": "ProofMode v0.0.17-RC-1 autogenerated=false",
"algorithm": "proofmode-pgp-rsa",
"publicKey": "-----BEGIN PGP PUBLIC KEY BLOCK-----\nVersion: BCPG v1.71\n\nmQINBGOBXO4BEAC79RJqDkhlO9M68FLjQ/SYuYHhk6tkKggoWs2N9z8sspkwepsp\nkTQk/Xfu1X+Xs0o2gRJXEziQB37aSqJ2adAE4h7kJOSjUZ13v7TNiEaSD8qX4JQl\nMFstoi//JLxuT7UVBETHoivt3VG64U5Z8SmzFOZVbEK130SaQpK5rOAvrJrcCYQx\nh8t5r+E91hqDmJBHLZoeVKq8TV2jea52x9dHPNoc2SOTjIYS/2tLrlQou+y1NqhZ\nVER+57aKj0DVrHfkdQ5U4sVf9nYfZCOfdDPciGOXIzCaXW4dtxjWI41WqK6JrVQ+\nVkH410IcV44rQugXKL16CRqaAmrVfmd1c+UPer9EGKOaYZi5eNT8To5nKhuYSJDy\ncDZVtO3OCkYxUppKR5j0SoL4Qf1Lr/rfk3hQeRI9DdsekFVXz6HkPIqVJbn2RlU3\naOY403nzCSmGDkoDzGXZHS30gAdRDOjTjtWGuM+5rWYWB3xFWg0J03QdPYJYJ+qf\nWlM93xXkwScgD7RDKxFcWVrSD61TioaEJMa7YlZuYUfAjFiokCjiRSaTc3W0MrJG\nHrBKTu8oHBsVK687JkL5iPIea33qSMyQFj6FHB6xU2/2s0ePN1BUpFWEdv64aNVa\nzSxD+i2gs6Me1NwEZNDpcML76zOzJaAEFtdBXmspBhr6VXsuUDnjlIzUmQARAQAB\ntBtub29uZUBwcm9vZm1vZGUud2l0bmVzcy5vcmeJAi4EEwECABgFAmOBXO8CG4ME\nCwkIBwYVCAIJCgsCHgEACgkQ01jFsT0KxiLUSQ/8CO2vl3QTk5cqi5lLjmIucTYn\n1H9bTybMIxA4mx3OvzN9j36iNbqDCzUKHrKLycc3o3Rb8LVZVc6iNziZVbV2bKgV\n/1Di3IcB4drkIqPlkHf+IL4Qz05oRL1gfDuhKGtsLgGohGKCj2URgZsfotM6185k\ncdZdqmvs+uHGTzOIIyb0UyKBoeLbsN5sMKDFPMetkFF4CGhh+PH4BrA3+4KtywJ8\nN373A4j5tXrGr3rrAjqppZ4jTEbUDKI1/76EmxpWUcP+DUv9sJB1mlqS7npzOBvN\n8v6iSBQ64R0b4/pr6ipQVD78LD+LxZToQO7v8VOXa7ervOMT1EHheDSIeYC6MP3R\nSPgT2grgy04gjZjD3HYyJ+sUwW4slnHNm7XLwKz3GYEYDXSPwPWGrSh3WUstQ6DL\nXPPJvoi+S++yPV0hT/OilJPz/cJBWZ7wAazor0kJSb2vtU/8S+YfJHK4KCgTkyFx\nzeyZptmSJdWyV+/Jey42WtV1u96HYaIwg9TeE3MqFLd3GoM8OAucGUgZMK7k3Qve\nRV3Swz2mAULGUot1CFmAhN2Yk1e5kHyiLwdRege0NtM/Eu2pMSmo72pKWKb4O0Xa\nF5RGfiol1GIX+/O5tteUWSV75Lm4NQObFXAwvwgEvt5/BFZQ/fAAzqC9IQ9z4/sc\nOR4f9Gc/GYBExlCvllm5Ag0EY4Fc7wEQALaTij8Vw2PJC6a0XKGYxYmXRwcs2YBJ\nF3WTE24ArvdAjNnoWGx7zhPWV+10eiJZQRnq7yXZD+1wBA3Wnb38vnlmyuX5l1OA\n3I7hQC5v0qgsy2CcMI6MwlgypipF8JLAI9/yuTGWMuNx4GUGD2B4kkhV6SaIhd1q\nwJnZDsLCqnjsIwQezF2p924z5iOudzihmaZfFwzC5Os99AgWP3e0skR4EO77uVpt\nWe332cQmY54dgGCwzLX1ZkoFrsNNgBNvTsvsD+twrOUewK8Lja5oSd0QAJFAUKr0\nuaBxzVnWaSAwtK9Wgtdv9H3mEDfN+y2vDExdRx5CouZilPMahmq5N5WBuXRLaUez\n5atzRknGzhW/sTf5WzqMsQcBEtcq8GJwb8CF77IG3mkLaWwKx5Afx+wLfRsii8Ut\nCCsCiix/wC6rahFoK9pzyihxA+5tZNchUI/0c+caus7AzdVR2A0Ju7U4PqY+ELrh\nAys+1Bk74vSeaJMNe/JBlKjSVGEUEZxiNY2xqprbJbjI3kqt1fxIptJKQJPxvFJV\nlANitXf6WoCS8NzNuNZavY/b4MWYEAdvMRpu1ZjK/EWa2J1UeRpeyNeai8TtO5fc\nNr7ebCBsxm6HZVGAnVaP8M5/bh8eQjJqA7HbcuMHvnPvV9np+pQNl+ZmB/sqwWAj\nmRGb3qXdFCQNABEBAAGJAh8EGAECAAkFAmOBXPACGwwACgkQ01jFsT0KxiLUbg/+\nKplkMOoQEwUb+cLs7dU8SrvQHadshPlAJW4/4cgo3B8dkmD31i5NmurUTW0FW7yz\nonBrzYPiMze7xrUuAYpeBaCTmjYOvmzlPxb5bCLSzOaa4I7UpOn6YmmYHAvB0BXV\nXqZDhdnfCNe9zg9apzU9HgzuIE0BrfPLd6t53hflifBQQIeBOWilxRo27x/ZxGmm\noR7Cddz3aPs37gCY3iCQDNY0NB3wtM8kpMXz0OJR9NDyPcvxSk36cJEY/TMIHc1e\nzrVjH7t1shAxi7yx+swEoXjrKiTk9EJ0d+R2AoBuHf/woTPqWO2ltwGHxVs088Ew\nehHteNg9QDrpwcZ6Mdra3PdGoniuqofzkCXy+bSC4IGdzYV9Rsco1WHsUqB8Kmot\n2HC2nvkNCh12tC6IBsy+Z3PLS6tejnOEeH/bhfHzm8R4sF1UyHWMZ9fFPf4h+3yH\n3jjw/mcyvkHZZ01RmANiZWKEoNDduthUwRTMWN21D6SNLpLPVA2Qod5N0p0or8OS\nI84fzRwDsCO3+qUE5yP99ROO108NLnLFYzfR+4dRmVfmGLU6gJcRnJF0brtp0K9b\n7hmf9A89/8138F8ufKoC2iiG0RG1Nrh9y+IJT2BkZVKpnPZ90T/8kViulHLf/6X9\nc6cSi5Gz/bUhvO8PZ/M95sLHd2RR3KP7rw/rhjvTN3M=\n=vbrb\n-----END PGP PUBLIC KEY BLOCK-----\n",
"custom": {
"c931ee18b3ee2d9fc2d215f03d3d568ed0aa6a7c5afdf78bf5c2ff187a918421.proof.csv": {
"signature": "-----BEGIN PGP SIGNATURE-----\nVersion: BCPG v1.71\n\niQIcBAABCAAGBQJjgrhxAAoJENNYxbE9CsYibbYQALKD7GvDOicyWviI35K/dxVn\nMOAHtAokjH9iXC0gIoLwFlOqoJ4MtRgazVZynTT06DdEkgB/hNAY8G7mOAuQOuXY\nc0l/fJ4pzwygWZMihLhgjIA9fc+tGjjg+jqd6hGoc3dchxEpi7z4Au4cn71PFdMk\nbVWIWGN9XYE5QNdFZ+Vxb+WDXYFE/fg62SgEYPGXXMLrQIgKhxMkwb7FpmUKq4MS\nQuPhYg7txSJnjSaMs1NcBr4Do6LBNzYHYWznvoskzz2saohF3hciIAXSCL8MGexV\nNFWuDasPBOvMFtMPQwPClSj9nRa4wNNQNEmtHMmT9fSb3PhWA4fDZerbIy/IrLHH\n6SocnhTwo8Zccdg7RbNEjyeWqrSkAP2FCMja8V3D1BfJ8y1DdwbvRQULHVokz6/3\n4Pr1yL3InsJurr05dGWQ/b8OkispH/751ebUcFurjKUsYNL9oK0gAxI9VZ2TLoDu\nJTBUhntHB3QkaMgcY0krx2tEJXilVA2n3s4M/rFxz7WhFwfTkT1mO/JxWE/aN0Ed\n+cP7ffOypuZuNXLE/AtDAqOawNKhUmMfVI4R9QR86z+67Y9DuCnxJo4ZQ6n6dweY\nNiN2B9L89Vg8Rzc4em9pQuCpPmQHb6CrF9eU42eDkKlLEpjcLBdLCDrJ19bB9cxi\nJHOkaRxYyXIT4VTGFs2t\n=KgHo\n-----END PGP SIGNATURE-----\n",
"authenticatedMessage": "eb526a0e5ec96ddcd23bb143e7ba4560f277a19fbe0d894b7ddd07f779a7bc4a",
"authenticatedMessageDescription": "SHA256 hash of the signed file"
},
"IMG_20221127_002811.jpg": {
"signature": "-----BEGIN PGP SIGNATURE-----\nVersion: BCPG v1.71\n\niQIcBAABCAAGBQJjgrhxAAoJENNYxbE9CsYi0RoP/2XtHQseXwXCcTgMo+ay4TqB\nsXdulF7YbUw74SJaVbzPtn/qsyRqCd32DupJZ+HUW2IEbLEWSEPGRxZl0sWVlO48\nbrtWjCKqk5A0hhDhGM8p/H6ktng4Dzl+u3ywyaNinbwPlcQ36So7lKcdjmOCdBfU\nGDJaLaKoM4TlCFufzZUVfATd4BTT5HKf/Z+RJLc9fmJZsdx4ULsR0bqBVvJMxdFl\nW6xjJhDRAlQC2DeJA1xkVz6Vjwp33qtAs5+YGIlwk+g88rrLvsqyGW1kO/dz6FKk\nyoFgeg61m9efR43V3x8dzJJb1gVp5fApaUvaesGQmieHu3Cv0lg3/zma+UnJM1SQ\nHo5wss9NEEw3Z+NSX2UNbTc+kg0+ZADmbqtrdsYfuCXQBqRnP6HiyYs5Mw1Aaz7K\nkkIVadGmOq9N+zlMQsi6UKnkCISD9TuST9p+vOXNQDT1Mg14IpRMdad7NhGDrwsT\nGRpauaA3fveR0YGKpQsYDDCdv66eixDu+4m4TYbcOcEfyErxJItMy0rH9Ozf3fWy\nqfdffWdQh3ehSPBUULLa8eF9jwuNxbDmXon3uku1sMTGfo7hvTqOAR/HoKaN8TSi\nMB96vq5qo+8vwdVa07m+iriHhvH1QkGC4Ia717UPc38M5rt7eidmMwkOkAS9TLw/\ndLxSZWn+XmLAu8oX07S7\n=Zp5+\n-----END PGP SIGNATURE-----\n",
"authenticatedMessage": "c931ee18b3ee2d9fc2d215f03d3d568ed0aa6a7c5afdf78bf5c2ff187a918421",
"authenticatedMessageDescription": "SHA256 hash of the signed file"
},
"c931ee18b3ee2d9fc2d215f03d3d568ed0aa6a7c5afdf78bf5c2ff187a918421.proof.json": {
"signature": "-----BEGIN PGP SIGNATURE-----\nVersion: BCPG v1.71\n\niQIcBAABCAAGBQJjgrhxAAoJENNYxbE9CsYi/2cP+QHqU7qTc+FxSwgg7NdAvGaU\nIMuQjq61KhWNf/CQRtwjXNcSP43ehUd4RrJbZc3p5sQrIf0BwIgOpAxffGE3+V91\nwZB/BzvDMceGwNw4vYDOe2aDdKiRcvwGxS+xa8JOlKtKwg9WeK3g/N2bVaDhP2QC\nhYisVy16d6a/0QYO8dsekf/uFI6EPhciHV45G+vholhWV3mjS531Ez2cIR62FCou\ntQNC1OzNwGgu3WZNJmQljqat8ZDIJG40tT1jcOsKo1LJLgD4nVN+YzzvdcgfRl3V\nCFfUFnayJB1O6E2kjAhz/mePvVtqDRiknsSsAqg2VbblD+J39QMZ4YI03QKr8FsW\ngtUAvxpGVHzyRG/MrNkgM3MBslNPAkS9XziEhlp1bzM4cc+Y9389v43aKaMnTAj4\n/K03aiu4OSYLdlQDtKOIz9QkAF1pF4T4/G73dTFWdZYZ15QPOOOElp0rxmAGISPR\nz/nFgCZRF/YlhUSIJ6dLemJDlWjxl4kZcyuVWh/KPSYrlFojNOQsjbgXVkVlR/Jo\nNioKrgTVh3xpvhr6wOzPKM88hAKx6C+v2VEGwS5iDGP6oqX0NAoHIs6/U5GFe6/f\nY7+8X3W/sZCyr1Chmj9ngT0SuImn8Xj9IY6BxzcQV/4JiDGRf7x1sRABLcCTHmIL\n/OWfKW/jwHqNYlbtlaPb\n=Mx2A\n-----END PGP SIGNATURE-----\n",
"authenticatedMessage": "93f1ab9cfefe8e82d4338348800634f9d8964315691d71ce5f2c1e1f62c2d25a",
"authenticatedMessageDescription": "SHA256 hash of the signed file"
}
}
}
],
"relatedAssetCid": "bafybeiapv3jcyh74fwuttudyvg2amfrgnfzjq2fg32w5jghvjytxgvugim"
}
}
Go through code to make sure all the dates/times stored in meta data are adjusted for the correct timezone.
Copied from https://github.com/starlinglab/organizing-private/issues/111
This allows us to authorize ingest only to known agents.
trusted_devices.txt
:
<sig_algo> <sig_identity> <description>
This can be used to identify the signer and add a line to metadata.
sig66 pubkey "Camera provisoned to ..."
proofmode gpg_key "Phone provisioned to ..."
starling-capture-zion-classic eth_pubkey "Phone provisioned to ..."
This is already used to identify the collection, author, etc.
dropbox email_address "Person at USC"
signal phone_number "Phone provisioned to ..."http jwt "Person at partner"
http jwt "Person at partner"
Probably in JSON it's better.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.