Giter VIP home page Giter VIP logo

auto-archiver's Introduction

Auto Archiver

PyPI version Docker Image Version (latest by date)

Read the article about Auto Archiver on bellingcat.com.

Python tool to automatically archive social media posts, videos, and images from a Google Sheets, the console, and more. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. If using Google Sheets as the source for links, it will be updated with information about the archived content. It can be run manually or on an automated basis.

There are 3 ways to use the auto-archiver:

  1. (easiest installation) via docker
  2. (local python install) pip install auto-archiver
  3. (legacy/development) clone and manually install from repo (see legacy tutorial video)

But you always need a configuration/orchestration file, which is where you'll configure where/what/how to archive. Make sure you read orchestration.

How to install and run the auto-archiver

Option 1 - docker

dockeri.co

Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the -v volume flag.

  1. install docker
  2. pull the auto-archiver docker image with docker pull bellingcat/auto-archiver
  3. run the docker image locally in a container: docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml breaking this command down:
    1. docker run tells docker to start a new container (an instance of the image)
    2. --rm makes sure this container is removed after execution (less garbage locally)
    3. -v $PWD/secrets:/app/secrets - your secrets folder
      1. -v is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
      2. $PWD/secrets points to a secrets/ folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
      3. /app/secrets points to the path the docker container where this image can be found
    4. -v $PWD/local_archive:/app/local_archive - (optional) if you use local_storage
      1. -v same as above, this is a volume instruction
      2. $PWD/local_archive is a folder local_archive/ in case you want to archive locally and have the files accessible outside docker
      3. /app/local_archive is a folder inside docker that you can reference in your orchestration.yml file

Option 2 - python package

Python package instructions
  1. make sure you have python 3.8 or higher installed
  2. install the package pip/pipenv/conda install auto-archiver
  3. test it's installed with auto-archiver --help
  4. run it with your orchestration file and pass any flags you want in the command line auto-archiver --config secrets/orchestration.yaml if your orchestration file is inside a secrets/, which we advise

You will also need ffmpeg, firefox and geckodriver, and optionally fonts-noto. Similar to the local installation.

Option 3 - local installation

This can also be used for development.

Legacy instructions, only use if docker/package is not an option

Install the following locally:

  1. ffmpeg must also be installed locally for this tool to work.
  2. firefox and geckodriver on a path folder like /usr/local/bin.
  3. (optional) fonts-noto to deal with multiple unicode characters during selenium/geckodriver's screenshots: sudo apt install fonts-noto -y.

Clone and run:

  1. git clone https://github.com/bellingcat/auto-archiver
  2. pipenv install
  3. pipenv run python -m src.auto_archiver --config secrets/orchestration.yaml

Orchestration

The archiver work is orchestrated by the following workflow (we call each a step):

  1. Feeder gets the links (from a spreadsheet, from the console, ...)
  2. Archiver tries to archive the link (twitter, youtube, ...)
  3. Enricher adds more info to the content (hashes, thumbnails, ...)
  4. Formatter creates a report from all the archived content (HTML, PDF, ...)
  5. Database knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)

To setup an auto-archiver instance create an orchestration.yaml which contains the workflow you would like. We advise you put this file into a secrets/ folder and do not share it with others because it will contain passwords and other secrets.

The structure of orchestration file is split into 2 parts: steps (what steps to use) and configurations (how those steps should behave), here's a simplification:

# orchestration.yaml content
steps:
  feeder: gsheet_feeder
  archivers: # order matters
    - youtubedl_archiver
  enrichers:
    - thumbnail_enricher
  formatter: html_formatter
  storages:
    - local_storage
  databases:
    - gsheet_db

configurations:
  gsheet_feeder:
    sheet: "your google sheet name"
    header: 2 # row with header for your sheet
  # ... configurations for the other steps here ...

To see all available steps (which archivers, storages, databses, ...) exist check the example.orchestration.yaml.

All the configurations in the orchestration.yaml file (you can name it differently but need to pass it in the --config FILENAME argument) can be seen in the console by using the --help flag. They can also be overwritten, for example if you are using the cli_feeder to archive from the command line and want to provide the URLs you should do:

auto-archiver --config secrets/orchestration.yaml --cli_feeder.urls="url1,url2,url3"

Here's the complete workflow that the auto-archiver goes through:

graph TD
    s((start)) --> F(fa:fa-table Feeder)
    F -->|get and clean URL| D1{fa:fa-database Database}
    D1 -->|is already archived| e((end))
    D1 -->|not yet archived| a(fa:fa-download Archivers)
    a -->|got media| E(fa:fa-chart-line Enrichers)
    E --> S[fa:fa-box-archive Storages]
    E --> Fo(fa:fa-code Formatter)
    Fo --> S
    Fo -->|update database| D2(fa:fa-database Database)
    D2 --> e

Orchestration checklist

Use this to make sure you help making sure you did all the required steps:

  • you have a /secrets folder with all your configuration files including
    • a orchestration file eg: orchestration.yaml pointing to the correct location of other files
    • (optional if you use GoogleSheets) you have a service_account.json (see how-to)
    • (optional for telegram) a anon.session which appears after the 1st run where you login to telegram
      • if you use private channels you need to add channel_invites and set join_channels=true at least once
    • (optional for VK) a vk_config.v2.json
    • (optional for using GoogleDrive storage) gd-token.json (see help script)
    • (optional for instagram) instaloader.session file which appears after the 1st run and login in instagram
    • (optional for browsertrix) profile.tar.gz file

Example invocations

The recommended way to run the auto-archiver is through Docker. The invocations below will run the auto-archiver Docker image using a configuration file that you have specified

# all the configurations come from ./secrets/orchestration.yaml
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml
# uses the same configurations but for another google docs sheet 
# with a header on row 2 and with some different column names
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1

The auto-archiver can also be run locally, if pre-requisites are correctly configured. Equivalent invocations are below.

# all the configurations come from ./secrets/orchestration.yaml
auto-archiver --config secrets/orchestration.yaml
# uses the same configurations but for another google docs sheet 
# with a header on row 2 and with some different column names
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1

Extra notes on configuration

Google Drive

To use Google Drive storage you need the id of the shared folder in the config.yaml file which must be shared with the service account eg [email protected] and then you can use --storage=gd

Telethon + Instagram with telegram bot

The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your anon.session in the root.

Atlos

When integrating with Atlos, you will need to provide an API token in your configuration. You can learn more about Atlos and how to get an API token here. You will have to provide this token to the atlos_feeder, atlos_storage, and atlos_db steps in your orchestration file. If you use a custom or self-hosted Atlos instance, you can also specify the atlos_url option to point to your custom instance's URL. For example:

# orchestration.yaml content
steps:
  feeder: atlos_feeder
  archivers: # order matters
    - youtubedl_archiver
  enrichers:
    - thumbnail_enricher
    - hash_enricher
  formatter: html_formatter
  storages:
    - atlos_storage
  databases:
    - console_db
    - atlos_db

configurations:
  atlos_feeder:
    atlos_url: "https://platform.atlos.org" # optional
    api_token: "...your API token..."
  atlos_db:
    atlos_url: "https://platform.atlos.org" # optional
    api_token: "...your API token..."
  atlos_storage:
    atlos_url: "https://platform.atlos.org" # optional
    api_token: "...your API token..."
  hash_enricher:
    algorithm: "SHA-256"

Running on Google Sheets Feeder (gsheet_feeder)

The --gsheet_feeder.sheet property is the name of the Google Sheet to check for URLs. This sheet must have been shared with the Google Service account used by gspread. This sheet must also have specific columns (case-insensitive) in the header as specified in Gsheet.configs. The default names of these columns and their purpose is:

Inputs:

  • Link (required): the URL of the post to archive
  • Destination folder: custom folder for archived file (regardless of storage)

Outputs:

  • Archive status (required): Status of archive operation
  • Archive location: URL of archived post
  • Archive date: Date archived
  • Thumbnail: Embeds a thumbnail for the post in the spreadsheet
  • Timestamp: Timestamp of original post
  • Title: Post title
  • Text: Post text
  • Screenshot: Link to screenshot of post
  • Hash: Hash of archived HTML file (which contains hashes of post media) - for checksums/verification
  • Perceptual Hash: Perceptual hashes of found images - these can be used for de-duplication of content
  • WACZ: Link to a WACZ web archive of post
  • ReplayWebpage: Link to a ReplayWebpage viewer of the WACZ archive

For example, this is a spreadsheet configured with all of the columns for the auto archiver and a few URLs to archive. (Note that the column names are not case sensitive.)

A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column

Now the auto archiver can be invoked, with this command in this example: docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --config secrets/orchestration-global.yaml --gsheet_feeder.sheet "Auto archive test 2023-2". Note that the sheet name has been overridden/specified in the command line invocation.

When the auto archiver starts running, it updates the "Archive status" column.

A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column. The auto archiver has added "archive in progress" to one of the status columns.

The links are downloaded and archived, and the spreadsheet is updated to the following:

A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.

Note that the first row is skipped, as it is assumed to be a header row (--gsheet_feeder.header=1 and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.

The "archive location" link contains the path of the archived file, in local storage, S3, or in Google Drive.

The archive result for a link in the demo sheet.


Development

Use python -m src.auto_archiver --config secrets/orchestration.yaml to run from the local development environment.

Docker development

working with docker locally:

  • docker build . -t auto-archiver to build a local image
  • docker run --rm -v $PWD/secrets:/app/secrets auto-archiver --config secrets/orchestration.yaml
    • to use local archive, also create a volume -v for it by adding -v $PWD/local_archive:/app/local_archive

manual release to docker hub

  • docker image tag auto-archiver bellingcat/auto-archiver:latest
  • docker push bellingcat/auto-archiver

RELEASE

  • update version in version.py
  • go to github releases > new release > use vx.y.z for matching version notation
    • package is automatically updated in pypi
    • docker image is automatically pushed to dockerhup

auto-archiver's People

Contributors

andyfcx avatar brrttwrks avatar djhmateer avatar edsu avatar emieldatalytica avatar galenreich avatar jamesarnall avatar jettchent avatar liliakai avatar loganwilliams avatar milesmcc avatar msramalho avatar tmaybe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

auto-archiver's Issues

YoutubeDL can return non-video content

Currently this isn't handled well by the archiver. For example, if YoutubeDL returns a PDF, it will attempt to run ffmpeg on it, resulting in an error. We should be ready to handle non-video content coming from YoutubeDL (and more generally, ref #3)

YouTube playlists - probably not intentional from user

https://podrobnosti.ua/2443817-na-kivschin-cherez-vorozhij-obstrl-vinikla-pozhezha.html

This site contains a live link at the top which is a link to a YouTube playlist with 1 item which is a live stream.

Currently the 'is_live' check doesn't catch it as it is a playlist, and will proceed to download 3.7GB of stream, then create 1000's of thumbnails.

I propose a simple fix in youtubedl_archiver.py to stop downloading of playlists which stuck me as probably not what the user would want.

        if info.get('is_live', False):
            logger.warning("Live streaming media, not archiving now")
            return ArchiveResult(status="Streaming media")

       # added this catch below
        infotype = info.get('_type', False)
        if infotype is not False:
            if 'playlist' in infotype:
                logger.info('found a youtube playlist - this probably is not intended. Have put in this as edge case of a live stream which is a single item in a playlist')
                return ArchiveResult(status="Playlist")

There is probably a much more elegant way to express this!

Can submit a PR if you agree.

Detect columns from headers

Rather than specifying columns to use for archive URL, timestamp, etc as command line flags, these should be determined from headers in the Google Sheet itself.

UX bug: archiving fails if the "url" is replaced with its linked title text

When pasting a url in Sheets, a helpful little dialog appears and suggests that you "Replace URL with its title" as linked text (see image below). Auto archiver doesn't know how to handle this format and returns "nothing archived". However, it should be possible to detect and extract the url when the cell value is a link.

Screenshot 2023-08-01 at 16 43 13

include exif metadata

Add an additional data point containing exif metadata where available, for now can only think of telegram media.

archive facebook with archive.ph

https://archive.ph/ does not have an API like the Internet Web Archive Wayback machine, although it can archive facebook pages and IWA cannot. Could we use selenium to submit links in the archive.ph UI and thus successfully archive links?

generating WACZ without Docker - wacz not working

Getting a proxy connection failed on the wacz_archiver_enricher on all urls.

First time I've set this up, so probably something simple / maybe I've missed something.

Next step for me is to setup a local dev version and debug it.. but this issue may be useful for others at the same stage as me.

I have the profile setup in secrets/profile.tar.gz which I did via

# create a new profile
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/"

Output of the run is:

docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml

2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:111 - FEEDER: gsheet_feeder
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:112 - ENRICHERS: ['hash_enricher', 'wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:113 - ARCHIVERS: ['wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:114 - DATABASES: ['gsheet_db']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:115 - STORAGES: ['local_storage']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:116 - FORMATTER: html_formatter
2023-08-22 10:50:24.319 | INFO     | auto_archiver.feeders.gsheet_feeder:__iter__:48 - Opening worksheet ii=0: wks.title='Sheet1' header=1
2023-08-22 10:50:26.275 | WARNING  | auto_archiver.databases.gsheet_db:started:28 - STARTED Metadata(status='no archiver', metadata={'_processed_at': datetime.datetime(2023, 8, 22, 10, 50, 26, 274503), 'url': 'https://twitter.com/dave_mateer/status/1505876265504546817'}, media=[])
2023-08-22 10:50:26.916 | INFO     | auto_archiver.core.orchestrator:archive:85 - Trying archiver wacz_archiver_enricher for https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:50:26.916 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:50:26.916 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection 5e60e6e9 --id 5e60e6e9 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.263Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.269Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.270Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:50:30.206Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:50:28.119Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:32.373Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:02.380Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.386Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:02.489 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/5e60e6e9/5e60e6e9.wacz'
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:31 - calculating media hashes for url='https://twitter.com/dave_mateer/status/1505876265504546817' (using SHA3-512)
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:02.490 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection c851aa3f --id c851aa3f --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.099Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.703Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:51:03.654Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:51:03.157Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.392Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.398Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.399Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:05.399Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.409Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.542Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:05.549 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/c851aa3f/c851aa3f.wacz'
2023-08-22 10:51:05.549 | DEBUG    | auto_archiver.formatters.html_formatter:format:37 - [SKIP] FORMAT there is no media or metadata to format: url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:05.549 | SUCCESS  | auto_archiver.databases.gsheet_db:done:46 - DONE https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:51:06.365 | SUCCESS  | auto_archiver.feeders.gsheet_feeder:__iter__:79 - Finished worksheet Sheet1

and orchestation.yaml is:

steps:
  # only 1 feeder allowed
  feeder: gsheet_feeder # defaults to cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    # - telegram_archiver
    # - twitter_archiver
    #- twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    # - tiktok_archiver
    # - youtubedl_archiver
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
  enrichers:
    - hash_enricher
    # - metadata_enricher
    # - screenshot_enricher
    # - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
    # - pdq_hash_enricher # if you want to calculate hashes for thumbnails, include this after thumbnail_enricher
  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    #- console_db
    # - csv_db
    - gsheet_db
    # - mongo_db

configurations:
  gsheet_feeder:
    sheet: "AA Demo Main"
    header: 1
    service_account: "secrets/service_account.json"
    # allow_worksheets: "only parse this worksheet"
    # block_worksheets: "blocked sheet 1,blocked sheet 2"
    use_sheet_names_in_stored_paths: false
    columns:
      url: link
      status: archive status
      folder: destination folder
      archive: archive location
      date: archive date
      thumbnail: thumbnail
      timestamp: upload timestamp
      title: upload title
      text: textual content
      screenshot: screenshot
      hash: hash
      pdq_hash: perceptual hashes
      wacz: wacz
      replaywebpage: replaywebpage
  instagram_tbot_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
  telethon_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
    join_channels: false
    channel_invites: # if you want to archive from private channels
      - invite: https://t.me/+123456789
        id: 0000000001
      - invite: https://t.me/+123456788
        id: 0000000002

  twitter_api_archiver:
    # either bearer_token only
    # bearer_token: "TWITTER_BEARER_TOKEN"
   

  instagram_archiver:
    username: "INSTAGRAM_USERNAME"
    password: "INSTAGRAM_PASSWORD"
    # session_file: "secrets/instaloader.session"

  vk_archiver:
    username: "or phone number"
    password: "vk pass"
    session_file: "secrets/vk_config.v2.json"

  screenshot_enricher:
    width: 1280
    height: 2300
  wayback_archiver_enricher:
    timeout: 10
    key: "wayback key"
    secret: "wayback secret"
  hash_enricher:
    algorithm: "SHA3-512" # can also be SHA-256
  wacz_archiver_enricher:
    profile: secrets/profile.tar.gz
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat
  s3_storage:
    bucket: your-bucket-name
    region: reg1
    key: S3_KEY
    secret: S3_SECRET
    endpoint_url: "https://{region}.digitaloceanspaces.com"
    cdn_url: "https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}"
    # if private:true S3 urls will not be readable online
    private: false
    # with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
    key_path: random
  gdrive_storage:
    path_generator: url
    filename_generator: random
    root_folder_id: folder_id_from_url
    oauth_token: secrets/gd-token.json # needs to be generated with scripts/create_update_gdrive_oauth_token.py
    service_account: "secrets/service_account.json"
  csv_db:
    csv_file: "./local_archive/db.csv"

Add a timestamp authority client Step

Following information from this timestamp-authority repo ( RFC3161 Timestamp Authority) implement a Step which connects to a timestamp authority server, one example that can be tested write away is
https://freetsa.org/index_en.php

taken from there a full example with sha512 is:

###########################################################
# 1. create a tsq file (SHA 512)
###########################################################
openssl ts -query -data file.png -no_nonce -sha512 -out file.tsq

# Option -cert: FreeTSA is expected to include its signing certificate (Root + Intermediate Certificates) in the response. (Optional)
# If the tsq was created with the option "-cert", its verification does not require "-untrusted".
#$ openssl ts -query -data file.png -no_nonce -sha512 -cert -out file.tsq


# How to make Timestamps of many files?

# To timestamp multiple files, create a text file with all their SHA-512 hashes and timestamp it.
# Alternatively, you may pack all the files to be timestamped in a zip/rar/img/tar, etc file and timestamp it.

# Generate a text file with all the hashes of the /var/log/ files
$ find /var/log/ -type f -exec sha512sum {} + > compilation.txt

###########################################################
# 2. cURL Time Stamp Request Input (HTTP / HTTPS)
###########################################################

# HTTP 2.0 in cURL: Get the latest cURL release and use this command: curl --http2.
curl -H "Content-Type: application/timestamp-query" --data-binary '@file.tsq' https://freetsa.org/tsr > file.tsr

# Using the Tor-network.
#$ curl -k --socks5-hostname 127.0.0.1:9050 -H "Content-Type: application/timestamp-query" --data-binary '@file.tsq' https://4bvu5sj5xok272x6cjx4uurvsbsdigaxfmzqy3n3eita272vfopforqd.onion/tsr > file.tsr

# tsget is very useful to stamp multiple time-stamp-queries: https://www.openssl.org/docs/manmaster/apps/tsget.html
#$ tsget -h https://freetsa.org/tsr file1.tsq file2.tsq file3.tsq

###########################################################
# 3. Verify tsr file
###########################################################

wget https://freetsa.org/files/tsa.crt
wget https://freetsa.org/files/cacert.pem

# Timestamp Information.
openssl ts -reply -in file.tsr -text

# Verify (two diferent ways).
# openssl ts -verify -data file -in file.tsr -CAfile cacert.pem -untrusted tsa.crt 
openssl ts -verify -in file.tsr -queryfile file.tsq -CAfile cacert.pem -untrusted tsa.crt
# Verification: OK

Discussion topics

  • If we do this for a final document like the HTML report, it's enough to do it once but then it cannot be saved within the HTML report as it would break the hashing (unless we create a link to the tsq and tsr files that are only created after the html report is written and hashed) - it does not matter if the files don't exist yet they can be created later since their content is the actual verification and they don't need hashes since they contain the HTML hash even though it references them
  • If we do it for each file it can lead to a lot of overhead, does not sound like a great approach

Given the cyclical definition of this, I wonder what is the best way to implement it as it needs to run after the HtmlFormatter which can only happen as a database meaning the formatter should only include/display the links if they actually exist.

Handle pages with multiple videos

Youtube-dl supports pages that contains multiple videos, but the auto-archiver does not. Instead, these URLs will be skipped over with a notification to the user that pages with multiple videos are not currently supported.

In the current user interface, there is a 1:1 relationship between a row in the spreadsheet and a file to be archived. This needs to be generalized to 1:many in order to support pages with multiple videos. One possible way to do this would be to generate HTML index pages that link/include all videos archived for a page, similar to the way that thumbnail contact sheets are generated.

Improvement suggestions for WaybackArchiver

Hi!
I'm a member of Team Wayback at the Internet Archive.
I have some improvement suggestions for

class WaybackArchiver(Archiver):

  1. You could use the Wayback Machine Availability API to easily get capture info about a captured URL https://archive.org/help/wayback_api.php. https://web.archive.org/web/<URL> is not recommended because its purpose is to playback the latest capture. You don't need to load the whole data of the latest capture of a URL, you just need to know if its available or not.
  2. Save Page Now API has a lot of useful options https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit

if_not_archived_within=<timedelta> should be useful in your case.

Capture web page only if the latest existing capture at the Archive is older than the limit. Its format could be any datetime expression like โ€œ3d 5h 20mโ€ or just a number of seconds, e.g. โ€œ120โ€. If there is a capture within the defined timedelta, SPN2 returns that as a recent capture. The default system is 30 min.

Cheers!

AttributeError: 'HashEnricher' object has no attribute 'algorithm'

I'm running into an interesting error when archiving a simple URL locally on my computer (macOS) (command is python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land", version is 0.4.4): AttributeError: 'HashEnricher' object has no attribute 'algorithm'.

Here is my config:

steps:
  # only 1 feeder allowed
  feeder: cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    - telegram_archiver
    - twitter_archiver
    # - twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    - tiktok_archiver
    - youtubedl_archiver
    # - wayback_archiver_enricher
  enrichers:
    - hash_enricher
    - screenshot_enricher
    - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_enricher

  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    - console_db
    - csv_db
    # - gsheet_db
    # - mongo_db

configurations:
  screenshot_enricher:
    width: 1280
    height: 2300
  hash_enricher:
    algorithm: "SHA-256" # can also be SHA-256
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat

And here is the full log:

% python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land"
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:108 - FEEDER: cli_feeder
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:109 - ENRICHERS: ['hash_enricher', 'screenshot_enricher', 'thumbnail_enricher', 'wacz_enricher']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:110 - ARCHIVERS: ['telegram_archiver', 'twitter_archiver', 'tiktok_archiver', 'youtubedl_archiver']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:111 - DATABASES: ['console_db', 'csv_db']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:112 - STORAGES: ['local_storage']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:113 - FORMATTER: html_formatter
2023-03-14 11:52:44.363 | DEBUG    | auto_archiver.feeders.cli_feeder:__iter__:28 - Processing https://miles.land
2023-03-14 11:52:44.364 | DEBUG    | auto_archiver.core.orchestrator:archive:66 - result.rearchivable=True for url='https://miles.land'
2023-03-14 11:52:44.364 | WARNING  | auto_archiver.databases.console_db:started:22 - STARTED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[], rearchivable=True)
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver telegram_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver twitter_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver tiktok_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver youtubedl_archiver for https://miles.land
[generic] Extracting URL: https://miles.land
[generic] miles: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] miles: Extracting information
ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.archivers.youtubedl_archiver:download:37 - No video - Youtube normal control flow: ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:30 - calculating media hashes for url='https://miles.land' (using SHA-256)
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.screenshot_enricher:enrich:27 - Enriching screenshot for url='https://miles.land'
2023-03-14 11:52:53.272 | DEBUG    | auto_archiver.enrichers.thumbnail_enricher:enrich:23 - generating thumbnails
2023-03-14 11:52:53.273 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:35 - generating WACZ for url='https://miles.land'
2023-03-14 11:52:53.273 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:61 - Running browsertrix-crawler: docker run --rm -v /Users/miles/Desktop/tmpjdhyhj5x:/crawls/ webrecorder/browsertrix-crawler crawl --url https://miles.land --scopeType page --generateWACZ --text --collection dd5fef44 --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 90 --timeout 90
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.832Z","context":"general","message":"Page context being used with 1 worker","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Set netIdleWait to 15 seconds","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Seeds","details":[{"url":"https://miles.land/","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":99999}]}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.094Z","context":"state","message":"Storing state in memory","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.416Z","context":"general","message":"Text Extraction: Enabled","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.515Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"limit":{"max":0,"hit":false},"pendingPages":["{\"url\":\"https://miles.land/\",\"seedId\":0,\"depth\":0,\"started\":\"2023-03-14T18:52:54.448Z\"}"]}}
{"logLevel":"error","timestamp":"2023-03-14T18:52:58.314Z","context":"general","message":"Invalid Seed \"mailto:[email protected]\" - URL must start with http:// or https://","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.338Z","context":"behavior","message":"Behaviors started","details":{"behaviorTimeout":90,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.339Z","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.340Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"pageStatus","message":"Page finished","details":{"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.396Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.398Z","context":"general","message":"Num WARC Files: 8","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.700Z","context":"general","message":"Validating passed pages.jsonl file\nReading and Indexing All WARCs\nWriting archives...\nWriting logs...\nGenerating page index from passed pages...\nHeader detected in the passed pages.jsonl file\nGenerating datapackage.json\nGenerating datapackage-digest.json\n","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.737Z","context":"general","message":"Crawl status: done","details":{}}
2023-03-14 11:52:58.886 | ERROR    | auto_archiver.core.orchestrator:feed_item:44 - Got unexpected error on item Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True): 'HashEnricher' object has no attribute 'algorithm'
Traceback (most recent call last):
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 37, in feed_item
    return self.archive(item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 110, in archive
    s.store(m, result)  # modifies media
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 46, in store
    self.set_key(media, item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 78, in set_key
    he = HashEnricher({"algorithm": "SHA-256", "chunksize": 1.6e7})
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/enrichers/hash_enricher.py", line 18, in __init__
    assert self.algorithm in algo_choices, f"Invalid hash algorithm selected, must be one of {algo_choices} (you selected {self.algorithm})."
AttributeError: 'HashEnricher' object has no attribute 'algorithm'

2023-03-14 11:52:58.887 | ERROR    | auto_archiver.databases.console_db:failed:25 - FAILED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True)
2023-03-14 11:52:58.887 | SUCCESS  | auto_archiver.feeders.cli_feeder:__iter__:30 - Processed 1 URL(s)

I can try to investigate and submit a PR, but figured I'd open the issue just to have.

Google Drive bug with leading /

In telethon if there are subdirectories wanting creating, sometimes a key is passed with a leading / character which confuses the join.

I've worked around it by adding a catch, but need to clean up

  • find the root cause
  • keep catch in logging to error log so we know if it happens again

https://github.com/bellingcat/auto-archiver/blob/main/storages/gd_storage.py

    def uploadf(self, file: str, key: str, **_kwargs):
        """
        1. for each sub-folder in the path check if exists or create
        2. upload file to root_id/other_paths.../filename
        """
        # doesn't work if key starts with / which can happen from telethon todo fix
        if key.startswith('/'):
            # remove first character ie /
            key = key[1:]

A example is: https://t.me/witnessdaily/169265

TikTok downloader stalling when video is unavailable - can't reproduce

Bug that I can't reliably reproduce but sometimes stalls the whole archiver for many hours until restarting the archiver.

Given this URL:
https://www.tiktok.com/@jusscomfyyy/video/7090483393586089222

The tiktok downloader stalls.

https://github.com/krypton-byte/tiktok-downloader

The test app https://tkdown.herokuapp.com/ correctly throws an invalid url.

class TiktokArchiver(Archiver):
    name = "tiktok"

    def download(self, url, check_if_exists=False):
        if 'tiktok.com' not in url:
            return False

        status = 'success'

        try:
            # really slow for some videos here 25minutes plus or stalls
            info = tiktok_downloader.info_post(url)
            key = self.get_key(f'{info.id}.mp4')

Specify hash algorithm in config

I needed to specify SHA3_521 rather than SHA256. Have a PR coming which passes this through.

  # in base_archiver use SHA256 or SHA3_512
  hash_algorithm: SHA3_512
  # hash_algorithm: SHA256

When .html is in the path screenshot saves as .html

http://brokenlinkcheckerchecker.com/pagea.html

The screenshot would save the png as wayback_pageb-2022-11-11t10-55-09-277235.html

A simple fix is in base_archiver.py - commented out a line at the bottom of the function:

    def _get_key_from_url(self, url, with_extension: str = None, append_datetime: bool = False):
        """
        Receives a URL and returns a slugified version of the URL path
        if a string is passed in @with_extension the slug is appended with it if there is no "." in the slug
        if @append_date is true, the key adds a timestamp after the URL slug and before the extension
        """
        url_path = urlparse(url).path
        path, ext = os.path.splitext(url_path)
        slug = slugify(path)
        if append_datetime:
            slug += "-" + slugify(datetime.datetime.utcnow().isoformat())
        if len(ext):
            slug += ext
        if with_extension is not None:
            # I have a url with .html in the path, and want the screenshot to be .png
            # eg http://brokenlinkcheckerchecker.com/pageb.html
            # am happy with .html.png as a file extension
            # commented out the follow line to fix
            # unsure as to why this is here 
            # if "." not in slug:
                slug += with_extension
        return self.get_key(slug)

which then gives wayback_pageb-2022-11-11t10-55-09-277235.html.png

Happy to do a PR if I've not missed any understanding here.

Auto Tweet the Hash

After every successful archive, do a Tweet with the Hash so that we can prove that on this day the picture/video was as it is.

Need to translate this older code to the new codebase. Use enricher probably.

https://github.com/djhmateer/auto-archiver/blob/main/auto_archive.py#L183

This code uses a SQL database to act as a queue, and another service runs every x minutes to poll the queue to see if anything should be tweeted. Checks the NextTweetTime so as not to bombard the API. Been working for a few months on the free Twitter API which gives 1500 Tweets per month.

Scrape Youtube comments, livechats

youtube-dl, or at least yt-dlp, is capable of downloading and dumping livechat data, machine and manual transcripts of videos, and so on. Youtube comments can be grabbed with another official API with some difficulty, and this project relatively easily. I've found having all of these around useful for some OSINT tasks.

Project should be easier to set up and run

Currently, running this project in an automated way requires creating a Digital Ocean Spaces bucket and manually managing cron jobs on a Linux server. Ideally, this would be simpler to deploy so that a new archiving spreadsheet could be set up in a user friendly way, even for non-programmers.

One promising possibility for moving in this direction is as a Google Sheets Add On, but other ideas can also be explored and evaluated.

Use Entry Number for the folder in Google Storage

This is probably an issue for me to implement in the google drive storage.

eg instead of a folder name: https-www-youtube-com-watch-v-wlahzurxrjy-list-pl7a55eb715fbb2940-index-7

I like it to be the entry number eg AA001 which is taken from the good spreadsheet.

Perhaps patch in via

filename_generator: static to be filename_generator: entry_number

Test and Demo spreadsheet with urls which test all aspects

A first step to testable code could be a spreadsheet which has expected output columns. This would make it easy to see if anything isn't working (regression testing).

I've got something already and will improve and post here.

This would also be useful for a demo purposes too, so users can see what is happening (and what should happen)

In order of usefulness to clients / most common links

Twitter

  • 1 image (should work)
  • multiple images (should work)
  • tweet with media sensitive image(s)
  • tweet that brings a login prompt (trick is to get rid of part of the url)
  • check tweet image size is max resolution
  • tweet that contain a non twitter video URL as intent is probably to get images from tweet

then video:

  • 1 video
  • multiple videos will not work

Facebook

  • 1 image - will not work
  • multiple images - will not work

then video:

  • 1 video. Handled by youtubedl
  • multiple videos - unusual and wont work

etc...

In order of developers

  • TelethonArchiver (Telegrams API)
  • TikTokArchiver (always getting invalid URL so far)
  • TwitterAPIArchiver (handles all tweets if API key is there)
  • YoutubeDLArchiver (handles youtube, and facebook video)
  • TelegramArchiver (backup if telethon doesn't work which is common)
  • TwitterArchiver (only if Twitter API not working)
  • VkArchiver
  • WaybackArchiver

Archive links from the Discord server

Giancarlo spoke to the Bellingcat Community Discord yesterday, including about how Bellingcat has an auto-archiver that works with links dropped manually into a Google Sheet. This seems like it could be easily extended to auto-archive any link posted in specific channels on the Discord. This might also turn out to be a faster way to use the auto-archiver, for any researchers who are using Discord in whatever capacity. The idea is that it gobbles up any URL posted anywhere in a message in an entire channel that matches one of the configured archivers.

If you want this and it wouldn't cost too much to run on specific channels, then I'm happy to build it. There are a few options for implementation that someone might be able to provide some guidance on.

Bot vs batch

I've started creating a bot. I figure that's probably better in terms of not exceeding the Discord API limits by doing a "get everything on this channel" frequently, but depending on how you guys like to run these archivers, there may be disadvantages in that it kinda has to stay running all the time. I suppose there'd be no harm in doing both, a bot that on startup reads the channel histories up to a max # of messages, and then waits quietly for new messages. Where do you stand on that?

How to trigger the actual archiving

You could:

  • Add it to a Google Sheet, and that's it. Let the existing scheduled archiver take care of any added URLs. A fair bit simpler.
  • On message receive with a link in it, add it to the sheet and schedule an archive to occur in the same discord_archive.py program. This is cooler because then a bot could whack a little react emoji on messages to indicate the archive status.

Deduplication

Considering this will be adding a bunch of new links to the archives, I would be worried about whether it's going to clobber previously archived pages in the S3 backend. This is something the archivers themselves are meant to detect, right? I don't think the Twitter one does this. Does DigitalOcean's S3 support version history, just in case? And the archivers don't overwrite anything if they hit a 404, right?


discordbot.mov

"failed: no archiver" in google sheet although can download the screenshot.

I attempted to set up the auto-archiver by following this instructional video (https://www.youtube.com/watch?v=VfAhcuV2tLQ).

Initially, the code was running and the archive status in the Google sheet showed "Archive in progress," but at the end, it displayed "failed: no archiver".

The logs indicate that I have successfully scraped some data. I also downloaded some screenshots (YouTube and video), and the YouTube video in webm format. However, I am unsure why the data cannot update in the google Sheet.

Also, is it necessary to utilize the browsertrix-crawler? I have downloaded the Docker desktop, and my machine can run the browsertrix-crawler, but the error persists."

The error messages are as below:
ERROR | main:process_sheet:138 - Got unexpected error in row 2 with twitter for url='https://twitter.com/anwaribrahim/status/1642750503422685187?cxt=HHwWhsDTsaK4nMwtAAAA': [Errno 2] No such file or directory: '/Users/usr/Documents/python/archiver/browsertrix/crawls/profile.tar.gz'
Traceback (most recent call last):
File "/Users/usr/Documents/python/archiver/auto_archive.py", line 133, in process_sheet
result = archiver.download(url, check_if_exists=c.check_if_exists)
File "/Users/usr/Documents/python/archiver/archivers/twitter_archiver.py", line 42, in download
wacz = self.get_wacz(url)
File "/Users/usr/Documents/python/archiver/archivers/base_archiver.py", line 234, in get_wacz
shutil.copyfile(self.browsertrix.profile, os.path.join(browsertrix_home, "profile.tar.gz"))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/usr/Documents/python/archiver/browsertrix/crawls/profile.tar.gz'

2023-04-13 14:47:20.800 | SUCCESS | main:process_sheet:167 - Finished worksheet Sheet1

Google service account credentials usage doesn't match README instructions

Hi there! The README says the Google service account credentials should be placed at ~/.config/gspread/service_account.json, which is where gspread looks by default. However, line 17 of auto_auto_archive.py specifies a credentials file in the same directory as the script, so the script won't run unless you drop service_account.json in the same directory.

So it seems that either auto_auto_archive.py should be changed to call gspread.service_account without a filename parameter; or the README should be changed to specify that service_account.json be created in the application root. If you have a preference, I can open a pull request and change it either way.

Support Instagram image posts

It would be helpful to have support for archiving Instagram image posts as well as videos (which are currently archived with youtube-dl). This would likely require additional authentication credentials, like a cookie, to be specified in the configuration file.

new enhancer: store HTTP headers and SSL certificate

Create a new enhancer that can

  • (optionally) store the HTTP request headers -need to think of use-cases and how to be comprehensive
  • (optionally) store the HTTP response headers - same as above
  • (optionally) store the SSL certificate of https connections

The display in the html_formatter should probably be initially hidden.

Another option is to conceive a high-level activity log that captures all actions and logs and appends them to the end of the html report.

Facebook image archiving

Archiving of image(s) on Facebook is not supported yet and would be very useful.

Placeholder Issue to put in ideas of potentially how it could be done.

Background

https://github.com/djhmateer/auto-archiver#archive-logic has a list of what works and doesn't. Facebook video works using youtube_dlp

In fork above to get a Facebook screenshot I am using using automation to click on the accept cookies page as we don't want the cookie popup in the screenshot.

To get a Facebook post link

"Each Facebook post has a timestamp on the top (it may be something like Just now, 3 mins or Yesterday). This timestamp contains the link to your post. So, to copy it, simply hover your mouse over the timestamp, right click, then copy link address"

Example

As an example of Facebook images which we would like to archive:

https://www.facebook.com/chelseymateerbeautician/posts/pfbid0mhimrwfeBpWKwBUFna28Q3RfaEK8HETcEpk1QXoEeFXHVwaa7oxLxKTHbBqu5nPpl

https://gist.github.com/pcardune/1332911 - potentially this may help.

#26 - @msramalho talked about the potential of https://archive.ph/

Multiple instances of auto-archiver and Proxmox / Azure

I'm hosting 3 instances of the auto-archiver on 3 separate VM's. I've allocated 4GB of RAM to each and the systems work well.

Is anyone running multiple instances on a single VM, and found any issues with simultaneous calls to ffmpeg / Gecko drivers. That is what concerns me the most.

Running python in a virtual env eg pipenv run python auto_archive.py should segregate that side I guess.

Archive non-video media (images and sound)

Currently, auto-archiver relies on youtube-dl to download media, which only finds video sources. It would be a significant improvement to download images, and possibly audio as well.

Add Browsertrix support when using docker image

Issue: browsertrix-crawler is executed via docker (docker run ...) and it uses volumes to

  1. pass the profile.tar.gz file
  2. save the results of its execution

If the auto-archiver is running inside docker, we have a docker-in-docker situation and that can be nefarious.
One workaround is to share the daemon of the host machine with the auto-archiver docker container via /var/run/docker.sock, for example:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v $PWD/secrets:/app/secrets -e SHARED_PATH=$PWD/secrets/crawls aa --config secrets/config-docker.yaml

However, doing this means that the -v volumes passed when doing docker run -v... browsertrix-crawler will only share volumes with the host (and not the docker container running the archiver) meaning the profile file and the results of extraction are host path dependent which adds a layer of complexity.

Additionally, using /var/run/docker.sock is generally undesireable for security as it gives a lot of permissions to the code in the container.

Challenge: can we find a secure and easy to use (both via docker and outside docker) for browsertrix-crawler? Would that be a docker-compose with 2 services communicating? a new service that responds to browsertrix-crawl requests?

Whitelist and Blacklist of Worksheet

We have a spreadsheet with multiple worksheets and I'd like to whitelist or blacklist based on the title.

The reason is that one of the worksheets is an exact copy of the worksheet that I want archived, with the same column names. So the archiver picks it up when we don't want it.

I propose adding 2 extra config items, something like this:

execution:
  # spreadsheet name - can be overwritten with CMD --sheet=
  sheet: "Test Hashing"

  # worksheet to blacklist. Leave blank which is default for none. Useful if users want a MASTERSHEET exact copy of the 
  # working worksheet
  worksheet_blacklist: MASTERSHEET
  # only check this worksheet rather than iterating through all worksheets in the spreadsheet. If whitelist is used 
  # then blacklist is ignored as whitelist is most restrictive.
  worksheet_whitelist: Sheet1

I only need a single items in the 'lists'.

Happy to code this up and do a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.