Giter VIP home page Giter VIP logo

rs-video-processor's Introduction

Richmond Sunlight Video Processor

The video OCR processor for Richmond Sunlight.

Maintainability Build Status

Purpose

This downloads video from the Virginia General Assembly's floor-session video archive and subjects it to various types of analysis. At this writing, that includes OCRing the on-screen chyrons, facial recognition, and closed-caption extraction. To come: voice pitch analysis and improved facial recognition.

History

The video processor was put together, piece by piece, over a decade, as a series of Bash and PHP scripts. This is an effort to consolidate those, and turn them into their own project. At the moment, it's still a series of Bash and PHP scripts, lashed together with twine, but isolating them as their own project will make it easier to standardize them and improve ment.

Infrastructure

It lives on a compute-optimized EC2 instance. Source updates are delivered via Travis CI -> S3, which the instance pulls updates from on boot. (Note that the includes/ directory is pulled from the deploy branch of richmondsunlight.com repository on each build.) The instance is stopped by default, and only started once rs-machine identifies a new video's availability. rs-machine communicates this information via SQS, though it fires up the rs-video-processor EC2 instance directly. rs-video-processor grabs the first entry from SQS to run through its processing pipeline, and continues to loop over available SQS entries so long as they exist. When the queue is finished, it shuts itself down.

rs-video-processor's People

Contributors

dependabot-preview[bot] avatar waldoj avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

rs-video-processor's Issues

Design reprocess functionality

Devise a method to queue videos for reprocessing, in whole or in part. Implementation is out of the scope of this issue—simply figure out what it looks like, and how it will work.

Halt progress if the video is already in the database

We need a verification process, early on, that checks whether this video is already in the database. Perhaps that should be back on rs-machine, perhaps in rs-video-processor, I'm not sure. But, right now, this will cheerfully attempt to ingest a preexisting video.

Figure out what we want to test

  • can retrieve video from legislature
  • can save a copy of the video to the EC2 instance
  • screenshots are extracted accurately (number, md5)
  • the chyrons are being cropped properly (not too big, not too small)
  • OCR gets chyrons within X% accuracy, for text and timestamp
  • closed-caption text can be found in JSON format
  • closed-caption text is correctly converted to SRT
  • closed-caption text saves to the database correctly
  • SRT is correctly converted to a transcript
  • can save video to S3 bucket

Check on the chyron-crop argument order

I'm seeing lots of errors scroll by in my terminal, and it looks like we're not getting chyron information OCRed. I'm pretty sure there's a crop-order problem, i.e. mixing up the order of height & width and the start coordinates, in ocr.sh's invocation of ImageMagick.

Convert Granicus captions to SRT

The Granicus captions (#19) are in their own JSON format, and not anything standard, like SRT. Create a method to convert Granicus captions into SRT, and integrate it into the conversion pipeline.

First, though, research this—it's possible that somebody has already written this code. Perhaps Granicus is using a standard (e.g., SAMI?), but isn't labeling it as such.

  • determine whether Granicus is adhering to a standard
  • determine whether an existing program exists to perform this conversion
  • if necessary, write a conversion method
  • put the conversion method at the end of the caption-importing pipeline

Replace SRT ingestion with WebVTT ingestion

The Video class and the scraper assume that the provided format will be SRT. Modify that code to instead work on WebVTT files, now that Granicus has moved to WebVTT captions (#21).

Within class.Video.php, these methods must be modified:

  • parse_sbv: clone, change to be parse_webvtt
  • normalize_line_endings
  • time_shift_srt: make this build atop parse_sbv, so it doesn't touch file contents
  • eliminate_duplicates: eliminate this entirely, semi-ironically
  • srt_to_database: make this build atop parse_sbv, so it doesn't touch file contents
  • srt_to_transcript: eliminate entirely

Upload video to Internet Archive

  • write / incorporate the functionality
  • store the API auth pair in Travis
  • drop the functionality into the processing pipeline
  • ensure that this is the URL that's used for path in the files table

Add AWS key, secret to settings file

Set up AWS_ACCESS_KEY and AWS_SECRET_KEY values within the openva/richmondsunlight.com settings file, and have config_variables.sh populate them. We need them available to get_video.php for SQS access.

Set up AWS permissions role

This instance needs to be able to read from SQS and write to S3. Add the relevant AWS credentials to Travis.

Share variables between Bash and PHP

A downside of going back and forth between Bash and PHP is that there's no common memory between the two. This means continually re-contexualizing what this video is—when it's from, what chamber it's in, what committee it is, etc. The solution is to build a config file for each video.

Seems to me that either YAML or JSON is the way to go. Of the tools that can be used within Bash, jq seems like the way to go, which points to JSON being the better solution.

Use an exit trap on handler.sh

We don't want to just flat-out exit—we want to move onto the next video in the queue and, more important, shut down if we're finished completely.

Rethink capture_directory

What does this even mean, now that these are stored on S3? See how we're using this, and rethink how this is being stored.

Have rs-machine start this instance

The options include having rs-machine start this via the AWS CLI and having SQS trigger CloudWatch to start EC2.

Start by reviewing the comments on #7.

  • identify a method that will work
  • implement it

Set up an RSS checker on Machine

Acceptance criteria:

  • polls House video RSS
  • polls Senate video RSS
  • discovers new entries in the RSS
  • transfers video to S3
  • logs the URL of a new entry to SQS

Debug save_metadata.php

This generates pages of errors and doesn't work.

To Do

  • bring the error verbosity way down
  • figure out why mplayer isn't returning useful results
  • fix class/array error
  • have this return only the video ID, as handler.sh expects
  • figure out why the capture rate is way off
  • figure out why capture_directory isn't populated

Figure out how to use Spot Instances

  • How are they requested, programmatically?
  • Is their source a build script (e.g., CodeDeploy, CloudFormation), or an AMI?
  • Is there a minimum-guaranteed run time? Is that long enough for this purpose?

Figure out how to wire CD to an EC2 instance that's not running

Normally I'd use GitHub -> Travis -> CodeDeploy -> EC2, but if the EC2 instance isn't running, CodeDeploy will error out (I assume). What's the mechanism for deploying updates to EC2?

One guess: let Travis pass code along to CodeDeploy. CodeDeploy will report success back to Travis (because Travis doesn't wait for the actual deployment process). Internally, CodeDeploy will store the new version of the software on S3. Then have the EC2 instance run CodeDeploy at the CLI on each boot, which should pull down the latest version of the code.

Change chyron dimensions

The exiting dimensions are for 720-pixel-wide video, ripped from DVDs, but the downloaded video is only 640.

Entire thing runs on an infinite loop

Having the script check for an additional video on completion went fine, until I moved that logic into an exit trap (#38), and now of course the whole thing runs infinitely.

The solution is probably to set an environment variable when nothing is found in the queue, and don't have it rerun itself when that variable is set.

Think through how to handle failures

If we pull a video out of SQS, it's no longer in the queue—the only record of its need is within the currently-running process. So if one of the many components fail, that's the end of that video.

The only way to revisit it is if we modify Machine's RSS poller, to compare available videos against those it has logged to SQS against those present in the database, to allow for re-queueing. But that is not how it works now.

Another mechanism could be to modify the finish() function in handler.sh, so that in the case of failure (not exiting, but actual failure), the video is re-added to SQS. But I worry that just leads to an infinite loop of video re-processing.

Turn resolve_chyrons.php into a CLI-based application

This should be fine, except for the manual-resolution step. That will need to remain in the admin section, but we'll want a Slack integration to highlight the need to perform that manual review periodically, when enough cruft builds up.

Create a record for this video in the database

At the moment, there's no functionality for this. Create a new PHP script, using the existing admin script as a guide, to insert that new record, in an entirely non-interactive process.

Pull in caption-ingestion functionality

Right now, all of the caption-ingestion functionality is within Video classes, and aren't triggered automatically, but instead within the video admin interface. Run these methods automatically, when the caption files are loaded.

Create a process to export facial-recognition training data

We need a corpus of reliable facial data for all legislators, to retrain the model when new legislators come into office and old ones leave. Put together this export functionality.

  • figure out what the file structure should look like
  • export structured data
  • ensure that the export can be reviewed for accuracy, quickly and easily

Give ocr.sh access to $date

I'm not sure why, but $date isn't set within ocr.sh, which is a deal-breaker for it.

  • have ocr.sh test for the presence of $date, and bail if it doesn't have it
  • fix whatever is stopping it from having $date

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.