Light

openva / rs-video-processor Goto Github PK

View Code? Open in Web Editor NEW

3.0 1.0 0.0 4.42 MB

The video OCR processor for Richmond Sunlight.

License: MIT License

PHP 72.88% Shell 16.63% Dockerfile 0.51% Python 9.98%

video ocr legislation

rs-video-processor's Introduction

Richmond Sunlight Video Processor

The video OCR processor for Richmond Sunlight.

Purpose

This downloads video from the Virginia General Assembly's floor-session video archive and subjects it to various types of analysis. At this writing, that includes OCRing the on-screen chyrons, facial recognition, and closed-caption extraction. To come: voice pitch analysis and improved facial recognition.

History

The video processor was put together, piece by piece, over a decade, as a series of Bash and PHP scripts. This is an effort to consolidate those, and turn them into their own project. At the moment, it's still a series of Bash and PHP scripts, lashed together with twine, but isolating them as their own project will make it easier to standardize them and improve ment.

Infrastructure

It lives on a compute-optimized EC2 instance. Source updates are delivered via Travis CI -> S3, which the instance pulls updates from on boot. (Note that the includes/ directory is pulled from the deploy branch of richmondsunlight.com repository on each build.) The instance is stopped by default, and only started once rs-machine identifies a new video's availability. rs-machine communicates this information via SQS, though it fires up the rs-video-processor EC2 instance directly. rs-video-processor grabs the first entry from SQS to run through its processing pipeline, and continues to loop over available SQS entries so long as they exist. When the queue is finished, it shuts itself down.

rs-video-processor's People

Contributors

Stargazers

Watchers

rs-video-processor's Issues

Design reprocess functionality

Devise a method to queue videos for reprocessing, in whole or in part. Implementation is out of the scope of this issue—simply figure out what it looks like, and how it will work.

Halt progress if the video is already in the database

We need a verification process, early on, that checks whether this video is already in the database. Perhaps that should be back on rs-machine, perhaps in rs-video-processor, I'm not sure. But, right now, this will cheerfully attempt to ingest a preexisting video.

Figure out what we want to test

Store caption data in MP4s

The downloaded MP4s have no embedded caption data. Merge the SRT-converted captions into the MP4 container before uploading them to the Internet Archive or to YouTube.

Check on the chyron-crop argument order

I'm seeing lots of errors scroll by in my terminal, and it looks like we're not getting chyron information OCRed. I'm pretty sure there's a crop-order problem, i.e. mixing up the order of height & width and the start coordinates, in ocr.sh's invocation of ImageMagick.

Modernize, convert caption scraper to fetch a given video's captions

To continue on this prior issue, modify the caption scraper to be applicable here.

update the code to function at the CLI
have it retrieve a given video, instead of getting all videos

Convert Granicus captions to SRT

The Granicus captions (#19) are in their own JSON format, and not anything standard, like SRT. Create a method to convert Granicus captions into SRT, and integrate it into the conversion pipeline.

First, though, research this—it's possible that somebody has already written this code. Perhaps Granicus is using a standard (e.g., SAMI?), but isn't labeling it as such.

determine whether Granicus is adhering to a standard
determine whether an existing program exists to perform this conversion
if necessary, write a conversion method
put the conversion method at the end of the caption-importing pipeline

After processing a video, check to see if there are more in the queue

If two videos are found at the same time, the EC2 instance will only be started once. So make sure that, when finishing processing, that the whole affair starts over again, just in case.

Cron job isn't running on boot

Nothing starts up with the machine.

Replace SRT ingestion with WebVTT ingestion

The Video class and the scraper assume that the provided format will be SRT. Modify that code to instead work on WebVTT files, now that Granicus has moved to WebVTT captions (#21).

Within class.Video.php, these methods must be modified:

parse_sbv: clone, change to be parse_webvtt
normalize_line_endings
time_shift_srt: make this build atop parse_sbv, so it doesn't touch file contents
eliminate_duplicates: eliminate this entirely, semi-ironically
srt_to_database: make this build atop parse_sbv, so it doesn't touch file contents
srt_to_transcript: eliminate entirely

Turn the video processor into a stand-alone entity

fork it off the main site
create a deploy script that installs the necessary packages
turn it into a CLI-based application
string together the currently-disparate steps

Upload video to Internet Archive

write / incorporate the functionality
store the API auth pair in Travis
drop the functionality into the processing pipeline
ensure that this is the URL that's used for path in the files table

Move retrieval logic out of Machine, into this

Right now, it's Machine that retrieves video and saves it to S3. This should really be done by the video processor. Move that logic out of Machine and into here.

Output dir specification is failing in handler.sh

output_dir="${$filename/.mp4/}"

-bash: ${$filename/.mp4/}: bad substitution

At first glance, I don't know why this isn't working.

Add AWS key, secret to settings file

Set up AWS_ACCESS_KEY and AWS_SECRET_KEY values within the openva/richmondsunlight.com settings file, and have config_variables.sh populate them. We need them available to get_video.php for SQS access.

Set up AWS permissions role

This instance needs to be able to read from SQS and write to S3. Add the relevant AWS credentials to Travis.

Upload video to YouTube

Document in the README how all of this is rigged up

Create a cron job that will kick off the process at boot

Share variables between Bash and PHP

A downside of going back and forth between Bash and PHP is that there's no common memory between the two. This means continually re-contexualizing what this video is—when it's from, what chamber it's in, what committee it is, etc. The solution is to build a config file for each video.

Seems to me that either YAML or JSON is the way to go. Of the tools that can be used within Bash, jq seems like the way to go, which points to JSON being the better solution.

Use an exit trap on handler.sh

We don't want to just flat-out exit—we want to move onto the next video in the queue and, more important, shut down if we're finished completely.

Set a CloudWatch alarm to auto-stop server

If the server is idle for whatever period, automatically stop the server. This will serve as insurance that it won't be kept running.

Rethink capture_directory

What does this even mean, now that these are stored on S3? See how we're using this, and rethink how this is being stored.

Incorporate livestreamed committee video

Because this issue ties together multiple repositories, it's being coordinated on the richmondsunlight.com repo. This issue exists here as a pointer.

Set up Travis integration

Have rs-machine start this instance

The options include having rs-machine start this via the AWS CLI and having SQS trigger CloudWatch to start EC2.

Start by reviewing the comments on #7.

identify a method that will work
implement it

Turn parse_video.php into a CLI application

This should be easy.

Set up an RSS checker on Machine

Acceptance criteria:

Debug save_metadata.php

This generates pages of errors and doesn't work.

To Do

bring the error verbosity way down
figure out why mplayer isn't returning useful results
fix class/array error
have this return only the video ID, as handler.sh expects
figure out why the capture rate is way off
figure out why capture_directory isn't populated

Prohibit saving a video multiple times

Running save_metadata.php twice will save multiple copies of the video in the files table. Add a check beforehand, and refuse to save a second copy.

Figure out how to use Spot Instances

How are they requested, programmatically?
Is their source a build script (e.g., CodeDeploy, CloudFormation), or an AMI?
Is there a minimum-guaranteed run time? Is that long enough for this purpose?

Figure out how to wire CD to an EC2 instance that's not running

Normally I'd use GitHub -> Travis -> CodeDeploy -> EC2, but if the EC2 instance isn't running, CodeDeploy will error out (I assume). What's the mechanism for deploying updates to EC2?

One guess: let Travis pass code along to CodeDeploy. CodeDeploy will report success back to Travis (because Travis doesn't wait for the actual deployment process). Internally, CodeDeploy will store the new version of the software on S3. Then have the EC2 instance run CodeDeploy at the CLI on each boot, which should pull down the latest version of the code.

Make code not be untestable garbage

Just what it says on the tin.

Move the SQS queue URL out of the code

Presumably this needs to live somewhere else, maybe in settings.inc.php.

Change chyron dimensions

The exiting dimensions are for 720-pixel-wide video, ripped from DVDs, but the downloaded video is only 640.

Entire thing runs on an infinite loop

Having the script check for an additional video on completion went fine, until I moved that logic into an exit trap (#38), and now of course the whole thing runs infinitely.

The solution is probably to set an environment variable when nothing is found in the queue, and don't have it rerun itself when that variable is set.

Set up AWS integration within PHP

get set up with AWS’ SDK for PHP
have the video processor read a single message from SQS

Create an SQS FIFO queue

Acceptance criteria

it exists

Think through how to handle failures

If we pull a video out of SQS, it's no longer in the queue—the only record of its need is within the currently-running process. So if one of the many components fail, that's the end of that video.

The only way to revisit it is if we modify Machine's RSS poller, to compare available videos against those it has logged to SQS against those present in the database, to allow for re-queueing. But that is not how it works now.

Another mechanism could be to modify the finish() function in handler.sh, so that in the case of failure (not exiting, but actual failure), the video is re-added to SQS. But I worry that just leads to an infinite loop of video re-processing.

Scale up instance when this is live

I've been running this a t2.micro and, wow, is it slow.

Turn resolve_chyrons.php into a CLI-based application

This should be fine, except for the manual-resolution step. That will need to remain in the admin section, but we'll want a Slack integration to highlight the need to perform that manual review periodically, when enough cruft builds up.

Set up CodeDeploy integration

Environment variable generation isn't working

# eval "$(jq -r '. | to_entries | .[] | .key + "=\"" + .value + "\""' < metadata.json)"

jq: error (at <stdin>:0): string ("date=\"") and number (20180111) cannot be added

Create a record for this video in the database

At the moment, there's no functionality for this. Create a new PHP script, using the existing admin script as a guide, to insert that new record, in an entirely non-interactive process.

Pull in caption-ingestion functionality

Right now, all of the caption-ingestion functionality is within Video classes, and aren't triggered automatically, but instead within the video admin interface. Run these methods automatically, when the caption files are loaded.

Enable self-shutdown functionality

This is commented out in handler.sh right now, but we need to enable it when this is in production.

Variables are being set as shell variables, not environment variables

As a result, they don't exist within child processes, which is to say that they are useless. Fix this.

Create a process to export facial-recognition training data

We need a corpus of reliable facial data for all legislators, to retrain the model when new legislators come into office and old ones leave. Put together this export functionality.

figure out what the file structure should look like
export structured data
ensure that the export can be reviewed for accuracy, quickly and easily

Figure out how to launch an instance via the SDK

When RS Machine finds a new video, and records its URL to SQS, it needs to be able to trigger the video processor EC2 instance starting up. Document how to do that.

Give ocr.sh access to $date

I'm not sure why, but $date isn't set within ocr.sh, which is a deal-breaker for it.

have ocr.sh test for the presence of $date, and bail if it doesn't have it
fix whatever is stopping it from having $date

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.