councildataproject / cdp-backend Goto Github PK

View Code? Open in Web Editor NEW

22.0 5.0 25.0 366.12 MB

Data storage utilities and processing pipelines used by CDP instances.

Home Page: https://councildataproject.org/cdp-backend

License: Mozilla Public License 2.0

Python 97.82% Just 2.18%

civic-tech government-data data-processing open-government infrastructure-as-code local-government infrastructure

cdp-backend's People

Contributors

Stargazers

Watchers

cdp-backend's Issues

Add script for processing local file

Use Case

Please provide a use case to help us understand your request in context

To process the various one-off events that occur, for example, I just started processing the 2021 election events for Seattle.

Solution

Please describe your ideal solution

I wrote a tiny script to do this in cdptools that takes a local file + some JSON file for the event / minutes details (the questions asked at the forum), uploads the video to Google storage then processes that uploaded file with the config JSON for event details.

In cdp-backend this is actually easier because all the ingestion models are seriabliable to JSON iirc. dataclasses_json.
So that would mean a user simply has to write the object in Python and serialize or just in pure JSON, then have a script accept the serialized ingestion model + the video path.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Dynamically Generate TypeScript DB Models, Transcript Model, and other Constants

Idea / Feature

Dynamically generate TypeScript object definitions from cdp-backend Python DB models.

Use Case / User Story

As a developer I want to be able to add, remove, or update a database model in a single location rather than multiple.

Solution

Add a script to cdp-backend that when ran will generate a TypeScript package of the database models + extras that we can push to npm on new version push.

Alternatives

Stakeholders

Backend maintainers
Frontend maintainers

Major Components

Database models TypeScript generation
Transcript File Model TypeScript generation
Database "ModelField" Python and TypeScript generation
Database constants TypeScript generation
script added as bin script to cdp-backend
task added to CI building and testing that attempts to generate these models on PR push / main build
task added to CI main building / publish job that pushes the generated TypeScript package dir to npm

Dependencies

Other Notes

Upgrade to GCP Speech-to-Text v2

To stay up-to-date we should upgrade to v2 when we get a chance, fortunately they released an upgrading guide here: https://github.com/googleapis/python-speech/blob/release-v2.0.0/UPGRADING.md

Adopt MyPy for type checking

Use Case

Please provide a use case to help us understand your request in context

As we add more code to the library, tests will largely ensure we aren't adding anything is breaking, but we should also check the documented types.

Solution

Please describe your ideal solution

Adopt mypy for static type checking and add to tox and CI workflows

Probably adopt Prefect's mypy base config as well.

[mypy]
ignore_missing_imports = True
disallow_untyped_defs = True
check_untyped_defs = True

There are bound to be some failures off the bat which is why we do this 🙂

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Disable ReferenceField `auto_load` for all Models

Feature Description

A clear and concise description of the feature you're requesting.

Currently ReferenceField properties for all of our models are loading on parent model initialization. What this means is when I request a Vote from the database, it will first load the vote data, then for each reference field load the referenced model, then recursively do that down the model dependency tree. This is great for "simple" access but terrible for large scale access.

Use Case

Please provide a use case to help us understand your request in context.

I want to be able to query the database for all votes and not have my calls rejected by Firebase.

Solution

Please describe your ideal solution.

Turn off auto_load: https://octabyte.io/FireO/fields/reference-field/#auto-load

Additionally, because certain parts of our pipeline depend on / expect those referenced models to exist, the pipeline will likely need to be updated to fetch the referenced model when needed.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

Combine all the video processing tasks into a single task

Feature Description

A clear and concise description of the feature you're requesting.

Reduce the pipeline computation time by combining all the tasks that require the video on the worker together.

Use Case

Please provide a use case to help us understand your request in context.

Drastically reducing the pipeline computation time.

Solution

Please describe your ideal solution.

Move the function calls to generate thumbnails to the start of the pipeline with the strip audio and generate hash.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

Database ORM / Collection Models

Use Case

Please provide a use case to help us understand your request in context

Managing database schema + inserting new items into the database will be simpler with an ORM style API

Solution

Please describe your ideal solution

I.E.:

full_council = Event("Full Council", datetime(2020, 10, 7), "...")
database.insert(full_council)

Event Gather Pipeline: Move database uploads to after pipeline processing

Use Case

It would be nice to handle db uploads in the event gather pipeline to after all the processing logic (audio creation, transcript creation, etc.) so that any errors can get surfaced before uploading to the database.

Solution

This can be done storing the results of processing in a temporary container then doing all the database calls with the results of the processing

Invalid link in README

Describe the Bug

A clear and concise description of the bug.

cdp-backend/dev-infrastructure/README.md

Line 12 in 5836ee3

[example repo](https://github.com/CouncilDataProject/example).

[example repo](https://github.com/CouncilDataProject/example).

contains URL to a non-existing valid repository.

Expected Behavior

What did you expect to happen instead?

Navigate to one of the repositories in https://github.com/CouncilDataProject

Reproduction

Steps to reproduce the behavior and/or a minimal example that exhibits the behavior.

Go to https://github.com/CouncilDataProject/cdp-backend/blob/main/dev-infrastructure/README.md
then click on the "example repo" link near the top of the README file.

demonstration and example data usage in our example repo.

Environment

Any additional information about your environment.

OS Version: 20.04.2-Ubuntu
CDP Backend Version: 3.0.0.dev27
Web Browser: Firefox 94.0 (64-bit)

Add builds for all platforms

Use Case

Please provide a use case to help us understand your request in context

While originally setup thinking that these pipelines would only run on servers, civic tech happens can happen and be used by all platforms and users.... so actually making sure we do that would be a good idea.

Solution

Please describe your ideal solution

Add windows and mac to build CI and fix any bugs.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Add MatterIndexPipeline

Feature Description

A clear and concise description of the feature you're requesting.

Add matter index pipeline to stored indexed terms for matters.

Use Case

Please provide a use case to help us understand your request in context.

Search for matters based off of a plain text search query.

Solution

Please describe your ideal solution.

Similar to event index pipeline but instead of parsing the transcript parse the matter document text and link to matter_ref instead of event_ref.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

Fix deadlink in event gather pipeline docs that links to event ingestion data model

Broken link in docs: https://councildataproject.github.io/cdp-backend/event_gather_pipeline.html

Should be updated to the renamed Event Ingestion Data Model

Add basic instance metadata collection to infrastructure stack

Feature Description

A clear and concise description of the feature you're requesting.

Partially discussed in CouncilDataProject/cdp-roadmap#3, it would be nice to have a collection with a single document to pull metadata about the instance.

Use Case

Please provide a use case to help us understand your request in context.

Most importantly would be which cdp version the instance is currently on. This would help in determine which features the instance supports.

Solution

Please describe your ideal solution.

Use pulumi gcp firestore document resource to create / update the same document every time the infrastructure is upgraded.
https://www.pulumi.com/docs/reference/pkg/gcp/firestore/document/

@tohuynh @isaacna can you think of other metadata that would be good to store in the database itself about the instance?
version, some subset of cookiecutter params maybe? anything else?

Use spacy to separate sentences in SR model transcribe function

Use Case

To account for edge cases in sentence detection (Prefixes like Mr.), we can use the spacy library.

Solution

Use the spacy library to separate sentences. Haven't looked into the library too deeply, but would probably require us to change the iteration logic since we create Word (and increment index) and potentially end the Sentence in the same iteration.

Catch timeout / request rejections during event archiving

Describe the Bug

A clear and concise description of the bug.

When we are storing links to the matter attachments during the event ingestion pipeline store_event_processing_result task, we run existance checks on every supporting file. Because there can be a lot of these supporting files, we occasionally encounter request rejections for hitting the host too often / too quickly.

Expected Behavior

What did you expect to happen instead?

The file should be skipped. Simply catching the exception and moving on instead of failing the pipeline.

Reproduction

Steps to reproduce the behavior and/or a minimal example that exhibits the behavior.

https://github.com/CouncilDataProject/seattle-staging/runs/4930903577?check_suite_focus=true

Environment

Any additional information about your environment.

Ubuntu 20.04 and cdp-backend v3.0.3 -- see GitHub Actions details.

Remove `is_required` from `create_constant_value_validator`

Feature Description

As discussed in #155

We don't need is_required in the function create_constant_value_validator. We can just use the required arg that is built in for all FireO fields. We should just remove the param in create_constant_value_validator and adjust the usages to keep things consistent

Create a bin script to generate the flow visualization

Create a bin script to generate event gather flow visualization to attach to documentation after #16 is completed.

Originally posted by @JacksonMaxfield in #33 (comment)

Configure GoogleCloudSRModel phrases to use tfidf found n-grams

Use Case

Please provide a use case to help us understand your request in context

phrases provides functionality to tune the SR model by providing phrases that the model should recognize / give context to the model about what is being talked about. We should try to improve this wherever possible. I.e. choosing phrases that most improve model performance and context.

Solution

Please describe your ideal solution

We provide the minutes items to the model as just a list of string and then the "clean_phrases" function simply returns the first 500 characters for each of the first 100 phrases of the full phrases list.

However, we could run TFIDF or some n-gram indexing to find the most valuable (specific to that meeting) n-grams in the minutes items and just provide those. This has a cold-start problem but we can probably get around that.

There may also be better methods for phrase selection to improve the model.

Along this entire issue should be tracking and reporting model performance. We could probably add ASV benchmarking as a way of monitoring performance.

Make ingestion model optional in update_db_model function

Description

In upload_db_model, we allow ingestion_model to be an optional param. However, if the db model already exists and an ingestion model isn't passed, then a uniqueness error is thrown.
There are some places in the event gather pipeline where we call upload_db_model_task without a corresponding ingestion model because we don't really need an ingestion model. Th
This is an unlikely edge case, but there could be a situation where an audio file is deleted from the file store, but the corresponding file db models are not. This could be happen if we're doing something like data cleanup or a backfill which causes inconsistencies.
There are a few different ways to refactor this
- We could fix this by changing the upload_db_model logic and removing the ingestion model is not None check, and changing update_db_model to take in Union[Model, IngestionModel] instead. It may also make more sense to have separate params for a db model vs ingestion model in update_db_model since the logic may differ a bit
- We could also change upload_db_model to only accept a db_model instead of ingestion_model, and update_db_model to take in existing_db_model and proposed_db_model. This would probably make the both functions more simple, however would require us to create extra db models outside of those functions

Expected Behavior

Uniqueness error to not get triggered when upload_db_model_task is called with None for ingestion_model when an existing db model is present

Ensure or Convert video to MP4 or WebM before processing

Feature Description

A clear and concise description of the feature you're requesting.

If the video isn't MP4 / WebM, we won't be able to render the video on the website. Ensure or convert to a "web ready" video format.

Use Case

Please provide a use case to help us understand your request in context.

Allow more video formats through the pipeline / catch common video recording format errors. (OBS records videos as MKV by default)

Solution

Please describe your ideal solution.

If the video file isn't MP4 or WEBM, download it, convert, and upload our conversion to file storage. Use our converted video as the basis which would mean hosting it ourselves but 🤷

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

There are some duplicate IndexedEventGrams

Describe the Bug

The top five IndexedEventGrams for an event contain duplicates. (There is a couple of examples of this at https://tohuynh.github.io/cdp-frontend/#/events for seattle-staging database).

For example, for this event:

zoo and animals are duplicates. The duplicates are exactly the same except for the values of datetime_weighted_value, value, and id.

Expected Behavior

There should be no duplicates IndexedEventGrams.

I checked the top 20 IndexedEventGrams for the event, my guess is that the old IndexedEventGrams aren't being updated with new datetime_weighted_value and value, instead, new ones are created and the old ones aren't deleted.

cc @JacksonMaxfield

Allow audio files to be processed as events

Describe the Bug

Not sure if I should count this as a feature or bug, but currently, if the file passed to get_static_thumbnail has no fetchable thumbnail (a .wav audio file), the method and event gather pipeline will error out with

[ERROR: runner:  66 2021-08-16 00:06:27,588] Unexpected error: ValueError('Could not find a format to read the specified file in any-mode mode')
Traceback (most recent call last):
  File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 860, in get_task_run_state
    logger=self.logger,
  File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/prefect/utilities/executors.py", line 298, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/Users/Isaac/Desktop/cdp-backend/cdp_backend/pipeline/event_gather_pipeline.py", line 747, in get_video_and_generate_thumbnails
    tmp_video_path, session_content_hash
  File "/Users/Isaac/Desktop/cdp-backend/cdp_backend/utils/file_utils.py", line 208, in get_static_thumbnail
    reader = imageio.get_reader(video_path)
  File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/imageio/core/functions.py", line 182, in get_reader
    "Could not find a format to read the specified file in %s mode" % modename
ValueError: Could not find a format to read the specified file in any-mode mode

I think to generalize the event gather as much as possible, we should account for events that are just audio files (in addition to video files). This may require some minor refactoring in the pipeline as well if we want to avoid redundancy with the audio splitters.

Expected Behavior

Allow events that are just audio to be processed in the event gather pipeline.

Reproduction

Run an event gather flow with the event linked to a URI of audio instead of video

Add ability to run tests after install to verify

I have looked for a "runTest" target in the Makefile. I can see that tests exist. I just cannot see if any other setup is required before the tests can be run.

I will add a PR for any Makefile or documentation change that I discover is needed.

Use GCS credentials in `resource_copy`

Feature Description

Currently we depend on environment variables to call resource_copy on files stored in GCS ("https://storage.googleapis"). We should explicitly use a credentials file with GCSFileSystem instead just to be safe. This shouldn't be necessary once public read for GCS is enabled

Add a function for downloading youtube videos

Feature Description

Context

Add a function for downloading youtube videos

Use Case

Some deployments (such as portland) may upload their videos via youtube. We need to be able to download the video for both audio stripping for session hash generation, trasncription, and speaker classification

Solution

We can use youtube-dl

Change `Transcript` object repr / to string

I am not sure how to do this for a dataclass but I assume we can just do it in the object definition.

When I print out a Transcript object in a Jupyter notebook or any Python interpreter really, the repr and the tostring include the whole transcript details. I would prefer the shortened details:

Transcript(
    generator='CDP WebVTT Conversion -- CDP v3.0.2',
    confidence=0.9699999999999663,
    session_datetime='2021-11-18T09:30:00-08:00',
    created_datetime='2022-01-12T05:05:00.633612',
    sentences=[...],
    annotations=None
)

Basically hiding the sentences because it makes the repr massive.

Move from Pulumi to Terraform Python

I was primarily using Pulumi when building the infrastructure module because it allowed us to write in Python and import variable from the other cdp-backend modules (namely database). But I somehow missed that Terraform released the Python lib a couple of months ago.

https://github.com/hashicorp/terraform-cdk/tree/master/examples/python/aws
https://github.com/hashicorp/terraform-cdk (pip install cdktf)

Terraform is a bit more low level and doesn't have any account management stuff. This is actually a benefit to us because that means one less credential for us and other devs to manage.

(I think they released Node earlier and then Python later which explains me missing it a bit, but not much. Fortunately it shouldn't be too hard to switch over.)

Add EventProcessingPipeline

Use Case

Please provide a use case to help us understand your request in context

Add the core pipeline to the repo.

Solution

Please describe your ideal solution

Bring over the major components of the existing cdptools.pipelines.EventGatherPipeline but as a Prefect Flow. All files should be handled with gcsfs / fsspec.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Better method for finding/creating context span of an IndexedEventGram (in event index pipeline)

Describe the Bug

IndexedEventGram's context_span provides a surrounding context of the gram. The current method to find/create the context span is not ideal. Sometimes context span doesn't entirely contain the gram.

Expected Behavior

The context span should contain the gram in its entirety.

In the intermediary step of creating n grams, keep track of the gram's start index (and maybe end index) in the original sentence so that a context span could be created from these indices.

Fix broken UML diagram

Description

A clear description of the bug

Somewhere along the way, the DB UML Diagram was broken. You can see the broken diff here

I believe this issue was introduced as a result of changing from _BUILD_IGNORE_CLASSES to having a variable of the classes instead.

Expected Behavior

What did you expect to happen instead?

A return back to the old image.

Add Database Models for Indexed Event Terms

Use Case

Please provide a use case to help us understand your request in context

Need collections to store the created indexed terms for events

Solution

Please describe your ideal solution

Complete the work laid out in cdptools #79 some of this may be incomplete, look to cdptools@feature/update-index-pipeline for most up-to-date.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Debug why some files can't be archived / linked

Describe the Bug

A clear and concise description of the bug.

Was checking the logs of test-deployment and saw that some files fail to be archived: https://github.com/CouncilDataProject/test-deployment/runs/3776409036?check_suite_focus=true#step:8:118

I can click on that link and it is a real file that is publically accessibile.

Expected Behavior

What did you expect to happen instead?

All publically accessible files are archivable.

Make event gather pipeline dry-runnable

Feature Description

A clear and concise description of the feature you're requesting.

Add a parameter to the pipeline that enables us to dryrun it without any database or file store upload.

Use Case

Please provide a use case to help us understand your request in context.

Would be very useful to dryrun the pipeline during deployment bot configuration validation.

Solution

Please describe your ideal solution.

wait for #57 to allow passing a single event or multiple events to the pipeline directly
if the event doesnt come with captions, mock the transcript generation interaction
mock all file store and database upload interactions
add a task to the pipeline that runs if it was a dry run that stores the created database models to a single json file. (this should be possible because fireo has a to_dict method on all model objects.)

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

Add a function for replacing all http URIs with https URIs when possible

We already do this for session video URIs:

cdp-backend/cdp_backend/pipeline/event_gather_pipeline.py

Line 337 in 2942549

elif not session.video_uri.startswith("https://"):

However this should be the default standard whenever we check an HTTP like URI.

If given an http(s) uri, try the uri in https first, then try it in http, whichever works, use it.

Context span selection fails during index creation

Describe the Bug

A clear and concise description of the bug.

https://github.com/CouncilDataProject/seattle-staging/runs/4862736150?check_suite_focus=true

Looks like the closest_term was "" and an empty string wasn't found in the list. Regardless of how valuable that is. This was from seattle-staging and we can run and store the index locally so hopefully this will allow us to check out both what term is being processed and what is going on in general.

Store files to unique paths instead of filenames

Feature Description

A clear and concise description of the feature you're requesting.

For certain files that we want to archive (such as person picture) we are simply downloading and then uploading to our own file storage and storing at /[filename]. Instead let's store the file at a unique path so that we can maybe reduce the amount of uploads.

Use Case

Please provide a use case to help us understand your request in context.

In addition to reducing the number of uploads we run, this should also just speed up the pipeline, and crucially: make our data storage "safer", in the case that a person uses the same filenaming for all of their pictures i.e. council.gov/{PERSONNAME}/picture.png the current pipeline would result in only storing a single picture for them all instead of storing them uniquely.

Solution

Please describe your ideal solution.

Anytime we are using the download -> upload pattern: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L874

change it to download -> hash -> check if exists (by hash) -> (optional) upload (with hash as the file name).

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

Adopt fireo transactional for upload safety

Use Case

Please provide a use case to help us understand your request in context

Further ensure that no "bad" or "incomplete" data is uploaded to a CDP database.

Solution

Please describe your ideal solution

Rework the event gather pipeline to create the fireo models from the ingestion models but do not upload until the very end of the pipeline where we can upload all of them in a single fireo transaction.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Generate thumbnail PNG and GIF for each video

Use Case

Please provide a use case to help us understand your request in context

Using our card example on the frontend documentation: https://councildataproject.org/cdp-frontend/?path=/story/library-cards-events--meeting-search-result

We need a thumbnail for each event. And we planned for this in two ways, our event schema in our database has both a static and hover thumbnail. https://councildataproject.org/cdp-backend/cdp_backend.database.html#cdp_backend.database.models.Event

Solution

Please describe your ideal solution

Write a function for the static thumbnail generation. Takes in a video file path, grabs a frame from the 30 second mark, saves to PNG, returns the save path.
Write a function for the hover thumbnail generation. Takes in a video file path, grabs 10 frames from every 1/10th of the video file. So if the video is 1 hour long, it would grab a frame from every 6 minutes. Saves those frames to a GIF and returns the save path.

imageio is really all that is needed here.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

This can become much more complicated later. But for right now we really just need to get work done. The 30 second and the every 1/10th of the video file are chosen some what randomly but they should be okay. Down the line we may want have some better frame selection in place but not now.

After these two functions are written, we will need to add them to the pipeline. I am envisioning that a single prefect task would do the following:

def generate_thumbnails(video_uri: str) -> Tuple[str, str]:
    # Download video
    local_video_copy_path = external_resource_copy(video_uri)

   # Process to get thumbnails
    static_thumbnail_path = get_static_thumbnail(local_video_copy_path)
    hover_thumbnail_path = get_hover_thumbnail(local_video_copy_path)

   # Store files in storage
    static_thumbnail_uri = upload_file(static_thumbnail_path)
    hover_thumbnail_uri = upload_file(hover_thumbnail_path)
    
    # Store file URIs in DB
    static_thumbnail_ref = upload_file_db(static_thumbnail_uri)
    hover_thumbnail_ref = upload_file_db(hover_thumbnail_uri)

    return static_thumbnail_ref, hover_thumbnail_ref

Bug in DB model id generator / model hasher

See CouncilDataProject/cdp-frontend#143
and CouncilDataProject/cdp-scrapers#50

The hasher is sometimes producing multiple sessions for a single event. I believe this happens when the pipeline is running against the same event that has previously been processed.

I.e.

scrapper runs once (and completes processing like normal)
scrapper runs again and produces a different hash from the first run because the Event model is a reference (ReferenceDocLoader) and not an in-memory model

I can show this behavior by running the following:

from cdp_backend.database import models as db_models
import fireo
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id

# Connect to the database
fireo.connection(client=Client(
    project="cdp-seattle-staging-dbengvtn",
    credentials=AnonymousCredentials()
))

def print_session_hashes(event_id):
    event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
    sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
    for session in sessions:
        print(session.id)
        updated = generate_and_attach_doc_hash_as_id(session)
        print(updated.id)

print_session_hashes("c847010ce7a4")
# prints
# 71d1777f22c7
# 5626e734b33d
# e8b6fd84193f
# 5626e734b33d

print_session_hashes("9409f4d1ea09")
# prints
# 25d3a51447d0
# 9679fad34929
# 30701b4a0fbf
# 9679fad34929

Notice that in the above case, the second and fourth hashes match for each event.
This is because I am forcing it to rehash itself -- but, those second and fourth hashes match neither of the original two hashes, which is bad. But since I am using the Python API to pull this data, it would seem to me that this may be happening because they are ReferenceDocLoaders... so something to explore next.

If I change this function to explicitely fetch the event object as an entire model (db_models.Event) in-memory rather than a ReferenceDocLoader object, it produces:

def print_session_hashes(event_id):
    event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
    sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
    for session in sessions:
        session.event_ref = session.event_ref.get()
        print(session.id)
        updated = generate_and_attach_doc_hash_as_id(session)
        print(updated.id)

print_session_hashes("c847010ce7a4")
# prints
# 71d1777f22c7
# e8b6fd84193f
# e8b6fd84193f
# e8b6fd84193f

print_session_hashes("9409f4d1ea09")
# prints
# 25d3a51447d0
# 30701b4a0fbf
# 30701b4a0fbf
# 30701b4a0fbf

Notice under this case that everything matches after the first hash. To my understanding this is because this is how the pipeline ran (simply due to insert order, the first time the pipeline ran is the third hash in the list, the second time the pipeline ran is the first hash in the list -- i.e. the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull). The first time this event was processed, it had the model in memory, then the second time this event was processed (6 or so hours later), it had the Event as a reference and not in memory. If we force the model into memory, then it fixes itself.

I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours. It seems to have only occurred for events in the Seattle Staging instance and for a couple of events.

I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.

I truly do not know how deep this issue goes. I haven't found any King County event that fails.
I have found the following seattle staging events that fail:

http://councildataproject.org/seattle-staging/#/events/c847010ce7a4 (duplicate session data)
http://councildataproject.org/seattle-staging/#/events/9e03815fa280 (duplicate session data)
http://councildataproject.org/seattle-staging/#/events/dbe0224f11a5 (missing transcript for second session)
http://councildataproject.org/seattle-staging/#/events/da2b5f47e912 (missing transcript for second session)
http://councildataproject.org/seattle-staging/#/events/c46c534afdce (missing transcript for second session)
http://councildataproject.org/seattle-staging/#/events/9409f4d1ea09 (missing transcript for second session)
http://councildataproject.org/seattle-staging/#/events/409e3e6f902f (missing transcript for second session)

cc @dphoria @tohuynh @isaacna

Event gather pipeline: Hash videos by content rather than URL

Use Case

Some pages can have the same URL but have changing content. An example would be a "latest" City Council video page that gets updated frequently. Hashing based on content instead of URL will guarantee uniqueness regardless of updates to file content for the same URL.

Solution

This is primarily for the event gather pipeline but should be addressed anywhere else relevant. We can hash the entire file with a library like hashlib or something similar.

Speaker classification

Feature Description

Backend issue for the relevant roadmap issue

Adding speaker classification to CDP transcripts. This could be through a script/class that retroactively attaches the speaker name to a transcript that already has speaker diarization enabled. Prodigy can be used for annotating the training data.

Use Case

With speaker classification we can provide transcripts annotated with the speaker. This can be used in many ways such as through a script or github action

Solution

Very high level idea would be to:

Use GCP's built-in speaker diarization to separate the speakers. We could also create our own audio classification model. We could also use something like Prodigy to annotate the data, but I believe they have their own diarization/transcription models as well.
Figure out how to add the classified speaker names to the diarized transcript. I'm not sure if GCP allows you to provide any training data, but from what I could tell they only separate the speakers, but the models don't take in training data to label them.

A bigger picture breakdown of all the major components can be found on the roadmap issue under "Major components".

ASV for tracking our SR model configuration and performance

With #150, we should come up with a more robust system for tracking how well our speech recognition model performs. Both our google configuration, or, if we ever train our own model or switch services, simply a method for checking with model is currently performing best.

I have previously set up ASV for aicsimageio to track our IO perf per each commit and I think I could setup something up for this as well.

Unlike AICSImageIO's ASV setup which runs on every commit, I think I would only set it up to run / update on every tag / new release to reduce cost.

As for dataset, I think I will try to find 5 meetings of seattle city council that have transcripts converted from closed caption files, and store those transcript JSONs in the repo / library to use as "ground truth". While the closed caption files aren't 100% correct, they are very close to it, and crutially, they are creating by a government employee as a part of the meeting process and used on government websites so it's really asking the question: "does the SR model configuration perform close to manual closed caption file creation"

Finally as for metric, the general idea in my head is to process each of the transcripts with the model and then use word error rate to see how the model performs against the ground truth closed caption converted file.

cc @nniiicc

Add back `resource_exists` validator for `uri` field on File

We had to remove the usage of resources_exists from the uri field validator because we couldn't think of a method to pass the google credentials to the validator. At some point we may want to add this back but this shouldn't hold up #49 from being merged.

Actually after trying this, I think passing another param to the validator is a little tricky.

I can add kwargs to resource_exists, but I'm not sure if there's an easy way to pass in fs into that function inside uri = fields.TextField(required=True, validator=validators.resource_exists).

Because model fields in FireO are assigned like db_file.uri = "gs://stuff", I'm not sure we can pass in other params into the validators arg.

Would it make more sense to just validate separately outside of FireO's built-in field validation? Or is there an easier way to do this I'm missing?

Originally posted by @isaacna in #49 (comment)

Optimize stored thumbnails for web render

Feature Description

A clear and concise description of the feature you're requesting.

Downscale and compress images to preemptively help speed up our webpage rendering times.

Use Case

Please provide a use case to help us understand your request in context.

Specifically I am worried about the event search results / the general events filtering and query page due to trying to render ~10 events with all their thumbnails at the same time.

Solution

Please describe your ideal solution.

if the image is larger than 960 x 540 downscale images and gif to 960 x 540
change the format for the static thumbnail from png to webp

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them.

Set public read during CDPStack creation

Use Case

Please provide a use case to help us understand your request in context

Maybe this should be a parameter in the CDPStack object but currently our standard for infrastructure is to allow public read so during stack creation, we should set the default read permissions for the bucket and database to be public as well.

I say potentially a parameter because dev stacks don't need to be public read which could save money for the dev 🙃

(the dev stack with example data should really only cost like $0.01 though since the example data should be really minimal)

Solution

Please describe your ideal solution

I haven't figured out which Pulumi resource is needed (or really if it is possible to do with the pulumi API, it may need to be a second ComponentResource that calls out to gcloud in a subprocess)

But setting this and managing permissions through code would be incredibly useful.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Add `matter_full_text_uri` property to `Matter` definition and impl

While looking through the frontend and design work on the legislation tracking project, I realized for the first time that I think we may be missing a crucial piece of information which is a link to the actual full matter text.

Current Matter is defined as: link but a matter definitely has full text and we should store a link to that full text. I propose full_text_uri or some varient of that.

Additionally, while we look into this, it would be good to investigate which MatterFile's make it through the pipeline: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L1444

I think the above try-except block may be dropping some MatterFile / MinutesItemFile attachments that would be useful to keep and so we may want to try to fix it if we do see that behavior.

Split infra deps from pipeline deps

Further dependency splitting, no need to have them all together.

Migrate cdp-deploy work to an infrastructure module of this codebase

Use Case

Please provide a use case to help us understand your request in context

After talking to the Pulumi team, we can export entire stacks of resources a CustomResource from this codebase and then just use it in a very small __main__.py file in a cookiecutter-cdp-instance repo / cookiecutter.

This would mean that the backend repo would be more tightly coupled to the database models and they would be versioned together. Which, imo, would be a very good thing.

Solution

Please describe your ideal solution

Create an infrastructure module in the cdp_backend library where we can export a CDPStack Pulumi CustomResource.

Follow the work done here as example resource definition
Follow the work done here as example resource usage

There should also be tests added when possible to check the infrastructure. But, at the very least, potentially adding a github action to always preview what would be created under the current version of the infrastructure / backend.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Add EventIndexPipeline

Use Case

Please provide a use case to help us understand your request in context

Create a searchable index for the events gathered and upload it to Firestore.

Solution

Please describe your ideal solution

Migrate over and clean up the work done in cdptools@feature/update-index-pipeline

Additionally, see if this can entirely be done on GH Actions or if we need to use compute engine. Keeping the cost low would be great.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Migrate minimal event data and minimal session data over from cdptools

Use Case

Please provide a use case to help us understand your request in context

Standardize and version the acceptable data format we accept for pipelines with the pipeline code itself rather than scrapper side.

Solution

Please describe your ideal solution

Migrate the work done by @isaacna on cdptools to this repo.

Potentially change the name from MinimalEventData and MinimalSessionData to just EventData and SessionData as I think with the Optional keys, they are the full event data spec.

We should also make objects for VoteData and such, which are a part of the Optional[List[MinutesItems]]. Basically the entire data structure of all minimal data and optional data should be documented as the accepted data structure we accept for pipelines.

Notes

This unfortunately will require a bit of copying the database ORM definitions unless we can think of a way to programmatically construct all of these models. I.E. run through the DB ORM definitions and created NamedTuples of every single model with the appropriate Optional tags. I think it's possible.

councildataproject / cdp-backend Goto Github PK

cdp-backend's People

Contributors

Stargazers

Watchers

Forkers

cdp-backend's Issues

Use Case

Solution

Alternatives

Idea / Feature

Use Case / User Story

Solution

Alternatives

Stakeholders

Major Components

Dependencies

Other Notes

Use Case

Solution

Alternatives

Feature Description

Use Case

Solution

Alternatives

Feature Description

Use Case

Solution

Alternatives

Use Case

Solution

Use Case

Solution

Describe the Bug

Expected Behavior

Reproduction

Environment

Use Case

Solution

Alternatives

Feature Description

Use Case

Solution

Alternatives

Feature Description

Use Case

Solution

Use Case

Solution

Describe the Bug

Expected Behavior

Reproduction

Environment

Feature Description

Use Case

Solution

Description

Expected Behavior

Feature Description

Use Case

Solution

Alternatives

Describe the Bug

Expected Behavior

Describe the Bug

Expected Behavior

Reproduction

Feature Description

Feature Description

Use Case

Solution

Use Case

Solution

Alternatives

Describe the Bug

Expected Behavior

Description

Expected Behavior

Use Case

Solution