councildataproject / cdp-backend Goto Github PK
View Code? Open in Web Editor NEWData storage utilities and processing pipelines used by CDP instances.
Home Page: https://councildataproject.org/cdp-backend
License: Mozilla Public License 2.0
Data storage utilities and processing pipelines used by CDP instances.
Home Page: https://councildataproject.org/cdp-backend
License: Mozilla Public License 2.0
Please provide a use case to help us understand your request in context
To process the various one-off events that occur, for example, I just started processing the 2021 election events for Seattle.
Please describe your ideal solution
I wrote a tiny script to do this in cdptools
that takes a local file + some JSON file for the event / minutes details (the questions asked at the forum), uploads the video to Google storage then processes that uploaded file with the config JSON for event details.
In cdp-backend
this is actually easier because all the ingestion models are seriabliable to JSON iirc. dataclasses_json
.
So that would mean a user simply has to write the object in Python and serialize or just in pure JSON, then have a script accept the serialized ingestion model + the video path.
Please describe any alternatives you've considered, even if you've dismissed them
Dynamically generate TypeScript object definitions from cdp-backend
Python DB models.
As a developer I want to be able to add, remove, or update a database model in a single location rather than multiple.
Add a script to cdp-backend
that when ran will generate a TypeScript package of the database models + extras that we can push to npm on new version push.
Backend maintainers
Frontend maintainers
cdp-backend
To stay up-to-date we should upgrade to v2 when we get a chance, fortunately they released an upgrading guide here: https://github.com/googleapis/python-speech/blob/release-v2.0.0/UPGRADING.md
Please provide a use case to help us understand your request in context
As we add more code to the library, tests will largely ensure we aren't adding anything is breaking, but we should also check the documented types.
Please describe your ideal solution
Adopt mypy
for static type checking and add to tox
and CI workflows
Probably adopt Prefect's mypy
base config as well.
[mypy]
ignore_missing_imports = True
disallow_untyped_defs = True
check_untyped_defs = True
There are bound to be some failures off the bat which is why we do this ๐
Please describe any alternatives you've considered, even if you've dismissed them
A clear and concise description of the feature you're requesting.
Currently ReferenceField
properties for all of our models are loading on parent model initialization. What this means is when I request a Vote
from the database, it will first load the vote data, then for each reference field load the referenced model, then recursively do that down the model dependency tree. This is great for "simple" access but terrible for large scale access.
Please provide a use case to help us understand your request in context.
I want to be able to query the database for all votes and not have my calls rejected by Firebase.
Please describe your ideal solution.
Turn off auto_load
: https://octabyte.io/FireO/fields/reference-field/#auto-load
Additionally, because certain parts of our pipeline depend on / expect those referenced models to exist, the pipeline will likely need to be updated to fetch the referenced model when needed.
Please describe any alternatives you've considered, even if you've dismissed them.
A clear and concise description of the feature you're requesting.
Reduce the pipeline computation time by combining all the tasks that require the video on the worker together.
Please provide a use case to help us understand your request in context.
Drastically reducing the pipeline computation time.
Please describe your ideal solution.
Move the function calls to generate thumbnails to the start of the pipeline with the strip audio and generate hash.
Please describe any alternatives you've considered, even if you've dismissed them.
Please provide a use case to help us understand your request in context
Managing database schema + inserting new items into the database will be simpler with an ORM style API
Please describe your ideal solution
I.E.:
full_council = Event("Full Council", datetime(2020, 10, 7), "...")
database.insert(full_council)
It would be nice to handle db uploads in the event gather pipeline to after all the processing logic (audio creation, transcript creation, etc.) so that any errors can get surfaced before uploading to the database.
This can be done storing the results of processing in a temporary container then doing all the database calls with the results of the processing
A clear and concise description of the bug.
cdp-backend/dev-infrastructure/README.md
Line 12 in 5836ee3
[example repo](https://github.com/CouncilDataProject/example).
contains URL to a non-existing valid repository.
What did you expect to happen instead?
Navigate to one of the repositories in https://github.com/CouncilDataProject
Steps to reproduce the behavior and/or a minimal example that exhibits the behavior.
Go to https://github.com/CouncilDataProject/cdp-backend/blob/main/dev-infrastructure/README.md
then click on the "example repo" link near the top of the README file.
demonstration and example data usage in our example repo.
Any additional information about your environment.
Please provide a use case to help us understand your request in context
While originally setup thinking that these pipelines would only run on servers, civic tech happens can happen and be used by all platforms and users.... so actually making sure we do that would be a good idea.
Please describe your ideal solution
Add windows and mac to build CI and fix any bugs.
Please describe any alternatives you've considered, even if you've dismissed them
A clear and concise description of the feature you're requesting.
Add matter index pipeline to stored indexed terms for matters.
Please provide a use case to help us understand your request in context.
Search for matters based off of a plain text search query.
Please describe your ideal solution.
Similar to event index pipeline but instead of parsing the transcript parse the matter document text and link to matter_ref
instead of event_ref
.
Please describe any alternatives you've considered, even if you've dismissed them.
Broken link in docs: https://councildataproject.github.io/cdp-backend/event_gather_pipeline.html
Should be updated to the renamed Event Ingestion Data Model
A clear and concise description of the feature you're requesting.
Partially discussed in CouncilDataProject/cdp-roadmap#3, it would be nice to have a collection with a single document to pull metadata about the instance.
Please provide a use case to help us understand your request in context.
Most importantly would be which cdp version the instance is currently on. This would help in determine which features the instance supports.
Please describe your ideal solution.
Use pulumi gcp firestore document resource to create / update the same document every time the infrastructure is upgraded.
https://www.pulumi.com/docs/reference/pkg/gcp/firestore/document/
@tohuynh @isaacna can you think of other metadata that would be good to store in the database itself about the instance?
version, some subset of cookiecutter params maybe? anything else?
To account for edge cases in sentence detection (Prefixes like Mr.), we can use the spacy library.
Use the spacy library to separate sentences. Haven't looked into the library too deeply, but would probably require us to change the iteration logic since we create Word
(and increment index) and potentially end the Sentence
in the same iteration.
A clear and concise description of the bug.
When we are storing links to the matter attachments during the event ingestion pipeline store_event_processing_result
task, we run existance checks on every supporting file. Because there can be a lot of these supporting files, we occasionally encounter request rejections for hitting the host too often / too quickly.
What did you expect to happen instead?
The file should be skipped. Simply catching the exception and moving on instead of failing the pipeline.
Steps to reproduce the behavior and/or a minimal example that exhibits the behavior.
https://github.com/CouncilDataProject/seattle-staging/runs/4930903577?check_suite_focus=true
Any additional information about your environment.
Ubuntu 20.04 and cdp-backend v3.0.3 -- see GitHub Actions details.
As discussed in #155
We don't need is_required
in the function create_constant_value_validator
. We can just use the required
arg that is built in for all FireO fields. We should just remove the param in create_constant_value_validator
and adjust the usages to keep things consistent
Create a bin script to generate event gather flow visualization to attach to documentation after #16 is completed.
Originally posted by @JacksonMaxfield in #33 (comment)
Please provide a use case to help us understand your request in context
phrases
provides functionality to tune the SR model by providing phrases that the model should recognize / give context to the model about what is being talked about. We should try to improve this wherever possible. I.e. choosing phrases that most improve model performance and context.
Please describe your ideal solution
We provide the minutes items to the model as just a list of string and then the "clean_phrases" function simply returns the first 500 characters for each of the first 100 phrases of the full phrases list.
However, we could run TFIDF or some n-gram indexing to find the most valuable (specific to that meeting) n-grams in the minutes items and just provide those. This has a cold-start problem but we can probably get around that.
There may also be better methods for phrase selection to improve the model.
Along this entire issue should be tracking and reporting model performance. We could probably add ASV benchmarking as a way of monitoring performance.
upload_db_model
, we allow ingestion_model
to be an optional param. However, if the db model already exists and an ingestion model isn't passed, then a uniqueness error is thrown.upload_db_model_task
without a corresponding ingestion model because we don't really need an ingestion model. Thfile
db models are not. This could be happen if we're doing something like data cleanup or a backfill which causes inconsistencies.upload_db_model
logic and removing the ingestion model is not None
check, and changing update_db_model
to take in Union[Model, IngestionModel]
instead. It may also make more sense to have separate params for a db model vs ingestion model in update_db_model
since the logic may differ a bitupload_db_model
to only accept a db_model
instead of ingestion_model
, and update_db_model
to take in existing_db_model
and proposed_db_model
. This would probably make the both functions more simple, however would require us to create extra db models outside of those functionsupload_db_model_task
is called with None
for ingestion_model when an existing db model is presentA clear and concise description of the feature you're requesting.
If the video isn't MP4 / WebM, we won't be able to render the video on the website. Ensure or convert to a "web ready" video format.
Please provide a use case to help us understand your request in context.
Allow more video formats through the pipeline / catch common video recording format errors. (OBS records videos as MKV by default)
Please describe your ideal solution.
If the video file isn't MP4 or WEBM, download it, convert, and upload our conversion to file storage. Use our converted video as the basis which would mean hosting it ourselves but ๐คท
Please describe any alternatives you've considered, even if you've dismissed them.
The top five IndexedEventGrams for an event contain duplicates. (There is a couple of examples of this at https://tohuynh.github.io/cdp-frontend/#/events for seattle-staging database).
For example, for this event:
zoo
and animals
are duplicates. The duplicates are exactly the same except for the values of datetime_weighted_value
, value
, and id
.
There should be no duplicates IndexedEventGrams.
I checked the top 20 IndexedEventGrams for the event, my guess is that the old IndexedEventGrams aren't being updated with new datetime_weighted_value
and value
, instead, new ones are created and the old ones aren't deleted.
cc @JacksonMaxfield
Not sure if I should count this as a feature or bug, but currently, if the file passed to get_static_thumbnail
has no fetchable thumbnail (a .wav
audio file), the method and event gather pipeline will error out with
[ERROR: runner: 66 2021-08-16 00:06:27,588] Unexpected error: ValueError('Could not find a format to read the specified file in any-mode mode')
Traceback (most recent call last):
File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/prefect/engine/runner.py", line 48, in inner
new_state = method(self, state, *args, **kwargs)
File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 860, in get_task_run_state
logger=self.logger,
File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/prefect/utilities/executors.py", line 298, in run_task_with_timeout
return task.run(*args, **kwargs) # type: ignore
File "/Users/Isaac/Desktop/cdp-backend/cdp_backend/pipeline/event_gather_pipeline.py", line 747, in get_video_and_generate_thumbnails
tmp_video_path, session_content_hash
File "/Users/Isaac/Desktop/cdp-backend/cdp_backend/utils/file_utils.py", line 208, in get_static_thumbnail
reader = imageio.get_reader(video_path)
File "/Users/Isaac/.pyenv/versions/3.7.9/lib/python3.7/site-packages/imageio/core/functions.py", line 182, in get_reader
"Could not find a format to read the specified file in %s mode" % modename
ValueError: Could not find a format to read the specified file in any-mode mode
I think to generalize the event gather as much as possible, we should account for events that are just audio files (in addition to video files). This may require some minor refactoring in the pipeline as well if we want to avoid redundancy with the audio splitters.
Allow events that are just audio to be processed in the event gather pipeline.
Run an event gather flow with the event linked to a URI of audio instead of video
I have looked for a "runTest" target in the Makefile. I can see that tests exist. I just cannot see if any other setup is required before the tests can be run.
I will add a PR for any Makefile or documentation change that I discover is needed.
Currently we depend on environment variables to call resource_copy
on files stored in GCS ("https://storage.googleapis"). We should explicitly use a credentials file with GCSFileSystem
instead just to be safe. This shouldn't be necessary once public read for GCS is enabled
Add a function for downloading youtube videos
Some deployments (such as portland) may upload their videos via youtube. We need to be able to download the video for both audio stripping for session hash generation, trasncription, and speaker classification
We can use youtube-dl
I am not sure how to do this for a dataclass but I assume we can just do it in the object definition.
When I print out a Transcript
object in a Jupyter notebook or any Python interpreter really, the repr
and the tostring include the whole transcript details. I would prefer the shortened details:
Transcript(
generator='CDP WebVTT Conversion -- CDP v3.0.2',
confidence=0.9699999999999663,
session_datetime='2021-11-18T09:30:00-08:00',
created_datetime='2022-01-12T05:05:00.633612',
sentences=[...],
annotations=None
)
Basically hiding the sentences because it makes the repr massive.
I was primarily using Pulumi when building the infrastructure module because it allowed us to write in Python and import variable from the other cdp-backend
modules (namely database
). But I somehow missed that Terraform released the Python lib a couple of months ago.
https://github.com/hashicorp/terraform-cdk/tree/master/examples/python/aws
https://github.com/hashicorp/terraform-cdk (pip install cdktf
)
Terraform is a bit more low level and doesn't have any account management stuff. This is actually a benefit to us because that means one less credential for us and other devs to manage.
(I think they released Node earlier and then Python later which explains me missing it a bit, but not much. Fortunately it shouldn't be too hard to switch over.)
Please provide a use case to help us understand your request in context
Add the core pipeline to the repo.
Please describe your ideal solution
Bring over the major components of the existing cdptools.pipelines.EventGatherPipeline
but as a Prefect Flow
. All files should be handled with gcsfs
/ fsspec
.
Please describe any alternatives you've considered, even if you've dismissed them
IndexedEventGram's context_span
provides a surrounding context of the gram. The current method to find/create the context span is not ideal. Sometimes context span doesn't entirely contain the gram.
The context span should contain the gram in its entirety.
In the intermediary step of creating n grams, keep track of the gram's start index (and maybe end index) in the original sentence so that a context span could be created from these indices.
A clear description of the bug
Somewhere along the way, the DB UML Diagram was broken. You can see the broken diff here
I believe this issue was introduced as a result of changing from _BUILD_IGNORE_CLASSES
to having a variable of the classes instead.
What did you expect to happen instead?
A return back to the old image.
Please provide a use case to help us understand your request in context
Need collections to store the created indexed terms for events
Please describe your ideal solution
Complete the work laid out in cdptools #79 some of this may be incomplete, look to cdptools@feature/update-index-pipeline for most up-to-date.
Please describe any alternatives you've considered, even if you've dismissed them
A clear and concise description of the bug.
Was checking the logs of test-deployment and saw that some files fail to be archived: https://github.com/CouncilDataProject/test-deployment/runs/3776409036?check_suite_focus=true#step:8:118
I can click on that link and it is a real file that is publically accessibile.
What did you expect to happen instead?
All publically accessible files are archivable.
A clear and concise description of the feature you're requesting.
Add a parameter to the pipeline that enables us to dryrun it without any database or file store upload.
Please provide a use case to help us understand your request in context.
Would be very useful to dryrun the pipeline during deployment bot configuration validation.
Please describe your ideal solution.
fireo
has a to_dict
method on all model objects.)Please describe any alternatives you've considered, even if you've dismissed them.
We already do this for session video URIs:
However this should be the default standard whenever we check an HTTP like URI.
If given an http(s) uri, try the uri in https first, then try it in http, whichever works, use it.
A clear and concise description of the bug.
https://github.com/CouncilDataProject/seattle-staging/runs/4862736150?check_suite_focus=true
Looks like the closest_term
was ""
and an empty string wasn't found in the list. Regardless of how valuable that is. This was from seattle-staging and we can run and store the index locally so hopefully this will allow us to check out both what term is being processed and what is going on in general.
A clear and concise description of the feature you're requesting.
For certain files that we want to archive (such as person picture) we are simply downloading and then uploading to our own file storage and storing at /[filename]
. Instead let's store the file at a unique path so that we can maybe reduce the amount of uploads.
Please provide a use case to help us understand your request in context.
In addition to reducing the number of uploads we run, this should also just speed up the pipeline, and crucially: make our data storage "safer", in the case that a person uses the same filenaming for all of their pictures i.e. council.gov/{PERSONNAME}/picture.png
the current pipeline would result in only storing a single picture for them all instead of storing them uniquely.
Please describe your ideal solution.
Anytime we are using the download
-> upload
pattern: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L874
change it to download
-> hash
-> check if exists (by hash)
-> (optional) upload (with hash as the file name)
.
Please describe any alternatives you've considered, even if you've dismissed them.
Please provide a use case to help us understand your request in context
Further ensure that no "bad" or "incomplete" data is uploaded to a CDP database.
Please describe your ideal solution
Rework the event gather pipeline to create the fireo models from the ingestion models but do not upload until the very end of the pipeline where we can upload all of them in a single fireo transaction.
Please describe any alternatives you've considered, even if you've dismissed them
Please provide a use case to help us understand your request in context
Using our card example on the frontend documentation: https://councildataproject.org/cdp-frontend/?path=/story/library-cards-events--meeting-search-result
We need a thumbnail for each event. And we planned for this in two ways, our event schema in our database has both a static and hover thumbnail. https://councildataproject.org/cdp-backend/cdp_backend.database.html#cdp_backend.database.models.Event
Please describe your ideal solution
imageio is really all that is needed here.
Please describe any alternatives you've considered, even if you've dismissed them
This can become much more complicated later. But for right now we really just need to get work done. The 30 second and the every 1/10th of the video file are chosen some what randomly but they should be okay. Down the line we may want have some better frame selection in place but not now.
After these two functions are written, we will need to add them to the pipeline. I am envisioning that a single prefect task would do the following:
def generate_thumbnails(video_uri: str) -> Tuple[str, str]:
# Download video
local_video_copy_path = external_resource_copy(video_uri)
# Process to get thumbnails
static_thumbnail_path = get_static_thumbnail(local_video_copy_path)
hover_thumbnail_path = get_hover_thumbnail(local_video_copy_path)
# Store files in storage
static_thumbnail_uri = upload_file(static_thumbnail_path)
hover_thumbnail_uri = upload_file(hover_thumbnail_path)
# Store file URIs in DB
static_thumbnail_ref = upload_file_db(static_thumbnail_uri)
hover_thumbnail_ref = upload_file_db(hover_thumbnail_uri)
return static_thumbnail_ref, hover_thumbnail_ref
See CouncilDataProject/cdp-frontend#143
and CouncilDataProject/cdp-scrapers#50
The hasher is sometimes producing multiple sessions for a single event. I believe this happens when the pipeline is running against the same event that has previously been processed.
I.e.
Event
model is a reference (ReferenceDocLoader
) and not an in-memory modelI can show this behavior by running the following:
from cdp_backend.database import models as db_models
import fireo
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client
from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id
# Connect to the database
fireo.connection(client=Client(
project="cdp-seattle-staging-dbengvtn",
credentials=AnonymousCredentials()
))
def print_session_hashes(event_id):
event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
for session in sessions:
print(session.id)
updated = generate_and_attach_doc_hash_as_id(session)
print(updated.id)
print_session_hashes("c847010ce7a4")
# prints
# 71d1777f22c7
# 5626e734b33d
# e8b6fd84193f
# 5626e734b33d
print_session_hashes("9409f4d1ea09")
# prints
# 25d3a51447d0
# 9679fad34929
# 30701b4a0fbf
# 9679fad34929
Notice that in the above case, the second and fourth hashes match for each event.
This is because I am forcing it to rehash itself -- but, those second and fourth hashes match neither of the original two hashes, which is bad. But since I am using the Python API to pull this data, it would seem to me that this may be happening because they are ReferenceDocLoaders
... so something to explore next.
If I change this function to explicitely fetch the event object as an entire model (db_models.Event
) in-memory rather than a ReferenceDocLoader
object, it produces:
def print_session_hashes(event_id):
event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
for session in sessions:
session.event_ref = session.event_ref.get()
print(session.id)
updated = generate_and_attach_doc_hash_as_id(session)
print(updated.id)
print_session_hashes("c847010ce7a4")
# prints
# 71d1777f22c7
# e8b6fd84193f
# e8b6fd84193f
# e8b6fd84193f
print_session_hashes("9409f4d1ea09")
# prints
# 25d3a51447d0
# 30701b4a0fbf
# 30701b4a0fbf
# 30701b4a0fbf
Notice under this case that everything matches after the first hash. To my understanding this is because this is how the pipeline ran (simply due to insert order, the first time the pipeline ran is the third hash in the list, the second time the pipeline ran is the first hash in the list -- i.e. the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull). The first time this event was processed, it had the model in memory, then the second time this event was processed (6 or so hours later), it had the Event as a reference and not in memory. If we force the model into memory, then it fixes itself.
I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours. It seems to have only occurred for events in the Seattle Staging instance and for a couple of events.
I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.
I truly do not know how deep this issue goes. I haven't found any King County event that fails.
I have found the following seattle staging events that fail:
Some pages can have the same URL but have changing content. An example would be a "latest" City Council video page that gets updated frequently. Hashing based on content instead of URL will guarantee uniqueness regardless of updates to file content for the same URL.
This is primarily for the event gather pipeline but should be addressed anywhere else relevant. We can hash the entire file with a library like hashlib
or something similar.
Backend issue for the relevant roadmap issue
Adding speaker classification to CDP transcripts. This could be through a script/class that retroactively attaches the speaker name to a transcript that already has speaker diarization enabled. Prodigy can be used for annotating the training data.
With speaker classification we can provide transcripts annotated with the speaker. This can be used in many ways such as through a script or github action
Very high level idea would be to:
A bigger picture breakdown of all the major components can be found on the roadmap issue under "Major components".
With #150, we should come up with a more robust system for tracking how well our speech recognition model performs. Both our google configuration, or, if we ever train our own model or switch services, simply a method for checking with model is currently performing best.
I have previously set up ASV for aicsimageio to track our IO perf per each commit and I think I could setup something up for this as well.
Unlike AICSImageIO's ASV setup which runs on every commit, I think I would only set it up to run / update on every tag / new release to reduce cost.
As for dataset, I think I will try to find 5 meetings of seattle city council that have transcripts converted from closed caption files, and store those transcript JSONs in the repo / library to use as "ground truth". While the closed caption files aren't 100% correct, they are very close to it, and crutially, they are creating by a government employee as a part of the meeting process and used on government websites so it's really asking the question: "does the SR model configuration perform close to manual closed caption file creation"
Finally as for metric, the general idea in my head is to process each of the transcripts with the model and then use word error rate to see how the model performs against the ground truth closed caption converted file.
cc @nniiicc
We had to remove the usage of resources_exists
from the uri field validator because we couldn't think of a method to pass the google credentials to the validator. At some point we may want to add this back but this shouldn't hold up #49 from being merged.
Actually after trying this, I think passing another param to the validator is a little tricky.
I can add kwargs to
resource_exists
, but I'm not sure if there's an easy way to pass infs
into that function insideuri = fields.TextField(required=True, validator=validators.resource_exists)
.
Because model fields in FireO are assigned like
db_file.uri = "gs://stuff"
, I'm not sure we can pass in other params into thevalidators
arg.
Would it make more sense to just validate separately outside of FireO's built-in field validation? Or is there an easier way to do this I'm missing?
Originally posted by @isaacna in #49 (comment)
A clear and concise description of the feature you're requesting.
Downscale and compress images to preemptively help speed up our webpage rendering times.
Please provide a use case to help us understand your request in context.
Specifically I am worried about the event search results / the general events filtering and query page due to trying to render ~10 events with all their thumbnails at the same time.
Please describe your ideal solution.
Please describe any alternatives you've considered, even if you've dismissed them.
Please provide a use case to help us understand your request in context
Maybe this should be a parameter in the CDPStack object but currently our standard for infrastructure is to allow public read so during stack creation, we should set the default read permissions for the bucket and database to be public as well.
I say potentially a parameter because dev stacks don't need to be public read which could save money for the dev ๐
(the dev stack with example data should really only cost like $0.01 though since the example data should be really minimal)
Please describe your ideal solution
I haven't figured out which Pulumi resource is needed (or really if it is possible to do with the pulumi API, it may need to be a second ComponentResource
that calls out to gcloud
in a subprocess)
But setting this and managing permissions through code would be incredibly useful.
Please describe any alternatives you've considered, even if you've dismissed them
While looking through the frontend and design work on the legislation tracking project, I realized for the first time that I think we may be missing a crucial piece of information which is a link to the actual full matter text.
Current Matter
is defined as: link but a matter definitely has full text and we should store a link to that full text. I propose full_text_uri
or some varient of that.
Additionally, while we look into this, it would be good to investigate which MatterFile
's make it through the pipeline: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L1444
I think the above try-except block may be dropping some MatterFile
/ MinutesItemFile
attachments that would be useful to keep and so we may want to try to fix it if we do see that behavior.
Further dependency splitting, no need to have them all together.
Please provide a use case to help us understand your request in context
After talking to the Pulumi team, we can export entire stacks of resources a CustomResource
from this codebase and then just use it in a very small __main__.py
file in a cookiecutter-cdp-instance
repo / cookiecutter.
This would mean that the backend repo would be more tightly coupled to the database models and they would be versioned together. Which, imo, would be a very good thing.
Please describe your ideal solution
Create an infrastructure
module in the cdp_backend
library where we can export a CDPStack
Pulumi
CustomResource
.
Follow the work done here as example resource definition
Follow the work done here as example resource usage
There should also be tests added when possible to check the infrastructure. But, at the very least, potentially adding a github action to always preview
what would be created under the current version of the infrastructure / backend.
Please describe any alternatives you've considered, even if you've dismissed them
Please provide a use case to help us understand your request in context
Create a searchable index for the events gathered and upload it to Firestore.
Please describe your ideal solution
Migrate over and clean up the work done in cdptools@feature/update-index-pipeline
Additionally, see if this can entirely be done on GH Actions or if we need to use compute engine. Keeping the cost low would be great.
Please describe any alternatives you've considered, even if you've dismissed them
Please provide a use case to help us understand your request in context
Standardize and version the acceptable data format we accept for pipelines with the pipeline code itself rather than scrapper side.
Please describe your ideal solution
Migrate the work done by @isaacna on cdptools to this repo.
Potentially change the name from MinimalEventData
and MinimalSessionData
to just EventData
and SessionData
as I think with the Optional
keys, they are the full event data spec.
We should also make objects for VoteData
and such, which are a part of the Optional[List[MinutesItems]]
. Basically the entire data structure of all minimal data and optional data should be documented as the accepted data structure we accept for pipelines.
This unfortunately will require a bit of copying the database ORM definitions unless we can think of a way to programmatically construct all of these models. I.E. run through the DB ORM definitions and created NamedTuples of every single model with the appropriate Optional
tags. I think it's possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.