ccao-data / data-architecture Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 31.02 MB

Codebase for CCAO data infrastructure construction and management

Home Page: https://ccao-data.github.io/data-architecture/

Python 8.86% R 90.69% Mermaid 0.29% CSS 0.06% JavaScript 0.09%

aws aws-athena aws-s3 data-architecture data-engineering

data-architecture's People

Contributors

Stargazers

Forkers

bradley39e gaybro8777 suveer698

data-architecture's Issues

[Data catalog] Document remaining tables and views

The initial dbt project we added in #23 contains documentation for many of the Athena views that we build on top of our source data, but not all of them. It also does not contain documentation for source tables yet. Add documentation for these tables and views.

Update internal views

Was looking through the views when I was linting things a few weeks back and found a few to-dos:

The values_by_year CTE in vw_assessment_roll can be replaced with the equivalent view
The values_by_year CTE in vw_top_5 can be replaced with the equivalent view
The res model view needs to be updated with proper filling + linted with SQLfluff
Same as above for the condo model view

Update existing hydrology/distance to water feature

The current "distance to water" feature is based on the hydrology layer from TIGER/the U.S. Census. It doesn't really have any concept of whether each PIN is close to "good water" (a nice pond, the lake, pretty river, etc.) or "bad water" (retention pond, streams that cause flooding). We should update the feature with a labeled layer of good vs. bad water, and potentially split the feature into two separate ones. Ray is reaching out to Highway/GIS to see if we have a more complete or labeled water layer already.

[Data catalog] Define GitHub Actions workflow for rebuilding and deploying docs site

Every commit to the master branch should run a workflow that rebuilds and deploys our documentation site (#33).

Bonus points if you can get deploy previews set up, but not required.

Depends on #33.

Collect aggregate UAD statistics in S3

FHFA publishes quarterly UAD statistics aggregated to the Census tract level. We should collect this data in S3 for potential use in modeling and reporting.

We need to:

Create a raw data collection script. Download, unzip, etc.
Create a warehousing script. Collect from raw, rename, clean, impute, save as parquet

We should also:

Test using some aspects of this data directly in the Residential and condominium models

[Data Catalog] Finish outlining tests for default and add reporting and iasworld tests to sql guide

@jeancochrane is making progress on implementing dbt. We need to

Provide tests for more databases
Make sure the "default" database is fully covered

[Data catalog] Improve Jinja templating indentation in dbt files

It seems like SQLFluff doesn't quite understand how to handle for loops in our dbt Jinja templates, and it keeps wanting to pull the {% endfor %} tag onto its own line. For example:

data-architecture/dbt/tests/generic/test_column_length.sql

Lines 16 to 20 in 3f31ca0

 {% for column in columns %} 

 {%- set length_columns = length_columns.append( 

 [column, 'len_' + column] 

 ) %} 

 {% endfor %}

It's also not totally clear if SQLFluff wants the body of {% test %} and {% macro %} blocks to be indented. I would prefer if their bodies weren't indented, since the tags act more like file metadata than proper blocks given dbt's organization schema for tests and macros (where each test/macro gets its own file), but either way we should get SQLFluff to enforce the rule automatically.

Update our SQLFluff indentation rules so that we can fix and enforce both of these policies.

Eliminate randomness in `vw_pin_sale` caused by `row_number` and sale dates

When we join vw_pin_sale to other tables/views, row count is inconsistent. This is likely cause by

ROW_NUMBER() OVER (
    PARTITION BY sales.parid, sales.saledt ORDER BY sales.price DESC
) AS max_price,
-- Some pins sell for the exact same price a few months after
-- they're sold. These sales are unecessary for modeling and may be
-- duplicates
LAG(DATE_PARSE(SUBSTR(sales.saledt, 1, 10), '%Y-%m-%d')) OVER (
    PARTITION BY sales.parid, sales.price ORDER BY sales.saledt
) AS same_price_earlier_date

We may be able to fix this by ordering by wen field or something similar as well.

[Data catalog] Implement source freshness tests for critical iasWorld tables

Certain iasWorld tables are constantly being updated with new data. When running unit tests and reports, it is critical that these tables be as up-to-date as possible.

dbt builds in tests for source freshness that we can use to check that the most recent possible data is loaded. Now that all iasWorld tables are sources (as of #71), we should use this functionality as an additional check (in addition to the current CloudWatch checks) to ensure that iasWorld data was successfully updated by sqoop.

Specifically, we should target the following tables:

HTPAR
ASMT_ALL
PERMIT

dbt source freshness requires a loaded_at_field. In our case, we should use the wen column. It can also use an optional filter, which in our case should constrain the check to the most recent tax year.

Remove deactivated rows from appeal view

Some tables in iasWorld contain rows that have been deactivated - these rows should not be used for anything by Data, they are maintained in iasWorld purely for historical purposes. As of now, I only know of three tables where we need to be wary of removing deactivated rows - htpar, land, and asmt_all. I've added the conditional WHERE deactivat IS NULL where necessary.

[Data catalog] Merge `data-catalog` branch into `master`

Once we're feeling ready to put our data catalog work into production, merge the data-catalog branch into the master branch of this repo.

This will require updating our GitHub Actions workflows (#31) to remove references to data-catalog and point any operations that were running against that environment to run against the prod environment. Note that since the data-catalog branch runs dbt in the ci profile, we'll also want to double-check references to --target ci in the data-catalog branch context and switch them to --target prod where appropriate.

Fix land sf aggregation in res and condo chars views

This issue came up in appeal worksheet process and should be addressed in our characteristics views as well. Some parcels have multiple landlines that shouldn't actually be added given a non-null value for influ in iasworld.land

Remove MAX clauses from reporting.vw_assessment_roll_test and use conditionals to deduplicate instead

Update data onboarding ticket

We should update the data onboarding ticket to reflect the current state of this repo + include a requirement/task to update the data catalog.

Gather 2023 existing data for 2024 modeling

Collect res and condo third-party data (spatial data, environmental data, etc). Update the ingest scripts in aws-s3 directory.

Combine MS building footprint data with LIDAR to estimate building sqft

Microsoft publishes US building footprint data that contains GeoJSON polygons of (almost) every building in Cook County. ISGS publishes LiDAR-based estimates of height, including building height.

Condos currently lack building or unit square footage, but we may be able to combine these two data sources to produce an estimate of building size or even a rough unit sqft.

Steps

Download and join building footprint data to Cook County PINs
- Get lat/lon of all 299/399 properties convert to geography st_coords
- Get all building footprints in Cook County
  - Either clip overall US spatial layer with Cook County shape
  - OR if they include GEOID from Census, then use 17031
- Perform a spatial intersection polygon-to-point, attaching polygon geometry of footprint to point (meta_bldg_pin)
Download LiDAR (.las) point-cloud data from ISGS. Estimate building altitude by averaging Z coord for all points classified as "building" within a footprint. Estimate ground altitude using the average of Z coords from a small (2m) buffer of points classified as "ground" around each building. Take the difference to find the approximate building height.
Combine building footprint and height to approximate the total square feet of the building. Estimate internal area (sq ft) using an average # of feet per story (10ft, or look for a better estimate)
Confirm the accuracy of the footprint, LiDAR, and combined data using a combination of City footprint/height/story data and spot checks of buildings for which we can find existing measurements (look for historic buildings, they usually have lots of info).
Produce a final Rmarkdown report on the accuracy of the data sources and the result of their combination. Would like to see:
- Map of sample building footprints + their height
- Distributions of error for each data source (MS footprints vs City, LiDAR vs City or spot check, some measure of the accuracy of # of stories estimate)
- Very brief walkthrough of the steps taken to produce the combined data
- Outline of any known or potential problems with the data or methodology used to clean/combine it
- Brief review of the licensing of both data sets to ensure we can use them
Commit all code for cleaning, joining, report, etc. to a separate branch named after this issue

[Data catalog] Rename dbt models to remove the `_test` suffix

The dbt models defined in #23 (and potentially #24) all included a _test suffix in order to prevent us from overwriting production views before we had dedicated dev/prod environments in place (#28).

Once the environments have been established, update all relevant models to remove the unnecessary _test suffix from their model names, which will allow us to start tracking our production models using dbt.

Depends on #28.

Add building touching / shared wall indicator

Many buildings in the city and denser suburbs have shared walls (i.e. they are touching). This not only applies to rowhomes and townhomes, but also mixed multi-family housing like 211s and 212s.

We should create a "shared wall" or "touching" indicator for each parcel. This indicator may not only affect value, it is also useful for things like averaging groups of like buildings (based on the assumption that building groups that share walls are likely to have similar values).

We can create this feature using parcel boundary and building footprint data already in S3. Spatial relations such as "touching" should be sufficient for a first pass attempt.

Decompose modelling views into reusable components

The modeling views share a lot of code that shouldn't be reproduced. We can move this code into its own views that can be shared across the modelling views to ensure consistency:

acs5 AS (
    SELECT *
    FROM census.vw_acs5_stat
    WHERE geography = 'tract'
),

housing_index AS (
    SELECT
        geoid,
        year,
        AVG(CAST(ihs_index AS DOUBLE)) AS ihs_avg_year_index
    FROM other.ihs_index
    GROUP BY geoid, year
),

sqft_percentiles AS (
    SELECT
        ch.year,
        leg.user1 AS township_code,
        CAST(APPROX_PERCENTILE(ch.char_land_sf, 0.95) AS INT)
            AS char_land_sf_95_percentile
    FROM default.vw_pin_condo_char AS ch
    LEFT JOIN iasworld.legdat AS leg
        ON ch.pin = leg.parid AND ch.year = leg.taxyr
    GROUP BY ch.year, leg.user1
),

tax_bill_amount AS (
    SELECT
        pardat.parid AS pin,
        pardat.taxyr AS year,
        pin.tax_bill_total AS tot_tax_amt,
        tax_code.tax_code_rate AS tax_rate
    FROM iasworld.pardat
    LEFT JOIN tax.pin
        ON pardat.parid = pin.pin
        AND (
            CASE WHEN pardat.taxyr > (SELECT MAX(year) FROM tax.pin)
                    THEN (SELECT MAX(year) FROM tax.pin)
                ELSE pardat.taxyr
            END = pin.year
        )
    LEFT JOIN (
        SELECT DISTINCT
            year,
            tax_code_num,
            tax_code_rate
        FROM tax.tax_code
    ) AS tax_code
        ON pin.tax_code_num = tax_code.tax_code_num
        AND pin.year = tax_code.year

    WHERE pin.pin IS NOT NULL
),

school_district_ratings AS (
    SELECT
        district_geoid,
        district_type,
        AVG(rating) AS school_district_avg_rating,
        COUNT(*) AS num_schools_in_district
    FROM other.great_schools_rating
    GROUP BY district_geoid, district_type
)

Consolidate IC valuation model data

Currently the public needs to download individual files from our website if they want commercial valuation data. We can aggregate it and host it on socrata.

[Data catalog] Update views to point to dbt stub models

Once stub dbt models are implemented for non-DAG tables in #26, we'll be able to update the FROM clauses in our view definitions to ref() the models. This will allow us to produce a full lineage graph of our views in our dbt docs.

Depends on #26.

Investigate existing data integrity issues

We need to investigate the following data integrity tests that are currently failing.

Queries for each of these tests are saved under the test name in the Athena saved query dashboard. These queries should be useful for investigation.

`vw_pin_appeal`

vw_pin_appeal_unique_by_14_digit_pin_and_year
- Failures: 280662
vw_pin_appeal_unique_by_case_number_and_year
- Failures: 365905
vw_pin_appeal_no_unexpected_change_values
- Failures: 43
vw_pin_appeal_change_matches_appeal_outcome
- Failures: 266758

`vw_card_res_char`

vw_card_res_char_unique_by_card_and_year
- Failures: 913
vw_card_res_char_renovation_fields_match
- Failures: 73941

`vw_pin_address`

vw_pin_address_no_extra_whitespace
- Failures: 880607

`vw_pin_condo_char`

vw_pin_condo_char_unique_by_14_digit_pin_and_year
- Failures: 70297

vw_pin_sale

unique_vw_pin_sale_doc_no
- Failures: 1711769
vw_pin_sale_reasonable_number_of_sales_per_year
- Failures: 3192

vw_pin_universe

vw_pin_universe_class_count_is_consistent_by_year
- Failures: 24

Scoping a "data mart"

Placeholder issue for upcoming "Data Mart" / "Factbook" progress in Q423 or Q124.

[Data catalog] Add script to support incremental testing of dbt models

I anticipate we will be much faster at defining data integrity checks for our models than we are at actually fixing the data problems revealed by those checks. If that's true, we'll need to wrap dbt test in a script that can distinguish between expected and unexpected test failures, or else our CI test runs will fail constantly.

I'm not totally sure what the best way to do this is, so this issue will require some design work. We'll probably want some sort of version controlled config file that we can use to map tests to their expected number of failures; perhaps we can even include this in the dbt schema.yml files via a plugin API.

Implement a data catalog system for existing CCAO data architecture

Goal

Design and implement a data catalog system for internal use that improves our data documentation, quality, and orchestration.

Overview

The Data Department currently has an extremely ad-hoc and incomplete data catalog system, spread out over multiple spreadsheets and services. We should consolidate these efforts into a single place or platform to serve as a "source of truth" for our department. Ideally, the catalog should provide (in descending rank order of priority):

Schema / referential integrity validation between linked tables
Data quality and integrity metrics, see #15
Pipeline orchestration and scheduling + validation
Some insight into data sourcing and lineage (what changed, when, etc.)
Documentation about tables, columns, and relationships (metadata, notes, etc.)

Problems to solve

See "Initial Issues" in the draft architecture project overview here.

Possible solutions

This seems like an area where lots of good software solutions already exist, since these needs are common across organizations. The two I've looked at are DBT and Glue, both of which could likely meet the needs provided above. However, I'm sure there are plenty of other possible solutions.

@jeancochrane

[Data catalog] Flesh out dev docs on how to build and test with dbt

We have some barebones documentation in dbt/README.md, but currently it only explains how to install requirements and run some basic build and test commands. We should also document the scope of the project more broadly and explain how to develop it.

Some topics to cover, off the top of my head:

Motivation & link to design doc
Link to published documentation
Development
- Adding a new data source
  - materialization trickiness
- #108
- Adding a new model
- Adding a new test
- How and when tests are run automatically
- What to do if cleanup-dbt-resources fails
- Testing as you develop
- Debugging this error by clearing the state:

Compilation Error
  dbt found two schema.yml entries for the same resource named location.vw_pin10_location. Resources and their associated columns may only be described a single time. To fix this, remove one of the resource entries for location.vw_pin10_location in this file:
   - models/location/schema.yml

Depends on the final pieces of infrastructure being complete, namely
#55 and ccao-data/service-sqoop-iasworld#1.

Replace aggregated land calculation in all views with default.vw_pin_land

Land aggregation is tricky. To make sure we get it right across all views, we should only pull it from default.vw_pin_land.

[Data catalog] Document existing data integrity problems

I need to produce an issue with the following information:

Which data integrity checks are failing
What the checks mean
How to investigate them
Status of the investigation

The goal here is to have a central source of truth that we can use to track our progress resolving these issues.

[Data catalog] Deploy the dbt-generated documentation as a static site

Running dbt docs generate in the dbt/ directory will generate target/index.html, an HTML version of the documentation for our dbt models and tests. Use this file to deploy our documentation as a static site.

There are a number of good options for deploying static sites; in roughly descending order of preference:

GitHub Pages
AWS S3 with a CloudFront CDN
Netlify

GitHub Pages is preferred since it is easy to use and already integrated into GitHub; S3 + CloudFront is next on the list since we already pay for AWS; and Netlify is last in the priority queue because we don't use it yet, but it's very easy to set up so it's a plausible option if the first two don't work for whatever reason.

Consider a geography_type / PIN-level indicator for assessment stage

[Data catalog] Add stub models for tables/views defined outside the DAG

The models defined in our dbt project currently only represent the Athena views that we build on top of our source data, but we would also like to include models for our source data tables so that we can add them to our data catalog documentation. To accomplish this, we'll need to add "stub" models that don't actually build new data, they just allow dbt to recognize the model as a valid entity that can be documented.

For guidance, see this dbt community post: https://discourse.getdbt.com/t/representing-non-sql-models-in-a-dbt-dag/2083

If this issue gets picked up before we have a proper dev/prod split set up (#28), it might be wise to continue following the pattern of suffixing models with _test until we have environments up and we can safely remove the suffix (#29).

[Data catalog] Factor out dbt models to share logic with Athena views during development

Inspired by this thread. Briefly: We want to avoid duplication between dbt/models/ and aws-athena/views/ as much as possible during development of the data catalog; in order to do that, we need to have these two versions of our views share as much logic as possible.

To do that, we plan to create a long-lived branch named data-catalog that we will use for development of the data catalog. We will merge all of our data catalog work into this branch. On the data-catalog branch, dbt models will be symlinked to the view definitions in aws-athena/views/, and tweaked minimally so as to make them work with dbt (most importantly, removing CREATE TABLE AS statements and using dbt references for other views in the DAG). Then, as changes are made to the views on the master branch, we can merge them into data-catalog and resolve merge conflicts; when we're ready to go live with the data catalog, we can merge data-catalog into master.

Simplify, update, and de-duplicate condo characteristics view

A couple (2) duplicates have made there way back into this view as it's gotten larger and more complicated. It also includes PINs it shouldn't (namely those that have been deactivated).

Create corner lot indicator

Overview

Assumption: corner lots tend to be more valuable than non-corner lots due to their size, increased accessibility (for commercial), and prominent location.

Given this assumption, we should add a corner lot indicator variable to all AVMs, as well as to commercial valuation spreadsheets/methods. We can impute a corner lot variable using a variety of GIS methods, including:

Examining the nodal complexity of the street network
Touching parcel boundaries at specific angles
Lot size compared to neighbors

We should use a combination of methods to output a single indicator (0 or 1) for each PIN in the county. N at Mansueto and J at CSDS will likely have good ideas on this.

Task

Create an R script that takes property locations (lat, lon) or shapes (parcel file) as an input, then outputs a corner lot indicator (1 or 0) for each PIN. Start with a prototype script using local data. The final script should pull raw data from S3 and write back to S3. It should run as a glue job. The script should be saved to was-glue/jobs/location-corners.R. Use the branch attached to this issue to save your work and submit it for review.

@miaoqi, figuring out how to identify these lots algorithmically will be a large part of this task. There is existing literature in the space in various geography and spatial data science journals. I can also put you in touch with folks at UChicago who can provide help if needed. Let's aim to have a running script by Feb 15th.

Some R libraries you'll probably need to get started:

sf, for all spatial operations
igraph, for any graph manipulation i.e. street networks
osmdata, for OpenStreetMap data i.e. roads

Monitor taxdist field in LEGDAT

The taxdist field in iasWorld's LEGDAT table contains the unique Cook County tax code of each PIN. Tax codes can change year-over-year as new taxing districts are created or destroyed. The LEGDAT table taxdist field is currently not updated to reflect those changes, it is frozen to tax year 2021 (even for later years). As such, this field should not be used for anything other than determining a PIN's township code.

Once the Clerk migrates to iasWorld, the responsibility for updating this field (and retroactively updating it) will fall to them.

[Data catalog] Define GitHub Actions workflows for building the dbt DAG and running tests

Overview

We want to use GitHub Actions to run the dbt run and dbt test commands that build and test our DAG, respectively. To do this, we'll need to add GitHub Actions workflow definitions to this repository.

Workflow definitions

Our workflow definition should support two different types of flows:

Rebuild all models that have changed since the last cached run, and run their tests
Run all the tests, regardless of the result of the last cached run

These two flows will be used in two different ways:

will provide continuous integration (CI) for dbt models in this repository, building and testing models when we make code changes to them; while
will provide a test interface that we can call from the GitHub Actions workflow API to run data integrity checks after pulling fresh source data from the system of record each night.

Caching is important in CI to help speed up development cycles, but it's unnecessary in the context of our nightly data integrity checks, where we want to validate all of the data on each run.

Cache behavior

The CI workflow (1 above) should exhibit the following cache behavior:

On every PR:
- Run the build and tests for models that have changed since last commit to master OR since the last successful workflow run for this PR
  - In other words: The first workflow run for any PR should use the cache from the master branch, and subsequent runs should use the cache from the first successful run on the PR branch
  - These builds and tests should run in a separate development environment, ideally one that is created exclusively for the PR and not shared by other PRs; we should use the same environment scheme set up in #28
  - The master branch cache should never be updated by this flow
On commits to master:
- Run builds and tests for models that have changed since the last commit to master
  - These builds and tests should run against the prod Athena environment
  - The master branch cache should be updated when this flow succeeds

AWS credentials

In order to run dbt commands against Athena from the context of a GitHub Action workflow, we'll need to inject valid AWS credentials into the workflow. Credentials should be stored as encrypted secrets and should have their permissions restricted as much as possible to reduce the attack surface of the credentials. See the dbt-athena docs for a list of the required permissions for the adapter.

Incremental testing

Since our CI tests will only be useful if they can distinguish between expected and unexpected data integrity issues, this issue depends on #32.

Create reporting view with most recent boundaries

We anticipate being asked to provide pins within or in addition to their current political boundaries. This data doesn't live in any tables or views currently because it's not relevant to our modeling pipeline.

Add check for 299 switching to vacant land PIN

Per valuations, when a condo building is demolished, often its primary PIN will revert to vacant land or an otherwise land-only class, however, the sub-PINs for each individual unit may still be 299s and thus included in reporting etc.

We should add a check for this edge case to the model and reporting to ensure anything that changes from a 299 to a land-only class is booted from the model and stats.

Store data infrastructure configurations as code

Currently, AWS resources are configured manually via the console. However, this approach will not scale as our infrastructure becomes more complicated. We should create configurations for Glue, S3, etc. using CloudFormation or Terraform.

[Data catalog] Add remaining tests to DAG models

Once we've added models and documentation for the remaining Athena views that we want to build with dbt (#26), we'll want to add tests for them. We'll also want to finish up the tests for all of the existing models that are marked with TODO (indicating that Jean didn't understand the test and required clarification). Billy will be a helpful resource for defining the remaining tests.

See this doc for explanations of the remaining tests.

This is a separate story since I'm anticipating that the work to define and write these tests will generate enough code to warrant a dedicated pull request, but if that's not true, feel free to roll it into #26.

Depends on #26 and #61.

[Data catalog] dbt CI is not properly selecting for new and modified models

I noticed that the most recent CI run for the data-catalog branch did not (re)build any of the models we added or changed in #64. Testing things out locally, I noticed two problems:

We're missing the state:new selector in our workflow configuration, which is intended to select for new models
In my local environment, neither state:new nor state:modified properly picks up additions or changes to the model definition

Investigate replacing AWS Athena with DuckDB

DuckDB is the new thing in the data engineering space right now. It's basically SQLite but for analytical workloads instead of application workloads. It might fit our use-case pretty well and simplify some existing/future complexity. We should investigate its viability for our data at some point in the near future.

DuckDB's biggest value add for us is its ability to read Parquet files directly from S3. Something like the code below could be used to create a local database file full of views that are effectively pointers to S3 buckets. We could then build views on top of those pointers, just like in Athena. This would essentially make a DuckDB file our main query layer, replacing Athena.

library(duckdb)

conn <- DBI::dbConnect(duckdb("test.db"))

dbExecute(
  conn,
  "
  INSTALL httpfs;
  LOAD httpfs;
  SET s3_region='---';
  SET s3_access_key_id='---';
  SET s3_secret_access_key='---';
  "
)

dbExecute(
  conn,
  "
  CREATE OR REPLACE SCHEMA model;
  CREATE OR REPLACE VIEW model.shap AS
    SELECT *
    FROM parquet_scan('s3://bucket_name/shap/year=2023/run_id=2023-03-15-clever-kyra/*/*.parquet', hive_partitioning=true)
  "
)

shap <- dbGetQuery(
  conn,
  "
  SELECT *
  FROM model.shap
  WHERE run_id = '2023-03-15-clever-kyra'
  AND year = '2023'
  AND township_code = '77'
  "
)

Why (would we do this)

DuckDB is fast. It's much faster than Athena, even over network. The simple test query above runs in 1.4s via DuckDB vs 9.2s via Athena. Views and more complicated queries could see even bigger speedups. Though query speed is not a top concern, this would help with some use cases (feeding Tableau, future live applications, model input views).
No need for AWS Glue or crawlers. DuckDB uses the metadata of the Parquet files for queries, meaning we wouldn't need to maintain a metastore via Glue's data catalog and new partitions would be found automatically. This does have some huge potential downsides, see complications.
Easier local and CI testing. Now that we're building out dbt, we'll need separate test/prod environments in Athena. This will involve a lot of complexity (see #31 and #28), as we'll need to potentially maintain separate tables and views for testing, or do something truly heinous like this. For DuckDB, building and testing new versions of views could be done locally or entirely within GitHub Actions, no separate test env needed.
DuckDB supports more of standard SQL than Athena, including constraints, indexes, and keys. Those could be useful for our efforts to enforce referential integrity on our existing data.
Smaller one-off databases can be built to support specific applications and speed up query times.

How (would we do this)

I think DuckDB could essentially be a drop-in replacement for Athena. We would create stub views (as shown above) that act as tables currently do in Athena. We would then just build our existing views on top of those pointers.

The result would be a local file that contains pointers and view definitions for our entire data lake. This file would be built and tested via GitHub Actions and then uploaded to S3. From there, we could download a single local copy of the file to the main Data Dept. server, and point all connections to that file. Any changes to the views or underlying data would kick off a new workflow and replace the local file.

If we wanted, we could also just build the entire current version of the data lake into the DB file. This would likely speed things up even further for local queries on the server.

Complications

DuckDB is intended to be something similar to SQLite i.e. it's not intended for multiplayer, online use. Syncing the local copy of the DB file would be a pain. There are lots of people working on solutions to this problem, but they are pretty complicated:
Applications like Tableau would need a local copy of the DB file in order to stay current.
We currently use Glue for orchestration stuff, and we still are going to need an orchestrator for some things. This doesn't solve that problem.
Can you replace a DuckDB file mid-connection? Because that would likely happen constantly if we were replacing a local copy via GHA. Edit: it seems like the R package stores the S3 pointers in-memory, as you can still query them even after completely deleting the local DB file.
Certain schema issues are currently fatal or nearly unworkable with DuckDB. See duckdb/duckdb#8018. This would be a big issue for us as many of our tables have changing columns over time.

Add pre-commit styling and linting to R and Python scripts

We should add style and lint pre-commit hooks for everything in aws-s3/. Update .pre-commit-config.yaml then run pre-commit run --all-files.

I would consider adding the Action for GitHub's super linter.

Remove inconsistent row counts from default.vw_card_res_char

char_recent_renovation isn't partitioned by card, which it must be since dweldat isn't unique by just PIN.

[Data catalog] Add production profile to the dbt configuration

The dbt profile created in #23 only supports a development Athena database. We would also like to add a profile that supports building views and tables into the production Athena databases.

This might be slightly tricky, since we override the database defined in profiles.yml via a folder-level config in the models attribute of dbt_project.yml. We do this so that we can build models into separate databases depending on their surrounding subdirectory in the models/ directory, e.g. default/ or location/. This means we can't just create a new prod entry in profiles.yml, which is the recommended way of supporting multiple environments.

We may want to investigate using a separate Athena Data Catalog for dev and production, but that would depend on dev being able to read data from the production catalog, since our source data is built directly into the production Data Catalog.

Ingest CSDS travel time matrices

The UChicago Center for Spatial Data Science (CSDS) has created a new set of national tract-to-tract travel time matrices for different modes of transit. We should ingest these and then potentially replace some of our "distance to X" features with actual travel times. Long-term, exploratory project, no rush. Ask @dfsnow for the link to the files.

Add remaining views and tables to sql guide document

We need to complete our testing regime for DBT by documenting how our tables and views should function.

Setup S3 triggers for Glue crawlers

Currently, all our AWS Glue crawlers must be triggered manually or are run on a schedule (immediately after some files are expected to update). A better design would have the crawlers run automatically when new files are posted to the relevant S3 bucket.

This is a common design pattern for which there are many tutorials:

Automate data integrity checks

Goal

Design and implement a system of automated data integrity checks to validate our expectations about fields and join results. We need unit tests for data + transforms.

Overview

The CCAO has a complex backend system-of-record that ingests data from many processes, people, and places. Unfortunately, this system also lacks many of the data validation and integrity checks found in more robust systems. The result is frequent issues with incorrect records and invalid data types (e.g. a PIN populating a numeric field). Such errors flow downstream until they reach the Data Department's lake, where we will manually flag them for correction (via email).

We need to develop a much more robust and automated process for performing checks to detect invalid data and bad joins. I'm imagining this as a sort of "unit tests for data," wherein we have expectations/assertions about what the possible values of a field are (or what the results of a join should be) that must be met. Ideally, such a system would also provide a way to catalog any errors and notify relevant parties. I imagine this as an extension of #10, and likely built on top of that system.

Problems to solve

See "Initial Issues" in the draft architecture project overview here.

Notes

@wrridgeway Has created a spreadsheet of assertions/assumptions that we already have about columns in the iasWorld database. This would be a great place to start when building tests.
Other parties in the CCAO, including Tia, Mirella, and Legacy, run their own system of QC checks. However, many of their checks are manual and happen at the end of other review processes. My hope is that we can take the checks they're already running and incorporate them into this system, thus catching problems much earlier (and automatically).

@jeancochrane

[Data catalog] Add documentation for tables/views defined outside the DAG

Once we have stub models for the tables/views that are not actually built by the dbt DAG (#24) we should add documentation for those tables and views.

Depends on #24.

	{% for column in columns %}
	{%- set length_columns = length_columns.append(
	[column, 'len_' + column]
	) %}
	{% endfor %}

ccao-data / data-architecture Goto Github PK

data-architecture's People

Contributors

Stargazers

Forkers

data-architecture's Issues

Steps

vw_pin_appeal

vw_card_res_char

vw_pin_address

vw_pin_condo_char

vw_pin_sale

vw_pin_universe

Goal

Overview

Problems to solve

Possible solutions

Overview

Task

Overview

Workflow definitions

Cache behavior

AWS credentials

Incremental testing

Why (would we do this)

How (would we do this)

Complications

Goal

Overview

Problems to solve

Notes

Recommend Projects

Recommend Topics

Recommend Org

`vw_pin_appeal`

`vw_card_res_char`

`vw_pin_address`

`vw_pin_condo_char`