Giter VIP home page Giter VIP logo

disclosure-backend-static's Introduction

Build Status

Disclosure Backend Static

The disclosure-backend-static repo is the backend powering Open Disclosure California.

It was created in haste running up to the 2016 election, and thus is engineered around a "get it done" philosophy. At that time, we had already designed an API and built (most of) a frontend; this repo was created to implement those as quickly as possible.

This project implements a basic ETL pipeline to download the Oakland netfile data, download the CSV human-curated data for Oakland, and combine the two. The output is a directory of JSON files which mimic the existing API structure so no client code changes will be required.

Prerequisites

  • Ruby (see version in .ruby-version)

Installation

Note: You do not need to run these commands to develop on the frontend. All you need to do is clone the repository adjacent to the frontend repo.

If you'll be changing the backend code, install the development dependencies with these commands:

brew install postgresql
sudo pip install -r requirements.txt
gem install pg bundler
bundle install

Note: It appears there is a problem on Macintosh systems using the Apple Chips. If, when running make import you get:

ImportError: You don't appear to have the necessary database backend installed for connection string you're trying to use. Available backends include:

PostgreSQL:	pip install psycopg2

Try the following:

pip uninstall psycopg2-binary
pip install psycopg2-binary --no-cache-dir

Codespaces

This repository is set up to work in a container under Codespaces. In other words, you can start up an environment that is already set up without having to do any of the installation steps required to set up a local environment. This can be used as a way to trouble-shoot code before it is committed to the production pipeline. The following information may be helpful to get started using Codespaces:

  1. Go to the page displaying the repository code: https://github.com/caciviclab/disclosure-backend-static
  2. Change the branch to the branch that you're interested in running and editing.
  3. Click on the Code button and click the Codespaces tab in the dropdown
  4. Click on the button to start a new codespace
  5. Wait for the container to be set up and a terminal prompt at /workspace to be presented in the web page, which will look familiar if you've worked with VS Code before
  6. In the terminal, you can start running the commands from the next section, such as make download
  7. There will be a running Postgres database started on localhost and you can simply type psql in the terminal to connect to the server
  8. The make import command will populate the Postgres database
  9. Click on the cylinder icon on the left to access the SQL Tools extension, which will allow you to connect to Postgres and view its schema as well as query the data that you imported
  10. The container has Python 3.9 and a current version of Ruby installed, along with all the required packages
  11. We will try to get the container to match the environment for Travis CI, including trying to get the same Ruby version installed
  12. This same setup allows anyone to run the same container on their local machine in VS Code using the Dev Containers extension, but we are mainly focused on enabling Codespaces right now and making sure that we can harden the setup for Codespaces first. (We'll be adding instructions for Dev Containers once this is solid)
  13. If you make changes in your Codespaces environment, don't forget to push it after committing it with Git. It is like another machine, so it won't end up in the GitHub repository if don't do a git push

Running

Download the raw data files. You only need to run this once in a while to get the latest data.

$ make download

Import the data into the database for easier processing. You only need to run this after you've downloaded new data.

$ make import

Run the calculators. Everything is output into the "build" folder.

$ make process

Optionally, reindex the build outputs into Algolia. (Reindexing requires the ALGOLIASEARCH_APPLICATION_ID and ALGOLIASEARCH_API_KEY environment variables).

$ make reindex

If you want to serve the static JSON files via a local web server:

$ make run

Developing

Maintaining database schema

When make import is run, a number of postgres tables are created for importing the downloaded data. The schema of these tables are explicitly defined in the dbschema directory and may have to be updated in the future to accommodate future data. Columns that hold string data may not be sized large enough for future data. For example, if a name column accepts names of at most 20 characters and in the future, we have data where the name is 21 characters long, the data import will fail. When this occurs, we will have to update the corresponding schema file in dbschema to support more characters. Simply make the change and re-run make import to verify that it succeeds.

Checking output data changes

This repository is used to generate data files that are used by the website. After make process is run, a build directory is generated containing the data files. This directory is checked in to the repository and later checked out when generating the website. After making code changes, it is important to compare the generated build directory against the build directory generated before the code changes and verify that changes from the code changes are as expected.

Because a strict comparison of all contents of the build directory will always include changes that occur independent of any code change, every developer has to know about these expected changes in order to perform this check. To remove the need for this, a specific file, bin/create-digests.py, generate digests for JSON data in the build directory after excluding these expected changes. To look for changes that exclude these expected changes, simply look for a change in the build/digests.json file.

Currently, these are the expected changes that occur independent of any code change:

  • timestamps change for each run
  • top contributors lists contain undefined ordering of contributors with the same contribution
  • top spenders lists contain undefined ordering of spenders with the same spending
  • rounding differences for floats

The expected changes are excluded before generating digests for data in the build directory. The logic for this can be found in the function clean_data, found in the file bin/create-digests.py. After the code is modified such that an expected change no longer exists, the exclusion of that change can be removed from clean_data. For example, the rounding of floats are not consistently the same each time make process is run, due to differences in the environment. When the code is fixed so that the rounding of floats is the same as long as the data hasn't changed, the round_float call in clean_data can be removed.

An additional script has been created to generate a report that enables comparing the totals for candidates. The script is bin/report-candidates.py and it generates build/candidates.csv and build/candidates.xlsx. The reports include a list of all the candidates and totals calculated multiple ways that should add up to the same number.

To ensure that database schema changes are visible in pull requests, the complete postgres schema is also saved to a schema.sql file in the build directory. Because the build directory is automatically re-built for each branch in a PR and committed to the repository, any change to the schema caused by a code change will be shown a difference in the schema.sql file when reviewing the PR.

Adding a calculator

Each metric about a candidate is calculated independently. A metric might be something like "total contributions received" or something more complex like "percentage of contributions that are less than $100".

When adding a new calculation, a good first place to start is the official Form 460. Is the data are you looking for reported on that form? If so, you will probably find it in your database after the import process. There are also a couple other forms that we import, like Form 496. (These are the names of the files in the input directory. Check those out.)

Each schedule of each form is imported into a separate postgres table. For example, Schedule A of Form 460 is imported into the A-Contributions table.

Now that you have a way of querying the data, you should come up with a SQL query that calculates the value you are trying to get. Once you can express your calcualtion as SQL, put it in a calcuator file like so:

  1. Create a new file named calculators/[your_thing]_calculator.rb
  2. Here is some boilerplate for that file:
# the name of this class _must_ match the filename of this file, i.e. end
# with "Calculator" if the file ends with "_calculator.rb"
class YourThingCalculator
  def initialize(candidates: [], ballot_measures: [], committees: [])
    @candidates = candidates
    @candidates_by_filer_id = @candidates.where('"FPPC" IS NOT NULL')
      .index_by { |candidate| candidate['FPPC'] }
  end

  def fetch
    @results = ActiveRecord::Base.connection.execute(<<-SQL)
      -- your sql query here
    SQL

    @results.each do |row|
      # make sure Filer_ID is returned as a column by your query!
      candidate = @candidates_by_filer_id[row['Filer_ID'].to_i]

      # change this!
      candidate.save_calculation(:your_thing, row[column_with_your_desired_data])
    end
  end
end
  1. You will want to fill in the SQL query and make sure that the query selects the Filer_ID column.
  2. Make sure to update the call to candidate.save_calculation. That method will serialize its second argument as JSON, so it can store any kind of data.
  3. Your calculation can be retrieved with candidate.calculation(:your_thing). You will want to add this into an API response in the process.rb file.

Data flow

This is how the data flows through the back end. Finance data is pulled from Netfile which is supplemented by a Google Sheet mapping Filer Ids to ballot information like candidate names, offices, ballot measures, etc. Once data is filtered, aggregated, and transformed, the front end consumes it and builds the static HTML front end.

Diagram showing how finance data flows through different disclosure components

Common Errors

During Bundle Install

error: use of undeclared identifier 'LZMA_OK'

Try:

brew unlink xz
bundle install
brew link xz

During make download

wget: command not found

Run brew install wget.

disclosure-backend-static's People

Contributors

adborden avatar anlawyer avatar chenglimear avatar ckingbailey avatar dependabot[bot] avatar gauravmk avatar joheyjo avatar joshling1919 avatar matthewchatham avatar mford008 avatar michaelliu2 avatar mikeubell avatar monkeyhippies avatar r-i-c-h avatar tdooner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disclosure-backend-static's Issues

Primarily formed ballot measure committees not appearing on referendum list of supporting/opposing committees

@tdooner @mikeubell, @sfdoran's staff brought this up when combing over the site this week. It looks like there are committees that aren't showing up as supporting/opposing ballot measures. Unfortunately, this means the data we're showing is misleading, so we'd like to fix this ASAP.

In most of the these cases, these are Ballot Measure Committees (BMC) formed primarily to support or defeat a specific ballot measure. In this case, the committee usually does not specify their expenditures were in support or opposing a ballot measure on Schedule D, they just leave it blank. In that case, we should be attributing those expenditures to the ballot measure they were primarily formed for.

For primarily formed ballot measure committees, we have the measure number they are in support/opposing in the committee spreadsheet, although I am not sure if we also need the support/oppose value, hope one of you can comment on this.

Need to exclude data for officeholder accounts from campaign totals

Issue by sfdoran
Wednesday Nov 15, 2017 at 21:25 GMT
Originally opened as caciviclab/disclosure-backend#280


Larry Reid and Lynette Gibson-McElhaney changed their campaign committees to Officeholder committees in the midst of a reporting period.

Transactions after they filed the change should be excluded/subtracted from the campaign totals.

Larry Reid for Council 2016 Officeholder Committee FPPC ID 1387905 re-designated 12/19/2016
Lynette Gibson-McElhaney 2016 Officeholder Committee FPPC ID 1375179 re-designated 8/29/2017

Include Form 496 data in ballot measure candidate / opposition totals

This form lists independent expenditures in support or opposition to a ballot measure or candidate.

Currently there is not much here except for some school advocacy, but this sheet is gonna fill up soon.

  • ballot measure supporting / opposing calculator
  • ? candidate opposing calculator ( 🌟 NEW!)

Matching up to candidate is going to be tricky since the form doesn't include the candidate's recipient committee id :(. For any ballot measures without a Bal_Num we can use, add to the mapping in models/oakland_referendum.rb.

Endpoint for candidate contributions

Not sure where this stacks up in priorities or difficulty. Looking for an endpoint to get the contributions made to a specific candidate. Maybe /candidate/:id/contributions/:type with some kind of limit and offset parameter so we can scroll the dataset. And maybe we can break this down a bit, since it's kind of 3 datasets.

Here's what the frontend mock is asking for:
'sm-mobile-6-candidate-contributions'

There should be data for 2018 June Measure D

There is a committee in support of the ballot measure now, as well as a major donor that is supporting the measure.

Let's make sure that that is represented on Open Disclosure ASAP.

Primary:
Protect Oakland Libraries - 1400467

Major Donor:
Friends of the Oakland Public Library - Pending

Create ballot for San Francisco November 6, 2018 election

We need folks to seek out all candidate and referendum info for San Francisco in the November 6, 2018 election. Info we need is:

Candidate

  • FPPC#
  • Committee Name
  • Candidate
  • Aliases
  • Office
  • Incumbent
  • Accepted expenditure ceiling
  • Website
  • Twitter
  • Party Affiliation
  • Occupation
  • Bio
  • Photo
  • VotersEdge
  • Internal Notes

Referendum

  • Measure number
  • Short Title
  • Full Title
  • Summary
  • VotersEdge
  • Internal notes

Create ballot for San Francisco June 5, 2018 special election

We need folks to seek out all candidate and referendum info for San Francisco in the June 5, 2018 special election. Info we need is:

Candidate

  • FPPC#
  • Committee Name
  • Candidate
  • Aliases
  • Office
  • Incumbent
  • Accepted expenditure ceiling
  • Website
  • Twitter
  • Party Affiliation
  • Occupation
  • Bio
  • Photo
  • VotersEdge
  • Internal Notes

Referendum

  • Measure number
  • Short Title
  • Full Title
  • Summary
  • VotersEdge
  • Internal notes

There are non-matching contributions / reciepts in Form 497

Question for Suzanne: If everyone is in compliance, when a contribution over $1000 is made from a committee -> committee, we should see it twice in Form 497, right?

(We would see it once on F497 Page 1 filed by the contributing committee, and once on F497 Page 2 filed by the receiving committee.)

But there are instances right now where a contribution is not reported by both committees. For instance:

Oakland Police Officer's Association - Political Action Committee
gave $1400 to Filer ID 1387192 ("Viola Gonzales for City Council 2016") but the corresponding receipt by her committee does not appear.

Is this a known filing discrepancy to the PEC?

Define an API for Search

Hey @adborden, we probably need to figure out an API for the search feature specified here:

https://docs.google.com/document/d/1fu0GMbicaERV_7R0aoLBw1ST-yFM1P3K_8mqhDTOcGs/edit#heading=h.chyrjd53oksh

It's possible that we can just query the cloud search provider directly... and thus not need to do an API on the backend. I've never designed a search API including dropdown results and other stuff, so I'd prefer to avoid that if we can just query the cloud search engine directly.

/office_election/:office_election_id should include candidate total_contributions

$ curl -v 'https://disclosure-backend-static.tdooner.com/office_election/1'
{
  "id": 1,
  "contest_type": "Office",
  "name": "School Board District 1",
  "candidates": [
    {
      "id": 26,
      "name": "Donald Macleay",
      "photo_url": null,
      "website_url": null,
      "twitter_url": "@donmacleay",
      "first_name": "Donald",
      "last_name": "Macleay",
      "ballot_item": 1,
      "office_election": 1
    }
    // .... snip
  ],
  "ballot_id": 1
}

In order to render the list of candidates with the total collected, each candidate object should include a total_contributions field.

Figure out how to carry through expenditures

If Committee A gives money to Committee B which then spends the money supporting a ballot measure, we might accidentally double-count that money in support of the measure. (i.e. Just Cause vs Committee To Protect Oakland Renters)

Add "public campaign financing received" to backend

Suzanne will input this into the spreadsheet manually and then we'll expose it as data that the frontend can use.

The number should be null if the candidate is not eligible for public funding (i.e. Mayoral candidates, citywide offices, or OUSD), or a number greater than or equal to zero if they are eligible.

Process contributions from Form 497

The filing deadlines for Form 460 are sparse -- only two until November (9/29, 10/27) -- and between the deadlines the only data we get is from Form 497.

This form contains some sizeable contributions: $1.5M as of writing.

We will have to figure out how to deduplicate these with the Form 460 contributions after the filing deadlines, but for now we just need to include them at all.

Automated Updating

We identified this as a major task at the OD offsite on 6/11/17. The rationale behind the feature is exactly what you'd expect: we need to present a product that is as up-to-date as possible so that we aren't misleading users with stale data.

Charting the shortest course from where we are right now to automated updating, we can probably get away without rewriting the backend. Travis-CI supports Cron Jobs and also Heroku Scheduled Jobs could be an approach we use.

Either way, here is my proposed TODO list of sub-issues in this milestone:

  • Remove ssconvert dependency (it is heavy and doesn't do much for us)
  • Consider removing the csvsql dependency too (it is python and everything else is ruby)
  • Create a script which installs postgres, ruby, python, anything else to do the processing. The script will need credentials to push to Github at the end 😬
  • Test this script in Docker locally, maybe
  • Wire up the script inside of Travis or Heroku.

Handle case where candidates don't meet the reporting threshold

The reporting threshold is $2000 (of combined expenditures and contributions).

The ballot qualification deadline is in early August -- we should wait until the deadline to do the background research on candidates without registered committees. Until then, we should group all candidates without committees under a new section on the page that reads something like:

Other Candidates

Candidates who have raised and spent less than $2,000 are not required to file campaign finance statements.

Endpoint for ballot measure contributors

"It's like #20 but for ballot measures!"

Mock: https://cloud.githubusercontent.com/assets/6467815/18463189/0cc264b0-793d-11e6-8956-afc64f1c1f3d.png

For each committee that has a contribution to a ballot measure, we should return a list of other things that committee has done. Look through the data: are these independent committees financing candidates? Or doing anything more than just supporting/opposing a single ballot measure? If not, we probably don't need to build this page.

Otherwise, we should create some object like /committee/:id/supported_referendums or something like that which returns that data in some format.

Missing contributions?

Going through all the contributions we have in Oakland, I was expecting something around 1400 (before the 9/29 filing deadline) But only seeing 679.

$ ls -1 build/committee/ | while read fppc; do curl "https://disclosure-backend-static.tdooner.com/committee/$fppc/contributions" | jq '.[]'; done  |  jq --slurp '. | length'
679

cc @KyleW

Add calculator for total loans

The table efile_COAK_2016_B1-Loans / efile_COAK_2016_B2-Loans / efile_COAK_2016_H-Loans has a bunch of loan data that we need to process.

Include total amounts in committee endpoint

For the committee page, we want to show the total expenditures/contributions from this committee, so we'd like to have total_expenditures and total_contributions appear in /committee/:fppc_id. I believe we already have the total expenditures but it's only getting included on the /referendum/:id/supporting endpoint.

cc @KyleW

Contribution & expenditure data prior to 2016 missing from calculations

Issue by sfdoran
Wednesday Nov 15, 2017 at 21:09 GMT
Originally opened as caciviclab/disclosure-backend#279


See campaigns for Jody London, Rebecca Kaplan, Lynette Gibson-McElhaney, and Roseann Torres

McElhaney & Kaplan did significant fundraising & spending in 2015.

I did not include 2017 statements in my calculations, so the totals may change when those are included as well.

Spreadsheet attached.
Summary totals 2016 election (updated thru 11.9.17).xlsx

Update data output structure to better support the front end

This is more of a discussion than a specific bug or proposal.

The front end wants to be the source for ballot information (who/what is on the ballot) and to consume the finance data from the back end in a heirarchical form... the json files on the back end are organized by entity/id and isn't helpful because the Ids can change between builds.

For example, we could write front end code to iterate over candidate data, looking at office election Ids in the candidate files and working backward to the election, but maybe the back end should just output the data in a structure that fits the front end. That would look more like:

locality/
└── election (month-year or YYYY-mm-dd)
    ├── candidates
    │   └── filer-id.json
    ├── referendum_supporting 
    │   └── referendum-name-slug.json
    ├── referendum_opposing
    │   └── referendum-name-slug.json
    └── contributions
        └── filer-id.json

Other options:

  • Fix Ids to specific entities (e.g. explicit Id from the Google Sheet)
  • Write code on the front end to dynamically figure out the Ids when pulling in the finance data

Fix accidental summation of amended filings

This perennial problem bites again -- my database locally has duplicate transactions. The transactions are equal except for the Report_Num column which is 000 for the first and 001 for the second instance.

We should only use the transactions from the latest report.

Determine the quantity of committee to committee expenditures for ballot measures

Related to #29, there are some instances of potential double-counting expenditures in support/opposition to ballot measures.

For instance, transaction ID OgSkVRBY4uzb ($59,000!) in inputs/efile_COAK_2016_497.csv was given to the Committee To Protect Oakland Renters.

That committee then also has spent $60,000 in support of the ballot measure, as determined by this query:

 select sum("Tran_Amt1") from "efile_COAK_2016_A-Contributions" where "Report_Num" = '001' AND "Filer_ID" = '1364564';

How many contributions go from committee to committee? Which committees do this more than others? How deep does the rabbit hole go....? 🐰

Track / output data completeness

This is a tracking issue for places we might want to measure stats about our data processing:

For instance:

  • how many 497 contributions cannot be matched up to a supporting/opposing committee
  • matching committee entities to schedule D expenditure records
  • last updated date

Remove Google Sheet for ballot data

The problem with the Google Sheet is that that it's not easy to recruit folks to add new localities to it. You have to invite/approve their access to the sheet. Each election results in several tabs. As part of this work, I want to document the instructions to add your locality to ODCA so that folks can submit a PR to add a locality instead of having to reach out and work directly with the ODCA team.

There are a few alternatives:

  • Using CSV files instead (which can be exported from any Google Sheet)
  • Using YAML data files

The YAML approach looks something like this and individual yaml files for each candidate, referendum, and committee sans the finance data which would be generated separately.

If folks prefer working in the sheet over yaml, then we could setup a template Google Sheet that folks could copy, fill out, and then export to a CSV which they could then add to the repo on Github.

/referendum/:id/supporting does not include fppc_id for supporting_organizations

I believe we already updated the committees primary key to be fppc_id rather than some integer. /referendum/:id/supporting does not include fppc_id for supporting_organizations. Also applies to /opposing.

http://disclosure.caciviclab.org/#!/city/2/oakland/2016/referendum/1/sugar-sweetened-beveragetax/
Get's a 404 from the API when clicking on the opposing committee, trying to fetch http://disclosure-backend-static.f.tdooner.com/committee/10 I'm assuming the frontend code won't change, id will just be set to the fppc_id, is that fair?

cc @KyleW

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.