department-of-general-services / boe_tabulator Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 9.0 5.22 MB

Reads PDFs from Baltimore's archive of minutes from the Board of Estimates and places the data in a searchable table.

License: MIT License

Python 27.74% Jupyter Notebook 55.28% HTML 16.97%

boe_tabulator's People

Contributors

Stargazers

Watchers

Forkers

james-trimarco teapiper widal001 dbbabcock joeflack4 songlore brl1906 bmanek ddbilh

boe_tabulator's Issues

Develop new dataframes

Probably the best place to start is the agreements table, but the contractors table is another possibility.

fix bug where some dates can't be found

The problem is in the html on the source web page. If the source html is like this:

<a href="/files/4983-52172016-11-30pdf">November 30, 2016</a>

Then Python gets confused by the invisible character and can't identify the month as November.

Test consistency of PDF sectioning

Most minutes PDFs have a relatively simple top-level structure, with a somewhat more complex and unpredictable second-level structure. The top-level structure starts with:

BOARDS AND COMMISSIONS
OPTIONS/CONDEMNATION/QUICK-TAKES
TRANSFERS OF FUNDS

This issue will not change the core data pipeline, but will rather inform work on #42 by showing exactly how consistent the sectioning of the minutes pdfs is.

Write unit test for collection of annual links from the base page

Proposed Improvement

Adding a test for the collection of annual links from the minutes page and writing a function that makes this test pass, which can eventually be used to replace the following segment of code within store_boe_pdfs()

# find all links where the associated text contains the year
link = soup.find("a", href=True, text=str(year))
annual_url = base_url + link["href"]
print(f"Saving files from url: {annual_url}")

Test Structure

Create a test class with the following methods:

TestGetAnnualLinks:
    def test_get_annual_links(self):
        pass
    
    def test_fix_absolute_ref(self): 
        pass

    def test_new_year(self):
        pass
    
    def test_exclude_non_year_links(self):
        pass

`test_get_annual_links()`

Setup: Provide a sample HTML directly copied from the base url page and parse it into a BeautifulSoup object, and specify the dictionary of annual links that should be retrieved from that page by the function get_annual_links()
Execution: Pass the parsed HTML to get_annual_links() and capture the output
Validation: Check that the output from get_annual_links() captured in the execution matches the dictionary passed in the setup
Cleanup: No cleanup needed

Test Implications

These tests imply the following behavior from the code:

There is a function get_annual_links() that accepts a soupified version of an html page output by check_page_setup() and returns a dictionary of links with the following structure:
```
{2009: 'https://comptroller.baltimorecity.gov/minutes-2009',
 ...
 2020: 'https://comptroller.baltimorecity.gov/minutes-2020'}
```
The function ensures that the links included in the output dictionary are absolute references with the base_url as their root
The function will dynamically capture new links to future years of BOE minutes, while exclude similarly formatted links elsewhere on the page

Add comments explaining what each entity is for tables in `boe_min.sql`

As we continue to reason about which tables should go into the database and how they relate to one another, it'll be important to make sure all contributors to the repo share an understanding of what each table will contain.

Let's add a one to two-sentence SQL comment ahead of each table that just explains in plain English what entity that table will store data about.

Develop `create_agreements_table()`

This code will contain all the parts that read through the text and figure out the rows in the agreements table.

Alternate scraping function

PR submitted with an alternate scraping function that should function slightly better and be a drop-in replacement.

Updated character replacements list

As the encoding/decoding process for these pdfs seems to have some errors (characters being incorrectly decoded) we need to update the list of character replacements. Currently we have only two replacements identified but there are more than just those two.

Integrate flexible month mispelling checker into existing scrape functions

Harmonize Jupyter notebook and new functions

Anyone running the notebook will be using functions that are not the newest. Also updating the notebook to use the new functions will serve as a test that everything works together as expected.

Cleanup and add docstrings

Updated date parsing

We need a more flexible date parser, specifically with regards to handling misspellings of the month. Currently, we can handle any single-letter deletion, but not substitutions or additions.

Initiate repo style guide in Markdown

I think it'll be helpful to set a couple of ground rules at the outset.

This will be a living document; changes to it can be pushed just like we'd push code.

get_minutes_links() and the associated unit tests

Proposed Improvement

Split the functionality of retrieving the links to the minutes pdfs out into its own function.

The Function

Inputs:

year link url: An absolute URL that links to the minutes for a given year
CSS selector/find_all() arguments: OPTIONAL The selector/arguments that specifies where to find the minute links or the tag that contains the minutes links (e.g., div[class='field field-name-body'] > p:last-of-type). This would allow us to accommodate future pages that may differ in their DOM structure. This also assumes that there is a tag somewhere on the page that serves as a "container" which all the links are in. This is still tentative and might not make the final cut.

Outputs:

minutes_links_dict: a dictionary with...
- keys: the parsed date string as a key in the format of YYYY-MM-DDx, where x is an optional identifier for meetings that occurred on the same day (e.g., 2017-06-12b)
- values: an absolute URL that points to the minutes pdf for the day/meeting specified by the key

Tests:

class TestGetMinutesLinks:
    def test_get_minutes_links(self):
        pass

Setup: Provide a relative URL that points to an html file in the tests directory. This html file is a local copy of the base url page and will be parsed by BeautifulSoup.
Execution: Pass the relative URL to the get_minutes_links() function and store the output.
Validation: Check that the dictionary output of get_minutes_links() matches the expected output.
Cleanup: No cleanup required.

Example Input:

get_minutes_links(year_url, css_selector=<default value>)
get_minutes_links("https://comptroller.baltimorecity.gov/boe/meetings/minutes", "div[class='field field-name-body'] > p:last-of-type")

Example Output:

{
 "2017-01-11": "https://comptroller.baltimorecity.gov/files/0001-00792017-01-11pdf",
    ...,
 "2017-06-12b": "https://comptroller.baltimorecity.gov/files/2186-22002017-06-12pdf",
    ...,
 "2017-12-20": "https://comptroller.baltimorecity.gov/files/5482-55802017-12-20pdf"
}

Identify a list of analysis and process questions to be supported by this tool

Explore use cases and how this data can be used.

What types of questions are most critical to be answered within the city?
What types of questions are most critical to users outside of the city?

Please store in this location:

/discovery/operational_questions.md

Annual links aren't being collected correctly

Steps to reproduce

Run the first three cells of tabulator.ipynb

The output returned is:

Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2009
Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2010
...
Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2020
Wrote 0 .pdf files to local repo.

See screenshot below for reference

Issue and potential explanations

Issue: The value returned by link["href"] is the full url rather than a relative reference, therefore when it's appended to base_url it duplicates the first part of the url.
Potential Explanations:
- The page structure has changed since the original store_boe_pdfs() function was written
- Different versions of beautiful soup automatically return an absolute ref instead of a relative ref

Proposed solutions

Write a test that checks to ensure the current structure of the live page specified by the base url matches the expected structure of the page for which the functions were written
Write a test that checks the accuracy of the annual links gathered from a static copy of the html from the page specified by the base url
Create a function check_page_setup() that makes the first test pass
Create a function get_annual_links() that makes the second test pass
Store both check_page_setup() and get_annual_links() in bike_rack/store_boe_pdfs_helper_functions.py

Create `check_missing_pdfs()`

Proposed Improvement

Create a function check_missing_pdfs() that accepts the output of get_meeting_links() and returns a dictionary with all of the dates and links to the pdf that can't be found within the pdf_files directory.

By accepting the output of get_meeting_links() and passing the missing pdfs to a function called download_pdf() (which still needs to be created) the following lines of code in store_boe_pdfs()

response_annual = requests.get(annual_url)
soup_annual = BeautifulSoup(response_annual.text, "html.parser")
pdf_links = soup_annual.find_all(name="a", href=re.compile("files"))
for idx, link in enumerate(pdf_links):
    pdf_location = link["href"]
    pdf_url = base_url + pdf_location
    pdf_file = requests.get(pdf_url)
    # derive name of the pdf file we're going to create
    # encoding and decoding removes hidden characters
    pdf_html_text = (
        link.get_text().strip().encode("ascii", "ignore").decode("utf-8")
    )
    # handle cases where the date is written out in long form
    parsed, pdf_date = parse_long_dates(pdf_html_text)
    if not parsed:
        print(pdf_date) # error message
        continue
    pdf_filename = pdf_date + ".pdf"
    try:
        with open(save_path / pdf_filename, "wb") as f:
        f.write(pdf_file.content)
        total_counter += 1
    except TypeError as err:
        print(f"an error occurred with path {pdf_location}: {err}")

Can be replaced with this:

year_links = get_year_links(soup)

meeting_links = {}
for year, link in year_links:
    page = check_and_parse_page(link)
    meeting_links[year] = get_meeting_links(page)

missing_pdfs = check_missing_pdfs(meeting_links)

for year, meetings in missing_pdfs:
    for meeting, link in meetings:
        download_pdf(year, meeting, link)

Test Implications

There is a function check_missing_pdfs(meeting_links) that accepts a dictionary with the following structure:

{'2020': {'2020_11_10': 'https://comptroller...pdf'},
...
'2009': {'2009_10_21': 'https://comptroller...pdf'}}

The function returns a dictionary of the pdfs missing from the pdf_files with the same structure as the input dict
If it finds any files that aren't in the meeting_links input dictionary it prints their name out to the console

Move `check_and_parse_page()` function over to utils.py

This function is currently in the bike rack, and was created there as a safety measure. Since PR#49 is merged now, this function is approved and should be incorporated into the the main get_boe_pdfs() script.

get all pdfs into a directory for reading

Refactor tests and utils

Proposed Improvement

Refactor both utils.py and tests/ to organize them by major category of functions.

File Structure

For `tests/`

tests/
    conftest.py
    scrape/
        test_scrape.py
        sample_data.py
        sample_html.py
    parse/
        test_parse.py
        sample_data.py
        2013_11_20.pdf
        2010_03_17.pdf

For `utils.py`

common/
    utils.py
    scrape_utils.py
    parse_utils.py

Notes and Considerations

When refactoring, make sure the jupyter notebook is also updated

Write tests and code for `download_pdfs()`

Summary

Create a function download_pdf() that accepts the year, date, and url to the minutes for a BOE meeting, and downloads the pdf specified at the url then stores it within pdf_files/ within a sub-directory corresponding to the year in which the meeting occurred.

Code being improved

By accepting the output of check_missing_pdfs() the following lines of code in store_boe_pdfs()

response_annual = requests.get(annual_url)
soup_annual = BeautifulSoup(response_annual.text, "html.parser")
pdf_links = soup_annual.find_all(name="a", href=re.compile("files"))
for idx, link in enumerate(pdf_links):
    pdf_location = link["href"]
    pdf_url = base_url + pdf_location
    pdf_file = requests.get(pdf_url)
    # derive name of the pdf file we're going to create
    # encoding and decoding removes hidden characters
    pdf_html_text = (
        link.get_text().strip().encode("ascii", "ignore").decode("utf-8")
    )
    # handle cases where the date is written out in long form
    parsed, pdf_date = parse_long_dates(pdf_html_text)
    if not parsed:
        print(pdf_date) # error message
        continue
    pdf_filename = pdf_date + ".pdf"
    try:
        with open(save_path / pdf_filename, "wb") as f:
        f.write(pdf_file.content)
        total_counter += 1
    except TypeError as err:
        print(f"an error occurred with path {pdf_location}: {err}")

Can be replaced with something similar to the following code:

year_links = get_year_links(soup)

meeting_links = {}
for year, link in year_links:
    page = check_and_parse_page(link)
    meeting_links[year] = get_meeting_links(page)

missing_pdfs = check_missing_pdfs(meeting_links)

for year, meetings in missing_pdfs:
    for date, link in meetings:
        download_pdf(year, date, link)

Tests

There is a function download_pdfs() that accepts the following inputs:
- year: string of the year in which the meeting occurred
- date: string of the date on which the meeting occurred with format YYYY-MM-DD
- url: string which specifies the url to request to download the pdf
- dir: Path to directory in which to store the pdf, defaults to pdf_files/ if no directory is specified
If given a valid url to a pdf, the function downloads and stores the pdf in the sub-directory that matches the year in which the meeting occurred and returns True which can be stored in a variable passed
The name of the pdf matches the following pattern YYYY_MM_DD.pdf
If given a url that isn't a valid pdf, it returns False which can be stored in a variable passed

Set up development context

Add requirements.txt
Add to the ReadMe to show user how to replicate

Adjust `utils.store_boe_pdfs()` to avoid hardcoded years

Currently line 71 in utils.py reads:

for year in range(2009, 2021):

Let's adjust to use the code currently in bike_rack/alternative_scraping_function.py to fix this one problem only.

Create Minutes class to assist with pdf parsing

Proposed Improvement

Refactor a portion of the code from store_pdf_text_to_df() into its own class called Minutes which will allow us to parse the text for each set of minutes one meeting at a time instead of having to do all of them at once

Class Structure

Attributes

page_count The number of pages within the document
year The year in which BOE met and these minutes were recorded
meeting_date The date on which BOE met and these minutes were recorded with format YYYY-MM-DD
df The data frame of parsed text, this will be empty until the parse_pages() method is called to save unnecessary work
sections Stores the split text of the parsed pages, exact structure TBD

Methods

parse_pages() Accepts an input that specifies the set of pages to parse and stores the resulting parsed pages in the df attribute as a DataFrame.
- Inputs
  - range_type an enum that specifies which type of range to expect, options:
    - all parses all of the pages within the pdf, no additional parameters required
    - range parses all of the pages within a given range, requires start_page and end_page parameters be filled out
    - list parses all of the pages specified directly within the page_list parameter, requires page_list parameter be filled out
  - start_page Accepts a number within the range of page_count indicates the first page to start parsing
  - end_page Accepts a number within the range of page_count indicates the last page to parse
  - page_list Accepts a list of page numbers within the range of page_count which indicates all of the pages to parse
- Output: A dataframe of the parsed pages that can be accessed through the df attribute
get_sections() Splits the text that has been parsed into the df attribute into a list of callable sections that can be accessed through the sections attribute, exact specification TBD

Tests

Test Structure

The basic test structure will resemble the following:

class TestMinutes:
    def test_instantiation(self):
        assert 1

    def test_parse_pages(self):
        assert 1

Setup: Copy one of the pdfs from pdf_files/ directory into the tests/data/ subdirectory as a sample pdf to run the tests against
Execution: Use the sample to test both test_instantiation() and test_parse_pages()
Validation: Validate the assumptions listed below
Cleanup: No cleanup necessary

Testing Implications

That a Minutes class exists which accepts the path to a pdf file with the following pattern */YYYY_MM_DD.pdf when instantiating a new instance of the minutes
That an error message is returned if Minutes is passed a date for which there is no pdf saved
That the attributes page_count, meeting_date, and year exist and all return the correct value for a given meeting date
That the method parse_pages() works and accepts each of the range types specified above
That the output of parse_pages() is a dataframe that can be accessed from the attribute df

Think through what we can know about agreements

The minutes typically feature a section of agreements formatted like this:

STRONG CITY BALTIMORE, INC. $52,000.00
Account: 5000-506316-6397-460505-601002
The organization will provide a member of their staff
(Baltimore Corps Fellow) to coordinate youth services
related special projects for the YouthWorks Summer program.
The funds will be drawn from the Maryland Department of
Labor, Licensing, and Regulation and State General Funds.
The period of the agreement is September 1, 2016 through
August 31, 2017.

What can we know about each agreement?
What are the columns we expect to include in each row in an agreements table?
What are some initial ideas about how the code can correctly identify each agreement?
Are there reasonable approaches for validating the accuracy of agreements data?

Implement pre-commit hooks

testing auto-formatting to improve readability

Break giant text string from each PDF into modular sections

Perhaps we begin by just breaking the agreements section away from everything else.

One approach is to make each .pdf-derived text into an instance of a class, with methods that break it up.

Let's create a new .py file with all the functions that achieve this breaking-up of the original text.

Create 'bike_rack.py` and move some code to it

The bike_rack.py file holds code that is under construction or being stored as a resource.

Unit Test Framework Setup

Set up unit test framework and spec out initial set of tests for the existing code base.

In Scope

Selection and installation of a testing library (i.e. pytest, unittest, etc.)
Setup of a unit test sub-directory
Creation of unit test methods/classes with doc strings describing what each method should be testing
Update the README.md with instructions for how to run the tests upon installation and setup

Out of Scope

Writing actual tests. Tests should simply be substituted with pass or assert True == True and should be marked with # TO DO: comments for completion in a later issue.
Adjusting any of the existing classes methods or functions to make them easier to test

Write first unit tests

Generate a few plots for Py4CG Meetup

Just want to generate a few line plots so we'll have something to show as an example of the kinds of insights we can get.

department-of-general-services / boe_tabulator Goto Github PK

boe_tabulator's People

Contributors

Stargazers

Watchers

Forkers

boe_tabulator's Issues

Proposed Improvement

Test Structure

test_get_annual_links()

Test Implications

Proposed Improvement

The Function

Inputs:

Outputs:

Tests:

Example Input:

Example Output:

Steps to reproduce

Issue and potential explanations

Proposed solutions

Proposed Improvement

Test Implications

Proposed Improvement

File Structure

For tests/

For utils.py

Notes and Considerations

Summary

Code being improved

Tests

Proposed Improvement

Class Structure

Attributes

Methods

Tests

Test Structure

Testing Implications

In Scope

Out of Scope

Recommend Projects

Recommend Topics

Recommend Org

`test_get_annual_links()`

For `tests/`

For `utils.py`