Giter VIP home page Giter VIP logo

boe_tabulator's People

Contributors

james-trimarco avatar robertnunn avatar widal001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

boe_tabulator's Issues

Develop new dataframes

Probably the best place to start is the agreements table, but the contractors table is another possibility.

fix bug where some dates can't be found

The problem is in the html on the source web page. If the source html is like this:

<a href="/files/4983-52172016-11-30pdf">&#8203;​November 30, 2016</a>

Then Python gets confused by the invisible character and can't identify the month as November.

Test consistency of PDF sectioning

Most minutes PDFs have a relatively simple top-level structure, with a somewhat more complex and unpredictable second-level structure. The top-level structure starts with:

  1. BOARDS AND COMMISSIONS
  2. OPTIONS/CONDEMNATION/QUICK-TAKES
  3. TRANSFERS OF FUNDS

This issue will not change the core data pipeline, but will rather inform work on #42 by showing exactly how consistent the sectioning of the minutes pdfs is.

Write unit test for collection of annual links from the base page

Proposed Improvement

Adding a test for the collection of annual links from the minutes page and writing a function that makes this test pass, which can eventually be used to replace the following segment of code within store_boe_pdfs()

# find all links where the associated text contains the year
link = soup.find("a", href=True, text=str(year))
annual_url = base_url + link["href"]
print(f"Saving files from url: {annual_url}")

Test Structure

Create a test class with the following methods:

TestGetAnnualLinks:
    def test_get_annual_links(self):
        pass
    
    def test_fix_absolute_ref(self): 
        pass

    def test_new_year(self):
        pass
    
    def test_exclude_non_year_links(self):
        pass

test_get_annual_links()

  • Setup: Provide a sample HTML directly copied from the base url page and parse it into a BeautifulSoup object, and specify the dictionary of annual links that should be retrieved from that page by the function get_annual_links()
  • Execution: Pass the parsed HTML to get_annual_links() and capture the output
  • Validation: Check that the output from get_annual_links() captured in the execution matches the dictionary passed in the setup
  • Cleanup: No cleanup needed

Test Implications

These tests imply the following behavior from the code:

  • There is a function get_annual_links() that accepts a soupified version of an html page output by check_page_setup() and returns a dictionary of links with the following structure:
    {2009: 'https://comptroller.baltimorecity.gov/minutes-2009',
     ...
     2020: 'https://comptroller.baltimorecity.gov/minutes-2020'}
    
  • The function ensures that the links included in the output dictionary are absolute references with the base_url as their root
  • The function will dynamically capture new links to future years of BOE minutes, while exclude similarly formatted links elsewhere on the page

Add comments explaining what each entity is for tables in `boe_min.sql`

As we continue to reason about which tables should go into the database and how they relate to one another, it'll be important to make sure all contributors to the repo share an understanding of what each table will contain.

Let's add a one to two-sentence SQL comment ahead of each table that just explains in plain English what entity that table will store data about.

Alternate scraping function

PR submitted with an alternate scraping function that should function slightly better and be a drop-in replacement.

Updated character replacements list

As the encoding/decoding process for these pdfs seems to have some errors (characters being incorrectly decoded) we need to update the list of character replacements. Currently we have only two replacements identified but there are more than just those two.

Harmonize Jupyter notebook and new functions

Anyone running the notebook will be using functions that are not the newest. Also updating the notebook to use the new functions will serve as a test that everything works together as expected.

Updated date parsing

We need a more flexible date parser, specifically with regards to handling misspellings of the month. Currently, we can handle any single-letter deletion, but not substitutions or additions.

Initiate repo style guide in Markdown

I think it'll be helpful to set a couple of ground rules at the outset.

This will be a living document; changes to it can be pushed just like we'd push code.

get_minutes_links() and the associated unit tests

Proposed Improvement

Split the functionality of retrieving the links to the minutes pdfs out into its own function.

The Function

Inputs:

  • year link url: An absolute URL that links to the minutes for a given year
  • CSS selector/find_all() arguments: OPTIONAL The selector/arguments that specifies where to find the minute links or the tag that contains the minutes links (e.g., div[class='field field-name-body'] > p:last-of-type). This would allow us to accommodate future pages that may differ in their DOM structure. This also assumes that there is a tag somewhere on the page that serves as a "container" which all the links are in. This is still tentative and might not make the final cut.

Outputs:

  • minutes_links_dict: a dictionary with...
    • keys: the parsed date string as a key in the format of YYYY-MM-DDx, where x is an optional identifier for meetings that occurred on the same day (e.g., 2017-06-12b)
    • values: an absolute URL that points to the minutes pdf for the day/meeting specified by the key

Tests:

class TestGetMinutesLinks:
    def test_get_minutes_links(self):
        pass
  • Setup: Provide a relative URL that points to an html file in the tests directory. This html file is a local copy of the base url page and will be parsed by BeautifulSoup.
  • Execution: Pass the relative URL to the get_minutes_links() function and store the output.
  • Validation: Check that the dictionary output of get_minutes_links() matches the expected output.
  • Cleanup: No cleanup required.

Example Input:

get_minutes_links(year_url, css_selector=<default value>)
get_minutes_links("https://comptroller.baltimorecity.gov/boe/meetings/minutes", "div[class='field field-name-body'] > p:last-of-type")

Example Output:

{
 "2017-01-11": "https://comptroller.baltimorecity.gov/files/0001-00792017-01-11pdf",
    ...,
 "2017-06-12b": "https://comptroller.baltimorecity.gov/files/2186-22002017-06-12pdf",
    ...,
 "2017-12-20": "https://comptroller.baltimorecity.gov/files/5482-55802017-12-20pdf"
}

Annual links aren't being collected correctly

Steps to reproduce

  • Run the first three cells of tabulator.ipynb
  • The output returned is:
    Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2009
    Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2010
    ...
    Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2020
    Wrote 0 .pdf files to local repo.
    
  • See screenshot below for reference

Screen Shot 2020-11-03 at 2 41 19 PM

Issue and potential explanations

  • Issue: The value returned by link["href"] is the full url rather than a relative reference, therefore when it's appended to base_url it duplicates the first part of the url.
  • Potential Explanations:
    • The page structure has changed since the original store_boe_pdfs() function was written
    • Different versions of beautiful soup automatically return an absolute ref instead of a relative ref

Proposed solutions

  • Write a test that checks to ensure the current structure of the live page specified by the base url matches the expected structure of the page for which the functions were written
  • Write a test that checks the accuracy of the annual links gathered from a static copy of the html from the page specified by the base url
  • Create a function check_page_setup() that makes the first test pass
  • Create a function get_annual_links() that makes the second test pass
  • Store both check_page_setup() and get_annual_links() in bike_rack/store_boe_pdfs_helper_functions.py

Create `check_missing_pdfs()`

Proposed Improvement

Create a function check_missing_pdfs() that accepts the output of get_meeting_links() and returns a dictionary with all of the dates and links to the pdf that can't be found within the pdf_files directory.

By accepting the output of get_meeting_links() and passing the missing pdfs to a function called download_pdf() (which still needs to be created) the following lines of code in store_boe_pdfs()

response_annual = requests.get(annual_url)
soup_annual = BeautifulSoup(response_annual.text, "html.parser")
pdf_links = soup_annual.find_all(name="a", href=re.compile("files"))
for idx, link in enumerate(pdf_links):
    pdf_location = link["href"]
    pdf_url = base_url + pdf_location
    pdf_file = requests.get(pdf_url)
    # derive name of the pdf file we're going to create
    # encoding and decoding removes hidden characters
    pdf_html_text = (
        link.get_text().strip().encode("ascii", "ignore").decode("utf-8")
    )
    # handle cases where the date is written out in long form
    parsed, pdf_date = parse_long_dates(pdf_html_text)
    if not parsed:
        print(pdf_date) # error message
        continue
    pdf_filename = pdf_date + ".pdf"
    try:
        with open(save_path / pdf_filename, "wb") as f:
        f.write(pdf_file.content)
        total_counter += 1
    except TypeError as err:
        print(f"an error occurred with path {pdf_location}: {err}")

Can be replaced with this:

year_links = get_year_links(soup)

meeting_links = {}
for year, link in year_links:
    page = check_and_parse_page(link)
    meeting_links[year] = get_meeting_links(page)

missing_pdfs = check_missing_pdfs(meeting_links)

for year, meetings in missing_pdfs:
    for meeting, link in meetings:
        download_pdf(year, meeting, link)

Test Implications

  • There is a function check_missing_pdfs(meeting_links) that accepts a dictionary with the following structure:
    {'2020': {'2020_11_10': 'https://comptroller...pdf'},
    ...
    '2009': {'2009_10_21': 'https://comptroller...pdf'}}
    
  • The function returns a dictionary of the pdfs missing from the pdf_files with the same structure as the input dict
  • If it finds any files that aren't in the meeting_links input dictionary it prints their name out to the console

Refactor tests and utils

Proposed Improvement

Refactor both utils.py and tests/ to organize them by major category of functions.

File Structure

For tests/

tests/
    conftest.py
    scrape/
        test_scrape.py
        sample_data.py
        sample_html.py
    parse/
        test_parse.py
        sample_data.py
        2013_11_20.pdf
        2010_03_17.pdf

For utils.py

common/
    utils.py
    scrape_utils.py
    parse_utils.py

Notes and Considerations

When refactoring, make sure the jupyter notebook is also updated

Write tests and code for `download_pdfs()`

Summary

Create a function download_pdf() that accepts the year, date, and url to the minutes for a BOE meeting, and downloads the pdf specified at the url then stores it within pdf_files/ within a sub-directory corresponding to the year in which the meeting occurred.

Code being improved

By accepting the output of check_missing_pdfs() the following lines of code in store_boe_pdfs()

response_annual = requests.get(annual_url)
soup_annual = BeautifulSoup(response_annual.text, "html.parser")
pdf_links = soup_annual.find_all(name="a", href=re.compile("files"))
for idx, link in enumerate(pdf_links):
    pdf_location = link["href"]
    pdf_url = base_url + pdf_location
    pdf_file = requests.get(pdf_url)
    # derive name of the pdf file we're going to create
    # encoding and decoding removes hidden characters
    pdf_html_text = (
        link.get_text().strip().encode("ascii", "ignore").decode("utf-8")
    )
    # handle cases where the date is written out in long form
    parsed, pdf_date = parse_long_dates(pdf_html_text)
    if not parsed:
        print(pdf_date) # error message
        continue
    pdf_filename = pdf_date + ".pdf"
    try:
        with open(save_path / pdf_filename, "wb") as f:
        f.write(pdf_file.content)
        total_counter += 1
    except TypeError as err:
        print(f"an error occurred with path {pdf_location}: {err}")

Can be replaced with something similar to the following code:

year_links = get_year_links(soup)

meeting_links = {}
for year, link in year_links:
    page = check_and_parse_page(link)
    meeting_links[year] = get_meeting_links(page)

missing_pdfs = check_missing_pdfs(meeting_links)

for year, meetings in missing_pdfs:
    for date, link in meetings:
        download_pdf(year, date, link)

Tests

  • There is a function download_pdfs() that accepts the following inputs:
    • year: string of the year in which the meeting occurred
    • date: string of the date on which the meeting occurred with format YYYY-MM-DD
    • url: string which specifies the url to request to download the pdf
    • dir: Path to directory in which to store the pdf, defaults to pdf_files/ if no directory is specified
  • If given a valid url to a pdf, the function downloads and stores the pdf in the sub-directory that matches the year in which the meeting occurred and returns True which can be stored in a variable passed
  • The name of the pdf matches the following pattern YYYY_MM_DD.pdf
  • If given a url that isn't a valid pdf, it returns False which can be stored in a variable passed

Create Minutes class to assist with pdf parsing

Proposed Improvement

Refactor a portion of the code from store_pdf_text_to_df() into its own class called Minutes which will allow us to parse the text for each set of minutes one meeting at a time instead of having to do all of them at once

Class Structure

Attributes

  • page_count The number of pages within the document
  • year The year in which BOE met and these minutes were recorded
  • meeting_date The date on which BOE met and these minutes were recorded with format YYYY-MM-DD
  • df The data frame of parsed text, this will be empty until the parse_pages() method is called to save unnecessary work
  • sections Stores the split text of the parsed pages, exact structure TBD

Methods

  • parse_pages() Accepts an input that specifies the set of pages to parse and stores the resulting parsed pages in the df attribute as a DataFrame.
    • Inputs
      • range_type an enum that specifies which type of range to expect, options:
        • all parses all of the pages within the pdf, no additional parameters required
        • range parses all of the pages within a given range, requires start_page and end_page parameters be filled out
        • list parses all of the pages specified directly within the page_list parameter, requires page_list parameter be filled out
      • start_page Accepts a number within the range of page_count indicates the first page to start parsing
      • end_page Accepts a number within the range of page_count indicates the last page to parse
      • page_list Accepts a list of page numbers within the range of page_count which indicates all of the pages to parse
    • Output: A dataframe of the parsed pages that can be accessed through the df attribute
  • get_sections() Splits the text that has been parsed into the df attribute into a list of callable sections that can be accessed through the sections attribute, exact specification TBD

Tests

Test Structure

The basic test structure will resemble the following:

class TestMinutes:
    def test_instantiation(self):
        assert 1

    def test_parse_pages(self):
        assert 1
  • Setup: Copy one of the pdfs from pdf_files/ directory into the tests/data/ subdirectory as a sample pdf to run the tests against
  • Execution: Use the sample to test both test_instantiation() and test_parse_pages()
  • Validation: Validate the assumptions listed below
  • Cleanup: No cleanup necessary

Testing Implications

  • That a Minutes class exists which accepts the path to a pdf file with the following pattern */YYYY_MM_DD.pdf when instantiating a new instance of the minutes
  • That an error message is returned if Minutes is passed a date for which there is no pdf saved
  • That the attributes page_count, meeting_date, and year exist and all return the correct value for a given meeting date
  • That the method parse_pages() works and accepts each of the range types specified above
  • That the output of parse_pages() is a dataframe that can be accessed from the attribute df

Think through what we can know about agreements

The minutes typically feature a section of agreements formatted like this:

STRONG CITY BALTIMORE, INC. $52,000.00
Account: 5000-506316-6397-460505-601002
The organization will provide a member of their staff
(Baltimore Corps Fellow) to coordinate youth services
related special projects for the YouthWorks Summer program.
The funds will be drawn from the Maryland Department of
Labor, Licensing, and Regulation and State General Funds.
The period of the agreement is September 1, 2016 through
August 31, 2017.

  • What can we know about each agreement?
  • What are the columns we expect to include in each row in an agreements table?
  • What are some initial ideas about how the code can correctly identify each agreement?
  • Are there reasonable approaches for validating the accuracy of agreements data?

Break giant text string from each PDF into modular sections

Perhaps we begin by just breaking the agreements section away from everything else.

One approach is to make each .pdf-derived text into an instance of a class, with methods that break it up.

Let's create a new .py file with all the functions that achieve this breaking-up of the original text.

Unit Test Framework Setup

Set up unit test framework and spec out initial set of tests for the existing code base.

In Scope

  • Selection and installation of a testing library (i.e. pytest, unittest, etc.)
  • Setup of a unit test sub-directory
  • Creation of unit test methods/classes with doc strings describing what each method should be testing
  • Update the README.md with instructions for how to run the tests upon installation and setup

Out of Scope

  • Writing actual tests. Tests should simply be substituted with pass or assert True == True and should be marked with # TO DO: comments for completion in a later issue.
  • Adjusting any of the existing classes methods or functions to make them easier to test

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.