department-of-general-services / boe_tabulator Goto Github PK
View Code? Open in Web Editor NEWReads PDFs from Baltimore's archive of minutes from the Board of Estimates and places the data in a searchable table.
License: MIT License
Reads PDFs from Baltimore's archive of minutes from the Board of Estimates and places the data in a searchable table.
License: MIT License
Probably the best place to start is the agreements table, but the contractors table is another possibility.
The problem is in the html on the source web page. If the source html is like this:
<a href="/files/4983-52172016-11-30pdf">​November 30, 2016</a>
Then Python gets confused by the invisible character and can't identify the month as November.
Most minutes PDFs have a relatively simple top-level structure, with a somewhat more complex and unpredictable second-level structure. The top-level structure starts with:
This issue will not change the core data pipeline, but will rather inform work on #42 by showing exactly how consistent the sectioning of the minutes pdfs is.
Adding a test for the collection of annual links from the minutes page and writing a function that makes this test pass, which can eventually be used to replace the following segment of code within store_boe_pdfs()
# find all links where the associated text contains the year
link = soup.find("a", href=True, text=str(year))
annual_url = base_url + link["href"]
print(f"Saving files from url: {annual_url}")
Create a test class with the following methods:
TestGetAnnualLinks:
def test_get_annual_links(self):
pass
def test_fix_absolute_ref(self):
pass
def test_new_year(self):
pass
def test_exclude_non_year_links(self):
pass
test_get_annual_links()
get_annual_links()
get_annual_links()
and capture the outputget_annual_links()
captured in the execution matches the dictionary passed in the setupThese tests imply the following behavior from the code:
get_annual_links()
that accepts a soupified version of an html page output by check_page_setup()
and returns a dictionary of links with the following structure:
{2009: 'https://comptroller.baltimorecity.gov/minutes-2009',
...
2020: 'https://comptroller.baltimorecity.gov/minutes-2020'}
base_url
as their rootAs we continue to reason about which tables should go into the database and how they relate to one another, it'll be important to make sure all contributors to the repo share an understanding of what each table will contain.
Let's add a one to two-sentence SQL comment ahead of each table that just explains in plain English what entity that table will store data about.
This code will contain all the parts that read through the text and figure out the rows in the agreements
table.
PR submitted with an alternate scraping function that should function slightly better and be a drop-in replacement.
As the encoding/decoding process for these pdfs seems to have some errors (characters being incorrectly decoded) we need to update the list of character replacements. Currently we have only two replacements identified but there are more than just those two.
Anyone running the notebook will be using functions that are not the newest. Also updating the notebook to use the new functions will serve as a test that everything works together as expected.
We need a more flexible date parser, specifically with regards to handling misspellings of the month. Currently, we can handle any single-letter deletion, but not substitutions or additions.
I think it'll be helpful to set a couple of ground rules at the outset.
This will be a living document; changes to it can be pushed just like we'd push code.
Split the functionality of retrieving the links to the minutes pdfs out into its own function.
div[class='field field-name-body'] > p:last-of-type
). This would allow us to accommodate future pages that may differ in their DOM structure. This also assumes that there is a tag somewhere on the page that serves as a "container" which all the links are in. This is still tentative and might not make the final cut.YYYY-MM-DDx
, where x
is an optional identifier for meetings that occurred on the same day (e.g., 2017-06-12b
)class TestGetMinutesLinks:
def test_get_minutes_links(self):
pass
get_minutes_links()
function and store the output.get_minutes_links()
matches the expected output.get_minutes_links(year_url, css_selector=<default value>)
get_minutes_links("https://comptroller.baltimorecity.gov/boe/meetings/minutes", "div[class='field field-name-body'] > p:last-of-type")
{
"2017-01-11": "https://comptroller.baltimorecity.gov/files/0001-00792017-01-11pdf",
...,
"2017-06-12b": "https://comptroller.baltimorecity.gov/files/2186-22002017-06-12pdf",
...,
"2017-12-20": "https://comptroller.baltimorecity.gov/files/5482-55802017-12-20pdf"
}
Explore use cases and how this data can be used.
What types of questions are most critical to be answered within the city?
What types of questions are most critical to users outside of the city?
Please store in this location:
/discovery/operational_questions.md
tabulator.ipynb
Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2009
Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2010
...
Saving files from url: https://comptroller.baltimorecity.gov/http://comptroller.baltimorecity.gov/minutes-2020
Wrote 0 .pdf files to local repo.
link["href"]
is the full url rather than a relative reference, therefore when it's appended to base_url
it duplicates the first part of the url.store_boe_pdfs()
function was writtencheck_page_setup()
that makes the first test passget_annual_links()
that makes the second test passcheck_page_setup()
and get_annual_links()
in bike_rack/store_boe_pdfs_helper_functions.py
Create a function check_missing_pdfs()
that accepts the output of get_meeting_links()
and returns a dictionary with all of the dates and links to the pdf that can't be found within the pdf_files
directory.
By accepting the output of get_meeting_links()
and passing the missing pdfs to a function called download_pdf()
(which still needs to be created) the following lines of code in store_boe_pdfs()
response_annual = requests.get(annual_url)
soup_annual = BeautifulSoup(response_annual.text, "html.parser")
pdf_links = soup_annual.find_all(name="a", href=re.compile("files"))
for idx, link in enumerate(pdf_links):
pdf_location = link["href"]
pdf_url = base_url + pdf_location
pdf_file = requests.get(pdf_url)
# derive name of the pdf file we're going to create
# encoding and decoding removes hidden characters
pdf_html_text = (
link.get_text().strip().encode("ascii", "ignore").decode("utf-8")
)
# handle cases where the date is written out in long form
parsed, pdf_date = parse_long_dates(pdf_html_text)
if not parsed:
print(pdf_date) # error message
continue
pdf_filename = pdf_date + ".pdf"
try:
with open(save_path / pdf_filename, "wb") as f:
f.write(pdf_file.content)
total_counter += 1
except TypeError as err:
print(f"an error occurred with path {pdf_location}: {err}")
Can be replaced with this:
year_links = get_year_links(soup)
meeting_links = {}
for year, link in year_links:
page = check_and_parse_page(link)
meeting_links[year] = get_meeting_links(page)
missing_pdfs = check_missing_pdfs(meeting_links)
for year, meetings in missing_pdfs:
for meeting, link in meetings:
download_pdf(year, meeting, link)
check_missing_pdfs(meeting_links)
that accepts a dictionary with the following structure:
{'2020': {'2020_11_10': 'https://comptroller...pdf'},
...
'2009': {'2009_10_21': 'https://comptroller...pdf'}}
pdf_files
with the same structure as the input dictmeeting_links
input dictionary it prints their name out to the consoleThis function is currently in the bike rack, and was created there as a safety measure. Since PR#49 is merged now, this function is approved and should be incorporated into the the main get_boe_pdfs()
script.
Refactor both utils.py
and tests/
to organize them by major category of functions.
tests/
tests/
conftest.py
scrape/
test_scrape.py
sample_data.py
sample_html.py
parse/
test_parse.py
sample_data.py
2013_11_20.pdf
2010_03_17.pdf
utils.py
common/
utils.py
scrape_utils.py
parse_utils.py
When refactoring, make sure the jupyter notebook is also updated
Create a function download_pdf()
that accepts the year, date, and url to the minutes for a BOE meeting, and downloads the pdf specified at the url then stores it within pdf_files/
within a sub-directory corresponding to the year in which the meeting occurred.
By accepting the output of check_missing_pdfs()
the following lines of code in store_boe_pdfs()
response_annual = requests.get(annual_url)
soup_annual = BeautifulSoup(response_annual.text, "html.parser")
pdf_links = soup_annual.find_all(name="a", href=re.compile("files"))
for idx, link in enumerate(pdf_links):
pdf_location = link["href"]
pdf_url = base_url + pdf_location
pdf_file = requests.get(pdf_url)
# derive name of the pdf file we're going to create
# encoding and decoding removes hidden characters
pdf_html_text = (
link.get_text().strip().encode("ascii", "ignore").decode("utf-8")
)
# handle cases where the date is written out in long form
parsed, pdf_date = parse_long_dates(pdf_html_text)
if not parsed:
print(pdf_date) # error message
continue
pdf_filename = pdf_date + ".pdf"
try:
with open(save_path / pdf_filename, "wb") as f:
f.write(pdf_file.content)
total_counter += 1
except TypeError as err:
print(f"an error occurred with path {pdf_location}: {err}")
Can be replaced with something similar to the following code:
year_links = get_year_links(soup)
meeting_links = {}
for year, link in year_links:
page = check_and_parse_page(link)
meeting_links[year] = get_meeting_links(page)
missing_pdfs = check_missing_pdfs(meeting_links)
for year, meetings in missing_pdfs:
for date, link in meetings:
download_pdf(year, date, link)
download_pdfs()
that accepts the following inputs:
YYYY-MM-DD
pdf_files/
if no directory is specifiedTrue
which can be stored in a variable passed
YYYY_MM_DD.pdf
False
which can be stored in a variable passed
requirements.txt
Currently line 71 in utils.py
reads:
for year in range(2009, 2021):
Let's adjust to use the code currently in bike_rack/alternative_scraping_function.py
to fix this one problem only.
Refactor a portion of the code from store_pdf_text_to_df()
into its own class called Minutes
which will allow us to parse the text for each set of minutes one meeting at a time instead of having to do all of them at once
page_count
The number of pages within the documentyear
The year in which BOE met and these minutes were recordedmeeting_date
The date on which BOE met and these minutes were recorded with format YYYY-MM-DD
df
The data frame of parsed text, this will be empty until the parse_pages()
method is called to save unnecessary worksections
Stores the split text of the parsed pages, exact structure TBDparse_pages()
Accepts an input that specifies the set of pages to parse and stores the resulting parsed pages in the df
attribute as a DataFrame.
range_type
an enum that specifies which type of range to expect, options:
all
parses all of the pages within the pdf, no additional parameters requiredrange
parses all of the pages within a given range, requires start_page
and end_page
parameters be filled outlist
parses all of the pages specified directly within the page_list
parameter, requires page_list
parameter be filled outstart_page
Accepts a number within the range of page_count
indicates the first page to start parsingend_page
Accepts a number within the range of page_count
indicates the last page to parsepage_list
Accepts a list of page numbers within the range of page_count
which indicates all of the pages to parsedf
attributeget_sections()
Splits the text that has been parsed into the df
attribute into a list of callable sections that can be accessed through the sections
attribute, exact specification TBDThe basic test structure will resemble the following:
class TestMinutes:
def test_instantiation(self):
assert 1
def test_parse_pages(self):
assert 1
pdf_files/
directory into the tests/data/
subdirectory as a sample pdf to run the tests againsttest_instantiation()
and test_parse_pages()
Minutes
class exists which accepts the path to a pdf file with the following pattern */YYYY_MM_DD.pdf
when instantiating a new instance of the minutesMinutes
is passed a date for which there is no pdf savedpage_count
, meeting_date
, and year
exist and all return the correct value for a given meeting dateparse_pages()
works and accepts each of the range types specified aboveparse_pages()
is a dataframe that can be accessed from the attribute df
The minutes typically feature a section of agreements formatted like this:
STRONG CITY BALTIMORE, INC. $52,000.00
Account: 5000-506316-6397-460505-601002
The organization will provide a member of their staff
(Baltimore Corps Fellow) to coordinate youth services
related special projects for the YouthWorks Summer program.
The funds will be drawn from the Maryland Department of
Labor, Licensing, and Regulation and State General Funds.
The period of the agreement is September 1, 2016 through
August 31, 2017.
agreements
table?testing auto-formatting to improve readability
Perhaps we begin by just breaking the agreements
section away from everything else.
One approach is to make each .pdf-derived text into an instance of a class, with methods that break it up.
Let's create a new .py file with all the functions that achieve this breaking-up of the original text.
The bike_rack.py
file holds code that is under construction or being stored as a resource.
Set up unit test framework and spec out initial set of tests for the existing code base.
pass
or assert True == True
and should be marked with # TO DO:
comments for completion in a later issue.Just want to generate a few line plots so we'll have something to show as an example of the kinds of insights we can get.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.