CAFR Parsing

Automated data extraction from U.S. state Comprehensive Annual Financial Reports (CAFR).

Directory Structure

taxonomy: The XBRL Taxonomy
templates: Where table templates (explained below) should be located. Any .txt files in this directory will be loaded automatically by miner.py.
data: The state CAFR files we are using as inputs.
example-output: Some pre-generated examples of the output this system produces, in XBRL-XML format, CSV format, and MS-Excel format.
analysis: Various explorations performed around XBRL, CAFR, and PDF conversions. Everything in this folder is purely documented thought process. Nothing in here is used by the system.
results: This folder should be empty in the repository. It is where miner.py places its output.

Installation

First-time setup (Linux)

You'll need a basic Python environment and the ability to check out this code repository, so you'll need at least the following packages:

git
python (specifically python2)

Depending on your OS distribution of Python, you may need to manually install setuptools. See that page for info; on Debian GNU/Linux and probably Ubuntu, you can just do sudo apt-get update && sudo apt-get install python-setuptools.

Then install pip to facilitate the rest of the dependency installation process:

sudo easy_install pip

(If your distribution uses Python 3 by default, you may need to change easy_install to easy_install-2.7.)

Now install some Python tools that will help you bootstrap the Python environment.

sudo pip install -UI setuptools pip virtualenv

Pick a place to store the repo. I usually put projects in a Code directory in my home folder, but you can adjust this accordingly. cd into that directory. (i.e. cd ~/Code) Then:

git clone https://github.com/OpenTechStrategies/cafr-parsing
virtualenv cafr-parsing

(If your distribution uses Python 3 by default, you'll need to change the virtualenv line to be virtualenv -p /usr/bin/python2 cafr-parsing or something along those lines.)

Now we'll cd into the cafr-parsing repo and "activate" this environment. Then, using pip, we'll install all the Python libraries defined in the requirements.txt file. (This is sort of like a Ruby Gemfile.)

cd cafr-parsing
source bin/activate
pip install -r requirements.txt

You can ensure that the virtual environment is using an isolated version of Python 2:

`which python`
python --version

Usage

miner.py parses CAFR files, which are PDF documents, and produces JSON files, which can be automatically translated to other formats easily (e.g., XBRL, CSV, .xlsx).

In order to know which tables to extract from the PDF files and what their fields mean, miner.py must be supplied with a "template" for each table: a manually-constructed JSON file that tells miner.py exactly how to recognize that table and how to map the data in the table to the desired output fields.

For now, the invocation process is just to open up miner.py in a text editor and add calls to the end like this:

    process_pdf("data/AL_cafr2011.pdf")

Once you've set up as many calls as you want, run miner.py (assuming you've already done the setup steps listed above in the "Installation" section):

    $ python miner.py

It may take a while to run, possibly minutes. When it's done, the results will be in the results/ directory. There will be one result file for each table for a given state CAFR in a given year.

data/AL_cafr2011.pdf is an example CAFR input file
templates/AL_statewidenetassets.txt is an example template file
results/AL_cafr2011-statewide_net_assets.xml is an example output file (this won't exist until you run miner.py)

Viewing XBRL

XBRL is not particularly useful to humans without software to render the content. Example CSV output, which were created by exporting the XBRL output to CSV with an XBRL viewer, can be found in the example-output/csv directory. Even more readable examples in XLSX format are located in the example-output/xlsx directory. Note when reading the CSV and XLSX files that the columns may appear in a different order than they do in the original PDFs.

Alternatively, examples of XBRL output can be found in the example-output/xbrl-xml directory

To view the XBRL directly:

Download and install an XBRL viewer.
Copy the taxonomy files (located in the taxonomy directory) into a working folder of your choosing.
Copy the results files (located in the results directory) into that working folder.
Open the results files in the XBRL viewer.

Resources

These are resources that were helpful while exploring:

XBRL Taxonomy Information

Next Steps

There are dozens of TODO flags scattered throughout the code base. Some are minor, some are major.
Continued refinement of the taxonomy structure.
Creation of additional templates for additional states.
Tools to assist in template generation.
Command line tools to run the miner from the command line.
Validation tools to identify when a template no longer matches the schema.
Think about whether this ties in with http://open-data-standards.github.io/ (and thence https://github.com/open-data-standards) in any useful way.

Related Projects

pdftabextract (a set of tools to help "extract tabular data from (OCR-processed) PDF files") might be useful to look at.

smollada / cafr-parsing Goto Github PK

cafr-parsing's Introduction

CAFR Parsing

Directory Structure

Installation

First-time setup (Linux)

Usage

Viewing XBRL

Resources

XBRL Taxonomy Information

Next Steps

Related Projects

cafr-parsing's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent