Giter VIP home page Giter VIP logo

cafr-parsing's Introduction

CAFR Parsing

Automated data extraction from U.S. state Comprehensive Annual Financial Reports (CAFR).

Directory Structure

  • taxonomy: The XBRL Taxonomy
  • templates: Where table templates (explained below) should be located. Any .txt files in this directory will be loaded automatically by miner.py.
  • data: The state CAFR files we are using as inputs.
  • example-output: Some pre-generated examples of the output this system produces, in XBRL-XML format, CSV format, and MS-Excel format.
  • analysis: Various explorations performed around XBRL, CAFR, and PDF conversions. Everything in this folder is purely documented thought process. Nothing in here is used by the system.
  • results: This folder should be empty in the repository. It is where miner.py places its output.

Installation

First-time setup (Linux)

You'll need a basic Python environment and the ability to check out this code repository, so you'll need at least the following packages:

  • git
  • python (specifically python2)

Depending on your OS distribution of Python, you may need to manually install setuptools. See that page for info; on Debian GNU/Linux and probably Ubuntu, you can just do sudo apt-get update && sudo apt-get install python-setuptools.

Then install pip to facilitate the rest of the dependency installation process:

sudo easy_install pip

(If your distribution uses Python 3 by default, you may need to change easy_install to easy_install-2.7.)

Now install some Python tools that will help you bootstrap the Python environment.

sudo pip install -UI setuptools pip virtualenv

Pick a place to store the repo. I usually put projects in a Code directory in my home folder, but you can adjust this accordingly. cd into that directory. (i.e. cd ~/Code) Then:

git clone https://github.com/OpenTechStrategies/cafr-parsing
virtualenv cafr-parsing

(If your distribution uses Python 3 by default, you'll need to change the virtualenv line to be virtualenv -p /usr/bin/python2 cafr-parsing or something along those lines.)

Now we'll cd into the cafr-parsing repo and "activate" this environment. Then, using pip, we'll install all the Python libraries defined in the requirements.txt file. (This is sort of like a Ruby Gemfile.)

cd cafr-parsing
source bin/activate
pip install -r requirements.txt

You can ensure that the virtual environment is using an isolated version of Python 2:

`which python`
python --version

Usage

miner.py parses CAFR files, which are PDF documents, and produces JSON files, which can be automatically translated to other formats easily (e.g., XBRL, CSV, .xlsx).

In order to know which tables to extract from the PDF files and what their fields mean, miner.py must be supplied with a "template" for each table: a manually-constructed JSON file that tells miner.py exactly how to recognize that table and how to map the data in the table to the desired output fields.

For now, the invocation process is just to open up miner.py in a text editor and add calls to the end like this:

    process_pdf("data/AL_cafr2011.pdf")

Once you've set up as many calls as you want, run miner.py (assuming you've already done the setup steps listed above in the "Installation" section):

    $ python miner.py

It may take a while to run, possibly minutes. When it's done, the results will be in the results/ directory. There will be one result file for each table for a given state CAFR in a given year.

  • data/AL_cafr2011.pdf is an example CAFR input file
  • templates/AL_statewidenetassets.txt is an example template file
  • results/AL_cafr2011-statewide_net_assets.xml is an example output file (this won't exist until you run miner.py)

Viewing XBRL

XBRL is not particularly useful to humans without software to render the content. Example CSV output, which were created by exporting the XBRL output to CSV with an XBRL viewer, can be found in the example-output/csv directory. Even more readable examples in XLSX format are located in the example-output/xlsx directory. Note when reading the CSV and XLSX files that the columns may appear in a different order than they do in the original PDFs.

Alternatively, examples of XBRL output can be found in the example-output/xbrl-xml directory

To view the XBRL directly:

  1. Download and install an XBRL viewer.
  2. Copy the taxonomy files (located in the taxonomy directory) into a working folder of your choosing.
  3. Copy the results files (located in the results directory) into that working folder.
  4. Open the results files in the XBRL viewer.

Resources

These are resources that were helpful while exploring:

XBRL Taxonomy Information

Next Steps

  • There are dozens of TODO flags scattered throughout the code base. Some are minor, some are major.
  • Continued refinement of the taxonomy structure.
  • Creation of additional templates for additional states.
  • Tools to assist in template generation.
  • Command line tools to run the miner from the command line.
  • Validation tools to identify when a template no longer matches the schema.
  • Think about whether this ties in with http://open-data-standards.github.io/ (and thence https://github.com/open-data-standards) in any useful way.

Related Projects

  • pdftabextract (a set of tools to help "extract tabular data from (OCR-processed) PDF files") might be useful to look at.

cafr-parsing's People

Contributors

cecilia-donnelly avatar kfogel avatar slifty avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.