census-data-downloader's Introduction

census-data-downloader

Download American Community Survey data from the U.S. Census Bureau and reformat it for humans.

What's available

All of the data files processed by this repository are published in the data/processed/ folder. They can be called in to applications via their raw URLs, like https://raw.githubusercontent.com/datadesk/census-data-downloader/master/data/processed/acs5_2017_population_counties.csv

The command-line interface

The library can be installed as a command-line interface that lets you download files on demand.

Installation

$ pipenv install census-data-downloader

Command-line usage

There's now a tool named censusdatadownloader ready for you.

Usage: censusdatadownloader [OPTIONS] TABLE COMMAND [ARGS]...

  Download Census data and reformat it for humans

Options:
  --data-dir TEXT  The folder where you want to download the data
  --year [2009-2021]   The years of data to download. By default it gets only the
                   latest year. Not all data are available for every year. Submit 'all' to get every year.
  --force          Force the downloading of the data
  --help           Show this message and exit.

Commands:
  aiannhhomelands            Download American Indian, Alaska Native and...
  cnectas                    Download combined New England city and town...
  congressionaldistricts     Download Congressional districts
  counties                   Download counties in all states
  countysubdivision          Download county subdivisions
  csas                       Download combined statistical areas
  divisions                  Download divisions
  elementaryschooldistricts  Download elementary school districts
  everything                 Download everything from everywhere
  msas                       Download metropolitian statistical areas
  nationwide                 Download nationwide data
  nectas                     Download New England city and town areas
  places                     Download Census-designated places
  pumas                      Download public use microdata areas
  regions                    Download regions
  secondaryschooldistricts   Download secondary school districts
  statelegislativedistricts  Download statehouse districts
  states                     Download states
  tracts                     Download Census tracts
  unifiedschooldistricts     Download unified school districts
  urbanareas                 Download urban areas
  zctas                      Download ZIP Code tabulation areas

Before you can use it you will need to add your CENSUS_API_KEY to your environment. If you don't have an API key, you can go here. One quick way to add your key:

$ export CENSUS_API_KEY='<your API key>'

Using it is as simple as providing one our processed table names to one of the download subcommands.

Here's an example of downloading all state-level data from the medianage dataset.

$ censusdatadownloader medianage states

You can specify the download directory with --data-dir.

$ censusdatadownloader --data-dir ./my-special-folder/ medianage states

And you can change the year you download with --year.

$ censusdatadownloader --year 2010 medianage states

That's it. Mix and match tables and subcommands to get whatever you need.

Python usage

You can also download tables from Python scripts. Import the class of the processed table you wish to retrieve and pass in your API key. Then call one of the download methods.

This example brings in all state-level data from the medianhouseholdincomeblack dataset.

>>> from census_data_downloader.tables import MedianHouseholdIncomeBlackDownloader
>>> downloader = MedianHouseholdIncomeBlackDownloader('<YOUR KEY>')
>>> downloader.download_states()

You can specify the data directory and the years by passing in the data_dir and years keyword arguments.

>>> downloader = MedianHouseholdIncomeBlackDownloader('<YOUR KEY>', data_dir='./', years=2016)
>>> downloader.download_states()

Usage examples

A gallery of graphics powered by our data is available on Observable.

The Los Angeles Times used this library for an analysis of Census undercounts on Native American reservations. The code that powers it is available as an open-source computational notebook.

Contributing to the library

Adding support for a new table

Subclass our downloader and provided it with its required inputs.

import collections
from census_data_downloader.core.tables import BaseTableConfig
from census_data_downloader.core.decorators import register


@register
class MedianHouseholdIncomeDownloader(BaseTableConfig):
    PROCESSED_TABLE_NAME = "medianhouseholdincome"  # Your humanized table name
    UNIVERSE = "households"  # The universe value for this table
    RAW_TABLE_NAME = 'B19013'  # The id of the source table
    RAW_FIELD_CROSSWALK = collections.OrderedDict({
        # A crosswalk between the raw field name and our humanized field name.
        "001": "median"
    })

Add it to the imports in the __init__.py file and it's good to go.

Developing the CLI

The command-line interface is implemented using Click and setuptools. To install it locally for development inside your virtual environment, run the following installation command, as prescribed by the Click documentation.

$ pip install --editable .

That's it. If you make some good ones, please consider submitting them as pull requests so everyone can benefit.

census-data-downloader's People

Contributors

Stargazers

Watchers

census-data-downloader's Issues

add processed us_citizen_total column to CitizenDownloader

Switch the Internet downloader table

Remove E suffix from all tables

Write out a companion README style Markdown file for every processed table

It could include:

The raw table ID
A URL leading to the Census Reporter page about the table
A humanized version of the metadata found in the file name
A URL leading to the raw data file

Possible error in column headers

The two income columns for acs5_2017_poverty_zctas.csv both say that the column is counting the number of people below poverty level: income_past12months_below_poverty_level and income_past12months_at_or_below_poverty_level

Add in options for survey choice in the CLI

5 year and 1 yr surveys, maybe even SF1

Update CLI with new geographies

Add margin of error fields to ACS downloads

Estimate and MOE Maps

-5..., -3..., -2... --> margin of error

look for + or - --> estimate (not just single +/-)

"c" in estimate N/A in moe

Add carveouts for downloading legislative districts for places that don't have them

DC, NE

Could our table config classes be simplified further?

Perhaps a YAML or JSON config input that is parsed by Python?

Connect README with setup.py so it shows up on PyPI

Processed data contains duplicate data for multiple geographies

Bug/Issue

Census data downloader correctly downloads raw data but creates a CSV duplicated data in the processed directory.

Environment

Python 3.8
Pipenv version 2018.11.27.dev0
Latest version of censusdatadownloader

Reproduce

Install the package and then try to download a data set.

pipenv install census-data-downloader
censusdatadownloader --data-dir data/census race states

Expected behavior

A 52 row CSV file with total population by race in the processed directory.

Actual behavior

A 52 CSV with the same data for each column processed directory.

Possible issues/solutions

It looks like the data is correctly downloaded in the raw directory which makes me think something's happening in the process step. I'm seeing this behavior specifically with the race [geography] arguments.

I noticed the same behavior for internet counties but did get the correct data when I used internet states.

I'll see if I can debug what's happening at the process step but in the meantime I'll rely on the raw data. Thanks for your work on this!

Add more detailed poverty tables

By gender
By age
By age and gender

Need a way to limit geography levels too it seems

Support more data sources

Sources available via the Census API include:

update with newest 2022 5 year release

generate source file for tables

--help and other docs should list all of the available tables somehow

Fix README image references so they work on PyPI

Undeclared dependency on census-data-aggregator

This package seems to depend on census-data-aggregator, which is not listed in the setup.py. In a fresh environment it therefore fails to import.

Publish an INVENTORY.md file with a list of all processed CSVs

This could be automatically generated by our classes via a template.

Migrate to GitHub actions for testing

Am I going crazy or is DC no longer pulling at the tract level?

I've had a set of scripts using census-data-downloader to pull tract data for the U.S. (that's for the awesome library btw).

I noticed today that DC is suddenly missing. I thought it as possibly related to the python-us's handling of DC. I added a DC_STATEHOOD=1 environment variable but I'm still not pulling any data for DC.

I swore the data was always but maybe I'm misremembering. Regardless, is there a way to debug how census-data-downloader pulls records by state? If so, I'm happy to fork and debug further. Thanks!

Some system for merging with the Census SHP files

It would be nice to quickly be able to inspect a map of each dataset.

This could be accomplished by providing a ".csvt" companion for each processed CSV file that includes the data types in a format QGIS will respect. Then the files could be manually merged with the shapes.

Another approach would be to have the downloader, or some downstream module based on it, do the merger automatically. Such a system could create a companion shapefile for each set with the data already merged in.

Is _process_raw_data busted?

I bet I busted it.

Save and publish GEO.id and GEO.id2 fields with the processed files

They are explained here

Better organize tables by folders

issue downloading 2009 annotation data

Look to see why this is happening

Standardize the name of the universe column

Add margin of error approximation for all aggregated fields in the process functions

-6666666 annotation values still show up in the estimate fields

Bug spotted by @irisslee

Support more geographies

Here's what the API now supports:

Have aggregator in processing step be aware of jam values

Drop jam values from aggregation and report what is dropped

Verify that margin of error download is accurate

API key. Windows 'export' is not recognized as an internal or external command, operable program or batch file.

Hi there I'm unable to activate the API key on windows CMD.
I've tried

export CENSUS_API_KEY=......

Also to no avail below command. As is states API key not present.

censusdatadownloader --year 2018 --data-dir data/external/ race counties

Where should I add the API key in the query above?

add in csvt

add this in so it downloads with data types automatically

Jam value information

As we think about integrating the output of census-data-downloader to be an input into the census-data-aggregator, is it possible to include the jam value information in the headers of the processed files?

For example, in the household income table the column name "10_and_under" would be replaced with "2499_10." The convention of having the headers in units of 1000s makes the 2499 representation a bit awkward though.

Integrate processing crosswalk for humanizing annotation values

Custom groups missing MOEs

For tables with custom groups (such as PovertyAge), the MOEs are not returned.

Add additional methods to base classes to let users support additional sources

This is somewhat related to #2.

I find this project to be extremely useful and a great framework for a task that I have to do often. In my projects, I've found myself using the base classes and concepts from this project when I want to download and process data from other Census Bureau API sources.

However, for non-ACS sources, I find myself entirely reimplementing many of the methods on my geotype downloader classes because the changes in functionality aren't possible by just calling super() and then adding additional logic.

I think adding these methods to BaseGeoTypeDownloader could make adding additional data sources easier, both in this project, and for other users in their own projects:

BaseGeoTypeDownloader.get_api_client(): This would be called from the constructor to set sefl.api and allow subclasses to specify a customized subclass of census.Census that supports additional API endpoints.
BaseGeoTypeDownloader.get_field_type_map(): This would be similar to BaseGeoTypeDownloader.get_raw_field_map() except it would map from raw field names to types that would be passed to pd.Series.astype(). Like BaseGeoTypeDownloader.get_raw_field_map(), this would be called from BaseGeoTypeDownloader.process() when setting the column types after reading in the raw table. The implementation could check for the existence of a FIELD_TYPES attribute on the table configuration class, and if that doesn't exist, default to the existing logic for ACS tables that checks the field name suffix. Adding the ability to explicitly set type conversions allows supporting non-ACS tables that might have field names that don't have the same suffix convention as ACS tables.

Error when installing CLI

Hi all,

Excited to use this tool! I noticed this error when I just tried installing census-data-downloader:

Traceback (most recent call last):
  File "/Users/williamsa/.local/share/virtualenvs/fema-X3N8u1-d/bin/censusdatadownloader", line 6, in <module>
    from census_data_downloader.cli import cmd
  File "/Users/williamsa/.local/share/virtualenvs/fema-X3N8u1-d/lib/python3.7/site-packages/census_data_downloader/__init__.py", line 5, in <module>
    from .tables import TABLE_LIST
ModuleNotFoundError: No module named 'census_data_downloader.tables'

I tried installing via pipx and pipenv and received the same error.

Fix annotation decoding step

Rework YearStructureBuilt loader to adapt the crosswalk by year using the reworked core framework

update readme with a list of tables & where to get API key

Clarify that this tool works only for ACS tables in README

I could be wrong about this, but looking through the source code quickly, it seems like all of the tables are ACS tables. I love this tool, but always forget whether or not it supports other kinds of census data, e.g. population estimates. It would be selfishly helpful to have a note in the README that clarifies this.

If I'm wrong about this only covering ACS tables, please let me know.

If my request makes sense to you all, I'm happy to make a pull request for this little documentation change.

Recommend Projects

datadesk / census-data-downloader Goto Github PK