ipums / nhgisxwalk Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 2.0 21.31 MB

Spatio-temporal NHGIS Crosswalks

Home Page: https://ipums.github.io/nhgisxwalk/

License: Mozilla Public License 2.0

Python 98.40% TeX 1.60%

nhgis crosswalks gis spatio-temporal-data

nhgisxwalk's People

Contributors

Stargazers

Watchers

Forkers

j-p-schroeder jgaboardi

nhgisxwalk's Issues

post-2010 state split

From the Specs for Crosswalk Deliverables document.

Split files by state: All of the above (including "existing" files)

Idea: Single state file for 2010-2019 target zones except for AZ, CA, NY, which need separate files for 2010 & later target zones.

For NY: 2010 in one file; 2011-2019 in another

For CA & AZ: 2010-2011 in one file; 2012-2019 in another

Swap out WY (56) for DC (11) for sample data / testing / example. While Wyoming has a small population, it has a (relatively) large number of blocks (220,401) in 2010. By comparison, DC has only 8,362 in 2010.

crosswalk README file within product?

The questions was raised about whether to include a README.txt within a zipped archive for all crosswalk products. This is currently done with the original nhgis_blkYEAR-blkYEAR_g{e}{j} data available here.

handcalcs for synthetic-example.ipynb

Try using handcalcs in the formula cells for the Synthetic Example notebook.

markdown cell below Cell 6

[RESOLVED] Subset results unreliable?

RESOLVED

~~Crosswalks should not be generated from subsets of data~~
See #1, #7, #11, #14, #15, #16

Problem

Crosswalk should be generated from a complete national base block crosswalk (and supplementary data if needed), then subset to a single state. Creating a crosswalk from a subset of data seemingly gives inconsistent results, though it is not clear exactly why this is happening. At first glance, this appeared to be a function of bordering state subsets acting upon each other, but even Hawaii gives bad results.

~~After further investigation this appears to be an issue with the generation of 1990 BGPs.~~
~~Will update after checking 2000 BGP results~~

Solution

~~Maybe try whole nation for testing and drop the subset testing?~~
~~this is ideal for completeness, but would require storing all data on GitHub and dramatically increase testing time~~

This was not an issue with the functionality of nhgisxwalk, but a mistake with how the supplementary 1990 BGP subset files were being saved. This has been corrected and accurate crosswalks are now being generated from subset files.

Create MVP crosswalks

Create national crosswalks for the following:

bgp1990-trt2010
bgp2000-trt2010
bgp1990-bkg2010
bgp2000-bkg2010

Create state crosswalks for the following:

bgp1990-trt2010
bgp2000-trt2010
bgp1990-bkg2010
bgp2000-bkg2010

crosswalk naming / saving?

Consider crosswalk naming that mirrors the current NHGIS block-block geographic crosswalks (nhgis_blk1990_blk2010_gj). For example, nhgis_bgp1990_trt2010.

Also, rethink upper level (non-nhgisxwalk) directory structure and where to save out "results".

maybe drop the ./results/ directory and put directly into ./crosswalks/
maybe a subdir for state breakdowns?

Change license

nhgisxwalk is currently licensed under the BSD 3-Clause but should be relicensed to Mozilla Public License v2.0 in preparation for the transfer to IPUMS (#70).

Add/remove/update notebooks

Update/rerun

notebooks/data-subset-sample-workflow-bgp1990trt2010.ipynb
notebooks/data-subset-sample-workflow-bgp2000trt2010.ipynb
notebooks/synthetic-example.ipynb
notebooks/weighted-portion-synthetic-atoms.ipynb

Add

notebooks/build_subset_1990.ipynb
notebooks/build_subset_2000.ipynb
notebooks/test_subset_1990.ipynb
notebooks/test_subset_2000.ipynb

Remove

notebooks/build_subset.ipynb

Block Group Part GISJOIN correction

Following the correction of the 1990 GISJOIN ID for block group parts (103rd congress vs. 101st congress) the code base should be updated/pruned.

Check/update the following:

delay codecov until after all builds are complete

Add the following sections to codecov.yml

codecov:
  notify:
    after_n_builds: 18

...

comment:
  layout: "reach, diff, files"
  behavior: once
  after_n_builds: 18
  require_changes: true

update compression options / index

New options are available for compression in pandas v1.0. Also, the data frame indices should be omitted when writing out the crosswalks. These two fixes should be updated in GeoCrossWalk.xwalk_to_csv().

Example

compression_opts = dict(method="zip", archive_name="without_index.csv")
df.to_csv("without_index.zip", compression=compression_opts, index=False)

Add headers to source files

Header in Source Files

In each source file that you've written, place a comment header at the top of the file. [NOTE: The example below uses the comment notation ("#") recognized by most scripting languages (e.g., Bash, Perl, Python, R, Ruby). Adjust that notation accordingly based on the language your code is written in.]
# This file is part of the Minnesota Population Center's {PROJECT TITLE}.
# For copyright and licensing information, see the NOTICE and LICENSE files
# in this project's top-level directory, and also on-line at:
#   https://github.com/mnpopcenter/{REPO-NAME}
Replace the placehoders {PROJECT TITLE} and {REPO-NAME}

Character code / naming conventions?

Originally in the Crosswalk Specs

Geographic abbreviations

decided / not yet used

Level	Option A: 3 char [initialscheme]	Option B: 2 char	Option C: mix
Block	blk [in use]	bk	blk
Block group	bkg	bg	bg
Block group part	bgp	bp	bgp
Census tract	trt	tr , ct ?	tr , ct
County	cty	co	co
Place	plc ?	pl	pl
County subdivision	csd ?	cs	cs , csub , cosub ?
CBSA	cbs , msa ?	cb , ma ?	cb , cbsa , msa ?

Examples:

Option A: 3 char	Option B: 2 char	Option C: mix
nhgis_blk1990_blk2010_gj nhgis_blk1990_blk2010_gj_27 nhgis_bgp1990_bkg2010 nhgis_bgp1990_trt2010 nhgis_bgp1990_cty2010	nhgis_bk1990_bk2010_gj nhgis_bk1990_bk2010_gj_27 nhgis_bp1990_bg2010 nhgis_bp1990_tr2010 nhgis_bp1990_ct2010 nhgis_bp1990_co2010	nhgis_blk1990_blk2010_gj nhgis_blk1990_blk2010_gj_27 nhgis_bgp1990_bg2010 nhgis_bgp1990_tr2010 nhgis_bgp1990_ct2010 nhgis_bgp1990_co2010

State FIPS vs. postal codes

Examples:

Option A: FIPS [initial scheme]	Option B: upper case postal
nhgis_blk1990_blk2010_gj_01 nhgis_blk1990_blk2010_gj_02 nhgis_blk1990_blk2010_gj_27 nhgis_blk1990_blk2010_gj_56	nhgis_blk1990_blk2010_gj_AL nhgis_blk1990_blk2010_gj_AK nhgis_blk1990_blk2010_gj_MN nhgis_blk1990_blk2010_gj_WY

Functionality with NHGIS API?

Incorporate functionality with the NHGIS API for fetching data and metadata.
See here for code examples.
Specifically check for cross availability through the API.

Add NOTICE.txt

NOTICE.txt

The copyright notice(s) for a project are placed in a file called NOTICE.txt in the project's top-level (root) folder. The .txt extension in the file's name makes it easier for developers and users of cross-platform projects to work with this file on Windows.

This file also contains the list of people -- both at MPC and outside our center -- who've contributed to the project. Including the contributors list in this file makes it easier to see the association between copyrights held by organizations and their employees who made contributions to the project.

A template for the this file is available for new projects. After downloading a copy of the template into your clone, replace these placeholders with the appropriate information:

{PROJECT TITLE}

{REPO-NAME}

{YEAR}

{YOUR NAME}

Flesh out docstring Examples

Flesh out docstring Examples in:

Specs for README files

Jonathan's specs for included README.txt files as per here.

Readme files

One readme file per zip file
Content will be identical for all BGP crosswalks
- See readme for BGP crosswalks for starting content (also see #74)
- Proposed file name: nhgis_bgp_crosswalk_README.txt
For block-to-block crosswalks, re-use the existing four readme files
- Copy into each of the state-specific zip files with no changes (no need to make the readme state-specific, too)

links and improved instructions for downloading NHGIS data

Add links and instructions for downloading NHGIS data. While this is not essential, it fosters good-faith openness for reproducibility.

This should eventually be superseded in v1.0.0 by interoperability with the NHGIS API (#12)

Finalized README.txt

Following #72 two remaining details must be finalized:

the name for the single block group part *_README.txt
the actual text within the single block group part *_README.txt

Handling "no-data" in 1990 blk / bgp

Implement the new method for handling "no-data" in 1990 blocks and block group parts.

Review "framework"
Review "handling no data"
work though a synthetic example
update weighted/portion synthetic examples notebooks
incorporate the synthetic example into example_crosswalk_data
test against Jonathan's results

Consider doc site

Consider a documentation site modeled after pysal/spaghetti.

Key resources:

Content for README

See also #73 and Jonathan's note here.

Documentation for NHGIS crosswalks from block group parts to later units

NHGIS crosswalk from 1990 to 2010 census blocks with GISJOIN identifiers

Contents

Data Summary
Notes
Citation and Use

Additional documentation on NHGIS crosswalks is available at:
https://www.nhgis.org/user-resources/geographic-crosswalks

Data Summary

Each NHGIS crosswalk file provides interpolation weights for allocating census counts from a specified
set of source zones to a specified set of target zones. Each record in the crosswalk represents a spatial
intersection between a single source zone and a single target zone.

File naming scheme: nhgis_[source geog][source year]_[target geog][target year]{_state FIPS}.csv

Geographic unit codes:
      blk →→ - Block
      bgp →→ - Block group part (intersections between block groups, places, county subdivisions, etc.)
      bg →→ -Block group
      tr →→ - Census tract
      co →→ - County

--> Remainder copied from the 1990 block to 2010 block GISJOIN crosswalk readme <--

--> Must edit <--

Content:

The top row is a header row
Each subsequent row represents a potential intersection between a 1990 block and 2010 block
The GJOIN1990 and GJOIN2010 fields contain NHGIS-standard GISJOIN block identifiers:
- A block GISJOIN is a concatenation of:
  - "G"
  - State NHGIS code: 3 digits (FIPS + "0")
  - County NHGIS code: 4 digits (FIPS + "0")
  - Census tract code: 4 or 6 digits in 1990; 6 digits in 2010
  - Census block code: 3 or 4 digits in 1990; 4 digits in 2010
- The GJOIN1990 field contains numerous blank values. These represent cases where the only 1990 blocks intersecting the corresponding 2010 block are offshore, lying in coastal or Great Lakes waters, which are excluded from NHGIS's block boundary files. None of the missing 1990 blocks had any reported population or housing units. The blank values are included here to ensure that all 2010 blocks are represented in the file.
The WEIGHT field contains the interpolation weights NHGIS uses to allocate portions of 1990 block counts to 2010 blocks for geographically standardized time series tables
The PAREA_VIA_BLK00 field contains the approximate portion of the 1990 block's land* area lying in the 2010 block, based on intersections that the 1990 and 2010 block have with 2000 blocks in 2000 and 2010 TIGER/Line files (i.e. indirect overlay via 2000 blocks).
- If a 1990 block's area is entirely water, then this value is based on the block's total area including water
- NHGIS uses these values to compute lower and upper bounds on 1990 estimates: for any record with a value greater than 0 and less than 1, it is assumed that either all or none of the 1990 block's characteristics could be located in the corresponding 2010 block.

Notes

NHGIS uses this crosswalk to generate 1990 data standardized to 2010 census units for NHGIS time series tables. Complete documentation on the interpolation model used to generate the weights in the crosswalk is provided at https://www.nhgis.org/documentation/time-series/1990-blocks-to-2010-geog.

In short, the model is based on "cascading density weighting", as introduced in Chapter 3 of Jonathan Schroeder's dissertation (Visualizing Patterns in U.S. Urban Population Trends, University of Minnesota) available here: http://hdl.handle.net/11299/48076.

The general sequence of operations:

Estimate 2000 population and housing unit counts for each intersection between 2000 and 2010 blocks.

Our basic "cascading density weighting" model does this by allocating 2000 counts among 2010 blocks in proportion to 2010 block population and housing densities (population and housing summed together).
We use this basic approach only for 2000 blocks that are not split by the boundaries of a 2010 target unit, where "target units" are the areas for which NHGIS plans to release standardized data: block groups, places, county subdivisions, school districts, ZCTA's, urban areas, congressional districts (111th and 113th), and any units that can be constructed from these (e.g., census tracts, counties, etc.).
For 2000 blocks that are split by the boundaries of a 2010 target unit, we use NHGIS's more advanced hybrid interpolation model (see https://www.nhgis.org/documentation/time-series/2000-blocks-to-2010-geog) to allocate 2000 counts among 2010 blocks.

Use the estimated 2000 population and housing unit densities from step 1 to guide the allocation of 1990 counts among 1990-2000-2010 block intersections.

The procedure also combines two types of overlay to model intersections between 1990, 2000, and 2010 blocks:

"Direct overlay" of 1990 & 2000 block polygons from 2000 TIGER/Line files with 2000 & 2010 block polygons from 2010 TIGER/Line files (with a preliminary step to georectify Hawaii's 2000 TIGER polygons to 2010 TIGER features in order to accommodate a systematic change in the coordinate system used to represent Hawaii features between the two TIGER versions)
"Indirect overlay":
a. Overlay 1990 & 2000 block polygons using the 2000 TIGER/Line basis
b. Overlay 2000 & 2010 block polygons using the 2010 TIGER/Line basis
c. Multiply 1990-2000 intersection proportions from step 2a with 2000-2010 proportions from step 2b to compute estimated proportions of each 1990 block within each 2010 block. (This is how the crosswalk's "PAREA_VIA_BLK00" values are derived.)

The direct overlay weights are constrained to eliminate any 1990-2010 intersections that are not valid in the indirect overlay. This prevents most "slivers" (invalid intersections caused by changes in TIGER feature representations) from being assigned any weight.

The final weighting blends weights from constrained direct overlay (CDO) and indirect overlay (IO) through a weighted average, giving high weight to CDO (and low weight to IO) in cases where the two TIGER/Line representations of a 2000 block align well and where the 1990-2000 block intersection and the 2000-2010 block intersection both comprise less than the entirety of the 2000 block. In cases where the block intersections cover the entirety of a 2000 block or the block intersection from one TIGER/Line version has no valid intersection with a the corresponding 2000 block in the other TIGER/Line version, then the weighting is based on IO alone.

Citation and Use

All persons are granted a limited license to use this documentation and the
accompanying data, subject to the following conditions:

Publications and research reports employing NHGIS data must cite it appropriately. The citation should include the following:

    Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles.
    IPUMS National Historical Geographic Information System: Version 12.0 [Database].
    Minneapolis: University of Minnesota. 2017.
    http://doi.org/10.18128/D050.V12.0

For policy briefs or articles in the popular press, we recommend that you cite the use of NHGIS data as follows:

IPUMS NHGIS, University of Minnesota, www.nhgis.org.

In addition, we request that users send us a copy of any publications, research
reports, or educational material making use of the data or documentation.
Printed matter should be sent to:

    IPUMS NHGIS
    Minnesota Population Center
    University of Minnesota
    50 Willey Hall
    225 19th Ave S
    Minneapolis, MN 55455

Send electronic material to: [email protected]

Swap example data to Delaware

Swap all example data to Delaware. More chance of testing edge cases than DC. Related to #1.

base crosswalks
- blk_1990-blk2010_gj (#19)
- blk_2000-blk2010_gj (#15)
data
- 1990
  - blocks (#19)
  - block groups parts
- 2000
  - blocks (#15)
- 2010
  - blocks (#19)
  - ~~tracts (#19)~~
notebooks
- data-subset-sample-workflow-bgp1990trt2010 (#19)
- data-subset-sample-workflow-bgp2000trt2010 (#15)
unittests
- 1990 test_nhgisxwalk.py (#19)
- 2000 test_nhgisxwalk.py (#15)

bgp README files

Replace the 6 current BGP README files with a single file as per our 2020-08-05 meeting:

Q. Should the readme file names vary among BGP crosswalks? Or should we just use a single name for all BGP crosswalk readmes?

I.e., if I want to update content, is there a single template file I can update, or must I update content in all of the separate files?

A. Let's use just one file with one name. James will make this edit...

Unify generate_XXXX_ids

Need to unify the generate_source_ids and generate_target_ids methods of GeoCrosswalk and generalize them into one method generate_ids.

They are acceptable in the current format, but need to updated soon so that we can deal with source and target geographies other than 1990 BGP and 2010 Tract.

YYYY block group parts X 2010 counties

Add functionality:

bgp1990_to_cty2010
bgp2000_to_cty2010

Update:

Notebooks:

data-subset-sample-workflow-bgp1990cty2010.ipynb
data-subset-sample-workflow-bgp2000cty2010.ipynb
generate-nation-and-states-bgp1990cty2010.ipynb
generate-nation-and-states-bgp2000cty2010.ipynb

GISJOIN vs. GEOID

Following this decision, all crosswalks created through nhgisxwalk will be keyed on the NHGIS-style GISJOIN ID, not the original census-style GEOID. A converter utility may be provided in the future. This would initially support the conversion from 1990, 2000, and 2010 block IDs, but may subsequently include more geographies.

May still implement a converter utility, but will also have GEOIDs in the same file following this decision. Also, need to create GISJOIN - GEOID crosswalks:

src year `gj`	trg year `gj`	trg year `ge`
`GJOIN1990`	`GJOIN2010`	`GEOID2010`

national files
state level files

Transfer to IPUMS

in code base
- update setup.py 2bae2cb
- review all links within code itself 7f2464f
- update environment.yml 3eb9b9a
in documentation
- text within README.md 2bae2cb
- links within README.md 2bae2cb
- text within resources/ 2bae2cb
- links within resources/ 2bae2cb
in resources/notebooks
- review all links within notebooks a0a8f73
- review binder functionality
in tools/
- first cell of gitcount.ipynb 13889c0

weights rounding functionality

the only differences I found were rounding differences. (I rounded my weights to 10 decimal digits, just to make sure anything that "should" sum to 1 actually does sum to exactly 1.

There should be a (default) option to round weights. This can easily be accomplished with pandas.DataFrame.round. It might make sense to put this at the end of GeoCrossWalk.__init__? Or maybe to have it as a stand-alone method?

Example

col1 = ["a", "b", "c"]
col2 = ["X", "Y", "Z"]
col3 = [0.123456, 0.99999999999999, 0.6666666666667]
cols = {"col1": col1, "col2": col2, "col3": col3}
df = pandas.DataFrame(cols)
print(df)

  col1 col2      col3
0    a    X  0.123456
1    b    Y  1.000000
2    c    Z  0.666667

print(df.round(3))

  col1 col2   col3
0    a    X  0.123
1    b    Y  1.000
2    c    Z  0.667

Add descriptive blurb into README.md

Add descriptive blurb into README.md
Maybe adapt from nhgisxwalk.GeoCrossWalk:

Generate a temporal crosswalk for census geography data built from the smallest intersecting units (atoms). Each row in a crosswalk represents a single atom, and comprised of a source ID (geo+year+gj), and target ID (geo+year+gj), and at least one column of weights. An example of a source ID is bgp1990gj (block group parts from 1990) and an example of a target ID is trt1990gj (tracts from 2010) — see for the nhgis_bgp1990gj_to_trt1990gj crosswalk extract of Delaware. The weights are the interpolated proportions of source attributes that are are calculated as being within the target units. For a description of the algorithmic workflow see the General Crosswalk Construction Framework. Data from 1990 poses specific problem due to the US Census Bureau not explicitly including blocks with no population/housing units in the summary files (SF1). For a description of the algorithmic workflow in the 1990 "no data" scenarios see Handling 1990 No-Data Blocks in Crosswalks. For more information of the base crosswalks see their technical details. For further description see Schroeder (2007).

Currently supported crosswalks include:

source	target
1990 block group parts	2010 tracts
2000 block group parts	2010 tracts
1990 block group parts	2010 block groups
2000 block group parts	2010 groups

Planned supported crosswalks include:

source	target
1990 block group parts	2012 tracts
2000 block group parts	2012 tracts
1990 block group parts	2012 block groups
2000 block group parts	2012 groups

Schroeder, J. P. 2007. Target-density weighting interpolation and uncertainty evaluation for temporal analysis of census data. Geographical Analysis 39 (3):311–335.

rethink static name checking

From Jonathan:

b8a2860 still hard-codes the expected length for geo codes (now either 2 or 3 characters instead of exactly 3). I’m not sure it’s worth fixing now, but to handle potentially longer codes in the future, I’d suggest instead grabbing a set of characters from “c” that is equal in length to “geo” and comparing those strings.

try comparing to self.source_geo/self.target_geo

block group part X block groups

Add functionality:

bgp1990_to_bkg2010
bgp2000_to_bkg2010

See the Minimum Viable Products (MVPs) in the Specs for Crosswalk Deliverables document.

Recreate susbets / unittests / etc.

Recreate unitests following the implementation of:

#9 (1990) (#19)
#8 (2000) (#15)

Specifically:

Start with the dev_ideas/build_subset.ipynb
Decide where this notebook should live for reproducibility

split base crosswalks by (target) state

nhgis_blk1990_blk2010_ge.csv.zip
nhgis_blk1990_blk2010_gj.csv.zip
nhgis_blk2000_blk2010_ge.csv.zip
nhgis_blk2000_blk2010_gj.csv.zip

start documenting release notes

start a gitcount, like in spaghetti
test for functionality

Urban/Rural codes for 2000 blocks?

Need a better way to handle/build Urban and Rural codes of 2000 block data.
These codes are needed to generate a block group part ID.

standard case example / tests

Maybe have two (toy) example datasets:

standard case example
special case example

These will be based on the 1990 stuff ("no-data") and 2000 stuff ("all data")

Add unittests for each.

ipums / nhgisxwalk Goto Github PK

nhgisxwalk's People

Contributors

Stargazers

Watchers

Forkers

nhgisxwalk's Issues

RESOLVED

Problem

Solution

Update/rerun

Add

Remove

Header in Source Files

Geographic abbreviations

Examples:

State FIPS vs. postal codes

Examples:

NOTICE.txt

NHGIS crosswalk from 1990 to 2010 census blocks with GISJOIN identifiers

Data Summary

--> Remainder copied from the 1990 block to 2010 block GISJOIN crosswalk readme <--

--> Must edit <--

Notes

Citation and Use

Currently supported crosswalks include:

Planned supported crosswalks include:

Recommend Projects

Recommend Topics

Recommend Org