Giter VIP home page Giter VIP logo

nhgisxwalk's People

Contributors

dependabot[bot] avatar j-p-schroeder avatar jgaboardi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nhgisxwalk's Issues

post-2010 state split

From the Specs for Crosswalk Deliverables document.

Split files by state: All of the above (including "existing" files)

  • Idea: Single state file for 2010-2019 target zones except for AZ, CA, NY, which need separate files for 2010 & later target zones.
    • For NY: 2010 in one file; 2011-2019 in another
    • For CA & AZ: 2010-2011 in one file; 2012-2019 in another

swap out sample data

Swap out WY (56) for DC (11) for sample data / testing / example. While Wyoming has a small population, it has a (relatively) large number of blocks (220,401) in 2010. By comparison, DC has only 8,362 in 2010.

[RESOLVED] Subset results unreliable?

RESOLVED

Crosswalks should not be generated from subsets of data
See #1, #7, #11, #14, #15, #16

Problem

Crosswalk should be generated from a complete national base block crosswalk (and supplementary data if needed), then subset to a single state. Creating a crosswalk from a subset of data seemingly gives inconsistent results, though it is not clear exactly why this is happening. At first glance, this appeared to be a function of bordering state subsets acting upon each other, but even Hawaii gives bad results.

  • After further investigation this appears to be an issue with the generation of 1990 BGPs.
  • Will update after checking 2000 BGP results

Solution

  • Maybe try whole nation for testing and drop the subset testing?
  • this is ideal for completeness, but would require storing all data on GitHub and dramatically increase testing time

This was not an issue with the functionality of nhgisxwalk, but a mistake with how the supplementary 1990 BGP subset files were being saved. This has been corrected and accurate crosswalks are now being generated from subset files.

Create MVP crosswalks

Create national crosswalks for the following:

  • bgp1990-trt2010
  • bgp2000-trt2010
  • bgp1990-bkg2010
  • bgp2000-bkg2010

Create state crosswalks for the following:

  • bgp1990-trt2010
  • bgp2000-trt2010
  • bgp1990-bkg2010
  • bgp2000-bkg2010

crosswalk naming / saving?

Consider crosswalk naming that mirrors the current NHGIS block-block geographic crosswalks (nhgis_blk1990_blk2010_gj). For example, nhgis_bgp1990_trt2010.

Also, rethink upper level (non-nhgisxwalk) directory structure and where to save out "results".

  • maybe drop the ./results/ directory and put directly into ./crosswalks/
  • maybe a subdir for state breakdowns?

Change license

nhgisxwalk is currently licensed under the BSD 3-Clause but should be relicensed to Mozilla Public License v2.0 in preparation for the transfer to IPUMS (#70).

Add/remove/update notebooks

Update/rerun

  • notebooks/data-subset-sample-workflow-bgp1990trt2010.ipynb
  • notebooks/data-subset-sample-workflow-bgp2000trt2010.ipynb
  • notebooks/synthetic-example.ipynb
  • notebooks/weighted-portion-synthetic-atoms.ipynb

Add

  • notebooks/build_subset_1990.ipynb
  • notebooks/build_subset_2000.ipynb
  • notebooks/test_subset_1990.ipynb
  • notebooks/test_subset_2000.ipynb

Remove

  • notebooks/build_subset.ipynb

Block Group Part GISJOIN correction

Following the correction of the 1990 GISJOIN ID for block group parts (103rd congress vs. 101st congress) the code base should be updated/pruned.

Check/update the following:

update compression options / index

New options are available for compression in pandas v1.0. Also, the data frame indices should be omitted when writing out the crosswalks. These two fixes should be updated in GeoCrossWalk.xwalk_to_csv().

Example

compression_opts = dict(method="zip", archive_name="without_index.csv")
df.to_csv("without_index.zip", compression=compression_opts, index=False)

Add headers to source files

Header in Source Files

In each source file that you've written, place a comment header at the top of the file. [NOTE: The example below uses the comment notation ("#") recognized by most scripting languages (e.g., Bash, Perl, Python, R, Ruby). Adjust that notation accordingly based on the language your code is written in.]

# This file is part of the Minnesota Population Center's {PROJECT TITLE}.
# For copyright and licensing information, see the NOTICE and LICENSE files
# in this project's top-level directory, and also on-line at:
#   https://github.com/mnpopcenter/{REPO-NAME}

Replace the placehoders {PROJECT TITLE} and {REPO-NAME}

Character code / naming conventions?

Originally in the Crosswalk Specs

Geographic abbreviations

decided / not yet used

Level Option A: 3 char [initialscheme] Option B: 2 char Option C: mix
Block blk [in use] bk blk
Block group bkg bg bg
Block group part bgp bp bgp
Census tract trt tr , ct ? tr , ct
County cty co co
Place plc ? pl pl
County subdivision csd ? cs cs , csub , cosub ?
CBSA cbs , msa ? cb , ma ? cb , cbsa , msa ?
Examples:
Option A: 3 char Option B: 2 char Option C: mix
nhgis_blk1990_blk2010_gj nhgis_blk1990_blk2010_gj_27 nhgis_bgp1990_bkg2010 nhgis_bgp1990_trt2010

nhgis_bgp1990_cty2010
nhgis_bk1990_bk2010_gj nhgis_bk1990_bk2010_gj_27 nhgis_bp1990_bg2010 nhgis_bp1990_tr2010 nhgis_bp1990_ct2010 nhgis_bp1990_co2010 nhgis_blk1990_blk2010_gj nhgis_blk1990_blk2010_gj_27 nhgis_bgp1990_bg2010 nhgis_bgp1990_tr2010 nhgis_bgp1990_ct2010 nhgis_bgp1990_co2010

State FIPS vs. postal codes

Examples:
Option A: FIPS [initial scheme] Option B: upper case postal
nhgis_blk1990_blk2010_gj_01
nhgis_blk1990_blk2010_gj_02 nhgis_blk1990_blk2010_gj_27 nhgis_blk1990_blk2010_gj_56
nhgis_blk1990_blk2010_gj_AL
nhgis_blk1990_blk2010_gj_AK nhgis_blk1990_blk2010_gj_MN nhgis_blk1990_blk2010_gj_WY

Add NOTICE.txt

NOTICE.txt

The copyright notice(s) for a project are placed in a file called NOTICE.txt in the project's top-level (root) folder. The .txt extension in the file's name makes it easier for developers and users of cross-platform projects to work with this file on Windows.

This file also contains the list of people -- both at MPC and outside our center -- who've contributed to the project. Including the contributors list in this file makes it easier to see the association between copyrights held by organizations and their employees who made contributions to the project.

A template for the this file is available for new projects. After downloading a copy of the template into your clone, replace these placeholders with the appropriate information:

  • {PROJECT TITLE}
  • {REPO-NAME}
  • {YEAR}
  • {YOUR NAME}

Specs for README files

Jonathan's specs for included README.txt files as per here.

Readme files

  • One readme file per zip file
  • Content will be identical for all BGP crosswalks
  • For block-to-block crosswalks, re-use the existing four readme files
    • Copy into each of the state-specific zip files with no changes (no need to make the readme state-specific, too)

links and improved instructions for downloading NHGIS data

Add links and instructions for downloading NHGIS data. While this is not essential, it fosters good-faith openness for reproducibility.

  • Base-level crosswalks
    • 1990-2010
    • 2000-2010
  • 1990 BLK SF
  • 1990 BGP SF
    • w/directions on navigation to compound geographies
  • 2000 BLK SF

This should eventually be superseded in v1.0.0 by interoperability with the NHGIS API (#12)

Finalized README.txt

Following #72 two remaining details must be finalized:

  • the name for the single block group part *_README.txt
  • the actual text within the single block group part *_README.txt

Content for README

See also #73 and Jonathan's note here.


Documentation for NHGIS crosswalks from block group parts to later units

NHGIS crosswalk from 1990 to 2010 census blocks with GISJOIN identifiers

 
Contents


Data Summary

 
Each NHGIS crosswalk file provides interpolation weights for allocating census counts from a specified
set of source zones to a specified set of target zones. Each record in the crosswalk represents a spatial
intersection between a single source zone and a single target zone.

File naming scheme:  nhgis_[source geog][source year]_[target geog][target year]{_state FIPS}.csv

Geographic unit codes:
      blk →→ - Block 
      bgp →→ - Block group part (intersections between block groups, places, county subdivisions, etc.)
      bg →→ -Block group
      tr →→ - Census tract
      co →→ - County


--> Remainder copied from the 1990 block to 2010 block GISJOIN crosswalk readme <--

--> Must edit <--


Content:

  • The top row is a header row
  • Each subsequent row represents a potential intersection between a 1990 block and 2010 block
  • The GJOIN1990 and GJOIN2010 fields contain NHGIS-standard GISJOIN block identifiers:
    • A block GISJOIN is a concatenation of:
      • "G"
      • State NHGIS code: 3 digits (FIPS + "0")
      • County NHGIS code: 4 digits (FIPS + "0")
      • Census tract code: 4 or 6 digits in 1990; 6 digits in 2010
      • Census block code: 3 or 4 digits in 1990; 4 digits in 2010
    • The GJOIN1990 field contains numerous blank values. These represent cases where the only 1990 blocks intersecting the corresponding 2010 block are offshore, lying in coastal or Great Lakes waters, which are excluded from NHGIS's block boundary files. None of the missing 1990 blocks had any reported population or housing units. The blank values are included here to ensure that all 2010 blocks are represented in the file.
  • The WEIGHT field contains the interpolation weights NHGIS uses to allocate portions of 1990 block counts to 2010 blocks for geographically standardized time series tables
  • The PAREA_VIA_BLK00 field contains the approximate portion of the 1990 block's land* area lying in the 2010 block, based on intersections that the 1990 and 2010 block have with 2000 blocks in 2000 and 2010 TIGER/Line files (i.e. indirect overlay via 2000 blocks).
    • If a 1990 block's area is entirely water, then this value is based on the block's total area including water
    • NHGIS uses these values to compute lower and upper bounds on 1990 estimates: for any record with a value greater than 0 and less than 1, it is assumed that either all or none of the 1990 block's characteristics could be located in the corresponding 2010 block.

Notes

NHGIS uses this crosswalk to generate 1990 data standardized to 2010 census units for NHGIS time series tables. Complete documentation on the interpolation model used to generate the weights in the crosswalk is provided at https://www.nhgis.org/documentation/time-series/1990-blocks-to-2010-geog.

In short, the model is based on "cascading density weighting", as introduced in Chapter 3 of Jonathan Schroeder's dissertation (Visualizing Patterns in U.S. Urban Population Trends, University of Minnesota) available here: http://hdl.handle.net/11299/48076.

The general sequence of operations:

  1. Estimate 2000 population and housing unit counts for each intersection between 2000 and 2010 blocks.
  • Our basic "cascading density weighting" model does this by allocating 2000 counts among 2010 blocks in proportion to 2010 block population and housing densities (population and housing summed together).
  • We use this basic approach only for 2000 blocks that are not split by the boundaries of a 2010 target unit, where "target units" are the areas for which NHGIS plans to release standardized data: block groups, places, county subdivisions, school districts, ZCTA's, urban areas, congressional districts (111th and 113th), and any units that can be constructed from these (e.g., census tracts, counties, etc.).
  • For 2000 blocks that are split by the boundaries of a 2010 target unit, we use NHGIS's more advanced hybrid interpolation model (see https://www.nhgis.org/documentation/time-series/2000-blocks-to-2010-geog) to allocate 2000 counts among 2010 blocks.
  1. Use the estimated 2000 population and housing unit densities from step 1 to guide the allocation of 1990 counts among 1990-2000-2010 block intersections.

The procedure also combines two types of overlay to model intersections between 1990, 2000, and 2010 blocks:

  1. "Direct overlay" of 1990 & 2000 block polygons from 2000 TIGER/Line files with 2000 & 2010 block polygons from 2010 TIGER/Line files (with a preliminary step to georectify Hawaii's 2000 TIGER polygons to 2010 TIGER features in order to accommodate a systematic change in the coordinate system used to represent Hawaii features between the two TIGER versions)
  2. "Indirect overlay":
    a. Overlay 1990 & 2000 block polygons using the 2000 TIGER/Line basis
    b. Overlay 2000 & 2010 block polygons using the 2010 TIGER/Line basis
    c. Multiply 1990-2000 intersection proportions from step 2a with 2000-2010 proportions from step 2b to compute estimated proportions of each 1990 block within each 2010 block. (This is how the crosswalk's "PAREA_VIA_BLK00" values are derived.)

The direct overlay weights are constrained to eliminate any 1990-2010 intersections that are not valid in the indirect overlay. This prevents most "slivers" (invalid intersections caused by changes in TIGER feature representations) from being assigned any weight.

The final weighting blends weights from constrained direct overlay (CDO) and indirect overlay (IO) through a weighted average, giving high weight to CDO (and low weight to IO) in cases where the two TIGER/Line representations of a 2000 block align well and where the 1990-2000 block intersection and the 2000-2010 block intersection both comprise less than the entirety of the 2000 block. In cases where the block intersections cover the entirety of a 2000 block or the block intersection from one TIGER/Line version has no valid intersection with a the corresponding 2000 block in the other TIGER/Line version, then the weighting is based on IO alone.

 

Citation and Use

 
All persons are granted a limited license to use this documentation and the
accompanying data, subject to the following conditions:

  • Publications and research reports employing NHGIS data must cite it appropriately. The citation should include the following:

    Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. 
    IPUMS National Historical Geographic Information System: Version 12.0 [Database]. 
    Minneapolis: University of Minnesota. 2017. 
    http://doi.org/10.18128/D050.V12.0

  • For policy briefs or articles in the popular press, we recommend that you cite the use of NHGIS data as follows:

    IPUMS NHGIS, University of Minnesota, www.nhgis.org.

In addition, we request that users send us a copy of any publications, research
reports, or educational material making use of the data or documentation.
Printed matter should be sent to:

    IPUMS NHGIS
    Minnesota Population Center
    University of Minnesota
    50 Willey Hall
    225 19th Ave S
    Minneapolis, MN 55455

Send electronic material to: [email protected]

Swap example data to Delaware

Swap all example data to Delaware. More chance of testing edge cases than DC. Related to #1.

  • base crosswalks
    • blk_1990-blk2010_gj (#19)
    • blk_2000-blk2010_gj (#15)
  • data
    • 1990
      • blocks (#19)
      • block groups parts
    • 2000
    • 2010
  • notebooks
    • data-subset-sample-workflow-bgp1990trt2010 (#19)
    • data-subset-sample-workflow-bgp2000trt2010 (#15)
  • unittests
    • 1990 test_nhgisxwalk.py (#19)
    • 2000 test_nhgisxwalk.py (#15)

bgp README files

Replace the 6 current BGP README files with a single file as per our 2020-08-05 meeting:

  • Q. Should the readme file names vary among BGP crosswalks? Or should we just use a single name for all BGP crosswalk readmes?
  • I.e., if I want to update content, is there a single template file I can update, or must I update content in all of the separate files?
  • A. Let's use just one file with one name. James will make this edit...

Unify generate_XXXX_ids

Need to unify the generate_source_ids and generate_target_ids methods of GeoCrosswalk and generalize them into one method generate_ids.

They are acceptable in the current format, but need to updated soon so that we can deal with source and target geographies other than 1990 BGP and 2010 Tract.

YYYY block group parts X 2010 counties

Add functionality:

  • bgp1990_to_cty2010
  • bgp2000_to_cty2010

Update:

  • README.md
  • __init__.py
  • geocrosswalk.py
  • id_codes.py
  • __code_components.py
  • tests/test_nhgisxwalk.py

Notebooks:

  • data-subset-sample-workflow-bgp1990cty2010.ipynb
  • data-subset-sample-workflow-bgp2000cty2010.ipynb
  • generate-nation-and-states-bgp1990cty2010.ipynb
  • generate-nation-and-states-bgp2000cty2010.ipynb

GISJOIN vs. GEOID

Following this decision, all crosswalks created through nhgisxwalk will be keyed on the NHGIS-style GISJOIN ID, not the original census-style GEOID. A converter utility may be provided in the future. This would initially support the conversion from 1990, 2000, and 2010 block IDs, but may subsequently include more geographies.

May still implement a converter utility, but will also have GEOIDs in the same file following this decision. Also, need to create GISJOIN - GEOID crosswalks:

src year gj trg year gj trg year ge
GJOIN1990 GJOIN2010 GEOID2010
  • national files
  • state level files

Transfer to IPUMS

  • in code base
  • in documentation
  • in resources/notebooks
    • review all links within notebooks a0a8f73
    • review binder functionality
  • in tools/
    • first cell of gitcount.ipynb 13889c0

weights rounding functionality

the only differences I found were rounding differences. (I rounded my weights to 10 decimal digits, just to make sure anything that "should" sum to 1 actually does sum to exactly 1.

There should be a (default) option to round weights. This can easily be accomplished with pandas.DataFrame.round. It might make sense to put this at the end of GeoCrossWalk.__init__? Or maybe to have it as a stand-alone method?

Example

col1 = ["a", "b", "c"]
col2 = ["X", "Y", "Z"]
col3 = [0.123456, 0.99999999999999, 0.6666666666667]
cols = {"col1": col1, "col2": col2, "col3": col3}
df = pandas.DataFrame(cols)
print(df)
  col1 col2      col3
0    a    X  0.123456
1    b    Y  1.000000
2    c    Z  0.666667
print(df.round(3))
  col1 col2   col3
0    a    X  0.123
1    b    Y  1.000
2    c    Z  0.667

Add descriptive blurb into README.md

  • Add descriptive blurb into README.md
  • Maybe adapt from nhgisxwalk.GeoCrossWalk:

Generate a temporal crosswalk for census geography data built from the smallest intersecting units (atoms). Each row in a crosswalk represents a single atom, and comprised of a source ID (geo+year+gj), and target ID (geo+year+gj), and at least one column of weights. An example of a source ID is bgp1990gj (block group parts from 1990) and an example of a target ID is trt1990gj (tracts from 2010) — see for the nhgis_bgp1990gj_to_trt1990gj crosswalk extract of Delaware. The weights are the interpolated proportions of source attributes that are are calculated as being within the target units. For a description of the algorithmic workflow see the General Crosswalk Construction Framework. Data from 1990 poses specific problem due to the US Census Bureau not explicitly including blocks with no population/housing units in the summary files (SF1). For a description of the algorithmic workflow in the 1990 "no data" scenarios see Handling 1990 No-Data Blocks in Crosswalks. For more information of the base crosswalks see their technical details. For further description see Schroeder (2007).

Currently supported crosswalks include:

source target
1990 block group parts 2010 tracts
2000 block group parts 2010 tracts
1990 block group parts 2010 block groups
2000 block group parts 2010 groups

Planned supported crosswalks include:

source target
1990 block group parts 2012 tracts
2000 block group parts 2012 tracts
1990 block group parts 2012 block groups
2000 block group parts 2012 groups
  • Schroeder, J. P. 2007. Target-density weighting interpolation and uncertainty evaluation for temporal analysis of census data. Geographical Analysis 39 (3):311–335.

rethink static name checking

From Jonathan:

b8a2860 still hard-codes the expected length for geo codes (now either 2 or 3 characters instead of exactly 3). I’m not sure it’s worth fixing now, but to handle potentially longer codes in the future, I’d suggest instead grabbing a set of characters from “c” that is equal in length to “geo” and comparing those strings.

try comparing to self.source_geo/self.target_geo

Recreate susbets / unittests / etc.

Recreate unitests following the implementation of:

Specifically:

  • Start with the dev_ideas/build_subset.ipynb
  • Decide where this notebook should live for reproducibility

Urban/Rural codes for 2000 blocks?

  • Need a better way to handle/build Urban and Rural codes of 2000 block data.
  • These codes are needed to generate a block group part ID.

standard case example / tests

Maybe have two (toy) example datasets:

  • standard case example
  • special case example

These will be based on the 1990 stuff ("no-data") and 2000 stuff ("all data")

Add unittests for each.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.