ipums / nhgisxwalk Goto Github PK
View Code? Open in Web Editor NEWSpatio-temporal NHGIS Crosswalks
Home Page: https://ipums.github.io/nhgisxwalk/
License: Mozilla Public License 2.0
Spatio-temporal NHGIS Crosswalks
Home Page: https://ipums.github.io/nhgisxwalk/
License: Mozilla Public License 2.0
Crosswalks should not be generated from subsets of data
See #1, #7, #11, #14, #15, #16
Crosswalk should be generated from a complete national base block crosswalk (and supplementary data if needed), then subset to a single state. Creating a crosswalk from a subset of data seemingly gives inconsistent results, though it is not clear exactly why this is happening. At first glance, this appeared to be a function of bordering state subsets acting upon each other, but even Hawaii gives bad results.
This was not an issue with the functionality of nhgisxwalk
, but a mistake with how the supplementary 1990 BGP subset files were being saved. This has been corrected and accurate crosswalks are now being generated from subset files.
Create national crosswalks for the following:
bgp1990-trt2010
bgp2000-trt2010
bgp1990-bkg2010
bgp2000-bkg2010
Create state crosswalks for the following:
bgp1990-trt2010
bgp2000-trt2010
bgp1990-bkg2010
bgp2000-bkg2010
Implement the new method for handling "no-data" in 1990 blocks and block group parts.
example_crosswalk_data
Consider a documentation site modeled after pysal/spaghetti
.
Key resources:
From the Specs for Crosswalk Deliverables document.
Split files by state: All of the above (including "existing" files)
Following #72 two remaining details must be finalized:
*_README.txt
*_README.txt
From Jonathan:
b8a2860 still hard-codes the expected length for geo codes (now either 2 or 3 characters instead of exactly 3). I’m not sure it’s worth fixing now, but to handle potentially longer codes in the future, I’d suggest instead grabbing a set of characters from “c” that is equal in length to “geo” and comparing those strings.
try comparing to self.source_geo
/self.target_geo
Consider crosswalk naming that mirrors the current NHGIS block-block geographic crosswalks (nhgis_blk1990_blk2010_gj
). For example, nhgis_bgp1990_trt2010
.
Also, rethink upper level (non-nhgisxwalk
) directory structure and where to save out "results".
./results/
directory and put directly into ./crosswalks/
Replace the 6 current BGP README files with a single file as per our 2020-08-05 meeting:
- Q. Should the readme file names vary among BGP crosswalks? Or should we just use a single name for all BGP crosswalk readmes?
- I.e., if I want to update content, is there a single template file I can update, or must I update content in all of the separate files?
- A. Let's use just one file with one name. James will make this edit...
README.md
nhgisxwalk.GeoCrossWalk
:Generate a temporal crosswalk for census geography data built from the smallest intersecting units (atoms). Each row in a crosswalk represents a single atom, and comprised of a source ID (geo+year+gj
), and target ID (geo+year+gj
), and at least one column of weights. An example of a source ID is bgp1990gj
(block group parts from 1990) and an example of a target ID is trt1990gj
(tracts from 2010) — see for the nhgis_bgp1990gj_to_trt1990gj
crosswalk extract of Delaware. The weights are the interpolated proportions of source attributes that are are calculated as being within the target units. For a description of the algorithmic workflow see the General Crosswalk Construction Framework. Data from 1990 poses specific problem due to the US Census Bureau not explicitly including blocks with no population/housing units in the summary files (SF1). For a description of the algorithmic workflow in the 1990 "no data" scenarios see Handling 1990 No-Data Blocks in Crosswalks. For more information of the base crosswalks see their technical details. For further description see Schroeder (2007).
source | target |
---|---|
1990 block group parts | 2010 tracts |
2000 block group parts | 2010 tracts |
1990 block group parts | 2010 block groups |
2000 block group parts | 2010 groups |
source | target |
---|---|
1990 block group parts | 2012 tracts |
2000 block group parts | 2012 tracts |
1990 block group parts | 2012 block groups |
2000 block group parts | 2012 groups |
Add functionality:
bgp1990_to_bkg2010
bgp2000_to_bkg2010
See the Minimum Viable Products (MVPs) in the Specs for Crosswalk Deliverables document.
Add functionality:
bgp1990_to_cty2010
bgp2000_to_cty2010
Update:
README.md
__init__.py
geocrosswalk.py
id_codes.py
__code_components.py
tests/test_nhgisxwalk.py
Notebooks:
data-subset-sample-workflow-bgp1990cty2010.ipynb
data-subset-sample-workflow-bgp2000cty2010.ipynb
generate-nation-and-states-bgp1990cty2010.ipynb
generate-nation-and-states-bgp2000cty2010.ipynb
Flesh out docstring Examples in:
GeoCrossWalk
calculate_atoms
New options are available for compression in pandas
v1.0
. Also, the data frame indices should be omitted when writing out the crosswalks. These two fixes should be updated in GeoCrossWalk.xwalk_to_csv()
.
Example
compression_opts = dict(method="zip", archive_name="without_index.csv")
df.to_csv("without_index.zip", compression=compression_opts, index=False)
notebooks/data-subset-sample-workflow-bgp1990trt2010.ipynb
notebooks/data-subset-sample-workflow-bgp2000trt2010.ipynb
notebooks/synthetic-example.ipynb
notebooks/weighted-portion-synthetic-atoms.ipynb
notebooks/build_subset_1990.ipynb
notebooks/build_subset_2000.ipynb
notebooks/test_subset_1990.ipynb
notebooks/test_subset_2000.ipynb
notebooks/build_subset.ipynb
nhgis_blk1990_blk2010_ge.csv.zip
nhgis_blk1990_blk2010_gj.csv.zip
nhgis_blk2000_blk2010_ge.csv.zip
nhgis_blk2000_blk2010_gj.csv.zip
Swap all example data to Delaware. More chance of testing edge cases than DC. Related to #1.
NOTICE.txt
The copyright notice(s) for a project are placed in a file called NOTICE.txt in the project's top-level (root) folder. The .txt extension in the file's name makes it easier for developers and users of cross-platform projects to work with this file on Windows.
This file also contains the list of people -- both at MPC and outside our center -- who've contributed to the project. Including the contributors list in this file makes it easier to see the association between copyrights held by organizations and their employees who made contributions to the project.
A template for the this file is available for new projects. After downloading a copy of the template into your clone, replace these placeholders with the appropriate information:
- {PROJECT TITLE}
- {REPO-NAME}
- {YEAR}
- {YOUR NAME}
nhgisxwalk
is currently licensed under the BSD 3-Clause but should be relicensed to Mozilla Public License v2.0 in preparation for the transfer to IPUMS (#70).
Jonathan's specs for included README.txt
files as per here.
Readme files
nhgis_bgp_crosswalk_README.txt
Following the correction of the 1990 GISJOIN ID for block group parts (103rd congress vs. 101st congress) the code base should be updated/pruned.
Check/update the following:
id_codes.code_cols
__code_components.bgp1990
geocrosswalk.handle_1990_no_data
— step 3atest_nhgisxwalk
build_subset_1990
test_subset_1990
data-subset-sample-workflow-bgp1990bg2010
data-subset-sample-workflow-bgp1990co2010
data-subset-sample-workflow-bgp1990tr2010
build_subset_1990.ipynb
— cell 24(ish)README.md
(if needed)See also #73 and Jonathan's note here.
Documentation for NHGIS crosswalks from block group parts to later units
Contents
Each NHGIS crosswalk file provides interpolation weights for allocating census counts from a specified
set of source zones to a specified set of target zones. Each record in the crosswalk represents a spatial
intersection between a single source zone and a single target zone.
File naming scheme: nhgis_[source geog][source year]_[target geog][target year]{_state FIPS}.csv
Geographic unit codes:
blk →→ - Block
bgp →→ - Block group part (intersections between block groups, places, county subdivisions, etc.)
bg →→ -Block group
tr →→ - Census tract
co →→ - County
Content:
NHGIS uses this crosswalk to generate 1990 data standardized to 2010 census units for NHGIS time series tables. Complete documentation on the interpolation model used to generate the weights in the crosswalk is provided at https://www.nhgis.org/documentation/time-series/1990-blocks-to-2010-geog.
In short, the model is based on "cascading density weighting", as introduced in Chapter 3 of Jonathan Schroeder's dissertation (Visualizing Patterns in U.S. Urban Population Trends, University of Minnesota) available here: http://hdl.handle.net/11299/48076.
The general sequence of operations:
The procedure also combines two types of overlay to model intersections between 1990, 2000, and 2010 blocks:
The direct overlay weights are constrained to eliminate any 1990-2010 intersections that are not valid in the indirect overlay. This prevents most "slivers" (invalid intersections caused by changes in TIGER feature representations) from being assigned any weight.
The final weighting blends weights from constrained direct overlay (CDO) and indirect overlay (IO) through a weighted average, giving high weight to CDO (and low weight to IO) in cases where the two TIGER/Line representations of a 2000 block align well and where the 1990-2000 block intersection and the 2000-2010 block intersection both comprise less than the entirety of the 2000 block. In cases where the block intersections cover the entirety of a 2000 block or the block intersection from one TIGER/Line version has no valid intersection with a the corresponding 2000 block in the other TIGER/Line version, then the weighting is based on IO alone.
All persons are granted a limited license to use this documentation and the
accompanying data, subject to the following conditions:
Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles.
IPUMS National Historical Geographic Information System: Version 12.0 [Database].
Minneapolis: University of Minnesota. 2017.
http://doi.org/10.18128/D050.V12.0
IPUMS NHGIS, University of Minnesota, www.nhgis.org.
In addition, we request that users send us a copy of any publications, research
reports, or educational material making use of the data or documentation.
Printed matter should be sent to:
IPUMS NHGIS
Minnesota Population Center
University of Minnesota
50 Willey Hall
225 19th Ave S
Minneapolis, MN 55455
Send electronic material to: [email protected]
Need to unify the generate_source_ids
and generate_target_ids
methods of GeoCrosswalk
and generalize them into one method generate_ids
.
They are acceptable in the current format, but need to updated soon so that we can deal with source and target geographies other than 1990 BGP and 2010 Tract.
Maybe have two (toy) example datasets:
These will be based on the 1990 stuff ("no-data") and 2000 stuff ("all data")
Add unittests for each.
The questions was raised about whether to include a README.txt
within a zipped archive for all crosswalk products. This is currently done with the original nhgis_blkYEAR-blkYEAR_g{e}{j}
data available here.
Swap out WY (56) for DC (11) for sample data / testing / example. While Wyoming has a small population, it has a (relatively) large number of blocks (220,401) in 2010. By comparison, DC has only 8,362 in 2010.
Originally in the Crosswalk Specs
decided / not yet used
Level | Option A: 3 char [initialscheme] | Option B: 2 char | Option C: mix |
---|---|---|---|
Block | blk [in use] | bk | blk |
Block group | bkg | bg | bg |
Block group part | bgp | bp | bgp |
Census tract | trt | tr , ct ? | tr , ct |
County | cty | co | co |
Place | plc ? | pl | pl |
County subdivision | csd ? | cs | cs , csub , cosub ? |
CBSA | cbs , msa ? | cb , ma ? | cb , cbsa , msa ? |
Option A: 3 char | Option B: 2 char | Option C: mix |
---|---|---|
nhgis_blk1990_blk2010_gj nhgis_blk1990_blk2010_gj_27 nhgis_bgp1990_bkg2010 nhgis_bgp1990_trt2010 nhgis_bgp1990_cty2010 |
nhgis_bk1990_bk2010_gj nhgis_bk1990_bk2010_gj_27 nhgis_bp1990_bg2010 nhgis_bp1990_tr2010 nhgis_bp1990_ct2010 nhgis_bp1990_co2010 | nhgis_blk1990_blk2010_gj nhgis_blk1990_blk2010_gj_27 nhgis_bgp1990_bg2010 nhgis_bgp1990_tr2010 nhgis_bgp1990_ct2010 nhgis_bgp1990_co2010 |
Option A: FIPS [initial scheme] | Option B: upper case postal |
---|---|
nhgis_blk1990_blk2010_gj_01 nhgis_blk1990_blk2010_gj_02 nhgis_blk1990_blk2010_gj_27 nhgis_blk1990_blk2010_gj_56 |
nhgis_blk1990_blk2010_gj_AL nhgis_blk1990_blk2010_gj_AK nhgis_blk1990_blk2010_gj_MN nhgis_blk1990_blk2010_gj_WY |
gitcount
, like in spaghetti
Try using handcalcs
in the formula cells for the Synthetic Example notebook.
Add links and instructions for downloading NHGIS data. While this is not essential, it fosters good-faith openness for reproducibility.
This should eventually be superseded in v1.0.0
by interoperability with the NHGIS API (#12)
Header in Source Files
In each source file that you've written, place a comment header at the top of the file. [NOTE: The example below uses the comment notation ("#") recognized by most scripting languages (e.g., Bash, Perl, Python, R, Ruby). Adjust that notation accordingly based on the language your code is written in.]
# This file is part of the Minnesota Population Center's {PROJECT TITLE}. # For copyright and licensing information, see the NOTICE and LICENSE files # in this project's top-level directory, and also on-line at: # https://github.com/mnpopcenter/{REPO-NAME}
Replace the placehoders
{PROJECT TITLE}
and{REPO-NAME}
Following this decision, all crosswalks created through nhgisxwalk
will be keyed on the NHGIS-style GISJOIN ID, not the original census-style GEOID. A converter utility may be provided in the future. This would initially support the conversion from 1990, 2000, and 2010 block IDs, but may subsequently include more geographies.
May still implement a converter utility, but will also have GEOIDs in the same file following this decision. Also, need to create GISJOIN - GEOID crosswalks:
src year gj |
trg year gj |
trg year ge |
---|---|---|
GJOIN1990 |
GJOIN2010 |
GEOID2010 |
the only differences I found were rounding differences. (I rounded my weights to 10 decimal digits, just to make sure anything that "should" sum to 1 actually does sum to exactly 1.
There should be a (default) option to round weights. This can easily be accomplished with pandas.DataFrame.round
. It might make sense to put this at the end of GeoCrossWalk.__init__
? Or maybe to have it as a stand-alone method?
Example
col1 = ["a", "b", "c"]
col2 = ["X", "Y", "Z"]
col3 = [0.123456, 0.99999999999999, 0.6666666666667]
cols = {"col1": col1, "col2": col2, "col3": col3}
df = pandas.DataFrame(cols)
print(df)
col1 col2 col3
0 a X 0.123456
1 b Y 1.000000
2 c Z 0.666667
print(df.round(3))
col1 col2 col3
0 a X 0.123
1 b Y 1.000
2 c Z 0.667
Add the following sections to codecov.yml
codecov:
notify:
after_n_builds: 18
...
comment:
layout: "reach, diff, files"
behavior: once
after_n_builds: 18
require_changes: true
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.