Create config file

title

author

date

output

OPM Plan B

Lars Vilhuber

2/22/2019

html_document

keep_md	number_sections
true	true

Situation

For a variety of reasons, the DUA (Data Use Agreement) between OPM and the Census Bureau has not been finalized for a signficant time. This puts at risk dependent data products (QWI, J2J, LODES). The present document outlines a fallback strategy which may allow for such dependent products to continue to be produced, albeit with some quality reductions. It is expected that quality compromises are minimal in the short-run, but would continue to increase over time. We address some workarounds at the end of this document.

Naming convention

OPM(Census) – OPM microdata acquired through DUA (last year of data: 2015)
OPM(FOIA[x]) – OPM microdata acquired through FOIA request to OPM (x = Cornell1, Cornell2, Buzzfeed). Time coverage varies
OPM(PU) – OPM microdata publicly available at Fedscope.gov
ECF(A) – ECF built with dataset A

Availability of Data

/home/ssgprojects/project0002/cdl77/opm-clean/

Locations at Cornell

kable(opmlocs)

x
/data/clean/opm-foia
/ssgprojects/project0002/cdl77/opm-clean/outputs/2016
/ssgprojects/project0002/cdl77/opm-clean/outputs/buzzfeed
/data/clean/opm

Sources:

OPM "/data/doc/opm/SRC.txt"
Buzzfeed "/data/doc/opm-foia/Buzzfeed-20170524-Were Sharing A Vast Trove Of Federal Payroll Records.pdf"
Cornell-FOIA 2013 "/data/doc/opm-foia/20131126154301380.pdf"
Cornell-FOIA 2016 "/data/doc/opm-foia/OPM letter FOIA response 201611.pdf" Fedscope "/data/doc/opm/FS_Employment_Sep2011_Documentation.pdf"

Variables

TODO: This still needs the data elements on the internal data

The various data sources do not all have the same data elements (full list):

overview <- read_excel("overview.xlsx") %>% select("Variable","Buzzfeed","Cornell-FOIA 2013","Cornell FOIA 2016", "Fedscope-old","Fedscope-new" )

## New names:
## * `` -> `..9`
## * `` -> `..10`

kable(overview %>% slice(1:10)) %>%
	kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

Variable	Buzzfeed	Cornell-FOIA 2013	Cornell FOIA 2016	Fedscope-old	Fedscope-new
Employee Name	1	0	0	0	0
Pseudo ID	1	2	3	0	0
Agency/Subelement	1	1	1	1	1
Duty Station	1	1	1	0	0
Location (State/Country)	0	0	0	1	1
Age Level	1	1	1	1	1
Sex	0	1	0	1	0
Ethnicity	0	0	0	1	1
Race	0	0	0	1	1
GS-Equivalent Grade	0	1	0	1	1

(with another 28 rows not shown)

In particular, 12 variables are common to all public datasets, but key variables are present only on one or two datasets:

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.

Variable	Buzzfeed	Cornell-FOIA 2013	Cornell FOIA 2016	Fedscope-old	Fedscope-new	common
Duty Station	1	1	1	0	0	3
Location (State/Country)	0	0	0	1	1	2
Sex	0	1	0	1	0	2
Ethnicity	0	0	0	1	1	2
Race	0	0	0	1	1	2
GS-Equivalent Grade	0	1	0	1	1	3
Length of Service	0	1	1	1	1	4
Average Salary	0	0	0	1	1	2
Average Length of Service	0	0	0	1	1	2
Employment	0	0	0	1	1	2
Supervisory Status	1	0	0	1	1	3
Work Status	0	0	0	1	1	2
CBSA	0	1	1	0	0	2
MSA-RETRO	0	1	0	1	1	3
STEM Occupation	0	0	0	1	1	2
File Date	0	1	1	0	0	2
Start Date	0	1	1	0	0	2
End Date	0	1	1	0	0	2
Accession	1	0	0	1	1	3
Effective Date of Accession	1	0	0	1	1	3
Separation	1	0	0	1	1	3
Effective Date of Separation	1	0	0	1	1	3

Data for tabulation purposes

For tabulation purposes, a few key variables are missing from some of the public-use data, which means no single data source is adequate for LEHD purposes:

Variable	Buzzfeed	Cornell-FOIA 2013	Cornell FOIA 2016	Fedscope-old	Fedscope-new	common
Employee Name	1	0	0	0	0	1
Duty Station	1	1	1	0	0	3
Age Level	1	1	1	1	1	5
Sex	0	1	0	1	0	2
Ethnicity	0	0	0	1	1	2
Race	0	0	0	1	1	2

Note that the combination of Employee Name, Sex, Age Level may be sufficient to acquire a PIK within the secure confines of the Census Bureau:

Variable	Buzzfeed	Cornell-FOIA 2013	Cornell FOIA 2016	Fedscope-old	Fedscope-new	common
Employee Name	1	0	0	0	0	1
Age Level	1	1	1	1	1	5
Sex	0	1	0	1	0	2

Alternatively, matching the consolidated public-use file to the confidential internal use file by the same methods will pick up a `PIK` from historical files.

None of the files have residential address - required for LODES processing. This requires the acquisition of a PIK.

The Plan

Step 1: Entity resolution for public data

Using Chen, Shrivastava, Steorts (2018) algorithms (fasthash), resolve to unique persons, using common variables as distinguishers. This still requires some work, as fasthash estimates, but does not output unique entities. Generates OPM(merged).

Step 2: Repeat entity resolution using private data

We then repeat the process with the private data. This attaches a PIK to most records. Imputation procedures (standard LEHD) will need to handle the remaining ones.

Step 2a: Alternative match

Alternatively, the OPM(merged) file can be matched to OPM(Census) using classical two-file matchers. This does not provide the best statistical features, but may be a feasible workaround.

Step 2b: Matching to Numident

Note that one possibility is to include the Census Numident in the set of files that are matched against (using a subset of variables), leveraging the demographics available on the OPM(merged). file. However, the match will be less certain, given the paucity of common information.

Scope

For a given end year t on OPM(Census), this will yield at least a t+1 file. OPM(Fedscope-new) is released every quarter. OPM(FOIA-new) can be generated at some cost yearly. As the link to OPM(Buzzfeed) and OPM(Census) gets farther away (t+k), the match quality will decrease/ non-matchable records increase.

Quality assessment

We would want to leverage the uncertainty in the linkage for tabulation purposes, providing a measure of the uncertainty to the tabulation system (imputed demographics are already carried forward in 10 implicates).

![graph](RB 033-2018-08-16 17.12.36.jpg)

labordynamicsinstitute / opm-match Goto Github PK

opm-match's Introduction

Situation

Naming convention

Availability of Data

Locations at Cornell

Variables

Data for tabulation purposes

The Plan

Step 1: Entity resolution for public data

Step 2: Repeat entity resolution using private data

Step 2a: Alternative match

Step 2b: Matching to Numident

Scope

Quality assessment

opm-match's People

Contributors

Watchers

opm-match's Issues

Recommend Projects

Recommend Topics

Recommend Org