Giter VIP home page Giter VIP logo

opm-match's Introduction

title author date output
OPM Plan B
Lars Vilhuber
2/22/2019
html_document
keep_md number_sections
true
true

Situation

For a variety of reasons, the DUA (Data Use Agreement) between OPM and the Census Bureau has not been finalized for a signficant time. This puts at risk dependent data products (QWI, J2J, LODES). The present document outlines a fallback strategy which may allow for such dependent products to continue to be produced, albeit with some quality reductions. It is expected that quality compromises are minimal in the short-run, but would continue to increase over time. We address some workarounds at the end of this document.

Naming convention

  • OPM(Census) – OPM microdata acquired through DUA (last year of data: 2015)
  • OPM(FOIA[x]) – OPM microdata acquired through FOIA request to OPM (x = Cornell1, Cornell2, Buzzfeed). Time coverage varies
  • OPM(PU) – OPM microdata publicly available at Fedscope.gov
  • ECF(A) – ECF built with dataset A

Availability of Data

/home/ssgprojects/project0002/cdl77/opm-clean/

Locations at Cornell

kable(opmlocs)
x
/data/clean/opm-foia
/ssgprojects/project0002/cdl77/opm-clean/outputs/2016
/ssgprojects/project0002/cdl77/opm-clean/outputs/buzzfeed
/data/clean/opm

Sources:

  • OPM "/data/doc/opm/SRC.txt"
  • Buzzfeed "/data/doc/opm-foia/Buzzfeed-20170524-Were Sharing A Vast Trove Of Federal Payroll Records.pdf"
  • Cornell-FOIA 2013 "/data/doc/opm-foia/20131126154301380.pdf"
  • Cornell-FOIA 2016 "/data/doc/opm-foia/OPM letter FOIA response 201611.pdf" Fedscope "/data/doc/opm/FS_Employment_Sep2011_Documentation.pdf"

Variables

TODO: This still needs the data elements on the internal data

The various data sources do not all have the same data elements (full list):

overview <- read_excel("overview.xlsx") %>% select("Variable","Buzzfeed","Cornell-FOIA 2013","Cornell FOIA 2016", "Fedscope-old","Fedscope-new" )
## New names:
## * `` -> `..9`
## * `` -> `..10`
kable(overview %>% slice(1:10)) %>%
	kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new
Employee Name 1 0 0 0 0
Pseudo ID 1 2 3 0 0
Agency/Subelement 1 1 1 1 1
Duty Station 1 1 1 0 0
Location (State/Country) 0 0 0 1 1
Age Level 1 1 1 1 1
Sex 0 1 0 1 0
Ethnicity 0 0 0 1 1
Race 0 0 0 1 1
GS-Equivalent Grade 0 1 0 1 1
(with another 28 rows not shown)

In particular, 12 variables are common to all public datasets, but key variables are present only on one or two datasets:

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.
Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new common
Duty Station 1 1 1 0 0 3
Location (State/Country) 0 0 0 1 1 2
Sex 0 1 0 1 0 2
Ethnicity 0 0 0 1 1 2
Race 0 0 0 1 1 2
GS-Equivalent Grade 0 1 0 1 1 3
Length of Service 0 1 1 1 1 4
Average Salary 0 0 0 1 1 2
Average Length of Service 0 0 0 1 1 2
Employment 0 0 0 1 1 2
Supervisory Status 1 0 0 1 1 3
Work Status 0 0 0 1 1 2
CBSA 0 1 1 0 0 2
MSA-RETRO 0 1 0 1 1 3
STEM Occupation 0 0 0 1 1 2
File Date 0 1 1 0 0 2
Start Date 0 1 1 0 0 2
End Date 0 1 1 0 0 2
Accession 1 0 0 1 1 3
Effective Date of Accession 1 0 0 1 1 3
Separation 1 0 0 1 1 3
Effective Date of Separation 1 0 0 1 1 3

Data for tabulation purposes

For tabulation purposes, a few key variables are missing from some of the public-use data, which means no single data source is adequate for LEHD purposes:

Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new common
Employee Name 1 0 0 0 0 1
Duty Station 1 1 1 0 0 3
Age Level 1 1 1 1 1 5
Sex 0 1 0 1 0 2
Ethnicity 0 0 0 1 1 2
Race 0 0 0 1 1 2

Note that the combination of Employee Name, Sex, Age Level may be sufficient to acquire a PIK within the secure confines of the Census Bureau:

Variable Buzzfeed Cornell-FOIA 2013 Cornell FOIA 2016 Fedscope-old Fedscope-new common
Employee Name 1 0 0 0 0 1
Age Level 1 1 1 1 1 5
Sex 0 1 0 1 0 2
Alternatively, matching the consolidated public-use file to the confidential internal use file by the same methods will pick up a `PIK` from historical files.

None of the files have residential address - required for LODES processing. This requires the acquisition of a PIK.

The Plan

Step 1: Entity resolution for public data

Using Chen, Shrivastava, Steorts (2018) algorithms (fasthash), resolve to unique persons, using common variables as distinguishers. This still requires some work, as fasthash estimates, but does not output unique entities. Generates OPM(merged).

Step 2: Repeat entity resolution using private data

We then repeat the process with the private data. This attaches a PIK to most records. Imputation procedures (standard LEHD) will need to handle the remaining ones.

Step 2a: Alternative match

Alternatively, the OPM(merged) file can be matched to OPM(Census) using classical two-file matchers. This does not provide the best statistical features, but may be a feasible workaround.

Step 2b: Matching to Numident

Note that one possibility is to include the Census Numident in the set of files that are matched against (using a subset of variables), leveraging the demographics available on the OPM(merged). file. However, the match will be less certain, given the paucity of common information.

Scope

For a given end year t on OPM(Census), this will yield at least a t+1 file. OPM(Fedscope-new) is released every quarter. OPM(FOIA-new) can be generated at some cost yearly. As the link to OPM(Buzzfeed) and OPM(Census) gets farther away (t+k), the match quality will decrease/ non-matchable records increase.

Quality assessment

We would want to leverage the uncertainty in the linkage for tabulation purposes, providing a measure of the uncertainty to the tabulation system (imputed demographics are already carried forward in 10 implicates).

![graph](RB 033-2018-08-16 17.12.36.jpg)

opm-match's People

Contributors

cdlin avatar larsvilhuber avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

opm-match's Issues

Create config file

  • create a config.do and stick all parameters in there (one hard-coded path per directory)
  • create global basedir /home/ssgprojects/project0002/cdl77/opm-match/
  • rephrase all the others as a function of $basedir, i.e., "global data "$basedir/inputs" " (more portable)

Create a 02 program for descriptive statistics

Create a 02 program that describes the proposed match. In particular,

duplicates tag $varlist1 , generate(dups1)
tab dups1

This will show the extent of the duplicates by ID variable in a simpler fashion

More relevant, I want to have a count of all those that are duplicates, since they do count as a type of exact match (for this purpose):

  • suppose you have 3x white men, 45, occ=A, etc.
  • it doesn't matter who is who - they are the same on all the relevant characteristics.

Create new variables

Before splitting the files (in 01), create some longitudinal variables, based on the within-file variables

  • length of attachment (tenure) to agency
  • length of attachment to duty station (posting_length)
  • length of occupation (though that SHOULD be the same as tenure)
  • quarter-on-quarter earnings change (or earnings-level change)

and then match/count duplicates on those as well - separately.

Also, consider also counting exact duplicates if NOT matching on file_date start_date end_date

(you might coarsen the change variable, to integer percent values)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.