Giter VIP home page Giter VIP logo

njschooldata's Introduction

njschooldata

a simple interface for accessing NJ DOE school data in R

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003). Data preparation is not just a first step, but must be repeated many over the course of analysis as new problems come to light or new data is collected. -@hadley

The State of NJ has been posting raw, fixed width text files with all the assessment results for NJ schools/districts for well over a decade now. That's great!

Unfortunately, those files are a bit of a pain to work with, especially if you're trying to work with multiple grades, or multiple years of data. Layouts change; file paths aren't consistent, etc.

There are also Excel files posted with all the data, but they aren't much better - for instance, for NJASK data (the assessment used until the transition to PARCC), for every year / grade combination there are on the order of 5 worksheets/tabs per file... a copy/paste nightmare of epic proportions.

njschooldata attempts to simplify the task of working with NJ education data by providing a concise and consistent interface for reading state files into R. We make heavy use of the tidyverse and aim to create a consistent, pipeable interface into NJ state education data.

Points of Interest

  • For any year/grade combination from 2015-2017, a call to fetch_parcc(end_year, grade_or_subj, subj) will return relevant statewide PARCC data. fetch_all_parcc() will return data for all years, all grades, all subjects.

  • For any year/grade combination from 2004-2014 (before the transition to PARCC/Common Core), a call to fetch_nj_assess(end_year, grade) will return the desired data frame as it appears on the state site, and fetch_nj_assess(end_year, grade, tidy=TRUE) will return a cleaned up version suitable for longitudinal data analysis.

Installation

library("devtools")
devtools::install_github("almartin82/njschooldata")
library(njschooldata)

Usage

Common Core / PARCC era (2015-present)

read in the 2015 grade 7 PARCC ELA data file:

fetch_parcc(end_year = 2015, grade_or_subj = 7, subj = 'ela')

read in the 2016 grade 4 PARCC Math data file:

fetch_parcc(end_year = 2016, grade_or_subj = 4, subj = 'math')

read in the 2017 HS Algebra data file:

fetch_parcc(end_year = 2017, grade_or_subj = 'ALG1', subj = 'math')

Pre-Common Core / NJASK era (2004-2014)

read in the 2010 grade 5 NJASK data file:

fetch_nj_assess(end_year = 2010, grade = 5)

read in the 2007 High School Proficiency Assessment (HSPA) data file:

fetch_nj_assess(end_year = 2007, grade = 11)

read in the 2005 state enrollment data file:

fetch_enr(end_year = 2005)

read in the 2014 HS cohort graduation rate data file (NJ has charmingly named this 'grate'):

fetch_grate(end_year = 2014, tidy = TRUE)

read in the 2002 HS graduate data file :

fetch_grate(end_year = 2002, tidy = TRUE)

Contributing

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Contributions are welcome!

Comments? Questions? Problem? Want to contribute to development? File an issue or send me an email.

Coverage

Anytime a year is passed as a parameter for assessment data, it referrs to the 'end_year' -- ie, the 2014-15 school year is 2015.

PARCC data runs from 2015-present and covers grades 3-11 in ELA and grades 3-8 and subjects 'ALG1', 'GEO', 'ALG2' in Math.

NJASK data runs from 2004-2014, roughly (there were a number of revisions to the assessment program, so grade coverage depends on the year. Look at valid call for the gory details.)

Longitudinal Analysis

The flat files provided by the state are a bit painful to work with. The layout isn't consistent across years or assessments, making longitudinal analysis a pain. Here's what the first 50 columns of the 2014 NJASK data file look like:

  [1] "CDS_Code"                                                                              
  [2] "County_Code/DFG/Aggregation_Code"                                                      
  [3] "District_Code"                                                                         
  [4] "School_Code"                                                                           
  [5] "County_Name"                                                                           
  [6] "District_Name"                                                                         
  [7] "School_Name"                                                                           
  [8] "DFG"                                                                                   
  [9] "Special_Needs"                                                                         
 [10] "TOTAL_POPULATION_Number_Enrolled_ELA"                                                  
 [11] "TOTAL_POPULATION_LANGUAGE_ARTS_Number_Not_Present"                                     
 [12] "TOTAL_POPULATION_LANGUAGE_ARTS_Number_of_Voids"                                        
 [13] "TOTAL_POPULATION_LANGUAGE_ARTS_Number_APA"                                             
 [14] "TOTAL_POPULATION_LANGUAGE_ARTS_Number_of_Valid_Scale_Scores"                           
 [15] "TOTAL_POPULATION_LANGUAGE_ARTS_Partially_Proficient_Percentage"                        
 [16] "TOTAL_POPULATION_LANGUAGE_ARTS_Proficient_Percentage"                                  
 [17] "TOTAL_POPULATION_LANGUAGE_ARTS_Advanced_Proficient_Percentage"                         
 [18] "TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean"                                       
 [19] "TOTAL_POPULATION_MATHEMATICS_Number_Enrolled_Math"                                   
 [20] "TOTAL_POPULATION_MATHEMATICS_Number_Not_Present"                                       
 [21] "TOTAL_POPULATION_MATHEMATICS_Number_of_Voids"                                        
 [22] "TOTAL_POPULATION_MATHEMATICS_Number_APA"                                               
 [23] "TOTAL_POPULATION_MATHEMATICS_Number_of_Valid_Scale_Scores"                             
 [24] "TOTAL_POPULATION_MATHEMATICS_Partially_Proficient_Percentage"                          
 [25] "TOTAL_POPULATION_MATHEMATICS_Proficient_Percentage"                                    
 [26] "TOTAL_POPULATION_MATHEMATICS_Advanced_Proficient_Percentage"                           
 [27] "TOTAL_POPULATION_MATHEMATICS_Scale_Score_Mean"                                         
 [28] "TOTAL_POPULATION_SCIENCE_Number_Enrolled_Science"                                      
 [29] "TOTAL_POPULATION_SCIENCE_Number_Not_Present"                                           
 [30] "TOTAL_POPULATION_SCIENCE_Number_of_Voids"                                              
 [31] "TOTAL_POPULATION_SCIENCE_Number_APA"                                                   
 [32] "TOTAL_POPULATION_SCIENCE_Number_of_Valid_Scale_Scores"                                 
 [33] "TOTAL_POPULATION_SCIENCE_Partially_Proficient_Percentage"                              
 [34] "TOTAL_POPULATION_SCIENCE_Proficient_Percentage"                                        
 [35] "TOTAL_POPULATION_SCIENCE_Advanced_Proficient_Percentage"                               
 [36] "TOTAL_POPULATION_SCIENCE_Scale_Score_Mean"                                             
 [37] "GENERAL_EDUCATION_Number_Enrolled_ELA"                                                 
 [38] "GENERAL_EDUCATION_LANGUAGE_ARTS_Number_Not_Present"                                    
 [39] "GENERAL_EDUCATION_LANGUAGE_ARTS_Number_of_Voids"                                       
 [40] "GENERAL_EDUCATION_LANGUAGE_ARTS_Number_APA"                                            
 [41] "GENERAL_EDUCATION_LANGUAGE_ARTS_Number_of_Valid_Scale_Scores"                          
 [42] "GENERAL_EDUCATION_LANGUAGE_ARTS_Partially_Proficient_Percentage"                       
 [43] "GENERAL_EDUCATION_LANGUAGE_ARTS_Proficient_Percentage"                                 
 [44] "GENERAL_EDUCATION_LANGUAGE_ARTS_Advanced_Proficient_Percentage"
 [45] "GENERAL_EDUCATION_LANGUAGE_ARTS_Scale_Score_Mean"                                      
 [46] "GENERAL_EDUCATION_MATHEMATICS_Number_Enrolled_Math"                                    
 [47] "GENERAL_EDUCATION_MATHEMATICS_Number_Not_Present"                                      
 [48] "GENERAL_EDUCATION_MATHEMATICS_Number_of_Voids"                                         
 [49] "GENERAL_EDUCATION_MATHEMATICS_Number_APA"                                          
 [50] "GENERAL_EDUCATION_MATHEMATICS_Number_of_Valid_Scale_Scores"    

(and on and on and on, for a grand total of 551 columns.) Aside from the virtue of one row per school, there's not a lot to be said about this format - it violates multiple tidy data principles.

fetch_nj_assess has a parameter tidy that will return a processed version of the assessment results designed to facilitate longitudinal data analysis. Instead of 500+ columns, a consistent data frame structure is returned. Instead of using column headers for values, subgroup and test name data are stored as variables. This makes the resulting data frame considerably longer (69,960 rows vs 1,160 rows for a recent NJASK example), but significantly easier to work with.

Here's an example of a tidied NJASK data file:

  assess_name testing_year grade county_code district_code school_code district_name
1       NJASK         2011     5          ST            NA          NA              
2       NJASK         2011     5          NS            NA          NA              
3       NJASK         2011     5          SN            NA          NA              
4       NJASK         2011     5          25           100          NA   ASBURY PARK
5       NJASK         2011     5          25           100          40   ASBURY PARK
6       NJASK         2011     5          01           110          NA ATLANTIC CITY
         school_name dfg special_needs         subgroup    assessment number_enrolled
1                     NA            NA total_population language_arts          103759
2                     NA            NA total_population language_arts           83778
3                     NA            NA total_population language_arts           19981
4                     NA            NA total_population language_arts             155
5 BRADLEY ELEMENTARY  NA            NA total_population language_arts              32
6                     NA            NA total_population language_arts             439
  number_not_present number_of_voids number_of_valid_classifications number_apa
1                164          103759                              NA        893
2                113           83778                              NA        696
3                 51           19981                              NA        197
4                  0             155                              NA          0
5                  0              32                              NA          0
6                  1             439                              NA          2
  number_valid_scale_scores partially_proficient proficient advanced_proficient
1                    102320                 39.1       54.8                 6.1
2                     82708                 32.7       60.0                 7.3
3                     19612                 65.9       33.1                 1.0
4                       154                 83.8       16.2                 0.0
5                        31                 80.6       19.4                 0.0
6                       433                 58.4       40.6                 0.9
  scale_score_mean
1            205.0
2            209.3
3            186.8
4            174.7
5            180.8
6            190.9

njschooldata's People

Contributors

almartin82 avatar chiouey avatar chrishaid avatar zmaher avatar ryanpfleger avatar

Stargazers

Roman avatar  avatar James Pooley avatar  avatar Jeff Mealo avatar Sam Firke avatar John Reiser avatar Ben Robinson avatar Gary Molloy avatar  avatar Jared Knowles avatar

Watchers

 avatar James Cloos avatar  avatar Lu Han avatar  avatar  avatar

njschooldata's Issues

1999-2000 NJ enrollment data is screwy

the published dictionary is incomplete - codes [1] "26" "34" "35" "36" "37" "38" "25" are missing. Sent an email to NJDOE. Am using the prior year dictionary for these codes as my best guess for their meaning.

get CDS data for every year of the RC

Right now report card data like SAT, College Matriculation etc reports back using county / district / school codes. Every report card includes a lookup table that has the name of the school and the district name. We should make it easy for a user to name / identify the schools in the SAT tables.

make identifying charter host city a one-shot process

if we're standardizing to a common interface, should be easy enough to

  • identify which are district codes
  • identify which are charter (since they sit in their own county)
  • match them to charter_city
  • report / warn if any are unmatched

Missing schools from charter_city

International Academy of Trenton, Trenton STEM-to-Civics, and Great Futures HS for the Health Sciences are missing from charter_city, and Paterson Arts and Science CS has either changed codes or is coded wrong.

District rank is missing for all years

@almartin82 District rank is missing for all years when using tges.R. Tried to run the following code but did not work:

force_indicator_types <- function(df) {
if ('rk' %in% names(df)) df$rk <- as.integer(str_split(df$rk, "\|")[[1]][1])
if ('rksal' %in% names(df)) df$rksal <- as.integer(str_split(df$rksal, "\|")[[1]][1])
if ('rrk' %in% names(df)) df$rrk <- as.integer(str_split(df$rrk, "\|")[[1]][1])
df
}

y1_df <- force_indicator_types(y1_df)
y2_df <- force_indicator_types(y2_df)
y3_df <- force_indicator_types(y3_df)

error with binding end_year 2018

Here's an example fail:

enr_all <- map_df(
  c(2017:2018),
  fetch_enr
)

I think the problem is in the clean_enr_data function.

Adding the following to clean_enr_data inside enr_types <- list( ) seems to make it work:
'CDS_Code' = 'character'

That is:

clean_enr_data <- function(df) {
  
  enr_types <- list(
    'county_id' = 'character',
    'CDS_Code' = 'character',
    'county_name' = 'character',
    'district_id' = 'character',
    'district_name' = 'character',
    'school_id' = 'character',
    'school_name' = 'character',
    'program_code' = 'character',
    'program_name' = 'character',
    'grade_level' = 'character',
    'white_m' = 'numeric',
    'white_f' = 'numeric',
    'black_m' = 'numeric',
    'black_f' = 'numeric',
    'hispanic_m' = 'numeric',
    'hispanic_f' = 'numeric',
    'asian_m' = 'numeric',
    'asian_f' = 'numeric',
    'native_american_m' = 'numeric',
    'native_american_f' = 'numeric',
    'pacific_islander_m' = 'numeric',
    'pacific_islander_f' = 'numeric',
    'multiracial_m' = 'numeric',
    'multiracial_f' = 'numeric',
    'free_lunch' = 'numeric',
    'reduced_lunch' = 'numeric',
    'lep' = 'numeric',
    'migrant' = 'numeric',
    'row_total' = 'numeric',
    'homeless' = 'numeric',
    'special_ed' = 'numeric',
    'title_1' = 'numeric',
    'end_year' = 'numeric'
  )
  
  df <- as.data.frame(df)
  
  #some old files (eg 02-03) have random, unlabeled rows.  kill those.
  df <- df[nchar(df$county_name) >0, ]
  
  for (i in 1:ncol(df)) {
    z = enr_types[[names(df)[i]]]
    if (z=='numeric') {
      df[, i] <- as.numeric(df[, i])
    } else if (z=='character') {
      df[, i] <- trim_whitespace(as.character(df[, i]))
      
    }
  }
  
  #make CDS_code
  df$CDS_Code <- paste0(
    stringr::str_pad(df$county_id, width=2, side='left', pad='0'),
    stringr::str_pad(df$district_id, width=4, side='left', pad='0'),
    stringr::str_pad(df$school_id, width=3, side='left', pad='0')
  )
  
  return(df)  
}

some state data isn't parsing

for instance, 2006 g3 HSPA. why not? because the ST data is bottom sorted, meaning that read_fwf type hinting mistakes county_code char for numeric. ugh.

tidy NJASK data when returned

the flat file is a unwieldy, 550+ column data frame, with lots of important information (subgroup, subject) locked away in column headers. fix that.

process program codes in enr data

every enrollment file record has a program code; these need to be converted from code into program name, and program names need to get cleaned up (many different ways of referring to kindergarten, for instance)

leading zeros

lots of R utilities for reading data use type hinting that converts school / district id's with leading zeros to numeric -- eg 0010 ABSECON CITY becomes 10 ABSECON CITY

left padding ids is only used to indicate the length of ids (ie, district ids are 4 digits long), so converting to numeric when possible is fine - we don't lose any information.

hspa data isn't reading correctly

hspa_ex <- standard_assess(2014, 11)

produces

Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_,  : 
  Overlapping specification not supported. Begin offset (0) must be greater than or equal to previous end offset (9)
Called from: read_tokens_(data, tokenizer, col_specs, col_names, locale_, 
    n_max, progress)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.