colearendt / tidyjson Goto Github PK

View Code? Open in Web Editor NEW

183.0 10.0 14.0 3.96 MB

Tidy your JSON data in R with tidyjson

License: Other

R 100.00%

tidyjson's Introduction

tidyjson

tidyjson provides tools for turning complex json into tidy data.

Installation

Get the released version from CRAN:

install.packages("tidyjson")

or the development version from github:

devtools::install_github("colearendt/tidyjson")

Examples

The following example takes a character vector of 500 documents in the worldbank dataset and spreads out all objects.
Every JSON object key gets its own column with types inferred, so long as the key does not represent an array. When recursive=TRUE (the default behavior), spread_all does this recursively for nested objects and creates column names using the sep parameter (i.e. {"a":{"b":1}} with sep='.' would generate a single column: a.b).

library(dplyr)
library(tidyjson)

worldbank %>% spread_all
#> # A tbl_json: 500 x 9 tibble with a "JSON" attribute
#>    ..JSON        docum…¹ board…² closi…³ count…⁴ proje…⁵ regio…⁶ total…⁷ _id.$…⁸
#>    <chr>           <int> <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  
#>  1 "{\"_id\":{\…       1 2013-1… 2018-0… Ethiop… Ethiop… Africa   1.3 e8 52b213…
#>  2 "{\"_id\":{\…       2 2013-1… <NA>    Tunisia TN: DT… Middle…  0      52b213…
#>  3 "{\"_id\":{\…       3 2013-1… <NA>    Tuvalu  Tuvalu… East A…  6.06e6 52b213…
#>  4 "{\"_id\":{\…       4 2013-1… <NA>    Yemen,… Gov't … Middle…  0      52b213…
#>  5 "{\"_id\":{\…       5 2013-1… 2019-0… Lesotho Second… Africa   1.31e7 52b213…
#>  6 "{\"_id\":{\…       6 2013-1… <NA>    Kenya   Additi… Africa   1   e7 52b213…
#>  7 "{\"_id\":{\…       7 2013-1… 2019-0… India   Nation… South …  5   e8 52b213…
#>  8 "{\"_id\":{\…       8 2013-1… <NA>    China   China … East A…  0      52b213…
#>  9 "{\"_id\":{\…       9 2013-1… 2018-1… India   Rajast… South …  1.6 e8 52b213…
#> 10 "{\"_id\":{\…      10 2013-1… 2014-1… Morocco MA Acc… Middle…  2   e8 52b213…
#> # … with 490 more rows, and abbreviated variable names ¹document.id,
#> #   ²boardapprovaldate, ³closingdate, ⁴countryshortname, ⁵project_name,
#> #   ⁶regionname, ⁷totalamt, ⁸`_id.$oid`

Some objects in worldbank are arrays, which are not handled by spread_all. This example shows how to quickly summarize the top level structure of a JSON collection

worldbank %>% gather_object %>% json_types %>% count(name, type)
#> # A tibble: 8 × 3
#>   name                type       n
#>   <chr>               <fct>  <int>
#> 1 _id                 object   500
#> 2 boardapprovaldate   string   500
#> 3 closingdate         string   370
#> 4 countryshortname    string   500
#> 5 majorsector_percent array    500
#> 6 project_name        string   500
#> 7 regionname          string   500
#> 8 totalamt            number   500

In order to capture the data in the majorsector_percent array, we can use enter_object to enter into that object, gather_array to stack the array and spread_all to capture the object items under the array.

worldbank %>%
  enter_object(majorsector_percent) %>%
  gather_array %>%
  spread_all %>%
  select(-document.id, -array.index)
#> # A tbl_json: 1,405 x 3 tibble with a "JSON" attribute
#>    ..JSON                  Name                                    Percent
#>    <chr>                   <chr>                                     <dbl>
#>  1 "{\"Name\":\"Educat..." Education                                    46
#>  2 "{\"Name\":\"Educat..." Education                                    26
#>  3 "{\"Name\":\"Public..." Public Administration, Law, and Justice      16
#>  4 "{\"Name\":\"Educat..." Education                                    12
#>  5 "{\"Name\":\"Public..." Public Administration, Law, and Justice      70
#>  6 "{\"Name\":\"Public..." Public Administration, Law, and Justice      30
#>  7 "{\"Name\":\"Transp..." Transportation                              100
#>  8 "{\"Name\":\"Health..." Health and other social services            100
#>  9 "{\"Name\":\"Indust..." Industry and trade                           50
#> 10 "{\"Name\":\"Indust..." Industry and trade                           40
#> # … with 1,395 more rows

API

Spreading objects into columns

spread_all() for spreading all object values into new columns, with nested objects having concatenated names
spread_values() for specifying a subset of object values to spread into new columns using the jstring(), jinteger(), jdouble() and jlogical() functions. It is possible to specify multiple parameters to extract data from nested objects (i.e. jstring('a','b')).

Object navigation

enter_object() for entering into an object by name, discarding all other JSON (and rows without the corresponding object name) and allowing further operations on the object value
gather_object() for stacking all object name-value pairs by name, expanding the rows of the tbl_json object accordingly

Array navigation

gather_array() for stacking all array values by index, expanding the rows of the tbl_json object accordingly

JSON inspection

json_types() for identifying JSON data types
json_length() for computing the length of JSON data (can be larger than 1 for objects and arrays)
json_complexity() for computing the length of the unnested JSON, i.e., how many terminal leaves there are in a complex JSON structure
is_json family of functions for testing the type of JSON data

JSON summarization

json_structure() for creating a single fixed column data.frame that recursively structures arbitrary JSON data
json_schema() for representing the schema of complex JSON, unioned across disparate JSON documents, and collapsing arrays to their most complex type representation

Creating tbl_json objects

as.tbl_json() for converting a string or character vector into a tbl_json object, or for converting a data.frame with a JSON column using the json.column argument
tbl_json() for combining a data.frame and associated list derived from JSON data into a tbl_json object
read_json() for reading JSON data from a file

Converting tbl_json objects

as.character.tbl_json for converting the JSON attribute of a tbl_json object back into a JSON character string

Included JSON data

commits: commit data for the dplyr repo from github API
issues: issue data for the dplyr repo from github API
worldbank: world bank funded projects from jsonstudio
companies: startup company data from jsonstudio

Philosophy

The goal is to turn complex JSON data, which is often represented as nested lists, into tidy data frames that can be more easily manipulated.

Work on a single JSON document, or on a collection of related documents
Create pipelines with %>%, producing code that can be read from left to right
Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of spread_all)
Work with arbitrarily nested arrays or objects
Handle ‘ragged’ arrays and / or objects (varying lengths by document)
Allow for extraction of data in values or object names
Ensure edge cases are handled correctly (especially empty data)
Integrate seamlessly with dplyr, allowing tbl_json objects to pipe in and out of dplyr verbs where reasonable

Related Work

Tidyjson depends upon

magrritr for the %>% pipe operator
jsonlite for converting JSON strings into nested lists
purrr for list operators
tidyr for unnesting and spreading

Further, there are other R packages that can be used to better understand JSON data

listviewer for viewing JSON data interactively

tidyjson's People

Contributors

Stargazers

Watchers

Forkers

jimtyhurst spascoe agwells mikebesso nmcginn ogorodriguez dpseidel firefoxxy8 sbilge talaricd cram070 cwru-sdle salim-b hadley

tidyjson's Issues

Undo fixed dependency on purrr for CRAN submission

https://github.com/jeremystan/tidyjson/blob/master/DESCRIPTION#L13

Create json_schema

Should do the following:

Work like json_structure, but aggregate across many documents
Arrays should be collapsed into a union of their structures
Should keep a count of how often each structure appears
Should be able to visualize the result as a graph per the visualizing JSON vignette

Fix "no visible binding for global variable" notes in cmd_check()

Currently getting this

checking R code for possible problems ... NOTE
json_structure: no visible binding for global variable ‘level’
json_structure_arrays: no visible binding for global variable ‘type’
json_structure_arrays: no visible binding for global variable
  ‘document.id’
json_structure_arrays: no visible binding for global variable
  ‘child.id’
json_structure_arrays: no visible binding for global variable ‘level’
json_structure_arrays: no visible binding for global variable
  ‘parent.id’
... 9 lines ...
  ‘parent.id’
json_structure_objects: no visible binding for global variable ‘index’
json_structure_objects: no visible binding for global variable ‘key’
read_json: no visible global function definition for ‘tail’
should_json_structure_expand_more: no visible binding for global
  variable ‘level’
Undefined global functions or variables:
  child.id document.id index key level parent.id tail type
Consider adding
  importFrom("utils", "tail")
to your NAMESPACE file.

I believe this can be solved by avoiding non-standard evaluation, and using the _ version of dplyr functions instead.

Warn on gather functions on name conflict only if non-default name is specified

Should not throw a warning:

json %>% gather_array %>% gather_array

Should throw a warning:

json %>% gather_array("special") %>% gather_array("special")

spread_all(recursive = FALSE) failing

issues %>% gather_array %>% spread_all(recursive = FALSE)
#> Error in `[.data.frame`(z, , final_columns, drop = FALSE) : 
#>   undefined columns selected

Remove use of `[[` or other operators in code

Last time I submitted to CRAN they complained about these, import functions from magrittr instead.

Create a vignette to compare tidyjson with purrr

Recreate every key tidyjson verb with purrr.

import the %>% operator

Same as is done in https://github.com/hadley/purrr/blob/master/R/utils.R#L1

See if you can remove all subsequent calls to library(magrittr)

first argument to verbs should not be x

Causes this not to work:

'{"x": 1}' %>% spread_values(x = jstring("x"))
#> Error in UseMethod("as.tbl_json") :
#>  no applicable method for 'as.tbl_json' applied to an object of class "function"

Yet this works:

'{"x": 1}' %>% spread_values(y = jstring("x"))
#>   document.id y
#> 1           1 1

change tbl_json to tidyjson per tibble?

Maybe it makes more sense for the tbl_json object to be a tidyjson object, like tbl_df has moved to tibble?

Set up new travis CI

Set up a new travis CI integration

Allow spread_values functions to work with unquoted paths

The following works:

'{"key": "value"}' %>% spread_values(key = jstring("key"))
#>   document.id   key
#> 1           1 value

but this does not:

'{"key": "value"}' %>% spread_values(key = jstring(key))
#> Error in as_function(.f, ...) : object 'key' not found

Create a programming with tidyjson vignette

Should:

cover use of spread_values versus spread_all
show using is_json_X functions to check inputs

Create a json_complexity function that computes the recursively un-nested length of JSON

Counts the total number of nodes in the JSON.

Create as.character.tbl_json to convert back into a character string

Can wrap around toJSON, but with appropriate arguments.

Should spread_all discard scalar values from associated JSON?

Currently it leaves the JSON as is

'{"a": 1, "b": [1, 2, 3]}' %>% spread_all
#> # A tbl_json: 1 x 2 tibble with a "JSON" attribute
#>    `attr(., "JSON")` document.id     a
#>                <chr>       <int> <dbl>
#> 1 {"a":1,"b":[1,2...           1     1

Perhaps instead it should strip these away:

'{"a": 1, "b": [1, 2, 3]}' %>% spread_all
#> # A tbl_json: 1 x 2 tibble with a "JSON" attribute
#>    `attr(., "JSON")` document.id     a
#>                <chr>       <int> <dbl>
#> 1 {"b":[1,2,3]}                1     1

This makes sense since they are already captured in the tbl_json object, and it will make it easier to see that the next step should be enter_object and then gather_array.

Rename and export determine_types

Using this in the purrr vignette, so should export it. Need to make it clearly different from json_types.

Generate readme from .rmd

E.g., https://github.com/jonocarroll/ggghost/blob/master/README.Rmd

Change spread_values to determine type and have spread_values_<type> for specific types

Ideally, would:

Use unquoted notation, e.g., spread_values(key1) instead of spread_values("key1")
spread_values would determine type automatically (converting NULLs to NAs of appropriate type)
How to handle nested keys, e.g., spread_values(key1, key2) could be two top level keys or key2 nested under key1

Use .null argument in map to handle missing object keys

Should cover package internals as well as purrr vignette.

Throw error on spread_all in case of name conflict

If spread_all generates a name that already exists in the data frame, then throw a meaningful error about the name conflict.

create spread_all to automatically spread all keys

Should work like:

'{"a": 1, "b": "x", "c": true}' %>% spread_all_values

Should not affect the state of the JSON object
Should work with nested objects
Should take a sep argument used to separate key names when objects are nested
Should just ignore arrays automatically
NULLs should be cast to NA

add a plot_json_graph example to readme header

Can be something simple.

Create vignette to visualize JSON structure

Use the companies dataset and visualize the structure of the JSON documents.

Add other badges (CRAN, others?)

Like https://raw.githubusercontent.com/jonocarroll/ggghost/master/README.Rmd

Change gather_keys to gather_object

Deprecate gather_keys with .Deprecated function, see backwards compatibility section in http://r-pkgs.had.co.nz/release.html#undefined

Print tbl_json objects with truncated JSON string

tbl_json objects should print like tbl_df objects, except they should have an additional column at the end, titled something like attr("JSON"), that shows the first N characters of the concise JSON representation of the JSON attribute.

Something like:

document.id key attr("JSON")
----------- --- ------------
1           "a" [1, 2, 3]
2           "b" true
3           "c" {"k1": "value", "k2": [1, 2], "k3...

Create a json_structure function to recursively identify the structure of a document

Every row should correspond to a "node" in the JSON document with a unique ID, and should identify the most recent key used to access the node, it's type, length and parent.

Change gather_keys to gather_object to be consistent with gather_array

Perhaps also unify their code.

Try to use map_df in purrr vignette to operate on json directly

e.g.,

json %>%
  at_depth(2, `%||%`, NA) %>%
  map_df(. %$% tibble(name, email_address, number_of_employees, founded_year))

Create plot_json_graph

Should use json_structure and create an igraph object. Initial version of code is in visualization vignette.

spread_values should not coerce types

This should throw an error:

'{"key": "1"}' %>% spread_values(int = jnumber("key"))
#>   document.id int
#> 1           1   1

Update NEWS.md

How to treat nested arrays?

Nested arrays are difficult to work with. For example,

x <- '[[1, 2], 1]' %>% gather_array %>% json_types
x
#>   document.id array.index   type
#> 1           1           1  array
#> 2           1           2 number

At this point, there is no way to gather the next array unless we filter on type == 'array'.

x %>% gather_array("level2")
#> Error in gather_array(., "level2") : 1 records are not arrays
x %>% filter(type == "array") %>% gather_array("level2")
#>   document.id array.index  type level2
#> 1           1           1 array      1
#> 2           1           1 array      2

append_values_number works, but returns NA for the array, and recursive = TRUE doesn't work through the second level array. Further, it could be that the types are mixed.

Increase the number of lines of JSON converted to strings in print.tbl_json

This will be very confusing to users:

> companies[1:5] %>% gather_keys %>% filter(is_json_object(.)) %>% gather_keys("key2")
#> # A tbl_json: 15 x 3 tibble with a "JSON" attribute
#>     `attr(., "JSON")` document.id   key            key2
#>                 <chr>       <int> <chr>           <chr>
#> 1  "52cdef7e4bab8b...           1   _id            $oid
#> 2  [[[150,22],"ass...           1 image available_sizes
#> 3                null           1 image     attribution
#> 4  "52cdef7f4bab8b...           2   _id            $oid
#> 5  [[[150,38],"ass...           2 image available_sizes
#> 6                null           2 image     attribution
#> 7  "52cdef7d4bab8b...           3   _id            $oid
#> 8  [[[150,36],"ass...           3 image available_sizes
#> 9                null           3 image     attribution
#> 10 "52cdef7d4bab8b...           4   _id            $oid
#> 11                ...           4 image available_sizes
#> 12                ...           4 image     attribution
#> 13                ...           5   _id            $oid
#> 14                ...           5 image available_sizes
#> 15                ...           5 image     attribution

dplyr::slice isn't filtering JSON appropriately

Filter works:

companies[1:5] %>% as.tbl_json %>% filter(document.id == 1) %>% attr("JSON") %>% length
#> [1] 1

but slice does not:

companies[1:5] %>% as.tbl_json %>% slice(1) %>% attr("JSON") %>% length
#> [1] 5

new <- '[1, 2, 3]' %>% gather_array("num") %>%
  left_join(data_frame(num = 1:3, letters = letters[1:3]), by = "num")

expect_is(new, "tbl_json")

"[[1, 2], [1, 2]]" %>% gather_array %>% gather_array
#> Error: found duplicated column name: array.index

Use \code and \link in documentation

E.g., from purrr https://github.com/hadley/purrr/blob/master/R/along.R

#' These functions take the idea of \code{\link{seq_along}} and generalise
#' it to creating lists (\code{list_along}) and repeating values
#' (\code{rep_along}).

Try to use unnest in gather_keys and gather_array

json_structure fails with empty object

'{}' %>% json_structure
#> Error: wrong result size (2), expected 0 or 1