vary

Overview

Methods to automate the loading of semi-structured data (ex. user modified files, OCR output) which are reliable enough to form a process around, but vary too much to immediately work with using typical data manipulation packages.

Streamlines a few string methods to correct naming convention differences with fuzzy matching and collapse unstructured text into a tibble by any given break-point string. Also adds utilities to drop entries based on NA thresholds and load files without specifying local paths,

Installation

# install.packages("devtools")
devtools::install_github("ulchc/vary")

Usage

After installation from GitHub, you can load it with:

library(vary)

Using fuzzy_rename() when it is known that the underlying data between two sources is equivalent

# The attributes of data and messy_data are the same, but naming conventions between the sources differ

data
#> # A tibble: 1 × 6
#>   ID    Code  Name  Day    Month Amount
#>   <chr> <chr> <chr> <chr>  <chr> <chr> 
#> 1 5.1.0 222   Book  Friday APR   19.00
messy_data
#> # A tibble: 1 × 6
#>   `Amount $` `Month (MMM) ` `Day of \n the week` `Product\nName` Barcode `ID #`
#>   <chr>      <chr>          <chr>                <chr>           <chr>   <chr> 
#> 1 20.00      MAY            Saturday             Notebook        223     5.1.1

names(data) %in% names(messy_data)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE

# No names are compatible between sources

names(data) %in% names(fuzzy_rename(messy_data, names(data)))
#> > Fuzzy Matches
#> `ID #` -> `ID`
#> `Barcode` -> `Code`
#> `Product\nName` -> `Name`
#> `Day of \n the week` -> `Day`
#> `Month (MMM) ` -> `Month`
#> `Amount $` -> `Amount`
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE

# fuzzy_rename() will match names and print out the changes

Automatically match, reorder, and combine without making manual adjustments

messy_data %>%
  fuzzy_rename(data) %>%
  select(names(data)) %>%
  rbind(data)
#> > Fuzzy Matches
#> `ID #` -> `ID`
#> `Barcode` -> `Code`
#> `Product\nName` -> `Name`
#> `Day of \n the week` -> `Day`
#> `Month (MMM) ` -> `Month`
#> `Amount $` -> `Amount`
#> # A tibble: 2 × 6
#>   ID    Code  Name     Day      Month Amount
#>   <chr> <chr> <chr>    <chr>    <chr> <chr> 
#> 1 5.1.1 223   Notebook Saturday MAY   20.00 
#> 2 5.1.0 222   Book     Friday   APR   19.00

Using fuzzy_match() to categorize and handle spelling mistakes from OCR text

colors_list
#> [1] "Red"    "Blue"   "Green"  "Yellow" "Violet" "Purple" "Orange"

color_phrases
#> [1] "The sunrise was 'yellovv'"   "There were 'purp/e' flowers"
#> [3] "The fruit was 'orang e'"

colors_mentioned <- fuzzy_match(color_phrases, colors_list)
#> > Fuzzy Matches
#> `The fruit was 'orang e'` -> `Orange`
#> `The sunrise was 'yellovv'` -> `Yellow`
#> `There were 'purp/e' flowers` -> `Purple`

# A message will indicate when there is a large string distance between fuzzy matches

writeLines(paste0("The colors mentioned were: ", paste0(colors_mentioned, collapse = ", ")))
#> The colors mentioned were: Yellow, Purple, Orange

Use which_rows() to filter data with mismatched columns

# Mismatched columns: country < -- > type

mismatched
#> # A tibble: 12 × 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 cases        1999 Afghanistan      745
#> 2 population   1999 Afghanistan 19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 population   2000 Afghanistan 20595360
#> 5 Brazil       1999 cases          37737
#> # … with 7 more rows

row_index <-
  which_rows(
    mismatched,
    contain_strings = c("CASES", "2000"),
    all_strings = TRUE,
    case_sensitive = FALSE,
    flatten = TRUE
  )

# Using which_rows() to filter data prior to resolving mismatched attributes

mismatched[row_index, ]
#> # A tibble: 3 × 4
#>   country      year type    count
#>   <chr>       <int> <chr>   <int>
#> 1 Afghanistan  2000 cases    2666
#> 2 cases        2000 Brazil  80488
#> 3 China        2000 cases  213766

# dplyr::filter is, of course, not designed to work under such conditions and would only return 2/3 of the rows

mismatched %>% filter(type == "cases" & year == 2000)
#> # A tibble: 2 × 4
#>   country      year type   count
#>   <chr>       <int> <chr>  <int>
#> 1 Afghanistan  2000 cases   2666
#> 2 China        2000 cases 213766

R Documentation

Use ?vary in R to view a linked list of all functions

ulchc / vary Goto Github PK

vary's Introduction

vary

Overview

Installation

Usage

R Documentation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent