Giter VIP home page Giter VIP logo

vary's Introduction

vary

R-CMD-check

Overview

Methods to automate the loading of semi-structured data (ex. user modified files, OCR output) which are reliable enough to form a process around, but vary too much to immediately work with using typical data manipulation packages.

Streamlines a few string methods to correct naming convention differences with fuzzy matching and collapse unstructured text into a tibble by any given break-point string. Also adds utilities to drop entries based on NA thresholds and load files without specifying local paths,

Installation

# install.packages("devtools")
devtools::install_github("ulchc/vary")

Usage

After installation from GitHub, you can load it with:

library(vary)

Using fuzzy_rename() when it is known that the underlying data between two sources is equivalent

# The attributes of data and messy_data are the same, but naming conventions between the sources differ
data
#> # A tibble: 1 × 6
#>   ID    Code  Name  Day    Month Amount
#>   <chr> <chr> <chr> <chr>  <chr> <chr> 
#> 1 5.1.0 222   Book  Friday APR   19.00
messy_data
#> # A tibble: 1 × 6
#>   `Amount $` `Month (MMM) ` `Day of \n the week` `Product\nName` Barcode `ID #`
#>   <chr>      <chr>          <chr>                <chr>           <chr>   <chr> 
#> 1 20.00      MAY            Saturday             Notebook        223     5.1.1
names(data) %in% names(messy_data)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE

# No names are compatible between sources
names(data) %in% names(fuzzy_rename(messy_data, names(data)))
#> > Fuzzy Matches
#> `ID #` -> `ID`
#> `Barcode` -> `Code`
#> `Product\nName` -> `Name`
#> `Day of \n the week` -> `Day`
#> `Month (MMM) ` -> `Month`
#> `Amount $` -> `Amount`
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE

# fuzzy_rename() will match names and print out the changes

Automatically match, reorder, and combine without making manual adjustments

messy_data %>%
  fuzzy_rename(data) %>%
  select(names(data)) %>%
  rbind(data)
#> > Fuzzy Matches
#> `ID #` -> `ID`
#> `Barcode` -> `Code`
#> `Product\nName` -> `Name`
#> `Day of \n the week` -> `Day`
#> `Month (MMM) ` -> `Month`
#> `Amount $` -> `Amount`
#> # A tibble: 2 × 6
#>   ID    Code  Name     Day      Month Amount
#>   <chr> <chr> <chr>    <chr>    <chr> <chr> 
#> 1 5.1.1 223   Notebook Saturday MAY   20.00 
#> 2 5.1.0 222   Book     Friday   APR   19.00

Using fuzzy_match() to categorize and handle spelling mistakes from OCR text

colors_list
#> [1] "Red"    "Blue"   "Green"  "Yellow" "Violet" "Purple" "Orange"
color_phrases
#> [1] "The sunrise was 'yellovv'"   "There were 'purp/e' flowers"
#> [3] "The fruit was 'orang e'"
colors_mentioned <- fuzzy_match(color_phrases, colors_list)
#> > Fuzzy Matches
#> `The fruit was 'orang e'` -> `Orange`
#> `The sunrise was 'yellovv'` -> `Yellow`
#> `There were 'purp/e' flowers` -> `Purple`

# A message will indicate when there is a large string distance between fuzzy matches
writeLines(paste0("The colors mentioned were: ", paste0(colors_mentioned, collapse = ", ")))
#> The colors mentioned were: Yellow, Purple, Orange

Use which_rows() to filter data with mismatched columns

# Mismatched columns: country < -- > type
mismatched
#> # A tibble: 12 × 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 cases        1999 Afghanistan      745
#> 2 population   1999 Afghanistan 19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 population   2000 Afghanistan 20595360
#> 5 Brazil       1999 cases          37737
#> # … with 7 more rows
row_index <-
  which_rows(
    mismatched,
    contain_strings = c("CASES", "2000"),
    all_strings = TRUE,
    case_sensitive = FALSE,
    flatten = TRUE
  )
# Using which_rows() to filter data prior to resolving mismatched attributes
mismatched[row_index, ]
#> # A tibble: 3 × 4
#>   country      year type    count
#>   <chr>       <int> <chr>   <int>
#> 1 Afghanistan  2000 cases    2666
#> 2 cases        2000 Brazil  80488
#> 3 China        2000 cases  213766
# dplyr::filter is, of course, not designed to work under such conditions and would only return 2/3 of the rows 
mismatched %>% filter(type == "cases" & year == 2000)
#> # A tibble: 2 × 4
#>   country      year type   count
#>   <chr>       <int> <chr>  <int>
#> 1 Afghanistan  2000 cases   2666
#> 2 China        2000 cases 213766

R Documentation

Use ?vary in R to view a linked list of all functions

vary's People

Contributors

ulchc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.