Giter VIP home page Giter VIP logo

onehot's Introduction

Travis-CI Build Status

Onehot package

Installation

devtools::install_github("https://github.com/Zelazny7/onehot")

Usage

set.seed(100)
test <- data.frame(
  factor    = factor(sample(c(NA, letters[1:3]), 100, T)),
  integer   = as.integer(runif(100) * 10),
  real      = rnorm(100),
  logical   = sample(c(T, F), 100, T),
  character = sample(letters, 100, T),
  stringsAsFactors = FALSE)

head(test)

##   factor integer       real logical character
## 1      a       3 -0.3329234   FALSE         f
## 2      a       3  1.3631137    TRUE         t
## 3      b       0 -0.4691473    TRUE         h
## 4   <NA>       3  0.8428756    TRUE         k
## 5      a       5 -1.4579937   FALSE         k
## 6      a       6 -0.4003059   FALSE         l

Create a onehot object

A onehot object contains information about the data.frame. This is used to transform a data.frame into a onehot encoded matrix. It should be saved to transform future datasets into the same exact layout.

library(onehot)

## Loading required package: Matrix

encoder <- onehot(test)

## Warning: Variables excluded for having levels > max_levels: character

## printe a summary
encoder

## Onehot Specification
## |-   1 Factors  => 4 Indicators 
## |-   3 Numerics => (NA <- -999)

Transforming data.frames

The onehot object has a predict method which may be used to transform a data.frame. Factors are onehot encoded. Character variables are skipped. However calling predict with stringsAsFactors=TRUE will convert character vectors to factors first.

train_data <- predict(encoder, test)
head(train_data)

##      factor_a factor_b factor_c factor_NA integer       real logical
## [1,]        1        0        0         0       3 -0.3329234       0
## [2,]        1        0        0         0       3  1.3631137       1
## [3,]        0        1        0         0       0 -0.4691473       1
## [4,]        0        0        0         1       3  0.8428756       1
## [5,]        1        0        0         0       5 -1.4579937       0
## [6,]        1        0        0         0       6 -0.4003059       0

NA indicator columns

add_NA_factors=TRUE (the default) will create an indicator column for every factor column. Having NAs as a factor level will result in an indicator column being created without using this option.

encoder <- onehot(test, add_NA_factors=TRUE)

## Warning: Variables excluded for having levels > max_levels: character

train_data <- predict(encoder, test)
head(train_data)

##      factor_a factor_b factor_c factor_NA integer       real logical
## [1,]        1        0        0         0       3 -0.3329234       0
## [2,]        1        0        0         0       3  1.3631137       1
## [3,]        0        1        0         0       0 -0.4691473       1
## [4,]        0        0        0         1       3  0.8428756       1
## [5,]        1        0        0         0       5 -1.4579937       0
## [6,]        1        0        0         0       6 -0.4003059       0

Sentinel values for numeric columns

The sentinel=VALUE argument will replace all numeric NAs with the provided value. Some ML algorithms such as randomForest and xgboost do not handle NA values. However, by using sentinel values such algorithms are usually able to separate them with enough decision-tree splits. The default value is -999

Sparse Matrices

onehot also provides support for predicting sparse, column compressed matrices from the Matrix package:

encoder <- onehot(test)

## Warning: Variables excluded for having levels > max_levels: character

train_data <- predict(encoder, test, sparse=TRUE)
head(train_data)

## 6 x 7 sparse Matrix of class "dgCMatrix"
##      factor_a factor_b factor_c factor_NA integer       real logical
## [1,]        1        .        .         .       3 -0.3329234       .
## [2,]        1        .        .         .       3  1.3631137       1
## [3,]        .        1        .         .       . -0.4691473       1
## [4,]        .        .        .         1       3  0.8428756       1
## [5,]        1        .        .         .       5 -1.4579937       .
## [6,]        1        .        .         .       6 -0.4003059       .

onehot's People

Contributors

gravesee avatar weekend-warrior avatar

Stargazers

Ruth Kristianingsih avatar  avatar Leo Lu avatar Ben Marwick avatar mark padgham avatar Anish Singh Walia avatar Robert Myles McDonnell avatar boB Rudis avatar Srikanth K S avatar  avatar  avatar

Watchers

Srikanth K S avatar Philipp Kopper avatar

onehot's Issues

Does not with NA?

Hi,

Perhaps I did something wrong, but it seems that it does not work with NAs...

Here is what I have done:

install.packages("onehot")
library("onehot")

Am I installing the correct package? As this is the package from Cran.

getOption("na.action")
[1] "na.omit"

Create a simple data frame

number = data.frame(x = c(NA, NA, "apple", "apple", "banana", "banana", "peer", "peer"),
y = c("a", "a", "b", "b", "c", "c", NA, NA))
encode = onehot(number, addNA = TRUE, max_levels = 5)
number = as.data.frame(predict(encode, number))

See the outcome

number

x=apple x=banana x=peer x=NA y=a y=b y=c y=NA
0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0
1 0 0 0 0 1 0 0
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0

Note that in column 4 (i.e., "x=NA"), the values are all 0, instead of having first two rows being 1, as it corresponds to NAs.

Thank you in advance!

Best regards,
Sihan

Check for singularity?

Does it make sense to check for and remove linear combinations in the expanded matrix? Scenarios such as the coincidence of missing values should probably be condensed to a single column. In some cases I don't think it makes a difference but if the modeling algorithm relies on non-singularity or incorporates randomization of covariates, it should have an impact.

predict crash on row count > 10000 and col count = 396

I have a modestly wide data set, which has 10 categorical features and the total columns after the one hot transformation is 396. The predict function crashes R in this use case.

I am not sure where is the problem. The train works on the similar size of the data. Not sure what the root cause is. Has this package been through some stress testing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.