sixianghu / dataman Goto Github PK
View Code? Open in Web Editor NEWR package for data cleaning, preliminary data analysis and modeling assessing with visualisation.
R package for data cleaning, preliminary data analysis and modeling assessing with visualisation.
Nan
and 0
by
parameterWhen populating the missing value for Date
variable, an origin
is needed when the replace
parameter given is an negative number.
test = data.frame(a = c(1,NA),b = c(1,1))
DataSummary(test)
VarName VarType Unique Missing Missing_pct Mean Min Max Entropy
1: a numeric 2 1 0.5 0.5 1 1 0
2: b numeric 1 0 0.0 1 1 1 0
It is related to:
https://github.com/SixiangHu/DataMan/blob/master/R/DataSummary.R#L75
That when x
include NA
s, the sum of weighted is less than sum(weight,na.rm=TRUE)
because there is no NA
in weight
.
Find a way to get rid of the legend in dataPlot
's and modePlot
's legend in hist
plot.
Currently, this is not supported by rbokeh
We may need a weight
variable like other plot functions.
Need a model comparison function for all ML methods
If there is a time variable with class of "POSIXct" or "POSIXt", both of time classes will be outputed, hence in the final tables, errors will jump out saying:
Error in .Method(..., row.names = row.names, check.rows = check.rows, :
arguments imply differing number of rows: 217, 218
rateComp <- function(df,fac,peril,reorder=FALSE){
df_sub <- filter(df,(KEY==fac) & (Peril == peril))
if(reorder){
suppressWarnings(min_x <- min(as.numeric(df_sub$Level1),na.rm = TRUE))
suppressWarnings(max_x <- max(as.numeric(df_sub$Level1),na.rm = TRUE))
suppressWarnings(oth_lvl <- as.character(df_sub$Level1[is.na(as.numeric(df_sub$Level1))]))
df_sub$Level1 <- ordered(factor(df_sub$Level1,levels=c(paste(min_x:max_x),oth_lvl)))
}
df_sub <- df_sub[order(df_sub$Level1),]
plot_ly(df_sub,x=Level1,y=Value.x, xaxis = "x1", yaxis = "y1",name="Model1") %>%
add_trace(df,x=Level1,y=Value.y, xaxis = "x1", yaxis = "y1",name="Model2") %>%
layout(xaxis=list(title=fac),yaxis=list(title=peril))
}
dataComp <- function(old,new,alpha=0.05){
strName <- names(old)
res <- data.frame(Name = NULL, Changes = NULL)
for (i in strName){
if (i %in% names(new)){
x<- chisq.test(table(old[,i]),p=table(new[,i])/sum(table(new[,i])))
if (x$p.value > alpha) {
tmp <-data.frame(Name=i,Changes=FALSE)
res <- rbind(res,tmp)
}
}
}
res
}
This initial model assessing step is to check each factor's ability of prediction based on different metrics in a stepwise analysis:
Because caret
package uses a wrapped predict
function, which requires type
to be one of raw
and prob
.
Hence need to change the code in modelPlot
to support caret
's train
model object.
A structure or template is needed to conduct this model comparison.
A PMML
structure would be a good starting point, even though not all ML package support it.
liftPlot
height and sizeDESCRIPTION
file: move rbokeh
from Depends
to Imports
glyph=25
glyph=24
y_axis
of ly_hist
and ly_bar
to scientific
hover
in dataPlot
, modelPlot
are too simple which cause warnings.by
doesn't look right when the by
variables has been regrouped as interval factor.ordered factor
.Depricated
ly_hist
and ly_bar
cannot plot missing valuerbokeh
from git_hub.onLoad
function to install rbokeh
from github (e.g. https://github.com/trinker/gmailR/blob/master/R/zzz.RWhen there are lots of null
or missing values, DataSummary
should still populate the min and max from non-missing values.
This may not be the final solution, but, can use rgl
or rbokeh
for this purpose.
If it is under rbokeh
, then probably will use hexbin plot ly_hexbin
or image plot ly_image
Need a efficiency on CramersV, which is slow atm.
Possible solution is call C or C++ implementation.
NaN
and 0by
parameterlibrary(RColorBrewer)
modelCompPlot <- function(x,act,pred,by=NULL,weights=NULL,newGroupNum=10){
if (is.null(x)) stop("x provided is blank.")
if (is.null(act)) stop("act provided is blank.")
if (is.null(pred)) stop("pred provided is blank.")
str_pred <- NULL
num_pred <- NULL
if (length(x) != length(act)) stop("x and act don't have the same length")
if (is.vector(pred)){
str_pred <- "pred"
num_pred <- 1
if(length(x) != length(pred)) stop("x and pred don't have the same length")
}
else {
str_pred <- colnames(pred)
if (is.null(str_pred)) str_pred <- paste("Pred",1:nrow(pred),sep="")
num_pred <- ncol(pred)
if(length(x) != nrow(pred)) stop("x and pred don't have the same length")
}
if(is.null(weights)) {weights <- rep(1,length(x))}
#New Group for data which has too much levels.
if ( (is.numeric(x) || is.integer(x)) && nlevels(as.factor(x))>100 ) {
if ( is.null(newGroupNum) ) newGroupNum <- 10
new_band <- seq(min(x, na.rm = TRUE),max(x, na.rm = TRUE),length.out=newGroupNum)
x <- as.character(cut(x,new_band,include.lowest = TRUE))
}
if(!is.null(by)){
data.plot <- data.table::as.data.table(as.data.frame(cbind(xvar=x,by=by,act,pred,weights),stringsAsFactors=FALSE))
setkey(data.plot,xvar,by)
dp_name_str <- c("act",str_pred,"weights")
data.plot <- data.plot[,lapply(.SD,as.numeric),by=list(xvar,by),.SDcols=dp_name_str]
data.agg <- data.plot[,lapply(.SD,weighted.mean,w=weights),by=list(xvar,by),.SDcols=dp_name_str]
p1 <- rbokeh::figure(xlab="x",ylab="",height=500) %>%
rbokeh::ly_lines(xvar,act,color=by,type=list(2),
width=2,data=data.agg)
for (i in num_pred){
p1 <- p1 %>% rbokeh::ly_lines(xvar,deparse(substitute(str_pred[i])),color=by,type=list(2),
width=2,data=data.agg)
}
}
else {
data.plot <- data.table::as.data.table(as.data.frame(cbind(xvar=x,act,pred,weights),stringsAsFactors=FALSE))
setkey(data.plot,xvar)
dp_name_str <- c("act",str_pred,"weights")
data.plot <- data.plot[,lapply(.SD,as.numeric),by=xvar,.SDcols=dp_name_str]
data.agg <- data.plot[,lapply(.SD,weighted.mean,w=weights),by=xvar,.SDcols=dp_name_str]
p1 <- rbokeh::figure(xlab="x",ylab="",width=700,height=800) %>%
rbokeh::ly_lines(xvar,act,color="#CC3399",
width=2,data=data.agg,legend = "Actual")
str_col <- brewer.pal(max(num_pred,3), "Spectral")
for (i in 1:num_pred){
print(str_pred[i])
p1 <- p1 %>% rbokeh::ly_lines(xvar,str_pred[i],color=str_col[i],width=2,data=data.agg,
legend=str_pred[i])
}
}
p1
}
Add #' @useDynLib CramersV
to the CramersV
c++
function to let Roxygen2
to generate the NAMESPACE
automatically.
Done:
networkD3
package for the network plot, or use the DiagrammeR
, igraph
and vivagraph
http://www.buildingwidgets.com/blog/?offset=1435611529118. [#26]Rcpp
or cleaner and elegant coding in RNot Now:
PMML
structure would be a good starting point, even though not all ML packages support it. Or use caret
package straight away.AbnormalDetection
package and changepoint
package.resiPlot
has contour plot for this.)resiPlot
has contour plot for this.)RcppParalle
for CramersV
function (REASON: not urgent need. And parallel overheads may have inverse effect.)Current modelPlot
ignores the actual base of variables in a glm, instead, it recalcualtes the mean or mode of the variable and treat them as the base.
If modeller specified another base and used in glm, the modePlot
may not recognise it, hence may give wrong "model prediction at base level" line.
Can use the networkD3
package for the network plot, or use the DiagrammeR
, igraph
and vivagraph
in interPlot
plot (in README file as well), the order of axis ticktext is not in order.
The issue is that then creating the matrix for plotly
, the column header is not created in order when given factor is not sorted.
A force sort may be needed for variable used.
A more realistic solution is that user define the order before feading into the interPlot
function.
This function can only apply to a model structure with explicit link function.
A drop=FALSE
or with=FALSE
needed in PopMiss
when the input is a data.table.
Because data.frame
and data.table
have different method to drop variables.
dataPlot <- function(data,xvar,yvar,weights=NULL,byvar=NULL,newGroupNum=NULL,interactive=FALSE,...){
opts.list<-list(...)
opts <- names(list(...))
if("xlim" %in% opts) xlim<-opts.list$xlim
if("ylim" %in% opts) ylim<-opts.list$ylim
if("binwidth" %in% opts) binwidth<-opts.list$binwidth else binwidth <- 1
# Error Trapping
if( is.null(data) ) stop("data set provided is null.")
if( is.null(xvar) ) stop("X variable provided is null.")
if( is.null(yvar) ) stop("Responce variable provided is null.")
if (is.character(xvar)) {
if(!xvar %in% colnames(data)) stop(paste("xvar variable (",xvar,") cannot be found.",""))
x <- data[,which(names(data)==xvar)]
xname <- xvar
}
else if (is.integer(xvar)) {
x <- data[,xvar]
xname <- names(data)[xvar]
}
else stop ("xvar provided is either a character (variable name) or integer (position of the variable).")
if (is.character(yvar)) {
if(!yvar %in% colnames(data)) stop(paste("yvar variable (",yvar,") cannot be found.",""))
yname <- yvar
y <- data[,which(names(data)==yvar)]
}
else if (is.integer(yvar)) {
y <- data[,yvar]
yname <- names(data)[yvar]
}
else stop ("yvar provided is either a character (variable name) or integer (position of the variable).")
if( !is.null(byvar) ){
if (is.character(byvar)){
if(!byvar %in% colnames(data) ) stop(paste("xvar variable (",byvar,") cannot be found.",""))
by <- data[,which(names(data)==byvar)]
byname <- byvar
}
else if (is.integer(byvar)) {
xname <- names(data)[byvar]
by <- as.character(data[,byvar])
}
else by <- NULL
}
else by <- NULL
if( !is.null(weights) ){
wname = "w"
if (is.character(weights)){
if(!weights %in% colnames(data) ) stop(paste("xvar variable (",weights,") cannot be found.",""))
w <- data[,which(names(data)==weights)]
}
else if (is.integer(weights) && length(weights)==1) {
w <- data[,weights]
}
else if (is.integer(weights) && length(weights)>1){
if ( dim(data)[1] != length(weights) ) stop ("Length of weights is not the same as dimension of the data provided.")
w <- weights
}
}
else w <- rep(1,dim(data)[1])
#New Group for xvar which has too much levels.
if ( (is.numeric(x) || is.integer(x) ) && nlevels(as.factor(x))>100 ) {
if ( is.null(newGroupNum) ) newGroupNum <- 10
new_band <- seq(min(x),max(x),length.out=newGroupNum)
x <- cut(x,new_band,include.lowest = TRUE)
}
#New Group for byvar which has too much levels.
if(!is.null(by)){
if ( (is.numeric(by) || is.integer(by)) && nlevels(as.factor(by))>100 ) {
if ( is.null(newGroupNum) ) newGroupNum <- 10
new_band <- seq(min(by),max(by),length.out=newGroupNum)
by <- cut(by,new_band,include.lowest = TRUE)
}
}
#Data for plot
if (is.null(by)) {
data.plot <- data.table::as.data.table(as.data.frame(cbind(x=x,y=y,w=w),stringsAsFactors=FALSE))
data.table::setkey(data.plot,x)
data.plot <- data.plot[,lapply(.SD,as.numeric),by=x,.SDcols=c("y","w")]
data.agg <- as.data.frame(data.plot[,lapply(.SD,weighted.mean,w=w),by=x,.SDcols=c("y","w")],row.names=c("xvar","weights","observed"))
data.freq <- as.data.frame(data.plot[,sum(w),by=x][,freq:=V1/sum(V1)])
data.melt <- reshape2::melt(data.agg[,-3],id=c("x"))
#line graph
strV1 <- paste("Observation Analysis on: ",xname)
gLine <- ggplot2::ggplot(data=data.melt,aes(x=x,y=value)) +
ggplot2::geom_line(size=1,colour= "magenta3") + ggplot2::geom_point(size=4,fill="white",shape=22)
if(("xlim" %in% opts) && is.numeric(data.melt$x)) gLine <-gLine + ggplot2::scale_x_continuous(limits=xlim)
else if(("xlim" %in% opts) && !is.numeric(data.melt$x)) gLine <-gLine + ggplot2::scale_x_discrete(limits=xlim)
else if(("xlim" %in% opts) && is(data.melt[,"x"],"Date")) gLine <-gLine + ggplot2::scale_x_date(label=date_format("%y%m"),limits=xlim)
if("ylim" %in% opts) gLine <-gLine + ggplot2::ylim(ylim)
gLine <- gLine + ggplot2::xlab("") + ggplot2::ylab(yname)+ ggplot2::ggtitle(strV1)+ theme_mp_line
if(nlevels(as.factor(data.melt$x))>25) gLine <- gLine + ggplot2::theme(axis.text.x = element_text(angle = 90,hjust=0.5,vjust=0.5))
#histogram graph
ghist <- ggplot2::ggplot(data=data.freq,aes(x=x,y=freq))+
ggplot2::geom_histogram(stat="identity",colour="black",fill="yellow")
if(("xlim" %in% opts) && is.numeric(data.melt$x)) ghist <-ghist + ggplot2::scale_x_continuous(limits=xlim)
else if(("xlim" %in% opts) && !is.numeric(data.melt$x)) ghist <-ghist + ggplot2::scale_x_discrete(limits=xlim)
ghist <- ghist + ggplot2::ylab("percent (%)")+
ggplot2::scale_y_continuous(labels = percent)+
ggplot2::xlab("")+ theme_mp_hist
gridExtra::grid.arrange(gLine,ghist,ncol=1,nrow=2,heights=c(4,1))
if (interactive) {
df <- data.frame(data.agg,freq=data.freq$freq)
gvisSingleOptionList <- list(pointSize=8,
series="[
{targetAxisIndex:0, type:'line',color:'magenta',pointShape: 'square'},
{targetAxisIndex:1, type:'bars',color:'yellow'}]",
crosshair="{trigger:'both'}",
hAxis.title=xname,
theme="maximized",
title=paste0("Observation analysis on ",xname, " Observed"),
vAxes="{1:{format:'##.#%',maxValue:1}}",
explorer="{ actions: ['dragToZoom', 'rightClickToReset'],keepInBounds: true }",
chartArea="{width:'90%',height:'90%'}",
height=750)
plot(googleVis::gvisComboChart(df,xvar="x",yvar="y",options=gvisSingleOptionList))
}
}
else{
data.plot <- data.table::as.data.table(as.data.frame(cbind(x=x,y=y,w=w,by=by),stringsAsFactors=FALSE))
data.table::setkey(data.plot,x,by)
data.plot <- data.plot[,lapply(.SD,as.numeric),by=list(x,by),.SDcols=c("y","w")]
data.agg <- as.data.frame(data.plot[,lapply(.SD,weighted.mean,w=w),by=list(x,by),.SDcols=c("y","w")],row.names=c("xvar","by","weights","observed"))
data.freq <- as.data.frame(data.plot[,sum(w),by=list(x,by)][,freq:=V1/sum(V1)])
#line graph
gLine1 <- ggplot2::ggplot(data=data.agg,aes(x=x,y=y,group=factor(by),colour=factor(by)))+
ggplot2::geom_line(size=1) + ggplot2::geom_point(size=4,fill="white")
if("xlim" %in% opts) gLine1 <-gLine1 + xlim(xlim)
if("ylim" %in% opts) gLine1 <-gLine1 + ylim(ylim)
gLine1 <- gLine1+ggplot2::xlab("")+ggplot2::ylab(yname)+ ggplot2::ggtitle(paste("Observation Analysis on: ",xname," by ",byname))+
theme_mp_line
if(nlevels(as.factor(data.agg$x))>25) gLine1 <- gLine1 + ggplot2::theme(axis.text.x = element_text(angle = 90,hjust=0.5,vjust=0.5))
#histogram graph
ghist <- ggplot2::ggplot(data=data.freq,aes(x=x,y=freq,fill=factor(by)))+ ggplot2::geom_histogram(stat="identity",binwidth=1)
if("xlim" %in% opts) ghist <-ghist + ggplot2::xlim(xlim)
ghist <- ghist + ggplot2::xlab("")+ ggplot2::ylab("percent (%)")+ ggplot2::scale_y_continuous(labels = percent) +
theme_mp_hist+ theme(legend.position="none")
gridExtra::grid.arrange(gLine1,ghist,ncol=1,nrow=2,heights=c(4,1))
}
}
Tried to have an improvement about the speed of cor
function.
Currently, the following code works (with inline
and RcppArmadillo
)
src <- '
mat m1 = as<mat>(x);
return(wrap(cor(m1)));
'
fx <- inline::cxxfunction( signature(x = "numeric") ,
body = src,
plugin='RcppArmadillo',
includes = 'using namespace Rcpp;
using namespace arma;')
It is slightly quicker than cor
:
library(microbenchmark)
x <- as.matrix(mtcars)
microbenchmark(
fx(x),
cor(x)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
fx(x) 11.50953 11.67405 11.96615 11.82167 12.21154 13.20968 100 a
cor(x) 13.92392 14.22428 14.46255 14.37597 14.72041 15.27963 100 b
But on a 300 x 300 matrix, the time spending is the same. In this case, there is no benefit to continue on this?
No max
or mean.or.mode
for characters in a data.table
CramersV forgets to convert non-integer vector to integer.
This issue is in the CramersV.data.frame
.
There are 2 solutions:
1: Use template
in CramersV.cpp
2: Convert all non-integer vector to integer in CramersV.r
in R
Some tests can be used from RegressionTestsInterface
in fRegression
package:
For regression (linear) residual test (because of the assumption that the variance must be constant)
https://cran.r-project.org/web/packages/fRegression/fRegression.pdf
When the class of response variable is character or factor, the positive response may be recognized as negative label.
Please check roc
function in pROC
package for more information (the levels
parameter).
Need more efficiency on DataSummary
, DetMiss
and PopMiss
.
Enhance the data report marrkdown:
formattable
plotly
in rmarkdownSometimes we may want to monitoring the trends:
There should have a function to give this feature. A starting point is the AbnormlDetection
package and changepoint
package.
It would be good to have control over the location of the legend and the max number of bins on the x axis.
I think this is easy to do with the following amendments to the code (I'm only showing lines that need to change with the changes in bold font)
dataPlot <- function(data,xvar,yvar,byvar=NULL,weights=NULL,
newGroupNum=10, legend_x= 0.5, legend_y = 1.0, x_maxBins = 100, y_maxBins = 20)
if ( (is.numeric(x) || is.integer(x) ) && nlevels(as.factor(x))>=x_maxBins ) {
#New Group for byvar if it has too many levels.
if(!is.null(by)){
if ( (is.numeric(by) || is.integer(by)) && nlevels(as.factor(by))>y_maxBins) {
l <- list(bordercolor = "#000000",borderwidth=1,x = legend_x, y = legend_y)
Regards
Nigel
A partition function to split the data into different proportion.
multiParts <- function(y,p){
if(is.null(y)) stop("no inputs")
if(sum(p)<1) warning("Sum of proportion is less than 1,
rest of the data will be dropped.")
if(sum(p)>1) stop("Sum of the proportion is larger than 1.")
rn <- 1:length(y)
np <- length(p)
res <- list()
# Use the createDataPartition from caret package:
# for (i in 1:np){
# ind <- caret::createDataPartition(y=rn,p=p[i])
# res[[i]] <- rn[ind[[1]]]
# rn <- rn[-ind[[1]]]
# p <- p/(sum(p[-c(1:i)]))
# }
# Use sample function instead
for (i in 1:np){
ind <- sample(1:length(rn),size=floor(p[i]*length(rn)))
res[[i]] <- rn[ind]
rn <- rn[-ind]
p <- p/(sum(p[-c(1:i)]))
}
res
}
When there a variable has many unique levels, modelPlot
tries to banded the variable into a default 20 groups
interPlot
method may need a separate solution because of the melting process.x <- c(1,NA,2)
df <- as.data.frame(cbind(x,x,x))
names(df)
PopMiss(df,na.treatment = "replace",0)
[1] "x" "x" "x"
x x x
1 1 1 1
2 0 NA NA
3 2 2 2
CramersV function failed on diamonds
dataset due to variable type Ord.factor
.
Needs to check the second class if there is any. Other wise, drop the variable to make sure the other variables will be analyised.
test = data.frame(a = c("A","NA","NULL","NaN",""),b = c(1,NA,NaN,Inf,-Inf))
DataSummary(test)
VarName VarType Unique Missing Missing_pct Mean Min Max Entropy
1: a factor 5 1 0.2 2 1 2 0
2: b numeric 5 2 0.4 NaN -Inf Inf 0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.