Giter VIP home page Giter VIP logo

dataman's People

Contributors

sixianghu avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dataman's Issues

Consistency Test

  • Consistency test by time and randno
  • Deal with Nan and 0
  • Necessary of vector in by parameter
  • More efficient

Legend on hist plot

Find a way to get rid of the legend in dataPlot's and modePlot's legend in hist plot.

Currently, this is not supported by rbokeh

Class of Time Variable Cause One More Output to Final Table

If there is a time variable with class of "POSIXct" or "POSIXt", both of time classes will be outputed, hence in the final tables, errors will jump out saying:

Error in .Method(..., row.names = row.names, check.rows = check.rows, :
arguments imply differing number of rows: 217, 218

Features

  • which terms in the model have been affect by new factors and in what way?
  • coefficient
  • changes by level (vis)
  • Is it because of correlation?
  • What this new factor has explained?

Rate Compare

rateComp <- function(df,fac,peril,reorder=FALSE){
  df_sub <- filter(df,(KEY==fac) & (Peril == peril))

  if(reorder){
    suppressWarnings(min_x <- min(as.numeric(df_sub$Level1),na.rm = TRUE))
    suppressWarnings(max_x <- max(as.numeric(df_sub$Level1),na.rm = TRUE))
    suppressWarnings(oth_lvl <- as.character(df_sub$Level1[is.na(as.numeric(df_sub$Level1))]))
    df_sub$Level1 <- ordered(factor(df_sub$Level1,levels=c(paste(min_x:max_x),oth_lvl)))
  }
  df_sub <- df_sub[order(df_sub$Level1),]
  plot_ly(df_sub,x=Level1,y=Value.x, xaxis = "x1", yaxis = "y1",name="Model1") %>%
          add_trace(df,x=Level1,y=Value.y, xaxis = "x1", yaxis = "y1",name="Model2") %>%
    layout(xaxis=list(title=fac),yaxis=list(title=peril))
}

Function to check the consistency of the data

dataComp <- function(old,new,alpha=0.05){
  strName <- names(old)
  res <- data.frame(Name = NULL, Changes = NULL)
  for (i in strName){
    if (i %in% names(new)){
       x<- chisq.test(table(old[,i]),p=table(new[,i])/sum(table(new[,i])))
       if (x$p.value > alpha) {
         tmp <-data.frame(Name=i,Changes=FALSE)
         res <- rbind(res,tmp)
       }
    }
  }
  res
}

Initial model assessing

This initial model assessing step is to check each factor's ability of prediction based on different metrics in a stepwise analysis:

  • - F-Score from partial F-test glm
  • - Variance
  • - Correlation to Response
  • - Correlation feature selection
  • - Gain from quick decision tree / randomForest / xgboost
  • - Information Value

Support `train` in `modelPlot`

Because caret package uses a wrapped predict function, which requires type to be one of raw and prob.

Hence need to change the code in modelPlot to support caret's train model object.

Issues relate to v0.4

  • liftPlot height and size
  • change DESCRIPTION file: move rbokeh from Depends to Imports
  • point glyphs need to be changed as:
    1. Model Prediction at Base Levels glyph=25
    2. Fitted Average glyph=24
  • Residual Assess (AvsE, by factor) (v0.4)
  • lift curve (v0.4)
  • AUC (v0.4)
  • Set y_axis of ly_hist and ly_bar to scientific
  • hover in dataPlot, modelPlot are too simple which cause warnings.
  • order of the legend of by doesn't look right when the by variables has been regrouped as interval factor.
  • Support ordered factor.

Depricated

Min and Max Value Improvement

When there are lots of null or missing values, DataSummary should still populate the min and max from non-missing values.

3D plot for 2-way interactions

This may not be the final solution, but, can use rgl or rbokeh for this purpose.

If it is under rbokeh, then probably will use hexbin plot ly_hexbin or image plot ly_image

Efficiency on CramersV

Need a efficiency on CramersV, which is slow atm.
Possible solution is call C or C++ implementation.

Issue relate to v0.3

  • Consistency test by time and random number (v0.3)
    1. Deal with NaN and 0
    2. Necessary of vector in by parameter
  • CramersV test (v0.3)
  • Data Exploratory Report Template (RMarkdown) (v0.3)
  • residual Plot (v0.3)

Need a prediction compare view

library(RColorBrewer)

modelCompPlot <- function(x,act,pred,by=NULL,weights=NULL,newGroupNum=10){
  if (is.null(x)) stop("x provided is blank.")
  if (is.null(act)) stop("act provided is blank.")
  if (is.null(pred)) stop("pred provided is blank.")

  str_pred <- NULL
  num_pred <- NULL
  if (length(x) != length(act)) stop("x and act don't have the same length")
  if (is.vector(pred)){
    str_pred <- "pred"
    num_pred <- 1
    if(length(x) != length(pred)) stop("x and pred don't have the same length")
  }
  else {
    str_pred <- colnames(pred)
    if (is.null(str_pred)) str_pred <- paste("Pred",1:nrow(pred),sep="")
    num_pred <- ncol(pred)
    if(length(x) != nrow(pred)) stop("x and pred don't have the same length")
  }

  if(is.null(weights)) {weights <- rep(1,length(x))}

  #New Group for data which has too much levels.
  if ( (is.numeric(x) || is.integer(x)) && nlevels(as.factor(x))>100 ) {
    if ( is.null(newGroupNum) ) newGroupNum <- 10

    new_band <- seq(min(x, na.rm = TRUE),max(x, na.rm = TRUE),length.out=newGroupNum)
    x <- as.character(cut(x,new_band,include.lowest = TRUE))
  }

  if(!is.null(by)){
    data.plot <- data.table::as.data.table(as.data.frame(cbind(xvar=x,by=by,act,pred,weights),stringsAsFactors=FALSE))
    setkey(data.plot,xvar,by)

    dp_name_str <- c("act",str_pred,"weights")
    data.plot <- data.plot[,lapply(.SD,as.numeric),by=list(xvar,by),.SDcols=dp_name_str]
    data.agg  <- data.plot[,lapply(.SD,weighted.mean,w=weights),by=list(xvar,by),.SDcols=dp_name_str]

    p1 <- rbokeh::figure(xlab="x",ylab="",height=500) %>% 
      rbokeh::ly_lines(xvar,act,color=by,type=list(2),
                      width=2,data=data.agg) 

    for (i in num_pred){
      p1 <- p1 %>% rbokeh::ly_lines(xvar,deparse(substitute(str_pred[i])),color=by,type=list(2),
                                    width=2,data=data.agg) 
    }
  }
  else {
    data.plot <- data.table::as.data.table(as.data.frame(cbind(xvar=x,act,pred,weights),stringsAsFactors=FALSE))
    setkey(data.plot,xvar)

    dp_name_str <- c("act",str_pred,"weights")
    data.plot <- data.plot[,lapply(.SD,as.numeric),by=xvar,.SDcols=dp_name_str]
    data.agg  <- data.plot[,lapply(.SD,weighted.mean,w=weights),by=xvar,.SDcols=dp_name_str]

    p1 <- rbokeh::figure(xlab="x",ylab="",width=700,height=800) %>% 
      rbokeh::ly_lines(xvar,act,color="#CC3399",
                       width=2,data=data.agg,legend = "Actual") 

    str_col <- brewer.pal(max(num_pred,3), "Spectral")
    for (i in 1:num_pred){
      print(str_pred[i])
      p1 <- p1 %>% rbokeh::ly_lines(xvar,str_pred[i],color=str_col[i],width=2,data=data.agg,
                                    legend=str_pred[i]) 
    }
  }

  p1
}

To Do List

  • boxplot for segmentation outlier
  • residual to tree methods (check variable missing)
  • single Tree visualisation on 2 dimension split, which can compare with different tree methods.
  • Travis.yml update to R template from current c template
  • EDA report
  • Quick Xgboost run for important variable
  • Quick abnormal detection run

Done:

  • Need a function to visualise the interactions or correlations:
    Can use the networkD3 package for the network plot, or use the DiagrammeR , igraph and vivagraph http://www.buildingwidgets.com/blog/?offset=1435611529118. [#26]
  • Code efficient by either Rcpp or cleaner and elegant coding in R
  • different residuals can be used on residual Plot (e.g. student, pearson, cooker's, etc.) [#52]
  • deviance residual vs predicted target value [#52]
  • different residuals can be used on residual Plot (e.g. student, pearson, cooker's, etc.) [#52]

Not Now:

  • Correlation matrix calculation for numerical matrix or data frame. Or find a quicker way for correlation calculation on big data. [#24]
  • Distributed / parallel Calculation
  • Need a model comparison function for all ML methods. A structure or template is needed to conduct this model comparison. A PMML structure would be a good starting point, even though not all ML packages support it. Or use caret package straight away.
  • which terms in the model have been affect by new factors and in what way?
    1. Coefficient
    2. Changes by level (vis)
    3. Is it because of correlation?
    4. What this new factor has explained?
      This is more like feature selection, and may to specific to glm / lm method.
  • Trends analysis:
    1. Within the model to assess the consistency
    2. Between datasets on different times to assess the development
    3. There should have a function to give this feature. A starting point is the AbnormalDetection package and changepoint package.
      More timeseries analysis. Cannot find specific / achievable criteria.
  • historgram of deviance residual (symmetry). (REASON: currentl resiPlot has contour plot for this.)
  • deviance residual (sqrt(weighted deviance) * sign(a-e)). (REASON: #52 provides an open slot for this.)
  • deviance plot for outlier visualisation. (REASON: currentl resiPlot has contour plot for this.)
  • support multinomial analysis. (REASON: not urgent need.)
  • RcppParalle for CramersV function (REASON: not urgent need. And parallel overheads may have inverse effect.)

change of the base of a variable in a glm model may affect the modelPlot

Current modelPlot ignores the actual base of variables in a glm, instead, it recalcualtes the mean or mode of the variable and treat them as the base.

If modeller specified another base and used in glm, the modePlot may not recognise it, hence may give wrong "model prediction at base level" line.

order of the axis ticktext is not in desired order in interPlot

in interPlot plot (in README file as well), the order of axis ticktext is not in order.

The issue is that then creating the matrix for plotly, the column header is not created in order when given factor is not sorted.

A force sort may be needed for variable used.

A more realistic solution is that user define the order before feading into the interPlot function.

view the feature without request of the model

dataPlot <- function(data,xvar,yvar,weights=NULL,byvar=NULL,newGroupNum=NULL,interactive=FALSE,...){

  opts.list<-list(...)
  opts <- names(list(...))
  if("xlim" %in% opts) xlim<-opts.list$xlim
  if("ylim" %in% opts) ylim<-opts.list$ylim
  if("binwidth" %in% opts) binwidth<-opts.list$binwidth else binwidth <- 1

  # Error Trapping
  if( is.null(data) ) stop("data set provided is null.")
  if( is.null(xvar) ) stop("X variable provided is null.") 
  if( is.null(yvar) ) stop("Responce variable provided is null.")

  if (is.character(xvar)) {
    if(!xvar %in% colnames(data)) stop(paste("xvar variable (",xvar,") cannot be found.",""))
    x <- data[,which(names(data)==xvar)]
    xname <- xvar
  }
  else if (is.integer(xvar)) { 
    x <- data[,xvar]
    xname <- names(data)[xvar]
  }
  else stop ("xvar provided is either a character (variable name) or integer (position of the variable).")

  if (is.character(yvar)) {
    if(!yvar %in% colnames(data)) stop(paste("yvar variable (",yvar,") cannot be found.",""))
    yname <- yvar
    y <- data[,which(names(data)==yvar)]
  }
  else if (is.integer(yvar)) {
    y <- data[,yvar]
    yname <- names(data)[yvar]
  }
  else stop ("yvar provided is either a character (variable name) or integer (position of the variable).")

  if( !is.null(byvar) ){
    if (is.character(byvar)){
      if(!byvar %in% colnames(data) ) stop(paste("xvar variable (",byvar,") cannot be found.",""))
      by <- data[,which(names(data)==byvar)]
      byname <- byvar
    }
    else if (is.integer(byvar)) {
      xname <- names(data)[byvar]
      by <- as.character(data[,byvar])
    }
    else by <- NULL
  }
  else by <- NULL

  if( !is.null(weights) ){
    wname = "w"
    if (is.character(weights)){
      if(!weights %in% colnames(data) ) stop(paste("xvar variable (",weights,") cannot be found.",""))
      w <- data[,which(names(data)==weights)]
    }
    else if (is.integer(weights) && length(weights)==1) {
      w <- data[,weights]
    }
    else if (is.integer(weights) && length(weights)>1){
      if ( dim(data)[1] != length(weights) ) stop ("Length of weights is not the same as dimension of the data provided.")
      w <- weights
    }
  }
  else w <- rep(1,dim(data)[1])

  #New Group for xvar which has too much levels.
  if ( (is.numeric(x) || is.integer(x) ) && nlevels(as.factor(x))>100 ) {
    if ( is.null(newGroupNum) ) newGroupNum <- 10

    new_band <- seq(min(x),max(x),length.out=newGroupNum)
    x <- cut(x,new_band,include.lowest = TRUE)
  }

  #New Group for byvar which has too much levels.
  if(!is.null(by)){
    if ( (is.numeric(by) || is.integer(by)) && nlevels(as.factor(by))>100 ) {
      if ( is.null(newGroupNum) ) newGroupNum <- 10

      new_band <- seq(min(by),max(by),length.out=newGroupNum)
      by <- cut(by,new_band,include.lowest = TRUE)
    }
  }

  #Data for plot
  if (is.null(by)) {
    data.plot <- data.table::as.data.table(as.data.frame(cbind(x=x,y=y,w=w),stringsAsFactors=FALSE))
    data.table::setkey(data.plot,x)

    data.plot <- data.plot[,lapply(.SD,as.numeric),by=x,.SDcols=c("y","w")]
    data.agg <- as.data.frame(data.plot[,lapply(.SD,weighted.mean,w=w),by=x,.SDcols=c("y","w")],row.names=c("xvar","weights","observed"))
    data.freq <- as.data.frame(data.plot[,sum(w),by=x][,freq:=V1/sum(V1)])

    data.melt <- reshape2::melt(data.agg[,-3],id=c("x"))

    #line graph

    strV1 <- paste("Observation Analysis on: ",xname)

    gLine <- ggplot2::ggplot(data=data.melt,aes(x=x,y=value)) + 
      ggplot2::geom_line(size=1,colour= "magenta3") + ggplot2::geom_point(size=4,fill="white",shape=22)
    if(("xlim" %in% opts) && is.numeric(data.melt$x)) gLine <-gLine + ggplot2::scale_x_continuous(limits=xlim)
    else if(("xlim" %in% opts) && !is.numeric(data.melt$x)) gLine <-gLine + ggplot2::scale_x_discrete(limits=xlim)
    else if(("xlim" %in% opts) && is(data.melt[,"x"],"Date")) gLine <-gLine + ggplot2::scale_x_date(label=date_format("%y%m"),limits=xlim)
    if("ylim" %in% opts) gLine <-gLine + ggplot2::ylim(ylim)
    gLine <- gLine + ggplot2::xlab("") + ggplot2::ylab(yname)+ ggplot2::ggtitle(strV1)+ theme_mp_line
    if(nlevels(as.factor(data.melt$x))>25) gLine <- gLine + ggplot2::theme(axis.text.x = element_text(angle = 90,hjust=0.5,vjust=0.5))

    #histogram graph
    ghist <- ggplot2::ggplot(data=data.freq,aes(x=x,y=freq))+
      ggplot2::geom_histogram(stat="identity",colour="black",fill="yellow")
    if(("xlim" %in% opts) && is.numeric(data.melt$x)) ghist <-ghist + ggplot2::scale_x_continuous(limits=xlim)
    else if(("xlim" %in% opts) && !is.numeric(data.melt$x)) ghist <-ghist + ggplot2::scale_x_discrete(limits=xlim)
    ghist <- ghist + ggplot2::ylab("percent (%)")+
      ggplot2::scale_y_continuous(labels = percent)+
      ggplot2::xlab("")+ theme_mp_hist

    gridExtra::grid.arrange(gLine,ghist,ncol=1,nrow=2,heights=c(4,1))

    if (interactive) {
      df <- data.frame(data.agg,freq=data.freq$freq)
      gvisSingleOptionList <- list(pointSize=8,
                                   series="[
                                   {targetAxisIndex:0, type:'line',color:'magenta',pointShape: 'square'},
                                   {targetAxisIndex:1, type:'bars',color:'yellow'}]",
                                   crosshair="{trigger:'both'}",
                                   hAxis.title=xname,
                                   theme="maximized",
                                   title=paste0("Observation analysis on ",xname, " Observed"),
                                   vAxes="{1:{format:'##.#%',maxValue:1}}",
                                   explorer="{ actions: ['dragToZoom', 'rightClickToReset'],keepInBounds: true }",
                                   chartArea="{width:'90%',height:'90%'}",
                                   height=750)

      plot(googleVis::gvisComboChart(df,xvar="x",yvar="y",options=gvisSingleOptionList))

    }
  }
  else{
    data.plot <- data.table::as.data.table(as.data.frame(cbind(x=x,y=y,w=w,by=by),stringsAsFactors=FALSE))
    data.table::setkey(data.plot,x,by)

    data.plot <- data.plot[,lapply(.SD,as.numeric),by=list(x,by),.SDcols=c("y","w")]

    data.agg <- as.data.frame(data.plot[,lapply(.SD,weighted.mean,w=w),by=list(x,by),.SDcols=c("y","w")],row.names=c("xvar","by","weights","observed"))

    data.freq <- as.data.frame(data.plot[,sum(w),by=list(x,by)][,freq:=V1/sum(V1)])

    #line graph
    gLine1 <- ggplot2::ggplot(data=data.agg,aes(x=x,y=y,group=factor(by),colour=factor(by)))+
      ggplot2::geom_line(size=1) + ggplot2::geom_point(size=4,fill="white")
    if("xlim" %in% opts) gLine1 <-gLine1 + xlim(xlim)
    if("ylim" %in% opts) gLine1 <-gLine1 + ylim(ylim)
    gLine1 <- gLine1+ggplot2::xlab("")+ggplot2::ylab(yname)+ ggplot2::ggtitle(paste("Observation Analysis on: ",xname," by ",byname))+
      theme_mp_line
    if(nlevels(as.factor(data.agg$x))>25) gLine1 <- gLine1 + ggplot2::theme(axis.text.x = element_text(angle = 90,hjust=0.5,vjust=0.5))

    #histogram graph
    ghist <- ggplot2::ggplot(data=data.freq,aes(x=x,y=freq,fill=factor(by)))+ ggplot2::geom_histogram(stat="identity",binwidth=1)
    if("xlim" %in% opts) ghist <-ghist + ggplot2::xlim(xlim)
    ghist <- ghist + ggplot2::xlab("")+ ggplot2::ylab("percent (%)")+ ggplot2::scale_y_continuous(labels = percent) + 
      theme_mp_hist+ theme(legend.position="none")

    gridExtra::grid.arrange(gLine1,ghist,ncol=1,nrow=2,heights=c(4,1))
  }
}

correlation calculation on numerical dataset

Tried to have an improvement about the speed of cor function.

Currently, the following code works (with inline and RcppArmadillo)

src <- '
  mat m1 = as<mat>(x);
  return(wrap(cor(m1)));
  '

  fx <- inline::cxxfunction( signature(x = "numeric") , 
                     body = src, 
                     plugin='RcppArmadillo', 
                     includes = 'using namespace Rcpp;
                                 using namespace arma;')

It is slightly quicker than cor:

library(microbenchmark)
x <- as.matrix(mtcars)
microbenchmark(
  fx(x),
  cor(x)
)
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
  fx(x) 11.50953 11.67405 11.96615 11.82167 12.21154 13.20968   100  a 
 cor(x) 13.92392 14.22428 14.46255 14.37597 14.72041 15.27963   100   b

But on a 300 x 300 matrix, the time spending is the same. In this case, there is no benefit to continue on this?

Model tests

Some tests can be used from RegressionTestsInterface in fRegression package:

  • "bg" Breusch–Godfrey test for higher order serial correlation,
  • "dw" Durbin–Watson test for autocorrelation of disturbances,
  • "harv" Harvey–Collier test for linearity,
  • "rain" Rainbow test for linearity, and
  • "reset" Ramsey’s RESET test for functional relation.

For regression (linear) residual test (because of the assumption that the variance must be constant)

https://cran.r-project.org/web/packages/fRegression/fRegression.pdf

[rocPlot]In some cases, the ROC is 1-ROC

When the class of response variable is character or factor, the positive response may be recognized as negative label.

Please check roc function in pROC package for more information (the levels parameter).

Data Report Markdown

Enhance the data report marrkdown:

  • using formattable
  • plotly in rmarkdown
  • parametise the rmarkdown

Detecting Trends

Sometimes we may want to monitoring the trends:

  • within the model to assess the consistency
  • between datasets on different times to assess the development

There should have a function to give this feature. A starting point is the AbnormlDetection package and changepoint package.

Suggestions for improvements to the DataPlot function

It would be good to have control over the location of the legend and the max number of bins on the x axis.

I think this is easy to do with the following amendments to the code (I'm only showing lines that need to change with the changes in bold font)

dataPlot <- function(data,xvar,yvar,byvar=NULL,weights=NULL,
newGroupNum=10, legend_x= 0.5, legend_y = 1.0, x_maxBins = 100, y_maxBins = 20)

if ( (is.numeric(x) || is.integer(x) ) && nlevels(as.factor(x))>=x_maxBins ) {

#New Group for byvar if it has too many levels.
if(!is.null(by)){
if ( (is.numeric(by) || is.integer(by)) && nlevels(as.factor(by))>y_maxBins) {

l <- list(bordercolor = "#000000",borderwidth=1,x = legend_x, y = legend_y)

Regards

Nigel

A data partition function to split data into multi parts

A partition function to split the data into different proportion.

multiParts <- function(y,p){
  if(is.null(y)) stop("no inputs")

  if(sum(p)<1) warning("Sum of proportion is less than 1, 
                       rest of the data will be dropped.")

  if(sum(p)>1) stop("Sum of the proportion is larger than 1.")

  rn <- 1:length(y)
  np <- length(p)

  res <- list()

#   Use the createDataPartition from caret package: 
#   for (i in 1:np){
#     ind <- caret::createDataPartition(y=rn,p=p[i])
#     res[[i]] <- rn[ind[[1]]]
#     rn <- rn[-ind[[1]]]
#     p <- p/(sum(p[-c(1:i)]))
#   }


#   Use sample function instead
  for (i in 1:np){
    ind <- sample(1:length(rn),size=floor(p[i]*length(rn)))
    res[[i]] <- rn[ind]
    rn <- rn[-ind]
    p <- p/(sum(p[-c(1:i)]))
  }  
  res
}

A nice cutter to set a proper range for plotting

When there a variable has many unique levels, modelPlot tries to banded the variable into a default 20 groups

  • Will this method can be improved?
  • interPlot method may need a separate solution because of the melting process.

CramersV on user defined class

CramersV function failed on diamonds dataset due to variable type Ord.factor.

Needs to check the second class if there is any. Other wise, drop the variable to make sure the other variables will be analyised.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.