Giter VIP home page Giter VIP logo

Comments (13)

elinw avatar elinw commented on June 15, 2024 1

Summary: Build a way to search sample data sets in R packages to identify packages with different characteristics such as the format of the data set (e.g. data frame, matrix, dist, ts) and where appropriate the types of variables (e.g. factor, numeric, ts).

from unconf18.

boshek avatar boshek commented on June 15, 2024

Cool idea! Trying to figure if I understand correctly. Do you mean something like:

check_dataset(package = "datasets")
# A tibble: 8 x 6
  Package  Item          Title                                                        Rows  Cols Class     
  <chr>    <chr>         <chr>                                                       <int> <int> <chr>     
1 datasets ability.cov   Ability and Intelligence Tests                                  6     8 list      
2 datasets airmiles      Passenger Miles on Commercial US Airlines, 1937-1960           24     2 ts        
3 datasets AirPassengers Monthly Airline Passenger Numbers 1949-1960                   144     2 ts        
4 datasets airquality    New York Air Quality Measurements                             153     6 data.frame
5 datasets anscombe      Anscombe's Quartet of 'Identical' Simple Linear Regressions    11     8 data.frame
6 datasets attenu        The Joyner-Boore Attenuation Data                             182     5 data.frame
7 datasets attitude      The Chatterjee-Price Attitude Data                             30     7 data.frame
8 datasets austres       Quarterly Time Series of the Number of Australian Residents    89     2 ts   

Then you narrow down if you are look for data.frame, list etc?. So a function that a) returns a tibble and b) accepts a package(s) as an argument?

from unconf18.

mpadge avatar mpadge commented on June 15, 2024

Concur here too: cool idea! It would also be pretty straightforward to integrate that within flipper. The mooted extension to trawling all /man directories is technically straightforward, and could very easily include functionality to trawl any @docType data to enables those to be text-searched, and to group by return type (@format).

from unconf18.

elinw avatar elinw commented on June 15, 2024

Cool, yes something similar to what @boshek has, I started playing a bit just to see what the complications would be. My basic idea would be to be able to

  1. Search for a data set of a particular type (e.g. data frame, ts, mts, matrix etc)
  2. Be able to search (within data frames I guess) for presence of variables with specific classes.
    So if you take a package name as an argument get all the information about the data into a tibble and then you'd be able to say give me all the data frames with a factor.

So this is just a quick script for making a data frame from the core data. I wanted to see what some of the complications would be and they are having spaces + extra words in the Item field and having multiple classes.

dataset_list <- data(package="datasets")
datasets_df <- as.data.frame(dataset_list[["results"]], stringsAsFactors = FALSE)
datasets_df$short <- gsub( " .*$", "", datasets_df$Item )

for (i in 1:nrow(datasets_df)){
  dataset_name <- get(datasets_df$short[i])
  # Get the first class name when there is more than one. 
  class_name <- class(dataset_name)
  datasets_df$class[i] <- class(dataset_name)[1]
  datasets_df$n_classes[i] <- length(class(dataset_name))
}

And then something like the below to get the classes but the question would be how to organize the information to make it most useful.
For example maybe something like a set of logical variables: has_numeric, has_factor, has_logical, has_integer, has_character etc.

# Figure out what would work best for people in terms of searching
unlist(lapply(get(datasets_df$Item[i]), class))

from unconf18.

jtr13 avatar jtr13 commented on June 15, 2024

Love this idea. Beyond class, it would be helpful to have information about the data types. Often I need several categorical variables, and while I do love the Titanic dataset, some more diversity would be a good thing. When writing exams I searched through the Sleuth3 manual for particular criteria but it was very time-consuming.

from unconf18.

noamross avatar noamross commented on June 15, 2024

A helpful starting point might be last year's project examining data packages on CRAN: https://github.com/ropenscilabs/data-packages.

from unconf18.

elinw avatar elinw commented on June 15, 2024

@jtr13 That's what I mean by class. There are so few ordered factors! So actually it would be good to know the number of each type. E.g. 3 factors, 2 ordered factors,5 numeric. I agree that it's the combinations that get really frustrating. When you want a simple example having to convert types can be a distraction from the main lesson.

@noamross If packages documented like that it would be cool and we could definitely include in a dashboard. We could at least provide a url for the description (although we can also try to scrape them).The other thing is packages that wrap APIs for accessing data. The main thing is to make it automated.

Then maybe if we have a sense of what is there that let's us think about what's missing.

from unconf18.

jtr13 avatar jtr13 commented on June 15, 2024

@elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)

from unconf18.

elinw avatar elinw commented on June 15, 2024

from unconf18.

jtr13 avatar jtr13 commented on June 15, 2024

I've never had to use ordered factors for that kind of data for my purposes (usually visualization). I just order the levels of regular factors.

from unconf18.

laderast avatar laderast commented on June 15, 2024

Cool idea! One thought might be that oftentimes, when I'm looking for a teaching dataset, I'm looking for the presence of variable relationships in the data, such as smoking status (categorical) vs. BMI (continuous). So could this be another way of classifying the datasets?

from unconf18.

elinw avatar elinw commented on June 15, 2024

Yes so that's what I was trying to say about getting the classes of the variables for the data frames.
https://github.com/elinw/dataestsearch/blob/master/R/datasetsearch.R

Is a concept but not that well coded (loops!! ) ... and it doesn't handle getting the variable types for tibbles but it does work for data frames. I mean this is just a concept but if we have a bunch of people we could make it really nice and figure out what is useful.

from unconf18.

laderast avatar laderast commented on June 15, 2024

Ah, ok, that makes sense. I did something similar with a shiny workshop in identifying variables from a data.frame so that factor, character, and continuous variables would populate the right dropdowns for any dataset that was loaded into an app. It's the same idea as your code: https://github.com/laderast/gradual_shiny/blob/master/03_observe_update/helper.R

from unconf18.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.