When I'm writing tutorials or documentation or when I'm teaching I often fall back on

Cool, yes something similar to what <a class="user-mention notranslate" data-hovercard

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Datasets search about unconf18 HOT 13 OPEN

ropensci commented on June 15, 2024 6

Datasets search

from unconf18.

Comments (13)

elinw commented on June 15, 2024 1

Summary: Build a way to search sample data sets in R packages to identify packages with different characteristics such as the format of the data set (e.g. data frame, matrix, dist, ts) and where appropriate the types of variables (e.g. factor, numeric, ts).

from unconf18.

boshek commented on June 15, 2024

Cool idea! Trying to figure if I understand correctly. Do you mean something like:

check_dataset(package = "datasets")
# A tibble: 8 x 6
  Package  Item          Title                                                        Rows  Cols Class     
  <chr>    <chr>         <chr>                                                       <int> <int> <chr>     
1 datasets ability.cov   Ability and Intelligence Tests                                  6     8 list      
2 datasets airmiles      Passenger Miles on Commercial US Airlines, 1937-1960           24     2 ts        
3 datasets AirPassengers Monthly Airline Passenger Numbers 1949-1960                   144     2 ts        
4 datasets airquality    New York Air Quality Measurements                             153     6 data.frame
5 datasets anscombe      Anscombe's Quartet of 'Identical' Simple Linear Regressions    11     8 data.frame
6 datasets attenu        The Joyner-Boore Attenuation Data                             182     5 data.frame
7 datasets attitude      The Chatterjee-Price Attitude Data                             30     7 data.frame
8 datasets austres       Quarterly Time Series of the Number of Australian Residents    89     2 ts

Then you narrow down if you are look for data.frame, list etc?. So a function that a) returns a tibble and b) accepts a package(s) as an argument?

from unconf18.

mpadge commented on June 15, 2024

Concur here too: cool idea! It would also be pretty straightforward to integrate that within flipper. The mooted extension to trawling all /man directories is technically straightforward, and could very easily include functionality to trawl any @docType data to enables those to be text-searched, and to group by return type (@format).

from unconf18.

elinw commented on June 15, 2024

Cool, yes something similar to what @boshek has, I started playing a bit just to see what the complications would be. My basic idea would be to be able to

Search for a data set of a particular type (e.g. data frame, ts, mts, matrix etc)
Be able to search (within data frames I guess) for presence of variables with specific classes.
So if you take a package name as an argument get all the information about the data into a tibble and then you'd be able to say give me all the data frames with a factor.

So this is just a quick script for making a data frame from the core data. I wanted to see what some of the complications would be and they are having spaces + extra words in the Item field and having multiple classes.

dataset_list <- data(package="datasets")
datasets_df <- as.data.frame(dataset_list[["results"]], stringsAsFactors = FALSE)
datasets_df$short <- gsub( " .*$", "", datasets_df$Item )

for (i in 1:nrow(datasets_df)){
  dataset_name <- get(datasets_df$short[i])
  # Get the first class name when there is more than one. 
  class_name <- class(dataset_name)
  datasets_df$class[i] <- class(dataset_name)[1]
  datasets_df$n_classes[i] <- length(class(dataset_name))
}

And then something like the below to get the classes but the question would be how to organize the information to make it most useful.
For example maybe something like a set of logical variables: has_numeric, has_factor, has_logical, has_integer, has_character etc.

# Figure out what would work best for people in terms of searching
unlist(lapply(get(datasets_df$Item[i]), class))

from unconf18.

jtr13 commented on June 15, 2024

Love this idea. Beyond class, it would be helpful to have information about the data types. Often I need several categorical variables, and while I do love the Titanic dataset, some more diversity would be a good thing. When writing exams I searched through the Sleuth3 manual for particular criteria but it was very time-consuming.

from unconf18.

noamross commented on June 15, 2024

A helpful starting point might be last year's project examining data packages on CRAN: https://github.com/ropenscilabs/data-packages.

from unconf18.

elinw commented on June 15, 2024

@jtr13 That's what I mean by class. There are so few ordered factors! So actually it would be good to know the number of each type. E.g. 3 factors, 2 ordered factors,5 numeric. I agree that it's the combinations that get really frustrating. When you want a simple example having to convert types can be a distraction from the main lesson.

@noamross If packages documented like that it would be cool and we could definitely include in a dashboard. We could at least provide a url for the description (although we can also try to scrape them).The other thing is packages that wrap APIs for accessing data. The main thing is to make it automated.

Then maybe if we have a sense of what is there that let's us think about what's missing.

from unconf18.

jtr13 commented on June 15, 2024

@elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)

from unconf18.

elinw commented on June 15, 2024

All those “too much, too little, about right” questions for one … Which leads to a whole other set of things. One of the big issues for me in the base categorical data is that they have it formatted into table classes but I want my students to see them like they are a data frame meaning a more realistic s setting where there are variables of at least two types.

…

On Apr 24, 2018, at 8:51 PM, Joyce Robbins ***@***.***> wrote: @elinw <https://github.com/elinw> Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAuEfUTazsvs9GTAV4Z87zrR3YKNzUbQks5tr8iegaJpZM4TXcvS>.

from unconf18.

jtr13 commented on June 15, 2024

I've never had to use ordered factors for that kind of data for my purposes (usually visualization). I just order the levels of regular factors.

from unconf18.

laderast commented on June 15, 2024

Cool idea! One thought might be that oftentimes, when I'm looking for a teaching dataset, I'm looking for the presence of variable relationships in the data, such as smoking status (categorical) vs. BMI (continuous). So could this be another way of classifying the datasets?

from unconf18.

elinw commented on June 15, 2024

Yes so that's what I was trying to say about getting the classes of the variables for the data frames.
https://github.com/elinw/dataestsearch/blob/master/R/datasetsearch.R

Is a concept but not that well coded (loops!! ) ... and it doesn't handle getting the variable types for tibbles but it does work for data frames. I mean this is just a concept but if we have a bunch of people we could make it really nice and figure out what is useful.

from unconf18.

laderast commented on June 15, 2024

Ah, ok, that makes sense. I did something similar with a shiny workshop in identifying variables from a data.frame so that factor, character, and continuous variables would populate the right dropdowns for any dataset that was loaded into an app. It's the same idea as your code: https://github.com/laderast/gradual_shiny/blob/master/03_observe_update/helper.R

from unconf18.

Datasets search about unconf18 HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent