Comments (13)
Summary: Build a way to search sample data sets in R packages to identify packages with different characteristics such as the format of the data set (e.g. data frame, matrix, dist, ts) and where appropriate the types of variables (e.g. factor, numeric, ts).
from unconf18.
Cool idea! Trying to figure if I understand correctly. Do you mean something like:
check_dataset(package = "datasets")
# A tibble: 8 x 6
Package Item Title Rows Cols Class
<chr> <chr> <chr> <int> <int> <chr>
1 datasets ability.cov Ability and Intelligence Tests 6 8 list
2 datasets airmiles Passenger Miles on Commercial US Airlines, 1937-1960 24 2 ts
3 datasets AirPassengers Monthly Airline Passenger Numbers 1949-1960 144 2 ts
4 datasets airquality New York Air Quality Measurements 153 6 data.frame
5 datasets anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions 11 8 data.frame
6 datasets attenu The Joyner-Boore Attenuation Data 182 5 data.frame
7 datasets attitude The Chatterjee-Price Attitude Data 30 7 data.frame
8 datasets austres Quarterly Time Series of the Number of Australian Residents 89 2 ts
Then you narrow down if you are look for data.frame, list etc?. So a function that a) returns a tibble and b) accepts a package(s) as an argument?
from unconf18.
Concur here too: cool idea! It would also be pretty straightforward to integrate that within flipper
. The mooted extension to trawling all /man
directories is technically straightforward, and could very easily include functionality to trawl any @docType data
to enables those to be text-searched, and to group by return type (@format
).
from unconf18.
Cool, yes something similar to what @boshek has, I started playing a bit just to see what the complications would be. My basic idea would be to be able to
- Search for a data set of a particular type (e.g. data frame, ts, mts, matrix etc)
- Be able to search (within data frames I guess) for presence of variables with specific classes.
So if you take a package name as an argument get all the information about the data into a tibble and then you'd be able to say give me all the data frames with a factor.
So this is just a quick script for making a data frame from the core data. I wanted to see what some of the complications would be and they are having spaces + extra words in the Item field and having multiple classes.
dataset_list <- data(package="datasets")
datasets_df <- as.data.frame(dataset_list[["results"]], stringsAsFactors = FALSE)
datasets_df$short <- gsub( " .*$", "", datasets_df$Item )
for (i in 1:nrow(datasets_df)){
dataset_name <- get(datasets_df$short[i])
# Get the first class name when there is more than one.
class_name <- class(dataset_name)
datasets_df$class[i] <- class(dataset_name)[1]
datasets_df$n_classes[i] <- length(class(dataset_name))
}
And then something like the below to get the classes but the question would be how to organize the information to make it most useful.
For example maybe something like a set of logical variables: has_numeric, has_factor, has_logical, has_integer, has_character etc.
# Figure out what would work best for people in terms of searching
unlist(lapply(get(datasets_df$Item[i]), class))
from unconf18.
Love this idea. Beyond class, it would be helpful to have information about the data types. Often I need several categorical variables, and while I do love the Titanic dataset, some more diversity would be a good thing. When writing exams I searched through the Sleuth3 manual for particular criteria but it was very time-consuming.
from unconf18.
A helpful starting point might be last year's project examining data packages on CRAN: https://github.com/ropenscilabs/data-packages.
from unconf18.
@jtr13 That's what I mean by class. There are so few ordered factors! So actually it would be good to know the number of each type. E.g. 3 factors, 2 ordered factors,5 numeric. I agree that it's the combinations that get really frustrating. When you want a simple example having to convert types can be a distraction from the main lesson.
@noamross If packages documented like that it would be cool and we could definitely include in a dashboard. We could at least provide a url for the description (although we can also try to scrape them).The other thing is packages that wrap APIs for accessing data. The main thing is to make it automated.
Then maybe if we have a sense of what is there that let's us think about what's missing.
from unconf18.
@elinw Got it. When do you use ordered factors? I never do! (This probably isn't the right place to discuss this...)
from unconf18.
from unconf18.
I've never had to use ordered factors for that kind of data for my purposes (usually visualization). I just order the levels of regular factors.
from unconf18.
Cool idea! One thought might be that oftentimes, when I'm looking for a teaching dataset, I'm looking for the presence of variable relationships in the data, such as smoking status (categorical) vs. BMI (continuous). So could this be another way of classifying the datasets?
from unconf18.
Yes so that's what I was trying to say about getting the classes of the variables for the data frames.
https://github.com/elinw/dataestsearch/blob/master/R/datasetsearch.R
Is a concept but not that well coded (loops!! ) ... and it doesn't handle getting the variable types for tibbles but it does work for data frames. I mean this is just a concept but if we have a bunch of people we could make it really nice and figure out what is useful.
from unconf18.
Ah, ok, that makes sense. I did something similar with a shiny workshop in identifying variables from a data.frame
so that factor, character, and continuous variables would populate the right dropdowns for any dataset that was loaded into an app. It's the same idea as your code: https://github.com/laderast/gradual_shiny/blob/master/03_observe_update/helper.R
from unconf18.
Related Issues (20)
- An on-boarding process for 'research compendia'? HOT 3
- Extensions to R / RStudio's autocompletion system HOT 8
- Testing and reporting performance regressions / tracking performance over time HOT 1
- packrat: ease the use of external libraries HOT 2
- R package wrapper to the CEDAR API
- Synthetic Dataset Generation HOT 7
- .rprofile interviews :: the next evolution HOT 11
- Open Source Qualitative Coding Tool HOT 14
- Tools and guidance on basic dataset metadata standards, files and formats. HOT 7
- Input text through Google Doc and display compiled Rmd
- Providing documentation for the `asis` engine HOT 15
- Collaboration workflow for users who are willing to use RStudio HOT 1
- Incorporate word doc track changes back into R markdown HOT 11
- GitHubBerries: get notified of new release of r package not on CRAN HOT 10
- Code of conduct template package + tools/packages for promoting a diverse and welcoming environment HOT 6
- Dealing with reactive values in Shiny [sth like ShinySignals!] HOT 1
- R on high performance clusters HOT 5
- Imaging and Vision
- Library of shinyapps for teaching HOT 3
- TMLE issue: HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unconf18.