ropensci / unconf18 Goto Github PK

Home Page: http://unconf18.ropensci.org/

HTML 16.81% JavaScript 61.26% CSS 21.93%

unconf18's Introduction

rOpenSci 2018 unconference

(invitation only), May 21 - 22, 2018. Seattle

Welcome to the repository for the 2018 unconference. rOpenSci will be hosting its fifth major developer meeting and open science hackathon this time in Seattle.

Participants
Please post ideas for projects, discussion topics, and sessions as issues.

Event hashtag is #runconf18

Code of conduct

To ensure a safe, enjoyable, and friendly experience for everyone who participates, we have a code of conduct. This applies to people attending in person or remotely, and for interacting over the issues.

Support

This meeting is made possible by generous support from:

The Helmsley Charitable Trust
Google
Microsoft
RStudio
NumFOCUS

unconf18's People

Contributors

Stargazers

Watchers

Forkers

mpadge persme11 davan690

unconf18's Issues

teach clean code principles with examples in R

Writing clean code is important for efficient work in every programming language. However, most of the resources available (at least what I encountered so far) contain examples in primarily object oriented languages making it less accessible for R programmers who are not software engineers.

Clean code makes it easier to collaborate, to avoid bugs and write working code faster. It also helps with understanding code written by others.

I envision a blog post series / bookdown with the relevant principles adopted from the Clean Code book with examples in R, focusing on every day R users and not necessarily those writing production R code. (i.e. do not mention error classes or R6 classes but focus on the most common use cases)

See:

Connecting R educators "in the wild"

I know it is late to propose a new project, so consider this an invite to talk with me more about this idea at unconf if it resonates with you!

Problem:
People who teach R create awesome R markdown, blogdown, and bookdown materials for teaching, most of which are stored on GitHub. But, they can be hard to find (everyone knows @STAT545-UBC, but discoverability of even these materials is low for people not fully steeped in #rstats). The tidyverse site has some links to courses, but the materials are variable: some are PDF syllabi, some are full repos, some are formal university course listings.

Idea:
Inspired by @batpigandme's idea (#48), I've been thinking of a website to aggregate existing educational materials from GitHub. Ideally, one could search GitHub for repos that include words in the title/tag/README like "curriculum", "course", "workshop", "bootcamp", and tag them as such (I want to catch repos like @hadley's Data Challenge Labs: https://github.com/dcl-2017-04/curriculum). Other items on my "would be nice" list:

Tag with blogdown, bookdown, or R markdown site
Tag with type of license, if there is one, re: reuse/attribution/etc.
Provide a "tidyverse" percentage: something like, of the packages loaded in the repo, what percent are in the tidyverse ecosystem?
Provide way to see "last updated" easily, and perhaps in navigable interface allow users to sort by this
Some kind of topic tagging: like statistics, machine learning, data science, data visualization, natural language processing, etc.
Perhaps a level tag, like undergrad, grad, K-12, etc.

Selfishly, I would find this type of resource very useful! But past-me would have found it invaluable. I frequently see professors in my own computer science group using Matlab for example because they don't know how to start teaching material they know using a language they don't know. It would be great to be able to forward them to courses on machine learning using R, for example. Just overheard yesterday a student lamenting that all course materials for ML are in Matlab, the TA only knows python, and she wants to use R, so I think this could also help students.

More broadly, I would love to establish an educator's collaborative around teaching R or with R. My university created one, and they worded it so nicely I'm just going to plagiarize:

"The Educators' Collaborative (EC) is a community of practice for people who are interested in education, including direct teaching, innovation, scholarship, curriculum design and mentoring. A community of practice is a group of skilled practitioners who interact regularly to learn from and with one another for the purpose of professional and personal development. Through in-person or online engagement, they create a shared understanding of purpose and develop communal resources to enhance their respective practices. (Lave & Wenger, 1991; Wenger, 1998)."

I have increasingly been working on team-taught courses and see real value in collaborating on curricula with other R educators. But not everyone has this luxury- it would be great to provide an organization to support innovative R education efforts.

Tagging folks that @stefaniebutland tagged on the Slack channel for interest/involvement in education:
@jennybc @laderast @hadley @jtr13 @czeildi @elinw @seankross @aurielfournier (I can't find Jenny Draper on GitHub, so I'm sorry for not tagging here!)

Modular tools for a drat / mini-cran repository

rOpenSci maintains a drat repository that is built nightly by Circle-CI (with the help of drat.builder) at http://packages.ropensci.org. At the moment the utility of this is somewhat minimal; it provides an alternative to devtools::install_github() an hosts some supplemental data packages too large for CRAN.

Maintaining a cran-like repo of development versions could be a lot more compelling if we explored some additional features. I think some of these could also be seen as useful services / perks of for a maintainer having their package on-boarded. Ideally these would be implemented in modular tools other groups could adopt on ad-hoc basis as well. What would you like to see?

prebuilt binaries (Mac, Windows, even Linux?) Could significantly reduce install times when downloading packages, possibly useful for CI setups too. (I believe @jeroen has some thoughts here).
Nightly builds. Particularly useful for packages that interact with an API that might introduce a breaking change. Currently we do this already via Travis, but with so many rOpenSci packages that can back up Travis builds of packages under active development.
Dashboard summaries: a convenient dashboard to see which rOpenSci packages are failing tests on what platform, how frequently they have been downloaded (from the drat/mini-cran, also from CRAN), GitHub issues, other information.
security features. Previous unconfs (@hrbrmstr and others) have brainstormed about what a secure package repository might look like, signed packages etc. A working platform (with binaries too!) might give us an interesting platform to try out these ideas?
reverse dependency checks. particularly for packages in our ecosystem that depend on other rOpenSci packages.
... Other ideas?

Would love to see feedback / interest flushing out any of the above as well as suggestions for other related services you could imagine for such an rOpenSci central repository.

Meta-Redux: Shiny App to Help You Choose What to Work On

Hello fellow Excel jockeys, metadata journalists, and SPSS licensees,

Just like last year (ropensci/unconf17#84) I set up a little app to help you browse proposed projects: https://jhubiostatistics.shinyapps.io/runconf18-app/. If there are other features you want to see in the app let me know on the project repo or better yet send me a pull request!

Testing and reporting performance regressions / tracking performance over time

There were two issues last year ropensci/unconf17#90, ropensci/unconf17#56 devoted to a service which would allow package authors to track performance of their package over time.

I think this is still an unmet need, maybe it would not be too much work to create a web service to do the storage and reporting for this with plumber?

ROpenSci storage for package caching

Lots of packages need caching. When things get complicated enough, packages may need to access their own eternal data to do their internal stuff. This means caching somewhere, somehow, in a form that is reliable and available. This costs money.

Oh, I don't have any of that; okay, I can't do that package.

And there it ends. How about considering applications to ROpenSci to (financially) support caching via some suitable provider? The flipper package is a case in point. This works at the moment because it only trawls the CRAN_package_db. We would like to extend this to all man/ directories, all non-CRAN packages on github, and many potential other places. This is impossible without some sorta cloudy caching scheme.

Any chance of ROpenSci having an application scheme whereby those with existing ROpenSci packages apply for access to a wee chunk of server space?

low-friction private data share & data publication

I'd love to have a robust and simple way to deal with data associated with a project.

For individual data files < 50 Mb, I have bliss. I can commit these files to a private GitHub repo with ease; they are automatically available to Travis for any checks (no encoding private credentials); I can flip a switch and get a DOI for every new release once I make my data public.

Or as another concrete example: my students are all asking me how to get there ~ 200MB spatial data files onto the private Travis builds we use in class.

For larger data files, life is not so sweet. The alternatives and their pitfalls, as I see them:

Amazon S3. Setup is far less trivial. Can be expensive if lots of people download my large files (maybe not an issue if it's only for private data). Working on travis requires encrypting keys. No convenient button to press to make this public with DOI when ready. (though could manually upload to Zenodo). Ability to directly access individual files (by URL).
datastorrr. Nearly perfect for data < 2 GB; (adds data as "attachments" to GitHub releases, which aren't version controlled. Would love to see the branch that supports private authentication merged and a preliminary CRAN release. Maybe good fodder for Unconf?
Git LFS: Closest to my workflow for small data, but GitHub's pricing model basically renders this unworkable. (also no idea if Zenodo import captures LFS files). @jimhester posted a brilliant work-around for this at https://github.com/jimhester/test-glfs using GitLab for the LFS side to store the large data files of a repo on GitHub (up to 10 GB), but I could never get this working myself. (Would love to get unstuck).

Other scientific repositories with less ideal solutions:

zenodo. Zenodo supports direct uploads with up to 50 GB of data, making it a great option for easy public data sharing. No private option, no ability to download directly from DOI address.
figshare allows for private sharing and public sharing, DOIs for public data. Not sure file limits. rfigshare package not actively maintained... No ability to download directly from DOI address
DataONE Allows private and public sharing, supports ORCID auth, rich metadata model (burdensome to enter at first but could be useful with better tooling). Requires re-authenticating with time-expiring token. provides DOIs and other identifiers. No ability to download directly from DOI address, but does support ability to access individual files without downloading entire archive...
... more / other related strategies?

Things that might be strategies but somehow never work well for me in this context:

Box / Dropbox
Google Drive

Tensorflow Probability in R

Earlier today, Google announced TensorFlow Probability: a probabilistic programming toolbox for machine learning. The full text of the announcement is available on Medium

See the article for full details, but at a high level.

TF Probability provides an incredibly flexible language for specifying models imperatively
You work with distributions as first class objects
When it comes time to fit your model, TFP has a host of tools (like MCMC and variational inference) to get the job done

This notebook provides an end-to-end walkthrough on fitting a linear mixed effects model using the InstEval data from lme4.

For this project, unconf participants should come up with a design for how TF Probability will work in R, referring to RStudio's work on keras and tfestimators. Participants will be able to write some of these wrappers, and should hope to complete some example notebooks before the end of the event. It would be great if we could do an R version of the notebook linked above, and maybe others too.

R already has other probabilistic programming languages, in Stan, and there are other R projects that try to build up a probabilistic programming language for TensorFlow (Greta). But this will be the primary Google-supported project in this area, with a lot of new features coming soon.

A single interface to navigate the documentation of multiple packages

(Relates to #48, #25.)

How can you search all funcions, datasets and help files across multiple packages within a GitHub organization (e.g. tidyverse, rOpenSci)?

You may use https://www.rdocumentation.org/ or the search field of a GitHub organization; but I think those tools are too general. I propose to build a small package to create a searchable and clickable table of all functions or datasets by package -- including any number of packages within a GitHub organization. A good place to host such a table is on the website of the relevant GitHub organization.

Here is an example, which I built to solve my problem at the time. We could make the code more general and package it up.

parallel progress bars

Getting progress bars and other information in the console for big jobs running in parallel is something I've wanted for a looong time. It is possible to get GUI progress bars on windows (using TK), but this method apparently doesn't work on mac/linux, and doesn't print to the console.

It would be awesome if this functionality could be integrated with the future package, so that it can be used on any parallel backed the future API supports. It would be super awesome if we could enable export of the widely used progress bars in utils, and the swanky progress bars in progress.

There are technical hurdles around with communication between processes and differences between operating systems, but it's definitely achievable. I've put together a gist¹ with a prototype that does this the dumb (but generalisable) way; writing progress information to tempfiles which are read by the main process:

library (future)
source("https://gist.githubusercontent.com/goldingn/d5a3aebfbc63eaadd92f0ff5ca811a5d/raw/12b552722020626e3f7014e1d9314266287acee0/parallel_progress.R")

foo <- function (n) {
  for (i in seq_len(n)) {
    update_parallel_progress(i, n)
    Sys.sleep(runif(1))
  }
  "success!"
}

plan(multiprocess)
future_replicate(4, foo(30))

There are various ways this could be improved:

Printing progress bars rather than just a percentage process (preferably just embedding bars from the utils and progress packages).
Sending progress information from processes running on another file system (e.g. remote servers²)
Handling more processes than threads
Handling sequential execution
Proper integration with the future package

Related discussions:

Re. progress information in future in which @HenrikBengtsson says he'd rather it were a separate package, and suggests using processx.

Re. multiple progress bars in progress - having progress bars on separate lines isn't trivial since not all consoles allow overwriting of more than one line of output.

Heads up to @HenrikBengtsson and @gaborcsardi, in case they know of progress on this topic that I'm not aware of!

¹ https://gist.github.com/goldingn/d5a3aebfbc63eaadd92f0ff5ca811a5d
² my main motivation for this is getting live console progress bars for greta jobs running on Google CloudML

Running minimal code with changed inputs

I am not sure if this topic was tackled from all sides but thought about sharing.

Problem
When one runs a script, there could be a computationally heavy parts, but not all these parts require re-running. The inputs to these parts could be the same, the imported data might not have changed since the last time, etc.

In Rmarkdown, one can use caching in chunks to save some data importing. And in scripts one could write some conditions to control running some code.

It would be more efficient if there's a simple way to detect changes and decide which parts to be re-run automatically

So if there are available solutions for this, let me know. If not we might think about certain functions or settings to help with this.

Workflow for publications using Rmarkdown with users that won't get passed Word/Google docs

This is a specific issue related to #27 (and somewhat to #22). How to to successfully and painlessly collaborate in a publication workflow using Rmarkdown with researchers (or others) that are not interested at all in getting passed Google docs.

I have no idea how to tackle this. I know this is something I would use a lot.

There is some discussion and ideas in this thread: https://twitter.com/CMastication/status/942151771627155457

And a deeper discussion here: https://community.rstudio.com/t/publishing-rmarkdown-to-google-docs/832 where @jennybc says "The problem we ran into is that compiling .Rmd to Google Doc is not that hard. But for the whole workflow to truly be useful, you then want to go the other direction. And that is really hard." Wondering if this got any easier since last time this was discussed.

FWIW, my 2 cents for the 2018 unconf 🙂

Datasets search

When I'm writing tutorials or documentation or when I'm teaching I often fall back on the same sample data sets over and over. At the same time, when I need something specific such as an ordered factor I have to search around to find one. I try to stick to the base datasets. I was thinking that it would be neat to have something (a package or a shiny app or a combination) that would let you search for a specific class of data structure (data frame, matrix, ts, dist, cube etc (there are a lot)) an also for specific variable types for those types that support multiple types. Maybe also experimental versus observational? https://vincentarelbundock.github.io/Rdatasets/datasets.html has a list of the data sets, but the purpose of that archive is more to put them all into csv format in a consistent manner.

An added bonus would be to be able to make the api generic enough to search other packages but my initial goal would be the ones in datasets.

Lesson/Examples of how to clean 'field' data

This might be too field ecology specific, but I think it could be useful more broadly.

This is a situation I ran into in my grad school work, and I know many others who are doing field work where they are collecting data hard copy, and then entering it every few days over several months of work run into.

There are data entry errors, spellings, issues, etc, plus you also end up with dozens of files that have been entered, probably my different people, etc.

I dealt with this in my own field work by creating a script I ran over all the files, checked them for the correct spelling of different things, and then printed out the things that were wrong.

This code is not my finest, but it got me through my phd.

Maybe something that does this better already exists, and I just need to learn what it is so I can point others with this issue towards it.

But if it doesn't, this would be something I'd love to work on building.

I realize this functionality already exists in open refine, but I personally don't care for open refine, so I did it this way.


# these are the vectors of values that I am ok with, with the correct spellings

# areas are my study areas
areas <- c("nvca","scnwr","fgca","slnwr","tsca","bkca","ccnwr","dcca","osca","tmpca")

# impound is my wetland impoundments
impound <- c("rail","sanctuary","ash","scmsu2","scmsu3","sgd","sgb","pool2","pool2w","pool3w","m11","m10","m13","ts2a","ts4a","ts6a","ts8a","kt9","kt2","kt5","kt6","ccmsu1","ccmsu2","ccmsu12","dc14","dc18","dc20","dc22","os21","os23","pooli","poole","poolc")

# regions are the four regions
regions <- c("nw","nc","ne","se")

# plant spellings that are correct 
plant <- c("reed canary grass","primrose","millet","bulrush","partridge pea","spikerush","a smartweed","p smartweed","willow","tree","buttonbush","arrowhead","river bulrush","biden","upland","cocklebur","lotus","grass","cattail","prairie cord grass","plantain","sedge","sesbania","typha","corn","sumpweed","toothcup","frogfruit","canola","sedge","crop","rush","goldenrod",NA)

for(i in 1:length(file_names)){
  int <-  read.csv(file_names[i])
# so this prints out instances where three are things that are not part of the lists above and includes the file name so I can go and find the issue.   
  print(paste0(int[(int$region %in% regions==FALSE),]$region," ",file_names[i]," region"))
  print(paste0(int[(int$area %in% areas==FALSE),]$area," ",file_names[i]," area"))
  print(paste0(int[(int$impound %in% impound==FALSE),]$impound," ",file_names[i]," impound"))
  print(paste0(int[(int$plant1 %in% plant==FALSE),]$plant1," ",file_names[i]," plant1"))
  print(paste0(int[(int$plant2 %in% plant==FALSE),]$plant2," ",file_names[i]," plant2"))
  print(paste0(int[(int$plant3 %in% plant==FALSE),]$plant3," ",file_names[i]," plant3"))
}

## once I resolve all of the issues identified from above I then read in all the files, put them in a list and I can stitch them together into one master file. 

vegsheets <- list()

for(i in 1:length(file_names)){
  vegsheets[[i]] <- read.csv(file_names[i])
}

## this takes the list and combines it all together into one data frame
masterdat <- do.call(rbind, vegsheets)

# write it out into a master file
write.csv(masterdat, "~/Github/data/2015_veg_master.csv", row.names=FALSE)```

An on-boarding process for 'research compendia'?

A late submission here, but the notion of a research compendium has been a frequent theme at previous unconfs, including the reproducibility guide (@iamciera et al), rrrpkg (@jennybc et al), checkers (@noamross et al), also @benmarwick's rrtools etc. Despite this, I think there are still quite a lot of open questions as to what exactly a compendium is, what tooling we need to support it, etc.

Given the success of the ropensci onboarding process in fostering discussion, promoting norms and boosting the visibility of packages, I wonder if a similar approach would be viable for reviewing/collating/promoting research compendia? Just as a compendium is generally viewed as something less than/simpler than a package, I imagine the review process would be somewhat of a lower bar; primarily a way to verify: "can I reproduce the outputs presented here"? and "can I understand what's going on here and how it's organized?" At the same time, it could be a good venue for learning about ways to improve (i.e. "this looks computationally intensive, you might want to talk to @wlandau about using drake here, or "you might want to put the associated data on Zenodo", etc.

So I have this vague notion that a similar on-boarding process could help build momentum/community/examples of research compendia, but there are still plenty of open questions as to how to pull this off. Should this be done under the auspices of rOpenSci or only synergistically? Should there be a more explicit journal connection, or a JOSS-like journal/index of compendia? How well would this work across domains? And most importantly, is there sufficient interest (editors/reviewers) to pull this off at all?

(idea originally based on discussion with @benmarwick at a DataONE meeting last fall, summarized here: https://github.com/benmarwick/onboarding-reproducible-compendia, but could go in different directions).

Caching for drake

Data scientists are expert at mining large volumes of data to produce insights, predict outcomes, and/or create visuals quickly and methodically. drake (https://github.com/ropensci/drake) has solved a lot of problems in the data-science-pipeline, but one thing we still struggle with is how to effectively collaborate on a large-scale project, without each contributor needing to run all of the workflow, or separating the workflows into many dis-jointed smaller workflows. In some large-scale projects, this is just not feasible.

It would be awesome if a wide community of R developers could come together and try to create a way for drake to have a collaborative caching feature.

My group had set up a wrapper package for remake (drake's predecessor) that allows tiny indicator files to be pushed up to github. These indicator files let the user know that the target was complete and the data was pushed up to some common caching location. The next user would do an upstream pull request from Github, pull down the indicator file. The new user would not need to re-run a target that some other collaborator had already run, but instead pull the data down (if it's needed) rather create it from the workflow. It got a bit awkward because we needed 2-3 remake targets to accomplish this, and that tripped up our "non-power-user" collaborators.

I'd propose the first step would be to develop caching workflow to Google Drive (using the googledrive package). Once the process was flushed out with using Google Drive, it could be more easily expanded to other data storage options (AWS using the aws.s3 package for example).

My gut says this might need to be a wrapper or companion package to drake (to keep the dependent packages minimized), but not sure. @wlandau and other drake experts: I would looove to hear any feedback you have on this idea. If in fact this issues is not-an-issue (ie...drake can already handle caching and I just missed it...totally possible...), then we could morph this issues into a group that helps create more content for a drake blogdown/bookdown book!

The wrapper package for remake is here:
https://github.com/USGS-R/scipiper

#12 is another drake-based project.

Review of literature for API mocking and testing

A pretty frequent topic of conversation in rOpenSci circle is approaches to testing packages that call Web APIs. I think this topic could use a "review of literature" - a systematic look at the different packages for this in the ecosystem, how their interfaces differ, the advantages and disadvantages of each, missing functionalities to be developed, and best practices/design patterns for testing packages.

Outputs for this might just be a blog post or vignette, or a series of issues and PRs across the package ecosystem.

Notably @sckott is nearing release of vcr, it could host the vignette and this review might be good for making sure it covers the use-cases that we explore.

Collaborating with NOT-users of R, RStudio, or Git

(For more on RStudio, Git and GitHub see #22.)

In my experience, valuable collaborators often stay out of the collaboration loop that happens on GitHub. Not only I miss their input but also I struggle to integrate the contributions they make outside my GitHub-based workflow. They may not want to learn complex tools and have little motivation (often they are already at the peak of their academic careers).

What is a good workflow to collaborate with those who don't use R, Rstudio or Git (but might use e.g. Google sheets and GitHub)?
How can we maximize their input?
How can we minimize the problems caused by contributions outside the GitHub workflow?

Test project / package after hypothetical package update

Related to #31 and also the following comment of @noamross in #35

If you want to do live testing of a package, like seeing what system files/folders it modifies, I'm working on a Dockerized setup for our standard package tests: https://github.com/noamross/launchboat, so one could run tests in an isolated environment before installing.

Before updating a dependency to the newest version see which tests would break.

I can imagine this as a service (or maybe Rhub can already do this??) or a local version with Docker ensuring a separate environment.

Discussion: Expanding peer review of code

rOpenSci has long been interested in incubating projects that adopt our approach to open peer review of code in areas outside our scope, such as in other languages or, especially for implementations of statistical algorithms. A few unconf attendees (inc. @mmulvahill, @dynamicwebpaige, @jenniferthompson) have expressed interest in this, so it would be good to set aside time to discuss prospects for new code review projects. I suggest this would be a second-day 60-90 minute lunch discussion rather than a full two-day project, but depending on people's interest some of us could run with it!

Promote data-packages to facilitate project-oriented workflows

Can we facilitate project-oriented workflows by promoting data-packages?

Although researchers would benefit from using self-contained projects, they rarely do. This workflow seems more common:

Store the data locally in one directory.
Import the data from multple R sessions and run analyses.

This approach is problematic becaue each R session is not self-contained.

A neat solution is to build a data-package. While its source may live in a single local directory, the data can be accessed by loading the data from any project, keeping it self-contained.

But many reasearchers don't know this approach or believe that building a package is too difficult. Indeed, building a basic data-package requires relatively few tools. The process can largely be handeled with the usethis package (by Hadley Wickham, @jennybc and RStudio). Can we describe the steps required to build a data-package via a series of usethis functions?

(As a starting point here is a checklist I use to build data-packages.)

Toward a general-purpose API response data tidier

I think the goal for the unconf would be to lay the foundation for an eventual package that is meant to sit in a user’s pipeline directly after a jsonlite::fromJSON() call, in place of initial bespoke munging of the nested list. The package would:

Tidy the nested list output into a tibble with one nested row per input
- Will be a best guess at how the user wants data formatted; at worst it should be easier to work with than the raw output
Allow for recursively filling .empty values at each level of the list (i.e., convert empty elements and NULLs to NAs or a user-defined value) so the tibble can be easily unnested
Include special handling for the most commonly used APIs so we know we’re getting these right (?)

The first step here (which is the goal for the unconf -- many thanks to @jennybc for working through the initial idea with me) seems to me to be defining patters in 1) API requests and their resulting responses and 2) in the most common/successful strategies people use in their tidying process. I think it would be useful to query a few more-or-less representative RESTful APIs and note the commonalities in the solutions for tidying them. The idea would be to extract the intersection of these solutions into general-purpose verbs and also to identify where these approaches fail.

I could see this package being useful not just for one-off data grabbing and tidying jobs but also for developing packages that interface with an arbitrary API. Could of course be used on any nested list, but I think it makes sense to keep the scope of the package focused to API data.

For a name, I’m thinking roomba¹ but definitely not married to it.

Provided that’s cool with the relevant trademark attorneys 😆 ↩

Write answers to common questions that R users ask about R packages

I propose to write answers to the commonly asked questions about R packages, targetting specifically an imaginary R user with no background.

This issue is motivated by the discussion with comments by @jennybc and @batpigandme (thread). A draft is here. I would love some help to finish it. Major edits are welcome and I'm happy to move the article wherever.

Building meta-packages (e.g. tidyverse) to centralize multiple packages

Do you find yourself repeatedly installing or loading the same packages? How about building a meta-package?

The main goal of a meta-package is to install and load other packages (a famous example is the tidyverse package). But also you may use the pkgdown website of a meta-package as a kind of meeting point for all the packages it gathers (e.g. tidyverse; fgeo).

It turns out that building a meta-package is surprisingly simple. The source code of the tidyverse is an excellent template. You could use it to build your own meta-package, or we could generalize it to create a package that creates a meta-package. Is this getting too meta?
:)

Security/Safety "Best Practices" for rOpenSci Package Developers/Reviewers

We've done a bit of this ad-hoc, but we could spend some dedicated cycles ensuring that rOpenSci not only has the best technical and maintenance standards — which it most certainly does — but is also the de-facto standard to replicate when considering safety/security.

Help researchers track results in manuscript back to source code.

How do you link a result in your manuscript back to its source code? This is fundamental to reproducible research. It seems basic and straight forward but, in the wild world I live, it is not. Research gets messy quickly: After a few weeks out of touch with a project, wish me luck finding my own stuff; and forget about finding code in a project managed by someone else.

My inelegant solution is this:

I tag each analysis with a random label and a description.

ab12 <- "Code which result proves that Earth is not flat."
result <- code

I keep the tag associated with that analysis throughout the lifecycle of the manuscript.

Whenever I need to go back to the source code, I use RStudio's Go to File/Function.

Is there a tool or better approach? What general recommendations do you have for researchers across a range of willingness to use version control and RStudio projects?

Discussion: Time management strategies and work-life balance

The great amount and quality of work that some people do seems impossible. Such people include @jennybc, @jimhester, @hadley, and many others from rOpenSci. What they achieve would not only take me a new brain, but also 10 lives, 60 marriages, and all of my mental health. What habits, strategies, and tools do they have and use?

In some countries kids learn time management skills early, at school. Not where I grew up, and likely now in too many other places. As an adult I learned some helpful strategies, but I continue to drift to bad habits and poor work-life balance. I would love to casually discuss this issue and learn from your experience.

Git + R nirvana: how to get there

The tools most of us use to accomplish Git/GitHub magic from R are the RStudio IDE and the git2r package (part of the rOpenSci org).

Under the hood, these exploit different tools to enact Git operations: system Git (RStudio IDE) and libgit2 (git2r). This means that various aspects of Git configuration can be good to go for one but not the other. This is mostly about configuring credentials for Git remotes, e.g. setting up SSH keys.

I've done a fair amount of testing and documenting for Happy Git. But I think initial setup could become even easier and better documented with respect to tricky bits, such as using passphrase-protected SSH keys on Windows. I'd love to stress-test and improve setup instructions for Git so that more people have more success across Mac/Windows/Linux for command line Git (--> RStudio IDE) and git2r (--> devtools, usethis, etc.).

color palette aggregator/modifier tool

There are loads of packages/ways to choose color palettes for R graphics, but I still end up doing a lot of iterations through many packages until I find a suitable color palette.

Existing packages (please add):
colorspace has a GUI that makes choosing colors easy with a preview function for different types of plots and a slider for Hue, Chroma, Luminance, and Power, but doesn't integrate seamlessly with other color packages. 'RColorBrewer' (here) has many palettes but you have to use the colorRampPalette() function to get more than 9 or 11 colors. hues allows you to pick palettes with many colors, but requires numeric HCL inputs for colors. There are many great palettes out there (i.e. Beyonce or wesanderson ) in addition, but checking how your graph looks across palettes from different packages requires loading and knowing the different syntax of each.

Idea:
A one-stop-shop for colors in R that combines the best of existing tools. The goal would be to change the workflow from iterating across multiple packages to find a suitable palette to using one tool that:

Aggregates and displays color palettes across packages
Allows users to increase/decrease the number of colors in these palettes
Pick and choose colors from predefined palettes to create your own custom palette
Adds transparency or ensure colorblind-friendly schemes
Get immediate previews of what a/your graph would look like when you change any of these options

my colleague Margaret Siple inspired this idea

Extensions to R / RStudio's autocompletion system

This was proposed by @kevinushey for last year's unconf, but he unfortunately Kevin had to miss the event last year. (ropensci/unconf17#52)

In the meantime I have started a proof of concept package https://github.com/jimhester/completeme, that implements some of what he proposed, but it would be great to improve the RStudio IDE integration and discuss ways to make this more full featured and robust, both for terminal completions, things like https://github.com/REditorSupport/languageserver and in the IDE.

Code review tools

This is a fork of #37

Develop tools to help R code reviewers better understand the code they are reviewing and make the process easier and more robust, especially for new reviewers.

Expand on or consolidate https://github.com/ropenscilabs/pkgreviewr, https://github.com/lockedata/PackageReviewR, and https://github.com/noamross/launchboat

I note that, for non-R package code, there are also guides and tools from last year's checker unconf project: https://github.com/ropenscilabs/checkers

@maelle @boshek @goldingn @annakrystalli

Package usage "in the wild"

Related to #25 (see #25 (comment)), but I think it's a sufficiently distinct approach to merit its own thread.

Overall idea

As a user, I often find it's helpful to see use-cases for packages and/or functions "in the wild" (i.e. in the context of some workflow or task). Some packages have great vignettes that cover this, but (limited to just a few people) there's simply no way for maintainers/developers to think of all the possible ways a package might come in handy. It can also be extremely helpful to read explanations from people who didn't write the package, since they have a sort of "beginner's mind." (I've done a few "roundups" of tweets, often of blog posts using various packages, e.g. for purrr here for this reason).

I imagine (and have anecdotal twitter evidence 😏) that maintainers also like seeing how their packages are being used, but don't always get that feedback, even when it exists, since the avenues are somewhat limited.

A very roughshod diagram of relationships among packages/feedback that exists "formally"

Here's where it gets fuzzy implementation-wise, but I've been wondering if there would be a good way to highlight package usage in blog posts or case studies (e.g. with blogdown), in such a way that users and maintainers would be able to easily find relevant content.

Carl Goodwin's been doing something to this effect by including tables of packages and functions used in his blogposts (see example from Surprising stories hide in seemingly mundane data below), but this is (to my knowledge) done by hand, and isn't something one would be necessarily be able to find from any docs related to, say, rgeolocate.

Stumbling blocks

Implementation (want to make it useful, without being platform-specific).
Would want it to be opt-in for package maintainers(?)
Possibly just a human communication issue that could be encouraged by, say, talking to other humans.
Breaking changes — blog posts might be an ephemeral format for this very reason.

Summarize change of packages since previous versions

In a bigger project or if you take up a project after some period you may want to update your packages and after such an update you may need to make several changes to your code. This can happen without any package management system or with packrat / checkpoint / docker etc. Even if you use packrat / checkpoint / docker in order to ensure reproducibility you would like to update dependencies from time to time: either you need somehing specific from a newer version or you just know that sooner or later you have to update and the longer you wait the more difficult it will be. However, you should be able to assess beforehand the amount of work needed after an update.

I propose a package which could generate a digestible summary of all the changes in the dependencies of your project between current state and newest versions. It would ideally work with or without an existing depedency manager.

identify current versions of dependencies and their newest versions
collect news from cran (+github) (+ any other repository)
create an html / web page from these, ideally organized by breaking changes / new features / fixes

Improvement possibilities:

work for subset of packages
work for cran newest and dev versions as well
idetify usages in your code
propose how to change your code: copy pastable code / functions to reformat if possible (ie only rename)
parse changes from code instead of news.md: would be more accurate

generate news.md skeleton based on code changes since last package version

I believe we could build on several existing packages: possibly containerit to identify dependencies, gh to pull news from github

"Safety Profiler" for User package libraries

We embark on unconf 2018 at an appro time: R 3.5.0 launches and lots of folks are feeling the x.y package upgrade/sidegrade process.

However, we could build a profiler/auditor — much like the emergent node audit — which would let folks know just how lagging they are and what potential safety issues they may be facing as a result.

This could take a bit of work and would also require delving into packages that include C[++] libraries with them.

Improved visualization for drake

Current capabilities

As with many similar reproducible pipeline toolkits, the drake package can display the dependency networks of declarative workflows.

devtools::install_github("ropensci/drake")
library(drake)
load_basic_example() # Call make(my_plan) to run the project.
config <- drake_config(my_plan)
vis_drake_graph(config)

The visNetwork package powers interactivity behind the scenes. Click here for the true, interactive version of the above screenshot. There, you can hover, click, drag, zoom, and pan to explore the graph.

Start fresh and customize!

Using the dataframes_graph() function, you can directly access the network data, including the nodes, edges, and relevant metadata. That means you can create your own custom visualizations without needing to develop drake itself. You can start from a clean slate and create your own fresh tool.

Unconf18 projects ideas

Condensed graphs

Ref: ropensci/drake#229. Network graphs of large workflows are cumbersome. Even with interactivity, graphs with hundreds of nodes are difficult to understand, and larger ones can max out a computer's memory and lag. Condensed graphs could potentially respond faster and more easily guide intuition. There are multiple approaches for simplifying, clustering, and downsizing. Examples:

EDIT: from ropensci/drake#229 (comment)), base drake is likely to support a rudimentary form of clustering. But a separate tool could account for nested groupings, and a shiny app could allow users to assign nodes to clusters interactively.

Static graphs

Ref: ropensci/drake#279. To print a visNetwork, you can either take a screenshot or export a file from RStudio's viewer pane. Either way, you need to go through a point-and-click tool or one the screenshot tools @maelle mentioned in #11. Drake cannot yet create static images on its own, and such images could be crisper than screenshots and would enhance reproducible examples.

Workflow plan generation

In drake, the declarative outline of a workflow is a data frame of targets and commands.

load_basic_example()
head(my_plan)

## # A tibble: 6 x 2
##   target            command                                                                      
##   <chr>             <chr>                                                                        
## 1 ""                "knit(knitr_in(\"report.Rmd\"), file_out(\"report.md\"), quiet = TRUE)"
## 2 small             simulate(48)                                                                 
## 3 large             simulate(64)                                                                 
## 4 regression1_small reg1(small)                                                                  
## 5 regression1_large reg1(large)                                                                  
## 6 regression2_small reg2(small)

The make() function resolves the dependency network and builds the targets.

make(my_plan) 
## target large
## target small
## target regression1_large
## target regression1_small
## target regression2_large
## ...

Currently, users need to write code to construct workflow plans. (See drake_plan(),
wildcard templating, and ropensci/drake#233)). To begin a large project project, I usually need to iterate between drake_plan() and vis_drake_graph() several times before all the nodes connect properly. A shiny app could interactively build an already-connected workflow graph and then generate a matching plan for make().

Alternative graphical arrangements (re: #12 (comment))

The default graphical arrangement in drake can be counter-intuitive. The dependency graph shows how the targets and imports depend on each other, which is super important, but it is not necessarily the order in which these objects are used chronologically. For example, in this network from vis_drake_graph(), the reg1() function appears upstream from small even though reg1() takes small as an argument to build regression1_small. An optional "code graph" or "call graph" could better demonstrate the flow of execution during make().

Final (initial?) thoughts

Drake stands out from its many peers with its intense focus on R. R stands out because of its strong community and visualization power. Collaboration on visuals will really help drake shine and hopefully improve reproducible research.

cc @krlmlr, @AlexAxthelm, @dapperjapper, @kendonB, @rkrug

Create educational docs/blog for claims like “R can’t be used in production”!

Rstudio is looking for Director of Education who can help with this project.
https://hire.withgoogle.com/public/jobs/rstudiocom/view/P_AAAAAACAAADJv_c3liAM2t

How can we ensure that uptake of R is not limited by misconceptions like “R isn’t a real programming language” or “R can’t be used in production”?

But, why don't we, as a community of R users, work on a doc or blog post where the data scientists talk about their experience of using R for a lot of things than just statistics that can help remove this kind of a stigma that is being told to young and aspiring data scientists like me. In all honesty, when I started learning data science, the instructors were pretty harsh in their view about R without even knowing it's full capabilities.

Why can't we collaborate with an open-source business like RStudio and work with to help promote R for higher level programming?

Tools for discovering new packages (again)

Direct follow-on from last year's two related issues issues thanks to @sfirke. The flipper package is kinda developed, kinda stalled, but I personally would love to get that a bit more developed. It currently does full heavyweight text analysis of DESCRIPTION files of all CRAN packages and produces a document similarity matrix that is used to connect one package to another.

The original vision of @njtierney was a standard swipe interface which we re-branded "flip" to enable quick and easy package browsing. In current state, one can simply:

flipper::flip ("package about a bunch of interesting stuff")

And it'll find a starting point in the matrix and then traverse strongest connections. We think that alone is kinda nifty, so please try! Required/desired refinements include:

Refining methods of traversing the matrix, including incorporating user stats with all associated concerns raised in previous issue. Extension to an ML framework would be very straightforward, because the whole thing works on fixed-sized binary vectors (like/dislike next jump along vector).
As @jimhester pointed out in original issue, trawling man files is likely to be even more informative. The infrastructure for this is all there, but it might push the limits of text similarity matrix processing?
Extension to all non-CRAN packages on github (I know there's a list somewhere, and @maelle has her excellent is_package function for repo enquiry.)
Slick flippable interface

That's all it would take to have most of the infrastructure there for one to type some text and start flipping through R packages until one discovered something desirable, interesting, or at least unexpected.

Foster an LGBTQ community for R

R-Ladies has been enormously successful at promoting gender diversity within the R community, and bringing many new women (and their diverse perspectives) into the R community. Can we do something similar for the LGBTQ community?

There are some differences between the communities we'd need to consider:

The LGBTQ community is much smaller than the R-Ladies community; would the intersection of R and LGBTQ be large enough to sustain a vibrant community?
On that note, there are many LGBTQ allies in the R community; how could we incorporate allies into this community as well?
Are there existing LGBTQ communities for data science / machine learning? Perhaps we could form an R community there?
R-Ladies is already very welcoming to diverse participants generally; perhaps a program within R-Ladies would make the most sense.
How would LGBTQ participants best like to engage within such a community? Thinking here about private forums vs public forums, etc.
Other thoughts/issues? (Please suggest in the comments.)

I propose we form a small group to discuss these issues and to make some recommendations for fostering an LGBTQ community as a part of the R community.

Percentiles and z scores in maternal child health

I often need to calculate percentiles, z scores, and other measures of growth in maternal & child health research. There are some SAS macros out there and a couple of R packages. The 2 R packages I found don't have all of the measures I need and are a bit clunky to use with tidyverse packages. There are other measures that don't have a SAS macro or R package, just a data table of LMS parameters in a manuscript (PDF). Ideally these methods would be available all in one place in an R package!

Here some things the package could calculate:

Percentiles and z scores for BMI using the CDC reference charts (SAS program: https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm). I've found 1 package (childsds) that does this, but does not have all of the options as the SAS macro. Importantly, it doesn't calculate %BMIp95 (percent of the 95th percentile of BMI), which is being used more in obesity research. The zscorer package only uses WHO growth charts. The CDC charts are standard in US pediatric obesity research. I have been in contact with the author of the SAS macro and he is very nice and fully supportive of turning this into an R package!
Birth weight for gestational age z-score. I haven't seen a publicly available program for this. I use a SAS program copied from another analyst. (Ref: Oken, E., Kleinman, K. P., Rich-Edwards, J., & Gillman, M. W. (2003). A nearly continuous measure of birth weight for gestational age using a United States national reference. BMC pediatrics, 3(1), 6.)
Child blood pressure percentiles. (https://sites.google.com/a/channing.harvard.edu/bernardrosner/pediatric-blood-press/childhood-blood-pressure/childhoodbppctsas)

Eventually this could also include measures of maternal gestational weight gain.

After speaking with Stefanie about this, it seems like this could be a good project for someone who hasn't made an R package (myself included). The calculations aren't too difficult.

Taking some pain out of finding/linking to unique IDs?

A wish/need/dread for data standards came up in issue 41, and brought a few ideas to mind:

For cleaning Darwin Core/biodiversity data, there are some good tools (e.g., Kurator, which looks like it's getting some translation to R).
For finding IDs for publications, people, specimens, taxa, etc, there are lots of great resources (fulltext, rorcid, spocc, taxize...)
But for actually finding & linking the pieces (specifically, the unique IDs for publications, specimens, people, etc), projects often run out of energy/time/awareness

Any thoughts on a helper/gentle-reminder app or lesson for suggesting linkable values contained within datasets or papers -- for instance, by indexing what types of fields/records exist in a given dataset, and suggesting relevant packages from CRAN or ropensci that could retrieve identifiers?

I realize I'm glossing over some major obstacles to actually linking data (e.g., cleaning free text values & resolving entities is enough of a mountain; plan-ahead is better than fix-it-in-post when possible), so I'm all ears if this could use more [or a different] focus.
Or if something magical already exists along these lines.
...Or if there's a good/sustainable alternative to developing tools/packages that rely on multiple API wrappers?...

Compiling R logic to the browser

(A rambling post that might converge into an actual unconf project)

I've become intrigued lately with the applications of WebAssembly, the latest approach to running compiled code in the browser. I wrote up a proof-of-concept of how one could run the same C++ code via Rcpp or in an htmlwidget. I think some exploration of use cases and design patterns of shared R/browser implementations of algorithms in this way could make a good unconf project.

It's a huge project to try to port R runtime to the browser this way, but there are a number of R packages that generate C(++) code for fast model execution that could then be compiled to WebAssembly in the browser:

NIMBLE compiles a sub-set of R code into C++ code that uses the Eigen library.
odin by @richfitz generates C++ code from ODE systems defined in R
Stan models are compiled to C++ before being compiled to binary.
pomp models use C templates with snippets provided by the model-builder.

For any of these, one could imagine fitting and optimizing models in R, then porting them to the browser to power pages/apps built on simulation or prediction from those models. For instance, I could use odin to fit a set of ODEs to data, and then port the fit odin model to an htmlwidget that lets the user simulate and visualize those ODE systems, with the option of changing some parameters or initial conditions. Getting a Stan model to run in the browser in an htmlwidget seems super interesting, too - it would be like shinystan without Shiny. (One of the Stan devs has said they would be interested in helping with this.)

Alternatively, keeping the concept of compiling to the browser but in a very different way, we could compile dplyr code to the browser by using dbplyr to create SQL from R expressions, then sending this SQL to be executed by the browser via a JavaScript SQL implementation such as alasql. One could create "live" htmlwidgets this way: they could query a remote data source, transform the data, and then feed it to an output like a plot.ly graph or a DataTable, all in the browser with no server back-end.

Finally, a totally different idea suggested to me recently was using this for distributed computing. Could one create htmlwidgets that run some code in the user's browser and send results back home? (For instance, running one MCMC chain of a Stan model?)

Thoughts?

Screenshot tools in R

I've recently discovered that RSelenium and seleniumPipes offer screenshot capabilities, including the use of CSS selectors. Their syntax might be easier to use than webshot when one needs to perform an action on the webpage before taking the screenshot. 📸

I think it could be useful to

assess and compare the screenshot functionalities of these three packages 📄
design a new function or package using them that'd make screenshots even easier. 🔧

A good tool for screenshots would:

be well documented
have all arguments webshot has (CSS selection, cliprect, expand, etc.)
but would support not using JavaScript but instead the more "intuitive" commands one use in e.g. seleniumPipes
would support different browsers
cover both websites and Shiny apps like these three packages do
would use magick for image processing and would return a magick object instead of writing the screenshot to disk (or would at least offer this possibility).

A bonus would be to also cover screencasts, cf rDVR. 🎥

Understanding CRAN Incoming

Recently in the Slack channel I semi jokingly suggested creating an app to run a cron job to scrape the incoming folder at CRAN in order to better understand empirically how long packages stay in each state and where they move to and potentially what the bottlenecks are and what the most common kinds of problems are (for example what kinds of Notes are most common and what operating system environments have the most failures and why). Maybe the CRAN managers already know this, but maintainers don't seem to. Ideally if we had the data we could use it to make suggestions about how to deal with bottle necks or to improve documentation to help people avoid problems or (my pet issue) improve the understandability of notes.

Debugging workflow

I would be very interested in a brief tutorial or demo by @jimhester or @kevinushey in debugging a C problem without printf(). Preferably on both MacOS and Windows. cc @jennybc.

R in Minecraft, the next generation

At ROpenSci 2017, a team created the miner package with R functions to control the Minecraft world. The goal was to provide a framework for young people to learn R, motivated by their desire to automate the construction of objects in Minecraft or to control the world in other way. An accompanying package, craft, was also created with functions to implement some larger-scale projects (for example, an elsify function to give the player the power Elsa to freeze water behind their feet). A bookdown book collected these projects into book form.

With a small team and another couple of days, we could get the miner package ready for CRAN, and add additional examples to craft and the book. Things we could work on include:

Cleaning up the API, in particular making location functions more consistent, and using R vectors instead of separate x, z, y arguments
Making the server connection more robust (right now it tends to disconnect every few minutes)
Providing a cleaner interface to Minecraft block types
Creating an up-to-date Docker image for the Spigot server, and simplifying the installation process

The Spigot/RaspberryJuice server is a bit limited too; there's no way to change the orientation of the player, for example. We might be able to explore other Minecraft APIs and see if they're more suitable for this project.

packrat: ease the use of external libraries

First case: a team collaborates on a project w packrat and one of the team members would like to use a package not closely related to the project like colorout. Afaik you can add this package as external package but then it would be required as external package for all project members. You can work around this by manipulating .Rprofile files but it is quite cumbersome.

Second case: There are some "meta" packages which do not necessarily have place as project dependencies like usethis, lintr, goodpractice, covr, pkgdown etc. I take advantage of these packages in almost all of my project but I do not necessarily want to add them as dependencies. I can use packrat::with_extlib but I run into the issue that it is not enough to specify the main package like pkgdown but also all their dependencies not present in my project which varies and makes it somewhat cumbersome to use. I think we could automate this.

packrat::with_extlib(c("pkgdown", "rstudioapi", "highlight", "debugme", "callr", "rematch"), build_site())

tagging @kevinushey as the developer of packrat - what do you think? Is there already solution I missed?

Finally build out the Security CRAN Task View

I started a repo but time is an enemy. A cpl of days in the Evergreen City could end up. @davidski had a stellar idea to use rOpenSci Unconf 2018 as a potential place to get'er done and I'll gladly lead the way to do this.

It won't take both days and folks can likely be involved in other projects before/during/after this one.

Implementation of non-linear dimensionality reduction algorithm (UMAP)

I recently read about a new non-linear dimensionality reduction algorithm called UMAP (github, arxiv), which is much faster than t-SNE, while producing two-dimensional visualizations that share many characteristics with t-SNE. I initially found out about it in the context of use on high-dimensional single-cell data in this paper.

The reference implementation is in Python (see github link above). It can be run in R through rPython as shown here. There is an R package designed for comparing dimensionality reduction techniques that contains an implementation of UMAP, but this package is "not suitable for large scale visualization" and I'm not completely sure based on the README whether it is an accurate or approximate implementation.

My thought is that the ideal would be a package focused on UMAP specifically, implemented in R or Rcpp. Unfortunately I am not at all an expert in this topic or familiar with the mathematics involved, so the best I would be able to do is try to translate the Python implementation into R.

Color coded errors/warnings/messages/printed text

By default errors, warnings and messages are all printed red in R which can be confusing as red usually means something wrong but a message can be just informative. I think it would be nice to provide an easy way to create colored messages, like by default red for error, yellow for warning, blue for friendly information etc. With the crayon package this should not be too difficult.

I am thinking about a way that maybe the end-user should be able to control this, or the creator of the package. We could distinguish style further based on the class of the condition.

Batch File Processing Workflow

Hi Everyone,

I've been kicking this idea around for a little bit. Our group does a lot of batch processing of input files when we run our pipeline for flow cytometry data. Sometimes the output of a step will fail, and we have to flag the files that fail so they aren't passed through further steps in the pipeline.

When I do this currently, I basically build file manifests (location of files with relevant metadata) and run some sort of processing in R. I was thinking maybe by incorporating data assertions (like with assertr), we can have a workflow that shows when files pass a step, and flags those files that fail a processing step. In the end, we can display to users of the pipeline which files passed and which files didn't, and which steps.

Maybe there's a little germ of an idea here that might work for the unconf. I'm not sure, so I'm putting it out there.