ropensci / auunconf Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 4.0 14.81 MB

repository for the Australian rOpenSci unconference 2016!

HTML 24.02% JavaScript 50.05% CSS 25.93%

auunconf unconf

auunconf's Introduction

rOpenSci

This repository has been archived. The former README is now in README-NOT.md.

auunconf's People

Contributors

Stargazers

Watchers

Forkers

milesmcbain a-simmons dicook yaozhao94

auunconf's Issues

Exploration and visualisation of missing data

In my PhD research I work with medical data and there are often large amounts of it missing. In my attempts to explore missing data problems and make my life easier I have done some work on two packages: ggmissing with Di Cook, and mex with Damjan Vukcevic. But, as my PhD research continues, I have been finding it hard to dedicate some serious time to continue work on these packages.

I'd like to propose a project on one, or perhaps both of these packages.

A bit more about them:

ggmissing extends ggplot to allow for missing data to be visualised. This would basically involve creating a couple of ggplot geom_missing_* functions that could be added as a layer to a plot. For example, geom_missing_point() would add in and colour the missing points. You can see more about it on the github repo, and at these slides.

mex is a missingness exploration package. This extends off of some research that I have done into using decision trees to explore missing data. The original idea of the package was to create a framework or even a recommended path for handling missing data. One idea was to break it into exploring, modelling, and confirming.

Exploring would include:

Creating a better, fast version of Little's MCAR test
Tabulation of missing data
use of t-tests/chi^2 to explore whether missingness affects values/counts
Tools and variations on function from previous work in packages like MissingDataGUI
Incorporating visualisations from visdat

Modelling would include:

Using machine learning methods to explore missing data.
Identifying clusters of missing data and then predicting these clusters with machine learning methods

Confirming might be something like:

Using cross validation to explore how accurate the missing data mechanism is

I'm very much open to suggestions about how to implement these ideas.

Assess the quality of open data in an open data portal

Create a tool to assess the quality of open data in an open data portal.
a challenge by ODI Queensland

Build on prior r work:

testdat
An analysis by the ODI of CSV files on Data.gov.uk

Leverage existing validation tools:

CSVLint.io a tool from the ODI to validate CSV files
GoodTables by Open Knowledge Labs

Apply standards, best practices or quality measures:

Open Data Certificates by the ODI
Frictionless Data and Data Packages by Open Knowledge
W3C CSV for the Web
Tau - a metric to assess the timeliness of data in catalogues.

Assess an open data portal or two:

Use any or none of these suggestions to provide insights about the quality of open data and how it is published.

Help open data publishers improve so the data they publish can be used to deliver ongoing value.

Thinking about taking the challenge? Got questions? Reply below and we'll do our best to answer.

Visualise citation network in a hive plot

from (http://www.hiveplot.net/)

A citation network usually looks like the one on the right. Martin Krzywinski from the Genome Sciences Center would call it a hairball, and he makes a convincing argument why the hive plots are better.

I wonder if we could visualise citation network data (say from the Web of Knowledge databases) in a hive plot and get more out of the data than from a force-directed network. Here's an example from Mike Bostock, showing the dependency graph of the Flare visualization toolkit.

Of course, it doesn't have to be a hive plot. The question is, how we could better visualise citation network? matrix diagrams and hierarchical edge bundling are two other approaches.

hextri package, documentation, vignette

R package for automatically spinning up a cluster on Amazon EC2 for use with SNOW/Snowfall

This is an idea for an R package that I started thinking about some time ago and I have made a good start on some scripts that could be used as a basis for this. Essentially, the idea would be to create an R package that would allow researchers to easily use other R packages like SNOW and Snowfall (for parallel computing on a cluster) on an Amazon Web Services (AWS) Elastic Compute (EC2) platform. The package would aim to make the process of spinning up a cluster on AWS as simple as possible, with functions that take care of this process and return the IP addresses of the workers. These in turn can then we handed to SNOW/Snowfall for easy parallel computation.

Shiny app for analysis of historical Summer Olympic Games performance

Topical for 2016 perhaps; an analysis of the results of all the modern Summer Olympic Games to date, presented as a multi-tabbed shinydashboard. Breakdowns by country, sport, event, gender, year, etc... Analysis of medal tallies, world-records, games-records, etc... by the same. It's a good chance to produce some really great example graphs out of ggplot2/ggvis/<your flavour>.

This looks like a reasonable starting point (needs cleaning up) and there must be dozens of ways to slice it up and present/analyse it. It would be interesting to see how things have changed over time (major winners, plateau of records, clustering of countries towards their strength sports).

I can't immediately see a package on CRAN that already does this, and if we have some forethought we can include various predictions for the 2016 games.

Interactive model diagnostic plots for the next generation of data modellers.

So a little while ago I put together shist(), kind of on a lark. It's an interactive histogram with a slider to set the bins. I blogged about on the BRAG one weiRd tip site.

I realise the idea is not original, but it does beg the question: can we provide a better user experience than static images for the common workhorse plots and diagnostics that modellers use all the time?

A good example is the residuals vs fits plot for generalised linear models. When a troublesome point calls out to you, you should be able to interrogate it with your mouse to get the metrics that indicate its leverage. Not waste time creating another plot for that. It might even give you the data (or index), because it's highly likely that's the next thing you want to know.

Bayesians aren't left out by any means: I rarely see an [MCMC trace](MCMC trace) that doesn't look like an amplitude plot for an audio file on soundcloud. For this to be of any real use I reckon it needs to be zoomable and scrubbable. The autocorrelation plots would probably likewise benefit from some opportunity to filter down to a selectable window of samples.

You all probably have your own key plots for your respective disciplines, the ones you read about in the handbook. I'd bet they can be improved with a bit of a redesign incorporating the interactive elements we have at our fingertips thanks to the likes of shiny, ggvis, plotly et. al. It wouldn't take much work to get a few of these together into a package that could have a pretty high impact.

Creating high quality metadata using ropensci/EML (and ingesting these into workflows)

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data

This EML metadata standard allows highly detailed information about methods, classificatory protocols, spatial and temporal coverage and ownership/intellectual rights. It is also the language used by the open-source metacat portal for publishing data.

The ropensci EML package https://github.com/ropensci/EML has been in development for years and is now approaching a release to CRAN in the near future.

I suggest to consider the idea raised in the issue below to "have a little mini-hackathon where EML/R users could propose use cases, and we could review and code solutions to make those use cases for metadata creation straightforward"
ropensci/EML#144 (comment)

Render large spatial datasets in shiny leaflet

Overview

The shiny R package has become one of the most popular ways to interactively explore and visualise data. The leaflet R package provides additional functionality that allows developers to embed interactive maps in their web-apps. These interactive web apps allow users to visualise spatial data (check out this tutorial). However, a lot of spatial data is currently too large for leaflet to natively render in a feasible amount of time (for example the worlds protected area network. I propose developing a fork of the leaflet R package with this capability. This fork would hopefully be integrated into the future versions of the leaflet R package.

Technical problem

The leaflet R package renders vector and raster data as html objects. Each coordinate or pixel must be loaded from a server to be rendered in the browser. This is not a problem for small datasets. But for large datasets, the browser must load hundreds of megabytes worth of coordinates or pixels to render the dataset. This causes the web-app to stall, or crash the browser if the dataset is too big.

Proposed solution

Tiles. Map tile layers are used to render detailed spatial data at various zoom levels. For instance, the following layer is rendered using tiles. See how fast it is at rendering this dataset?

library(leaflet)
leaflet() %>% addTiles() %>% addWMSTiles(
  "http://mesonet.agron.iastate.edu/cgi-bin/wms/nexrad/n0r.cgi",
  layers = 'nexrad-n0r-900913',
  options = WMSTileOptions(format = 'image/png', transparent = TRUE),
  attribution = "Weather data © 2012 IEM Nexrad"
)

I propose we add in the functionality to automatically convert vector and raster data to a tile layer, and render this tile layer instead of the raw data. This repository contains a python script designed to take an image and convert it to a set of tiles specifically for leaflet (modified from gdal2tiles in GDAL).

The trick will be linking everything together. I imagine the process will go something like this:

user inputs vector/raster dataset and a colour scheme
the dataset converted to a RGB 3-band .tif image
.tif image saved to disk
the python script used to generate tiles from .tif
location of the tiles on disk passed to addTiles function
data gets rendered as a tile layer

It would be super cool if we could implement an R version of the python script. As currently, the proposed functionality requires that both python and GDAL are installed on the users machine. However, we might be able to (ab)use the rgdal or gdalUtils R packages to have GDAL installed on the users machine. In theory, the geoprocessing could be handled by the rgeos R package.

Desired functionality

The leaflet R package contains functions to render different types of spatial data. For brevity, I'll just show an example of what I'm thinking of using raster datasets.

Here we have an example showing how you could normally render a raster dataset in R with leaflet.

library(raster)
library(leaflet)
filename <- system.file("external/test.grd", package="raster")
rast <- raster(filename)
leaflet() %>% addRasterImage(rast)

I propose modifying this function so that it has extra arguments that specify if the dataset should be converted to tiles for rendering (tiled; defaults to FALSE), and if so, where the files should be stored (dir; defaults to temporary directory but can be specified elsewhere so), and if a tiled dataset is already present at the location should it be overwritten or just used to render the dataset (overwrite; defaults to FALSE).

Note that the default options for the proposed function yield the same behaviour as the function in the current version of the package.

library(raster)
library(leaflet)
filename <- system.file("external/test.grd", package="raster")
rast <- raster(filename)
leaflet() %>% addRasterImage(rast, tiled=TRUE, dir=tempdir(), overwrite=FALSE)

I think if we could implement something like this for the addRasterImage and addGeoJSON functions in the leaflet R package that would be awesome. Or, if we could do this for all the add* functions that would be amazing.

Assess the quality of satellite sea surface and depth temperature data using physical measures

One of the persisting challenges in biosecurity is to predict the parts of a system that will be vulnerable to invasion by a pest. Here, we focus on the possible invasion of pests into Australian ports using ballast water as a vector. Marine vessels routinely exchange ballast water as a safety and efficiency measure. Unfortunately, ballast water carries biota, so exchange within ports can create a risk of invasion.

Presently CEBRA engages with ABARES within the Department of Agriculture and Water Resources to improve the quality of sea temperature data in order to identify ports that are vulnerable to invasion from one of several marine pests. Sea temperature is measured at a number of different scales by different organizations for different means. This challenge is to automate the collection, assessment, and comparison of different sources of sea temperature data.

simplify and/or automate access to satellite-based sea surface and depth temperature time series data for Australian ports
identify, access, and ease access to any kind of sea temperature data as measured physically, including tidal gauges, buoys, fishing data, and so on.
develop models and graphical displays that will simply and compactly compare the satellite to the locally measured data, focusing on: differences in means, min, max, amplitude, and so on.

R package for time series visualization

There have been ubiquitous in seeing large collections of time series in the practice, but it seems to me that there doesn't exist an R package aiming to provide efficient and straightforward tools for visualising sufficiently long time series, a handful of multivariate time series and even big time series. Many issues are worth giving a look/thought. For example,

What's the best data structure for time series data? A data frame with multiple keys (Year, Quarter, Month, Day, Hour, and etc.) and other variables?
It's kind of a pain to switch between a time series object (ts and mts in R) for analysing time series and a data frame for plotting such series in ggplot2. Is there other way around that we can make these two objects transform more seamlessly?
Rob's anomalous package provides a few useful cognostics to summarize time series characteristics, which makes big time series visualization accessible. Can we add more insightful and interesting cognostics to give a comprehensive picture of big time series?

Package for analysing and visualising Australian election data

I just read about this new R package for NZ election data that looks pretty neat: http://ellisp.github.io/blog/2016/04/03/nzelect1/ & https://github.com/ellisp/nzelect/blob/master/README.md

It got me thinking that something similar for Australia could be a good open data project for this unconf. We could go further with some spatial viz and interactivity in a shiny app. Perhaps even some Benford's law analysis ;)

Explore data from "Motivation, values, and work design as drivers of participation in the R open source project for statistical computing"

The paper http://www.pnas.org/content/112/48/14788.abstract was published last December. They did a survey of package developers, and collected lots of information including country and gender. I have this data.

I would like to do an exploration or the survey results, and make a lot of plots, to see what else we can learn about R package authors and why they contribute, and where they come from.

Another tslm problem

library(fpp)
Cityp <- pmax(fuel$City - 25, 0)
x <- 15:50
z <- pmax(x - 25, 0)
fit3 <- lm(Carbon ~ City + I(City^2) + I(City^3) + I(Cityp^3), data = fuel)
fcast3 <- forecast(fit3, newdata = data.frame(City = x, Cityp = z))

package to make it easier to pull data from The Australian Bureau of Meteorology

A package similar to rnoaa[https://www.google.com.au/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=rnoaa%20cran] to make it easy to get data from the Australian Bureau of Meterology.

Down under R stats conference under R consortium umbrella

https://twitter.com/SteffLocke/status/716652963423498240/photo/1

Experiment with (and potentially build on) the ALA R package

From https://github.com/AtlasOfLivingAustralia/ALA4R
"The Atlas of Living Australia (ALA) provides tools to enable users of biodiversity information to find, access, combine and visualise data on Australian plants and animals; these have been made available from http://www.ala.org.au/. Here we provide a subset of the tools to be directly used within R.

ALA4R enables the R community to directly access data and resources hosted by the ALA. Our goal is to enable outputs (e.g. observations of species) to be queried and output in a range of standard formats."

My particular interest is in citizen science contributions to the ALA (I'm not sure what information is available), and how the contributions are assessed.

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries

Subject: R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific studies, researchers use codebooks to document data formats, variable labels, factor labels, ranges for continuous variables, details of measuring instruments etc. Sometimes statisticians get a photocopied codebook or pdf but my preference (and that of aware researchers) is a spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and check for non-defined factor levels and identify out of range values for continuous variables. For data sets with hundreds of variables or when there are many data files with similar structure it is probably better to automate these procedures.

A package for extracting information from codebooks and using the meta data to assist labelling and data cleaning would prove useful to statisticians and scientists at the coal face of producing primary data.

Discussion around accommodation

Hi Everyone,

This is a place to discuss accommodation for the unconference here in Brisbane.

Stay tuned!

R package for accessing data.gov.au open data sets via API

I've worked with some data sets from data.gov.au and noticed that they do have a CKAN API that I can't immediately find any R package associated with (most happy to be corrected).

GET http://www.data.gov.au/api/3/action/group_list

The data is mostly well-organised with attached metadata, various formats, and proper attributions to the relevant department. It's an under-utilised resource as far as I can tell, and there are currently big pushes to better use this (e.g. GovHack challenges).

This doesn't immediately strike me as an insurmountable task (just watch, it will prove me wrong) but one that would have a pretty good payoff in data availability.

R package to store/access metadata associated with data/functions

First off, I see that there is already ropensci/EML and the associated idea, but I'm not a fan of S4, and I'm thinking bigger.

I've brought this up in discussions elsewhere in the past and I know that hadley hasn't made attributes a priority in his workflows (e.g. in relation to assertr() https://twitter.com/hadleywickham/status/559183346144522241) -- in fact, it was only recently that attributes were preserved in dplyr pipelines. They're certainly not preserved in plyr functions.

I'd love to be able to attach a python-esque docstring to data and functions that can be printed without invoking the full help menu (?library), which might contain the last time the object was updated (either automated or manually stated), source, attribution, etc... It's certainly possible to use comment() on a data.frame but I'm thinking perhaps these can be stored similarly to .Rmd files (with full markdown capability?) in a cache and searched/loaded independently to ensure they survive processing. This could include a checksum on the object to enforce reproducibility and perhaps even a trigger system if an object is declared immutable but is altered (override <- ... does one dare?). Needless to say, these would have to be transparent to existing structures, so that would need some careful consideration and balance.

Just thoughts at this stage.

Map structures for vectored maps to use with ggplot2

I find myself often re-doing the same code to extract data from a map repository, extract the polygons, and identifiers, in order to make chloropleth maps, or map backgrounds for spatial data. For example:

#### Set up map - yes, pbly should be recoded with purrr
world <- getMap(resolution = "low")
extractPolys <- function(p) {
  polys <- NULL
  for (i in 1:length(p)) {
    for (j in 1:length(p[[i]]@Polygons)) {
      x <- p[[i]]@Polygons[[j]]@coords
      polys$lon <- c(polys$lon, x[,1])
      polys$lat <- c(polys$lat, x[,2])
      polys$ID <- c(polys$ID, rep(p[[i]]@ID, nrow(x)))
      polys$region <- c(polys$region, rep(paste(p[[i]]@ID, j, sep="_"), nrow(x)))
      polys$order <- c(polys$order, 1:nrow(x))
    }
  }
  return(data.frame(polys))
}
polys <- extractPolys(world@polygons)

#### Map theme
theme_map <- theme_bw()
theme_map$line <- element_blank()
theme_map$strip.text <- element_blank()
theme_map$axis.text <- element_blank()
theme_map$plot.title <- element_blank()
theme_map$axis.title <- element_blank()
theme_map$panel.border <- element_rect(colour = "grey90", size=1, fill=NA)

#### Plot 
qplot(lon, lat, data=polys, group=region, geom="path") + 
  theme_map + coord_equal()

#### Merge data with map
#### Match country names to map names
cntrynames <- unique(datraw$country)
polynames <- unique(polys$ID)
setdiff(cntrynames, polynames)

#### Tabulate the countributing countries
cntry_count <- datraw %>% group_by(country) %>% tally()

#### Join to map
polys_cntry <- merge(polys, cntry_count, by.x="ID", by.y="country", all.x=TRUE)
polys_cntry <- polys_cntry %>% arrange(region, order)
ggplot(data=polys_cntry, aes(x=lon, y=lat)) + 
  geom_polygon(aes(group=region, fill=n), color="grey90", size=0.1) + 
  scale_fill_gradient("", low="#e0f3db", high="#43a2ca", na.value="white") + 
  scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) +
  coord_equal() + theme_map

This is the code that I put together to look at the R contributor survey. I wonder if it would be a good idea to have this packaged for more generally working with spatial data.

Optimizing reproducible research with R and related tools

One of the great strengths of R is how it enables reproducible research. I'm interested in the use of R packages as research compendia to accompany published articles and reports. I'd love to learn more and see some demos of how people are using R and related tools (such as Docker and make) to simplify the reproducibility of their research, and find out where the pain points are for others.

R package for storage/mapping of Australian-specific data sets

(I've done nothing in the way of seeing if anything like this already exists, so this may be a non-starter).

In a similar-yet-different approach to a previous entry, and which could tie in with #20, #16, and #6, what about suring up some Aussie maps with appropriate meta-data? States/boundaries/local boundaries/major and minor cities/etc...

I know that ggmap::get_map can get the ball rolling on a location, but I'd like to see (if it doesn't already exist) a package that loads up a heap of up-to-date and historical Australian data sets that can be added to a stored shapefile map, with the appropriate level factors to do so by state/city/etc... possibly without having to refer to an external map source.

plot_map_AUS("average_rainfall", "Adelaide", 2014)

If one was doing an analysis on such data, that could be extracted and processed however one liked,

histRainData_Adelaide <- plot_map_AUS("average_rainfall", "Adelaide", 2004:2014, plot=FALSE)

No more going and getting the ll/ul/lr/ur from the map bounding box, no more manually looking up a lat/lon to go get a map, or having to wait for the Google API limit to reset. Plot the Australian data NOW.

R package for packaging workloads, uploading and running on AWS

In past projects I've often found myself constrained by the resources of my local machine and wondered why can't I package up my scripts and data (similar to publishing a shiny app) upload it to Amazon, start an EC2 instance start and run my job, then download the result.

Of course I could do this manually, but that would get tedious, I often work offline so I can't work in AWS full time, and I only need the full compute power/memory of an EC2 instance until I'm ready to work on the full dataset.

So my proposal is an R package which:

packages up data and scripts
uploads it to Amazon
starts a server with an appropriate configuration
runs the package and saves the output somewhere
stops the instance

There may be some points of convergence with #12.

Classification for imbalanced data

Recently, I have been exploring the methods for classification on imbalanced data. As I know, the most commonly used technique is a combination of resampling/subsampling techniques plus classification models, like boosted decision tree, random forest, or others. There are various resources online which discuss about this problem, for example, the paper "Handling Imbalanced Data in Customer Churn Prediction Using Combined Sampling and Weighted Random", the blog "8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset", and the GitHub resource topepo/ICHPS2015_Class_Imbalance@master. One question coming into my mind is how imbalanced the data could be and how we can make our model perform better when the proportion of minority class goes down to 10%, 1%, or even 0.1% level. Hope this is also an interesting topic for you.

Visualising and communicating uncertainty

With the launch of the National Innovation and Science Agenda late last year, there is a strong focus to develop new and innovative ideas to help government and industry solve real world problems. While this sounds easy enough, deriving solutions for real world problems can be quite challenging, particularly if we are trying to model complex system processes and reconcile this with actual measurements.

I think the one thing that is lacking is the interpretation of these complex modelling frameworks and more importantly, the communication of uncertainty and what it means to the end user. More often than not, we present the outputs from our “model” in a paper/report that does not have any uptake or stakeholder engagement. How can we change this? How can our models have impact on society?

I would be interested in engaging in a discussion around this topic to determine if there is a general tool or suite of tools that can be developed to assist end users/stakeholders with the interpretation of the outputs from complex models and translate the outputs into impactful decisions. I recognise that this may be problem specific but I think the topic of communicating and visualising uncertainty is an important one that is not done well.

To help facilitate this discussion I have an example from the Great Barrier Reef where interest lies with the prediction of sediment loads. We have developed a Bayesian Hierarchical model for blending modelled output with measurements to provide a spatial map of sediment predictions through time with uncertainties. We are currently exploring ways to present these results to end users through a tool to assist with decision making. It occurred to me that the problem of communicating and visualising uncertainty and making decisions with confidence may not be unique and may benefit other modellers and end users wishing to understand the model that has been delivered.

Enhance the nullabor package

Get vignette into a form for a paper
Add power calculations
Add new null-generating mechanisms