Giter VIP home page Giter VIP logo

ceu-dv2's Introduction

This is the R script repository of the "Data Visualization 2: Practical Data Visualization with R" course in the 2020/2021 Winter term, part of the MSc in Business Analytics at CEU. For the previous editions, see 2019/2020 Spring and 2020/2021 Winter.

Table of Contents

Schedule

3 x 100 mins on Jan 12:

  • 13:30 - 15:00 session 1
  • 15:00 - 15:30 break
  • 15:30 - 17:00 session 2
  • 17:00 - 17:30 break
  • 17:30 - 19:00 session 3

2 x 2 x 75 mins on Jan 19 and 26:

  • 16:30 - 17:45 session 1
  • 17:45 - 18:00 break
  • 18:00 - 19:15 session 2

Location

Hybrid: in-person at the Budapest campus and on Zoom. Zoom URL shared in Moodle.

Syllabus

Please find in the syllabus folder of this repository.

Technical Prerequisites

Please bring your own laptop* and make sure to install the below items before attending the first class:

  1. Install R from https://cran.r-project.org
  2. Install RStudio Desktop (Open Source License) from https://www.rstudio.com/products/rstudio/download
  3. Register an account at https://github.com
  4. Enter the following commands in the R console (bottom left panel of RStudio) and make sure you see a plot in the bottom right panel and no errors in the R console:
install.packages(c('ggplot2', 'gganimate', 'transformr', 'gifski'))
library(ggplot2)
library(gganimate)
ggplot(diamonds, aes(cut)) + geom_bar() +
    transition_states(color, state_length = 0.1)

Optional steps I highly suggest to do as well before attending the class if you plan to use git:

  1. Bookmark, watch or star this repository so that you can easily find it later

  2. Install git from https://git-scm.com/

  3. Verify that in RStudio, you can see the path of the git executable binary in the Tools/Global Options menu's "Git/Svn" tab -- if not, then you might have to restart RStudio (if you installed git after starting RStudio) or installed git by not adding that to the PATH on Windows. Either way, browse the "git executable" manually (in some bin folder look for thee git executable file).

  4. Create an RSA key (optionally with a passphrase for increased security -- that you have to enter every time you push and pull to and from GitHub). Copy the public key and add that to you SSH keys on your GitHub profile.

  5. Create a new project choosing "version control", then "git" and paste the SSH version of the repo URL copied from GitHub in the pop-up -- now RStudio should be able to download the repo. If it asks you to accept GitHub's fingerprint, say "Yes".

  6. If RStudio/git is complaining that you have to set your identity, click on the "Git" tab in the top-right panel, then click on the Gear icon and then "Shell" -- here you can set your username and e-mail address in the command line, so that RStudio/git integration can work. Use the following commands:

    $ git config --global user.name "Your Name"
    $ git config --global user.email "Your e-mail address"

    Close this window, commit, push changes, all set.

Find more resources in Jenny Bryan's "Happy Git and GitHub for the useR" tutorial if in doubt or contact me.

(*) If you may not be able to use your own laptop, there's a shared RStudio Server set up in AWS for you. Look up the class Slack channel for how to access, or find below the steps how the service was configured:

๐Ÿ’ช RStudio Server installation steps
echo "deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/" | sudo tee -a /etc/apt/sources.list.d/cran.list
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+
sudo apt update && sudo apt upgrade
sudo apt install r-base gdebi-core r-cran-ggplot2 r-cran-gganimate
sudo apt install cargo libudunits2-dev libssl-dev libgdal-dev
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-2021.09.2-382-amd64.deb
sudo gdebi rstudio-server-2021.09.2-382-amd64.deb
๐Ÿ’ช Creating users
secret <- 'something super secret'
users <- c('list', 'of', 'users')

library(logger)
library(glue)
for (user in users) {

  ## remove invalid character
  user <- sub('@.*', '', user)
  user <- sub('-', '_', user)
  user <- sub('.', '_', user, fixed = TRUE)
  user <- tolower(user)

  log_info('Creating {user}')
  system(glue("sudo adduser --disabled-password --quiet --gecos '' {user}"))

  log_info('Setting password for {user}')
  system(glue("echo '{user}:{secret}' | sudo chpasswd")) # note the single quotes + placement of sudo

  log_info('Adding {user} to sudo group')
  system(glue('sudo adduser {user} sudo'))

}

Class Schedule

Week 1

  1. Warm-up exercise and security reminder: 1.R
  2. Intro / recap on R and ggplot2 from previous courses by introducing MDS: 1.R
  3. Scaling / standardizing variables: 1.R
  4. Simpson's paradox: 1.R
  5. Intro to data.table: 1.R
  6. Anscombe's quartett 1.R

Suggested reading:

Homework 1

  1. Load the nycflights13 package and check what kind of datasets exist in the package, then create a copy of flights dataset into a data.table object, called flight_data.
  2. Which destination had the lowest avg arrival delay from LGA with minimum 100 flight to that destination?
  3. Which destination's flights were the most on time (avg arrival delay closest to zero) from LGA with minimum 100 flight to that destination?
  4. Who is the manufacturer of the plane, which flights the most to CHS destination?
  5. Which airline (carrier) flow the most by distance?
  6. Plot the monthly number of flights with 20+ mins arrival delay!
  7. Plot the departure delay of flights going to IAH and the related day's wind speed on a scaterplot! Is there any association between the two variables? Try adding a linear model.
  8. Plot the airports as per their geolocation on a world map, by mapping the number flights going to that destionation to the size of the symbol!

If in doubt about the results and outputs, see this example submission prepared by Misi.

Submission: prepare an R markdown document that includes the exercise as a regular paragraph then the solution in an R code chunk (printing both the code and its output) and knit to HTML or PDF and upload to Moodle before Jan 19 noon (CET).

Week 2

  1. Homework format and solutions: 2.R
  2. Geocoding: 2.R
  3. Alternatives to boxplot: 2.R
  4. Data patterns 2.R
  5. Animations for hierarchical clustering: 2.R

Suggested reading:

Homework 2

Replicate https://rpubs.com/daroczig-ceu/dv2-h2. Find source dataset on Moodle.

Submission: prepare an R markdown document that includes the exercise as a regular paragraph then the solution in an R code chunk (printing both the code and its output) and knit to HTML or PDF and upload to Moodle before Jan 26 noon (CET).

Week 3

  1. Homework bonus exercise solutions: 3.R
  2. Loading and rendering shapefiles: 3.R
  3. datasaurus 3.R
  4. Creating factors from numeric variables 3.R
  5. Summaries with data.table 3.R
  6. ggplot2 themes 3.R
  7. Interactive plots 3.R
  8. PCA demo on image processing 3.R

Final project

Use any publicly accessible dataset (preferably from the TidyTuesday projects at https://github.com/rfordatascience/tidytuesday) and do data transformations that seems useful, optionally merge external datasets, generate data visualizations that makes sense and are insightful, plus provide comments on those in plain English.

Submission: prepare an R markdown document that includes plain English text description of the dataset, problems/questions you analyzed, actual R code chunks (printing both the code and its output) doing the analysis, comments and summary/conclusion of the results, and knit the Rmd to HTML, then upload to Moodle before Feb 16, 2022 midnight (CET). Please don't leave the submission for the last minute, and be sure to submit by Feb 9, 2022 if you would like to get some feedback before the final deadline.

Required items:

  • use 5 different type of plots (e.g. a scatterplot, boxplot, barchart, map etc.)
  • tweak the axis labels (e.g. add axis titles + unit of measurements), provide title and subtitle
  • get rid of the gray panel background
  • create an animation

The above items with proper homework solutions from the first two weeks will result in "B" grade.

For "A", please also work on the below extra items:

  • use data.table
  • add custom style to your plots by specifying non-default colors, font family, grid etc.
  • if the dataset has any spatial aspect, try to create a map (even if some geocoding is required), otherwise try to use some of the stats methods covered in the class (MDS, clustering, PCA)
  • publish your results on RPubs.com

Contact

File a GitHub ticket.

ceu-dv2's People

Contributors

daroczig avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.