nceas / nceas-training Goto Github PK

Training materials and modules from R-based data science short courses at NCEAS

TeX 33.78% R 56.44% CSS 5.85% Shell 0.14% JavaScript 3.79%

nceas-training's Introduction

NCEAS Training

This repository contains lessons used in NCEAS training events. The lessons are all written in RMarkdown and set up so that they build as a bookdown.

To contribute, see our contributing document

Customizing Materials

To create a custom book for a specific training, create a new branch for the training event (eg 2019-11-RRCourse). In that branch, you can make changes to _bookdown.yml to specify which content to include, and you can modify chapters. The built book should be hosted on another repository specific to that training event, not this repository. Please do not commit built versions of the book. Additionally, when adding material please carefully consider file size. PDF presentations should be compressed, and data files, if absolutely necessary, should be small (< 1MB).

Updating Materials

Changes to chapters that would be beneficial to other training events should be merged back into the master branch.

nceas-training's People

Contributors

Stargazers

Watchers

nceas-training's Issues

Remaining NEON edits / changes

remote lesson plans: Data Management Plans

Delivery Format

proposed

Synchronous presentation (30 min)

Asynchronous presentation and exercise

Resources Needed

Zoom or screencasting software

Session 8 fixes

8.1.4 Joins in dplyr notes using read.csv but in the code chunk it uses read_csv
stringsAsFactors is no longer needed in R >= 4.0.0 (also noted in #89)

sites_df <- data.frame(site = c("HAW-101",
                                "HAW-103",
                                "OAH-320",
                                "OAH-219",
                                "MAI-039"),
                       stringsAsFactors = FALSE)

Other not as important items, and more of a stylistic choice:

show that you can actually separate mutates with a comma instead of adding an additional call to mutate:

catch_clean <- catch_data %>% 
  mutate(Chinook = ifelse(Chinook == "I", 1, Chinook)) %>%
  mutate(Chinook = as.integer(Chinook))

catch_clean <- catch_data %>% 
  mutate(Chinook = ifelse(Chinook == "I", 1, Chinook),
               Chinook = as.integer(Chinook))

as an aide note - dplyr::if_else vs. base::ifelse

reorder git setup into the git lessons

In the past, we used to introduce the concepts behind github before we set up the tools. During today's arctic lesson, we set up the complex git setup before explaining any of it. We should rearrange the git setup to be included in the git intro lesson itself, which will allow it to follow some of the intro material. It will also allow us to shorten the intro setup section to get into the more interesting RMarkdown sooner.

remote lesson plans: Data modeling

Delivery Format

proposed

Synchronous presentation (30 min)
Small group exercise in breakout rooms with facilitators (30 min)
Main room takeaways from breakout sessions (15 min)

Resources Needed

Zoom with breakout rooms
HackMD
Excalidraw?

tidy data for social science surveys

Need to write a lesson on how to create tidy data structures out of survey data

key points to hit:

entities, observations, variables (survey population, individual, question response)
consistent coding of variables
open formats

our example dataset might feature:

excel format with tabs?
inconsistent coding variables
other?

@mbjones would like your input here

Remote lesson plans: git/GitHub publishing analysis

Delivery Format

proposed

Asynchronous recording (30 minutes)
Office hours

Resources Needed

Screencasting software

remote lesson plans: Data cleaning and manipulation exercise

Delivery format

proposed

Synchronous introduction to problem (15 minutes)
Breakout room hackathon with facilitators (1 hour)
Main room discussion and wrap up (15 minutes)

Resources needed

Zoom with breakout rooms
HackMD

remote lesson plans: Introduction to R and RMarkdown

Delivery Format

proposed

Asynchronous recording (75 minutes)
Office hours

Resources Needed

Screencasting software
Slack

remote lesson plans: Collaboration, authorship, and data policies

Delivery Format

Proposed

Synchronous presentation (30 min)
Small breakout discussion/questions with facilitators (15 min)
Regroup to deliver anything that came up in breakout discussions to entire group (15 min)

Synchronous presentation (30 min)
Addtl. discussion of questions that came up in HackMD (30 min)

Resources Needed

Zoom with breakout rooms
HackMD

Session 9 Fixes

missing a return in this section? - ggplot vs base vs lattice vs XYZ…

* ggplot2 All of them work! I use base graphics for simple, quick and dirty plots. I use ggplot2 for most everything else. ggplot2 excels at making complicated plots easy and easy plots simple enough.

consider adding some help/ hints for the 9.2.2 challenge question? For example, adding functions they could use and maybe to try starting with getting the year?
might be worth differentiating the color and fill varaibles
add section on saving with ggsave ?

identify OSS lessons to convert to current format

from the OSS folder, need to identify the content we want to convert into our current bookdown structure for the Delta training

remote lesson plans: Writing and publishing metadata in R

Delivery Format

proposed

Resources Needed

Replace lead-in image in Session 5 (git collab+conflict)

The first image in session 5 (git-collaboration-conflicts.Rmd, really) is a picture of an RStudio git commit dialog:

but I think we meant to have a picture of the hub-and-spoke workflow described in the preceding paragraphs. The image I'm thinking of is

It looks like it got yoinked out by accident maybe? in b680927#diff-bf084fad119baf4391b358d39b465153ccba987417166c74e58d605d91b09e46. @jeanetteclark what do you think?

I'm going to make the change once I hit submit on this so the book builds for tomorrow but I'll leave this open for comments.

remote lesson plans: Metadata best practices

Delivery Format

proposed

Synchronous presentation (45 minutes)
Wrap up questions (15 minutes)

Resources Needed

Zoom
HackMD

remote lesson plans: creating packages

Delivery Format

proposed

Resources Needed

remote lesson plans: Provenance and reproducibility

Delivery Format

proposed

Resources Needed

Consider changing stringsAsFactors and 'master' branch references in course materials

We have various references to stringsAsFactors which we might want to take out at some point. Maybe not yet as R 4.0 is still fairly new?
We also have references to the master branch in at least the [web publication chapter])https://learning.nceas.ucsb.edu/2021-02-RRCourse/session-9-data-visualisation-and-publishing-to-the-web.html).

Might be good to grep the modules and change some of these.

Remote lesson plans: Git conflicts

Delivery Format

proposed

Asynchronous recording (1 hour)?
Breakout room practice session (1 hour)
OR
Office hours (2 person hours)

Resources Needed

Screencasting software

Fix broken link in 2020-11-RRCourse Ch6

In https://learning.nceas.ucsb.edu/2020-11-RRCourse/session-6-social-aspects-of-collaboration.html

Bad link: https://learning.nceas.ucsb.edu/2020-11-RRCourse/files/ThinkingPreferenceMapping.pdf

Working link: https://learning.nceas.ucsb.edu/2020-11-RRCourse/files/ThinkingPreferencesMapping.pdf

Alexandra Etheridge (USGS) training inquiry

on Dec 30 Alexandra inquired about hosting a training for USGS in Sacramento

She is currently on furlough (federal shutdown). Need to follow up with estimated costs when government reopens

update data submission lesson to include annotations

The description of attributes has been enhanced with attribute annotations, and we should review these in section 2.1, and add annotations during the walkthrough in 2.1.7.1.

remote lesson plans: Data visualization (shiny)

Delivery Format

proposed

Synchronous presentation (30 minutes)
Breakout hackathon (45 minutes)
Main room wrap up (15 minutes)

Resources Needed

Zoom with breakout rooms
HackMD

Session 7: Data Modeling Fixes

7.1.4 - spelling wold - Note that, should one encounter a new species in the survey, we wold have to add new columns to the table. This is difficult to analyze, understand, and maintain.
7.1.6 Challenge - spelling diatram -Draw a new ER diatram showing this re-designed data structure
7.1.6 Challenge - still references excalidraw, should be invision? - Using the Excalidraw live session your instructor start
7.1.6 Challenge Maybe mention that the first identifier column is created automatically by the table and not part of the original data?

replace DT with reactable

reactable seems more powerful and looks a bit nicer than DT

remote lesson plans: NetCDF intro

Delivery Format

proposed

Asynchronous presentation (30 min)

Resources Needed

Screencasting software

Fix copying and pasting

Copying and pasting works inconsistently for some people. Not sure which combinations of OS/browser don't work but I'll test. Make sure we can copy the text in the block and click the copy button too.

include `pins` package for downloads

https://pins.rstudio.com/

Integrate my git intro slides into course book somehow

The intro slides I've given the last two times for the git module help me out a lot in teaching the module and setting up the why and what of git but they aren't integrated into the lesson.

I'd like to take a look at integrating them in one or more of the not-mutually-exclusive ways:

Modifying the introductory text (if it even needs it) to get across some of the same messaging
Providing annotated slides in the appendix
Linking a PDF of the slides from the lesson and storing the PDF in the book repo

remote lesson plans: Tidyverse data cleaning

Delivery Format

proposed

Asynchronous recording of lesson (~45 minutes)
Office hours (2 person hours)

Resources Needed

Screen recording software
Slack?

remote lesson plans: R/RStudio/git introduction and set up

Delivery Format

proposed

Asynchronous presentation
Office hours

Resources Needed

Screencasting software
Slack

remote lesson plans: creating functions

Delivery Format

proposed

Synchronous presentation (30 minutes)
Breakout room exercise (45 minutes)
Main room wrap up (15 minutes)

Resources Needed

Zoom with breakout rooms
HackMD

update recommendations to geospatial vector data file formats

GeoPackage is a better alternative

Update the Session 6 - PR and Branches with new github option

Potentially update the materials to use the new contribute button in the forked repository

remote lesson plans: Intro to ADC policies

Delivery Format

Proposed

Synchronous presentation (30 min)
Small breakout discussion/questions with ADC facilitators (15 min)
Regroup to deliver anything that came up in breakout discussions to entire group (15 min)

Synchronous presentation (30 min)
Addtl. discussion of questions that came up in HackMD (30 min)

Resources Needed

Zoom with breakout rooms
HackMD to track questions as they come up

Jeffrey Blanchard (UMASS) training inquiry

Hi,

I am interesting in learning more about hosting an "Trainings in
Environmental Data Science" workshop here at UMass for the broader
community in the area of New England. It is similar to the modular
workshops that are graduate students are requesting.

Regards, Jeff

remote lesson plans: git pull requests and branches

Delivery Format

proposed

Asynchronous recording?

Resources Needed

Session 6: Git Pull Requests and Branches - images still say master?

The text in there says main but the images still say master

remote lesson plans: Data visualization (ggplot/leaflet)

Delivery Format

proposed

Asynchronous presentation (30 minutes)
Office hours

Synchronous presentation (30 minutes)
Breakout group exercise (30 minutes)

Resources Needed

Zoom with breakout rooms
HackMD
Screencasting software

remote lesson plans: Collaboration thinking preferences

Delivery format

Resources needed

remote lesson plans: Submitting metadata

Delivery Format

proposed

Asynchronous presentation (30 minutes)

Resources Needed

Screencasting software

NB: This can also be a resource for the Arctic Data Center in general

remote lesson plans: Geospatial vector analysis

Delivery Format

proposed

Asynchronous presentation (1 hour)
Office hours (2 person hours)

Resources Needed

Screencasting software
Slack?

split data publishing demo into its own section

This would give it a 2nd level header

Teaching Notes: Best Practices Data and Metadata

These are the specific things I highlighted in my written teacahing notes when I taught this section for the Arctic Data Center training in Oct 2020. These notes are meant to complement, not replace, the written material.

Introduction

Who you are and what you do for NCEAS/Arctic Data Center
Going to go through best practices for data and metadata, then go through an example of creating metadata and submitting to a repository.
What is metadata?
Mentimeter questions - for word cloud:

Before we jump into the lesson, what do you think some best practices are for your data and your metadata?
How often would you say you and/or your lab follow those best practices?
What gets in your way or prevents you from following "best practices" for data / metadata?

Quick discussion about the answers to all three questions.
Transition - hopefully this lesson will give you some tools to circumvent the things that get in your way.

Overview

Good data management is important for all types of data - small or large.
Don't need a fancy database system to have well formatted data.
First - why both? Why is this important?
Start early and often for good data management but it's never too late to go back to your data.

Organizing Data

High points of the linked papers:

Use a scripted program
Open file formats - computers change but open formats will live on
Keep your raw data
Descriptive names
Plain text

With these guidelines, others can start with your raw data and take the same steps as you did.
Design your data to be tidy.

Metadata

We defined metadata earlier in the lesson as data about data.
Good metadata contains lots of details so it's good to compile this info as you go.
Go through bibliographic, discovery, interpretation, data structure, and rights details, emphasizing why each piece is important:

Biblio - you want credit for this data
Discovery - you want others to discover your data so it can be used in more studies
Interpretation - you want your data to be interpreted correctly so it isn't used out of context
Structure - define variables in your metadata so that your data can be found by others who want to use it
Rights - you want others to use your data appropriately

EML is what we'll be working with today.

Data Identifiers

DOIs refer to the exact version you use even if later on you need to update it - this helps us track uses of the dataset, like views, citations, and downloads.

Data Citation

Talk about data citation at the Arctic Data Center as why this is important.

Provenance

Many repos want to preserve more than just data and metadata - we're one of those, and we're able to preserve software and provenance as well.
Does anyone know what provenance is in the context of data and metadata?
Preserving provenance and code is a cool way to help researchers build on the work you did - standing on the shoulders of giants.
This is why one of the best practices is to clean your data on your script programmatically rather than just deleting cells from Excel.

Data Documentation and Publishing

Reusing data is the goal but we can't get there without sharing data, and we can't get there without a good data management plan.

Data repositories

Highlight that Github isn't an archival location - researchers should want a repo that gives them a DOI for their data.
Highlighted that we're working on a game to help researchers learn more about what repo to choose for their data, as well as building a centralized hub of resources. Not ready yet, so feel free to skip.

Metadata

Fundamentally important for future understanding of your data.
It takes time to preserve data well but it's worth the effort - and it's easier if you do it as you go. Don't think about it as doing the minimum required steps - you want others / future you to really understand the data.

Structure of a data package

Identifiers are important because the help the researcher cite the exact version of the dataset used.
Transition - we are a member of the DataONE federation, so let's zoom out from thinking about the Arctic Data Center and think about the larger repository landscape.

DataONE

Transition - Now, onto the hands on piece. We're going to randomly assign you to breakouts and an NCEAS staff member will walk you all through uploading some sample data into the Arctic Data Center.

Hands on exercise

Check for completeness when everyone's logged in with their ORCID and at other points throughout.
Ask for questions throughout as well.

mermaid syntax for ER diagrams

Consider using Mermaid for ER diagrams and other technical drawings in the lessons.

Here's an example ER diagram from the data modeling lesson:

  erDiagram
    Site ||--o{ SpeciesObservation : contains
    Site {
        int site
        string name
        float temp
    }
    SpeciesObservation {
        int id
        string date
        int site
        string spcode
        string height
    }

The syntax should also allow specification of primary and foreign keys, but when used, I see GitHub rendering issues, so this needs to be explored further. For example:

  erDiagram
    Site ||--o{ SpeciesObservation : contains
    Site {
        int site PK
        string name
        float temp
    }
    SpeciesObservation {
        int id PK
        string date
        int site FK
        string spcode
        string height
    }

I think the cause for this issue has been identified upstream in mermaid issue mermaid-js/mermaid#2548.

Delivery Format

proposed

Synchronous presentation (45 minutes)
Breakout room practice (30 minutes)
Main room discussion and wrap up (15 minutes)

Resources Needed

Zoom with breakout rooms
HackMD

remote lesson plans: Parallel computing in R

Delivery Format

proposed

nceas / nceas-training Goto Github PK

nceas-training's Introduction

NCEAS Training

Customizing Materials

Updating Materials

nceas-training's People

Contributors

Stargazers

Watchers

Forkers

nceas-training's Issues

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery format

Resources needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Delivery format

Resources needed

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Introduction

Overview

Organizing Data

Metadata

Data Identifiers

Data Citation

Provenance

Data Documentation and Publishing

Data repositories

Metadata

Structure of a data package

DataONE

Hands on exercise

Delivery Format

Resources Needed

Delivery Format

Resources Needed

Recommend Projects

Recommend Topics

Recommend Org