Exploration of changes in Canadian ice thickness over time. Is the change statistically significant?

License: MIT License

Jupyter Notebook 44.83% Python 0.38% R 0.15% Makefile 0.05% Dockerfile 0.03% HTML 54.58%

global_warming_effects_on_ice_thickness's Introduction

Global Warming Effects on Ice Thickness

Members:

Jayme Gordon
Mo Garoub
Sasha Babicki
Syad Khan

Data analysis project for DSCI 522: Data Science Workflows

About

We are trying to answer the question, did the median ice thickness in the Canadian Arctic change by a statistically significant amount from the years 1984 to 1996? This question stems from the rising global temperatures and a curiosity of how this warming impacts the depth of the ice. The dataset used in this analysis contains measurements of ice thickness at various established monitoring stations in the Canadian Arctic on a weekly basis. Using exploratory data analysis (EDA) we determined that the data are skewed. We decided to use a hypothesis test for independence of a difference in medians using permutation for our analysis.

The dataset used in this analysis contains measurements of ice thickness at various established monitoring stations in the Canadian Arctic on a weekly basis. The data is made available from the Government of Canada and the monitoring is done by the Canadian Ice Thickness Program. Information about the program can be accessed through the Government of Canada and the specific dataset we are using is publicly available here.

Summary Report

The summary report can be found here.

Usage

To replicate the analysis, clone this GitHub repository and follow the instructions for one of the options below:

1. Using Docker

Install Docker. Run the following command in the command line from the root directory of this project:

docker run  -it --rm -v $(pwd):/home/ice_thickness syadk/ice-thickness:v0.3.0 make -C /home/ice_thickness all

To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:

docker run  -it --rm -v $(pwd):/home/ice_thickness syadk/ice-thickness:v0.3.0 make -C /home/ice_thickness clean

2. Without using Docker

Install all the dependencies listed under the "Dependencies" header.

Python dependencies can be installed using the conda environment file provided in the 522_grp_13.yaml. To create and activate the environment, run the following commands in the command line from the root directory of this project:

conda env create --file 522_grp_13.yaml
conda activate 522_grp_13

R dependencies can be installed by running the following command in the R terminal:

install.packages(c("tidyverse", "dplyr", "infer", "ggplot2", "purrr", "knitr", "docopt", "svglite", "rmarkdown"))

Once dependencies are installed, run the following command at the command line from the root directory of this project:

make all

To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line from the root directory of this project:

make clean

Dependencies

Python 3.8.0 and Python packages:
- pandas==1.1.4
- ipykernel==5.3.4
- xlrd==1.2.0
- docopt==0.6.2
- altair==4.1.0
- pandas-profiling==2.9.0
- pytest==6.1.2
- altair_saver==0.5.0
- vega_datasets==0.9.0
- python-chromedriver-binary==88.0.*
R 4.0.3 and R libraries:
- tidyverse==1.3.0
- dplyr==1.0.2
- datateachr==0.2.1
- infer==0.5.3
- ggplot2==3.3.2
- purrr==0.3.4
- knitr==1.30
- docopt==0.7.1
- svglite==1.2.3.2
- rmarkdown==2.5

Makefile Dependency Graph

License

The Ice Thickness Program Collection, 1947-2002 data contains information licensed under the Open Government Licence – Canada (version 2.0).

References

Government of Canada (2020). Ice thickness data. Retrieved from: https://www.canada.ca/en/environment-climate-change/services/ice-forecasts-observations/latest-conditions/archive-overview/thickness-data.html

Timbers, T. (2020). DSCI 522 Statistical Inference and Computation I. Retrieved from: https://github.ubc.ca/MDS-2020-21/DSCI_552_stat-inf-1_students

global_warming_effects_on_ice_thickness's People

Contributors

Stargazers

Watchers

Forkers

sbabicki syadk mgaroub elgohr-update

global_warming_effects_on_ice_thickness's Issues

Change duplicate name ice_thickness.csv

Same name in raw and processed

Fix blurry graph

#76 first graph is blurry, make it smaller, save a larger file, or save diff format like pdf/svg

Add dependency versions to yaml

Script 2 - Read local data and clean/process it

A second script that reads the data from the first script and performs and data cleaning/pre-processing, transforming, and/or paritionting that needs to happen before exploratory data analysis or modeling takes place. This should take at least two arguments:
a path/filename pointing to the data to be read in
a path/filename pointing to where the cleaned/processed/transformed/paritioned data should live

EDA

Reorganize folder structure

Check years, make sure they are consistent. Make transition for filtering years more clear.

#76 The years you are highlighting and comparing change in the report. 1984 - 2002, then 1984 v 1996, then 1984 v 1994. The last two I would expect to be the same, by changing the ranges I am not sure how your narrative flows, since you are telling me about one comparison, then another with a stat test, and I can't compare them easily since they are two totally different comparisons even though the trends are likely the same.

Data download script

Paragraph describing script 4 analysis

@jaymegordo Can you also write a paragraph on what was done for the analysis that @mgaroub can include in the final report?

Originally posted by @sbabicki in #31 (comment)

Update analysis method

#76 I'm also wondering if it would be best to compare the individual stations in 1984 to the station in 1994, because the distributions in your eda shows very different ice thickness distributions for each station. Also the magnitude of change across a year is also different for each station. I do also like how you are moving to comparing individual months instead of just values for a whole year.

Add conclusion/result to summary in report

#76

Fix density.svg plot

density plot is failing to save data:

WARN Ignoring an invalid transform: {"as":["mean_ice_thickness","density"],"density":"mean_ice_thickness","groupby":["year","month"]}.
WARN Infinite extent for field "density": [Infinity, -Infinity]
WARN Ignoring an invalid transform: {"as":["mean_ice_thickness","density"],"density":"mean_ice_thickness","groupby":["year","month"]}.
WARN Infinite extent for field "density": [Infinity, -Infinity]
WARN facet encoding should be discrete (ordinal / nominal / binned).

Merge pull request, milestone 3 release, and submit to Canvas

@mgaroub I assume you will do the honours again :)

Directory structure

Final review - script documentation and README

Each individual script must be well documented. This means comments throughout, readable code and a brief summary at the top of each file that answers who wrote it, when it was written and what it does. Additionally, in your project's README.md file, explain how to reproducibly run your scripts from top to bottom, including the arguments that need to be provided to each script.

Update graph type in report

#76 FIrst graph in report: a different plot instead of a bar plot might be good, I am wondering if certain stations have different ice thicknesses and how that relates within each year. Or adding error bars at a minimum to the bar graph.

Script 5 - Summary Report

A fifth script: an .Rmd or .ipynb files that presents the key useful (not all!) exploratory data analysis as well as the statistical summaries and figures in a little report. There should be a written narrative in this document that introduces and justifies your question, introduces the data set, presents the findings/results, and interprets the findings/results in context of the question. Some critique of the analysis is also expected (limitations, assumptions, etc) and a statement of future directions (what would you do next if you had more time to work on this). The report is expected to be 1-2 written pages (excluding figures, tables and references). You are expected to have a reference section and cite 2-3 external sources (data source can be one of these citations) in addition to citing the programming languages and packages used. Yes, you need to cite the programming languages and packages used in your project. You will learn how to do this in lecture.

Create release 0.0.1

Update bib

Create milestone 4 release and submit to Canvas

You must include the URL of your public project's repository in your milestone 4 submission on Canvas, so we know where to find it.
Just before you submit the milestone 4, create a release on your project repository on GitHub and name it 0.3.0. Include the URL to this release in your milestone 4 submission on Canvas, so we can easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.

Update report to discuss extending analysis to other months and years

PDF export not working with makefile

Also the export takes a LONG time, can someone else try on their comp and see if it works or not?

Code is just commented out for now in the makefile.

Add citations for libraries used

Add names and dates to report

TA suggestion #76

Teamwork Contract

Complete teamwork contract

https://docs.google.com/document/d/1J0SMwLqYOj7h_TYNkfRzxN_4PPjyzieUNdDgO78FsPw/edit#

Dependency diagram of the Makefile

Create a dependency diagram of the Makefile using makefile2graph and include the image in the README.md file in the root of your project.

Milestone 2 General Feedback

Hello Global Warming Effects on Ice thickness Group!

Overall I enjoyed reading your report, and I was able to run your code without problems!
-code well formatted and concise, commented as needed, some checks/tests in separate code files
-good use of issues for project tracking and discussion. I like the added labels too

rationale good, and clear!

Suggestions for the Report:

names and date missing in report
first graph is blurry, make it smaller, save a larger file, or save diff format like pdf/svg
a different plot instead of a bar plot might be good, I am wondering if certain stations have different ice thicknesses and how that relates within each year. Or adding error bars at a minimum to the bar graph.
I'm also wondering if it would be best to compare the individual stations in 1984 to the station in 1994, because the distributions in your eda shows very different ice thickness distributions for each station. Also the magnitude of change across a year is also different for each station. I do also like how you are moving to comparing individual months instead of just values for a whole year.
could add a conclusion/result to summary in report
minor writing errors, end of report should say 'in the figure above' at the very end.
The years you are highlighting and comparing change in the report. 1984 - 2002, then 1984 v 1996, then 1984 v 1994. The last two I would expect to be the same, by changing the ranges I am not sure how your narrative flows, since you are telling me about one comparison, then another with a stat test, and I can't compare them easily since they are two totally different comparisons even though the trends are likely the same.

Update yaml to include all dependencies

I can't run the eda script with the current setup on my computer when I use the 522_grp_13 environment. Not urgent, but we should look into it next week.

Script 1 - Download and save data

A first script that downloads some data from the internet and saves it locally. This should take at least two arguments:
the path to the input file (a URL or a relative local path, such as data/file.csv)
a path/filename where to write the file to and what to call it (e.g., data/cleaned_data.csv)
Note 1 - you already wrote this script for your last milestone, here you should just improve it based on any feedback you received from the TAs.
Note 2 - choose more descriptive filenames than the ones used here for illustrative purposes.

Submit to canvas

Change all figures to .svg instead of .png

Update README with new usage details

Test Dockerfile

Someone not responsible for creating the file should test running it from scratch.

Export EDA jupyter notebook as document readable on GitHub

I'm getting an error when I try to export as .pdf, and .md is too large to preview on GitHub. Can someone try exporting the notebook as .pdf and let me know if it works for them?

Create Dockerfile

Wrap your project up in a nice fancy Docker container. To do this, you should:

Write your own Dockerfile (this should include all the software and libraries/packages needed to run your analysis pipeline, but not this project itself - i.e., do not git clone your project repo inside your Dockerfile). This Dockerfile should live in the root of your project repository.
use GitHub Actions to automate the building of your Docker image from your Dockerfile. You can use and edit the GitHub Actions workflow file (Links to an external site.) from the gha_docker_build (Links to an external site.) example repository I have created.
Do NOT include any of your data or code in the docker images!
include instructions in the project README on how to use your project with and without Docker. I envision that your instructions for using it with Docker should go something like this:
To run this analysis using Docker, clone/download this repository, use the command line to navigate to the root of this project on your computer, and then type the following (filling in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer).

docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/data_analysis_eg ttimbers/data_analysis_pipeline_eg make -C '/home/data_analysis_eg' all
To clean up the analysis type:

docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/data_analysis_eg ttimbers/data_analysis_pipeline_eg make -C '/home/data_analysis_eg' clean

Proposal

Script 4 - Analysis

A fourth script that reads the data from the second script, performs some statistical or machine learning analysis and summarizes the results as a figure(s) and a table(s). These should be written to files. This should take two arguments:
a path/filename pointing to the data
a path/filename prefix where to write the figure(s)/table(s) to and what to call it (e.g., results/this_analysis)

Standardize docopt arguments

We currently have 3 different naming conventions for the different scripts :P

Fix typo in report

#76 Report should say 'in the figure above' at the very end.

Makefile

Write another script, a Makefile (literally called Makefile), to act as a driver script to rule them all. This script should run the others in sequence, hard coding in the appropriate arguments. This script should:

live in the project's root directory and be named Makefile
be well documented (using the project README and comments inside the Makefile to explain what it does and how to use it)
have a all: target so that you can easily run the entire analysis from top to bottom by running make all at the command line
have a clean: target so that you can easily "undo" your analysis by running make clean at the command line

PDF needs some editing.

Problem with reading the figures titles
Unnecessary space under results
References should be in a separate page

Add extra detail to team work contract

From https://github.ubc.ca/MDS-2020-21/DSCI_522_dsci-workflows_students/blob/master/release/milestone1/milestone1.md:

How will work be distributed in a fair and equitable way?
How often will group meetings occur?
Will you have meeting agendas and minutes? If so, what will be the system for rotating through these responsibilities?
What will be the style of working?
Will you start each day with stand-ups, or submit a summary of your contributions 4 hours before each meeting? or something else?
What is the quality of work each team member expects from themselves and each other?

Update README for Milestone 2

In the main README.md file for this project you should explain the project in a short summary (view from 10,000 feet) and also explain how to run your data analysis (which order scripts are run in, what expected inputs are). Your explanation could include a flow chart/dependecy diagram. This should be an evolution of what was submitted at the proposal stage. Yes, we are asking you to overwrite your proposal here, and so if you want to keep that in a more findable place than in the Git history, do create a new file in the doc directory to archive it there (e.g., proposal.md).
List the software dependencies for your analysis in the README (or if you use conda, point to the environment.yml file)

Create GitHub repo

Add libraries from yaml to dependencies section in README

Add all figures used in the ice_thickness_eda.md to be produced by eda_figure_export.py

Colour code confidence intervals

Script 3 - Create EDA

A third script which creates exploratory data visualization(s) and table(s) that are useful to help the reader/consumer understand that dataset. This should take two arguments:
a path/filename pointing to the data
a path/filename prefix where to write the figure to and what to call it (e.g., results/this_eda.png)

Create fork on personal account, update UBC-MDS repo only through pull requests

Just a reminder! Once everyone has sent their first pull request we can close this issue.

EDA figures + code

Many figures are get cut between pages in the PDF document, which makes it hard to read. Also, it would be helpful to summarize all the EDA, and at least connect the main findings between each plot. There are many figures so this would be helpful.

ubc-mds / global_warming_effects_on_ice_thickness Goto Github PK