fri-datascience / course_ids Goto Github PK

License: Other

TeX 0.17% CSS 0.03% Jupyter Notebook 1.64% HTML 98.14% Dockerfile 0.02%

course_ids's Introduction

Introduction to data science

This repository contains lecture materials for the "Introduction to data science" course at the Data Science Master's program at the University of Ljubljana, Faculty for computer and information science.

Compiled version of the materials is accessible at https://fri-datascience.github.io/course_ids.

Introductory lesson slides are available at https://fri-datascience.github.io/course_ids/slides/00-intro.slides.html.

The course is administered by professors Erik Štrumbelj, Tomaž Curk and Slavko Žitnik

Repository updates

The repository was initially created and used during the Fall 2019 and is now being updated each year for the course. You are also invited to contribute to the repository.

To make updates we propose to use RStudio IDE. Prior to work on a project we advise to install the following dependencies:

install.packages("devtools")

remotes::install_github("slowkow/ggrepel")
install.packages(c("FactoMineR", "ggpubr"))
devtools::install_github("kassambara/factoextra")

install.packages(c("GPArotation", "bookdown", "reticulate", "moments", "ggcorrplot", "tmvnsim", "mnormt", "psych", "Rtsne", "naniar", "mice", "caret", "ggplot2", "gbm"), repos="https://cran.wu.ac.at/")

After that build the bookdown project, the newly compiled materials will be available in folder /docs/handbook. Use R commands as follows:

# Clean existing book data
bookdown::clean_book(TRUE)

# Clean R environment
rm(list = ls()) 

# Build HTML gitbook
bookdown::render_book('index.Rmd', 'bookdown::gitbook')

This work is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license (CC BY-NC-SA). Creative Commons licenses are built of four building blocks, each corresponding to a different requirement:

BY (attribution): this is a mandatory element of every CC license. Contrary to what is commonly believed, the attribution obligation in CC licenses extends beyond a simple indication of the name of the author; in fact, the user is obliged to retain a copyright notice (e.g. "(c) 2016 Paweł Kamocki"), a license notice (e.g. "This work is licensed under a Creative Commons Attribution 4.0 International License"), a disclaimer of warranties (if supplied) and a link to the licensed material.
NC (non-commercial) means that no commercial use can be made. Commercial use is defined as use primarily intended for commercial advantage or monetary compensation. Please note that this category is extremely unclear and can discourage potential users. In our view, when it comes to licensing of research data, this requirement should be avoided.
SA (share-alike): according to this requirement, if derivative works are made, they have to be licensed under the same or compatible license, i.e. a license containing the same (or compatible) requirements. There is only one license approved for compatibility with CC BY-SA 4.0 license: the Free Art License 1.3. In every other case, in order to comply with the SA requirement, you will have to re-license the derivative work under the same CC license, or its more recent version.

course_ids's People

Contributors

Stargazers

Watchers

Forkers

hitkodev kalcmatej99 marcheivanovski zack-henson theteleton martinjurkovic

course_ids's Issues

Docker bibtex entry ends up at end of last chapter

I think we should remove all bibtex entries and just use links or put the reference into Further reading.

Are all the data going to be in a data folder? Or do we make them available online?

Kaggle description in Chapter 1

I believe that in Chapter 1 it is possible to add a description and link to kaggle.com
Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access free GPUs and a huge repository of community published data and code.
Add description under 1.8 Further reading and references

[Docker] Change Dockerfile example

# Use an official Ubuntu runtime as a parent image
FROM ubuntu:18.04

# Set the current working directory to /work
WORKDIR /work

# Copy the current directory contents into the container at /work
ADD ./ .

# Install and configure your environment
RUN apt-get update \
    && apt-get install -y python3 python3-pip \
    && pip3 install flask

# Make port 8787 available to the world outside this container (i.e. docker world)
EXPOSE 8787

# Run server.py when the container launches
ENTRYPOINT python3 server.py

This Dockerfile is not written according to good practices:

if you copy project files as a first step, this invalidates Docker cache and causes all others step to execute again. Here this means that it will install ALL the requirements and this causes a significant delay as it has to fetch and install them all from internet. In Dockerfiles, a proper ordering of commands in very important. Code, as it changes frequently, has to be copied into the container as late as possible.
For Python, it is a good practice to use virtualenv even in the container. But using virtualenv in the container has many gotchas (e.g. see https://pythonspeed.com/articles/activate-virtualenv-dockerfile/)
python modules in the official Ubuntu repository are old. It is better to teach students to use and install dependencies via standard requirements.txt.
it is a good practice (a must in production environments!) to lock dependency versions (again, teach them to lock versions in requirements.txt). If you don't lock the versions, the build is NOT DETERMINISTIC (as it can brake later when a new version of a package is introduced)
a new recommended way to define requirements is to use pipenv, but for small projects requirements.txt is OK

As a properly written Dockerfile impacts a good user experience a lot, I would suggest to add this exercise into a course:

first, let them use an original variation of the Dockerfile so they can see for themselves that a small change in Python code causes a significant delay in building a new version of the container as they wait for dependencies to install.
Then introduce changes in Dockerfile (and explain what they do) and show them that now a change in code causes a MUCH quicker build for all subsequent versions of app's Docker image, as Docker reuses cached layers. This is VERY important If you use big dependencies (e.g. Tensorflow, Keras, ..)!

A better example (adapted from upper link, but not tested with your example code!)

needs flask in requirements.txt

FROM ubuntu:18.04

# Install Python
RUN apt-get update && \ 
    apt-get install -y --no-install-recommends python3 python3-virtualenv

# Create virtualenv
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m virtualenv --python=/usr/bin/python3 $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install dependencies:
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8787 available to the world outside this container (i.e. docker world)
EXPOSE 8787

# Copy the application:
COPY ./ .

# Run the application
CMD ["python3", "server.py"]

Git cheatsheet

One possible addition that can be made at the end of the 2 chapter is a short list of all the git commands mentioned in this chapter (like a cheatsheet). This will help the newbies not search for the git commands in the script as they will have them in the end.

Pandas

I think Chapter 1 could be enriched with a dedicated Pandas section, which could cover at least the basics along with a toy dataset for students to experiment, as it is a very useful library in the domain of Data Science.

fri-datascience / course_ids Goto Github PK

course_ids's Introduction

Introduction to data science

Repository updates

course_ids's People

Contributors

Stargazers

Watchers

Forkers

course_ids's Issues

Docker bibtex entry ends up at end of last chapter

Are all the data going to be in a data folder? Or do we make them available online?

Kaggle description in Chapter 1

[Docker] Change Dockerfile example

Git cheatsheet

Pandas

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent