Giter VIP home page Giter VIP logo

course_ids's Introduction

Introduction to data science

This repository contains lecture materials for the "Introduction to data science" course at the Data Science Master's program at the University of Ljubljana, Faculty for computer and information science.

Compiled version of the materials is accessible at https://fri-datascience.github.io/course_ids.

Introductory lesson slides are available at https://fri-datascience.github.io/course_ids/slides/00-intro.slides.html.

The course is administered by professors Erik Štrumbelj, Tomaž Curk and Slavko Žitnik

Repository updates

The repository was initially created and used during the Fall 2019 and is now being updated each year for the course. You are also invited to contribute to the repository.

To make updates we propose to use RStudio IDE. Prior to work on a project we advise to install the following dependencies:

install.packages("devtools")

remotes::install_github("slowkow/ggrepel")
install.packages(c("FactoMineR", "ggpubr"))
devtools::install_github("kassambara/factoextra")

install.packages(c("GPArotation", "bookdown", "reticulate", "moments", "ggcorrplot", "tmvnsim", "mnormt", "psych", "Rtsne", "naniar", "mice", "caret", "ggplot2", "gbm"), repos="https://cran.wu.ac.at/")

After that build the bookdown project, the newly compiled materials will be available in folder /docs/handbook. Use R commands as follows:

# Clean existing book data
bookdown::clean_book(TRUE)

# Clean R environment
rm(list = ls()) 

# Build HTML gitbook
bookdown::render_book('index.Rmd', 'bookdown::gitbook')

--

This work is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license (CC BY-NC-SA). Creative Commons licenses are built of four building blocks, each corresponding to a different requirement:

  • BY (attribution): this is a mandatory element of every CC license. Contrary to what is commonly believed, the attribution obligation in CC licenses extends beyond a simple indication of the name of the author; in fact, the user is obliged to retain a copyright notice (e.g. "(c) 2016 Paweł Kamocki"), a license notice (e.g. "This work is licensed under a Creative Commons Attribution 4.0 International License"), a disclaimer of warranties (if supplied) and a link to the licensed material.
  • NC (non-commercial) means that no commercial use can be made. Commercial use is defined as use primarily intended for commercial advantage or monetary compensation. Please note that this category is extremely unclear and can discourage potential users. In our view, when it comes to licensing of research data, this requirement should be avoided.
  • SA (share-alike): according to this requirement, if derivative works are made, they have to be licensed under the same or compatible license, i.e. a license containing the same (or compatible) requirements. There is only one license approved for compatibility with CC BY-SA 4.0 license: the Free Art License 1.3. In every other case, in order to comply with the SA requirement, you will have to re-license the derivative work under the same CC license, or its more recent version.

course_ids's People

Contributors

estrumbelj avatar hitkodev avatar marcheivanovski avatar martinjurkovic avatar speedarj avatar szitnik avatar tomazc avatar zrimseku avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

course_ids's Issues

Kaggle description in Chapter 1

I believe that in Chapter 1 it is possible to add a description and link to kaggle.com
Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access free GPUs and a huge repository of community published data and code.
Add description under 1.8 Further reading and references

[Docker] Change Dockerfile example

# Use an official Ubuntu runtime as a parent image
FROM ubuntu:18.04

# Set the current working directory to /work
WORKDIR /work

# Copy the current directory contents into the container at /work
ADD ./ .

# Install and configure your environment
RUN apt-get update \
    && apt-get install -y python3 python3-pip \
    && pip3 install flask

# Make port 8787 available to the world outside this container (i.e. docker world)
EXPOSE 8787

# Run server.py when the container launches
ENTRYPOINT python3 server.py

This Dockerfile is not written according to good practices:

  • if you copy project files as a first step, this invalidates Docker cache and causes all others step to execute again. Here this means that it will install ALL the requirements and this causes a significant delay as it has to fetch and install them all from internet. In Dockerfiles, a proper ordering of commands in very important. Code, as it changes frequently, has to be copied into the container as late as possible.
  • For Python, it is a good practice to use virtualenv even in the container. But using virtualenv in the container has many gotchas (e.g. see https://pythonspeed.com/articles/activate-virtualenv-dockerfile/)
  • python modules in the official Ubuntu repository are old. It is better to teach students to use and install dependencies via standard requirements.txt.
  • it is a good practice (a must in production environments!) to lock dependency versions (again, teach them to lock versions in requirements.txt). If you don't lock the versions, the build is NOT DETERMINISTIC (as it can brake later when a new version of a package is introduced)
  • a new recommended way to define requirements is to use pipenv, but for small projects requirements.txt is OK

As a properly written Dockerfile impacts a good user experience a lot, I would suggest to add this exercise into a course:

  • first, let them use an original variation of the Dockerfile so they can see for themselves that a small change in Python code causes a significant delay in building a new version of the container as they wait for dependencies to install.
  • Then introduce changes in Dockerfile (and explain what they do) and show them that now a change in code causes a MUCH quicker build for all subsequent versions of app's Docker image, as Docker reuses cached layers. This is VERY important If you use big dependencies (e.g. Tensorflow, Keras, ..)!

A better example (adapted from upper link, but not tested with your example code!)

  • needs flask in requirements.txt
FROM ubuntu:18.04

# Install Python
RUN apt-get update && \ 
    apt-get install -y --no-install-recommends python3 python3-virtualenv

# Create virtualenv
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m virtualenv --python=/usr/bin/python3 $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install dependencies:
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8787 available to the world outside this container (i.e. docker world)
EXPOSE 8787

# Copy the application:
COPY ./ .

# Run the application
CMD ["python3", "server.py"]

Git cheatsheet

One possible addition that can be made at the end of the 2 chapter is a short list of all the git commands mentioned in this chapter (like a cheatsheet). This will help the newbies not search for the git commands in the script as they will have them in the end.

Pandas

I think Chapter 1 could be enriched with a dedicated Pandas section, which could cover at least the basics along with a toy dataset for students to experiment, as it is a very useful library in the domain of Data Science.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.