Giter VIP home page Giter VIP logo

strata_bootcamp's Introduction

README

All code, slides and notes in support of "Data Bootcamp" tutorial at O'Reilly's Strata Conference 2011.

For more information on the tutorial see, http://strataconf.com/strata2011/public/schedule/detail/17164

Authors

License Information

Note: The workshop source code and lecture materials are distributed under different licenses.

All source code is licensed under the Simplified BSD License: http://www.opensource.org/licenses/bsd-license.php.

All workshop material other than source code (slides, handouts, etc.) are licensed under the Creative Commons Attribution-Share Alike 3.0 United States License: http://creativecommons.org/licenses/by-sa/3.0/us/.

Software Requirements

For those Data Bootcamp participants that wish to follow along with the instructors there are several software tools that you will need to have pre-installed. If you do not wish to practice during the session then it is not necessary to have these tools installed prior to bootcamp, but you will need them to replicate the methods described on your own.

For those running a UNIX distribution or Mac OS X all of the base tools (bash, Python, and R) are already installed, so you will only need to make sure that you have the supporting packages listed below. For Windows users you will need to install the tools separately from binaries, which you can download at the following sites:

UNIX bash

A large part of analyzing data is dealing with structured and unstructured text. As such, there are several command-line tools that allow for "quick and dirty" handling of this data. For this tutorial we will rely on the following set, which come with any UNIX-like distribution:

  • sed
  • awk
  • grep

Python

Python is a powerful high-level scripting language that is well suited for manipulating and analyzing data of all kinds. There are a number of Python libraries for analyzing data, but for this tutorial we will focus on the following:

  • email: For parsing email data
  • Natural Language Toolkit (NLTK): Powerful set of tools for performing natural language processing on text
  • NumPy, SciPy, matplotlib: A trio of scientific computing libraries in Python that provide data types and functions for numeric and statistical analysis, as well as visualization
  • Python Image Library (PIL): For the statistical analysis of image data
  • NetworkX: For the creation, manipulation, and study of the structure, dynamics, and functions of complex networks

There are a few ways to install Python packages, but we recommend either of the following. In you Python setuptools installed you can download and install all of the above libraries with the following command:

$ easy_install {package_name}

For example, to install NetworkX simply type:

$ easy_install networkx

You can also install packages from source by downloading the source files at the sites referenced above. Simply unarchive the source code, navigate to the folder where the source code is located, and use the following command:

$ python setup.py install

R

The R statistical programming language has become the de facto lingua franca for statistical analysis. There are thousands of R packages available on CRAN to perform any number of analyses. For the purposes of this tutorial we will use the extremely powerful ggplot2 package by Hadley Wickham for data visualization.

To install packages in R we use the install.packages command:

> install.packages("ggplot2", dependencies=TRUE)

Note, ggplot2 requires several other packages, so if you are running a new R installation this may take a few minutes.

Additional Software

During the tutorial there will be opportunity to visualize network relationships. A very useful tool for visualizing networks in Gephi, which is a standalone application. If you wish to follow along with this portion of the tutorial please download and install Gephi.

strata_bootcamp's People

Contributors

jhofman avatar drewconway avatar

Watchers

Akshay Sharma avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.