Big Data with Spark HATS
This Hands on Advanced Tutorial Session (HATS) is presented by the LPC to demonstrate a CMS analysis using Apache Spark, Spark-ROOT, Histogrammar, and MatplotLib. After introducing Spark and the paradigm it brings with it, students will learn some basic building blocks then combine them to perform a basic measurement of the Z-boson mass using CMS data recorded in 2016.
Getting Started
Students of the HATS will be provided access to Vanderbilt's Jupyter instance using their GitHub username. The jupyter instance contains this repository and all necessary software preconfigured.
Pre-Exercises
The day before the tutorial, it's critical that each student perform the
pre-exercises. This way, any potential technical/login issues can be cleared
up beforehand. To perform the pre-exercises, connect to
Jupyter. You will first need to log
in to GitHub and authorize Jupyter to authenticate (don't worry, GitHub
doesn't transfer your password, just a secret authentication token). You will
get a request to give me, PerilousApricot
, your credentials.
Once you've given Jupyter permission to authenticate, click "Start My Server" to start your Jupyter instance.
Once your server starts, you'll be placed into the Jupyter file browser. Then, navigate to
spark-hats/notebooks/00-preexercise.ipynb
to begin the pre-exercise.
Accessing this Tutorial in Jupyter
Once logged into Jupyter, navigate to the spark-hats
directory and open the notebook named Start-Here.ipynb
Built With
- Jupyter - Interactive python notebook interface
- Apache Spark - Fast and general engine for large-scale data processing
- Spark-ROOT - Scala-based ROOT/IO interface to Spark
- Histogrammar - Functional historgamming framework, optimized for Spark
- MatplotLib - Python plotting library
Authors
- Andrew Melo - [http://lpc.fnal.gov/fellows/2017/Andrew_Melo.shtml]
Acknowledgments
- The LPC Distinguished Researcher Program (link) - Support for the author
- Advanced Computing Center for Research and Education (ACCRE) (link) - Host facility and sysadmin support
- The Diana-HEP project (link - Interoperability and compatibility libaries
- Vanderbilt Trans Institutional Program (TIPs) Award (link) - Big Data hardware seed funding