In this lesson, we will be looking at data visualization using Pandas and Matplotlib - modules that we have already seen and used. Pandas uses matplotlib under the hood for data visualization, and provides some handy yet efficient functions for visualizing data from DataFrames.
You will be able to:
- Understand the relation between
pandas
andmatplotlib
plots and their attributes - Plot data from single variables using scatter plots, histograms, line plots, boxplots and KDE plots in pandas
- Plot multi-dimensional data using scatter matrix and parallel coordinate plots.
Before we dive into data visualization in Pandas, it would be a good idea to get a quick introduction to Matplotlib's style
package. Matplotlib comes with a number of predefined styles to customize the plots. These styles generally change the look of plots by changing color maps, line styles, backgrounds etc. Because Pandas is built on Matplotlib for visualizations, this will change the style of our Pandas graphs as well as we shall see below:
We can use plt.style.available
to see a list of predefined styles available in Matplotlib. The %matplotlib notebook
magic below optimizes the plots for displaying them in jupyter notebooks
import matplotlib.pyplot as plt
%matplotlib notebook
plt.style.available
['seaborn-dark',
'seaborn-darkgrid',
'seaborn-ticks',
'fivethirtyeight',
'seaborn-whitegrid',
'classic',
'_classic_test',
'fast',
'seaborn-talk',
'seaborn-dark-palette',
'seaborn-bright',
'seaborn-pastel',
'grayscale',
'seaborn-notebook',
'ggplot',
'seaborn-colorblind',
'seaborn-muted',
'seaborn',
'Solarize_Light2',
'seaborn-paper',
'bmh',
'seaborn-white',
'dark_background',
'seaborn-poster',
'seaborn-deep']
So this provides us with a list of styles available. In order to use a style, we simply give the command plt.style.use(<style name>)
. Let's use ggplot
for now and see how it changes the default style. Feel free to try other styles and see how they impact the look and feel of the plots!
plt.style.use('ggplot')
Pandas offers excellent built-in visualization features. It's particularly useful for exploratory data analysis of data that's stored as Pandas Series or DataFrame.
Let's build a synthetic temporal DataFrame with following steps:
- Data frame with three columns A, B and C
- For data in each column, we will use a random number generator to generate 365 numbers (to reflect days in a year) using
np.random.randn()
. - Using numpy's
cumsum
(cumulative sum) method, we will cumulatively sums the generated random numbers in each column. - Offset column A by +25 and column C by -25 with respect to Column B, which will remain unchanged
- Using
pd.date_range
, set the index to be everyday in 2018 (starting from 1st january).
We shall also set a seed for controlling the randomization, allowing us to reproduce the data.
It is always a good idea to set a random seed when dealing with probabilistic outputs.
Let's give this a go:
import pandas as pd
import numpy as np
np.random.seed(777)
data = pd.DataFrame({'A':np.random.randn(365).cumsum(),
'B':np.random.randn(365).cumsum() + 25,
'C':np.random.randn(365).cumsum() - 25},
index = pd.date_range('1/1/2018', periods = 365))
data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
A | B | C | |
---|---|---|---|
2018-01-01 | -0.468209 | 25.435990 | -22.997943 |
2018-01-02 | -1.291034 | 26.479220 | -22.673404 |
2018-01-03 | -1.356414 | 25.832356 | -21.669027 |
2018-01-04 | -2.069776 | 26.456703 | -21.408310 |
2018-01-05 | -1.163425 | 25.864281 | -22.685208 |
This is great. Now we have a dataset with three columns we can call time-series. Let's inspect our data visually. To plot this data we can simply use the .plot()
method on the DataFrame.
data.plot()
<IPython.core.display.Javascript object>
<matplotlib.axes._subplots.AxesSubplot at 0x1a208d2cf8>
This is sweet. So we didnt have to define our canvas, axes or labels etc. This is where pandas really shines. The DataFrame.plot()
method is just a simple wrapper around plt.plot()
that draws line plots. So when we call data.plot()
, we get a line graph of all the columns in the data frame with labels.
Also notice how this plot looks different in terms of look and feel. This is because the style we used earlier. Also the %matplotlib notebook
makes the plots interactive. Try clicking, dragging , zooming on above plot to see how this works.
Try changing the to a different style and see which one would you prefer.
The DataFrame.plot()
allows us to plot a number of different kinds of plots. We can select which plot we want to use by pressing it into the kind
parameter. Here is a complete list from the documentation
kind : str
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot
‘hexbin’ : hexbin plot
Let's try and create a scatter plot that takes the A and B columns of data
. We pass in "scatter"
to the kind
parameter to change the plot type. Also note, putting a semicolon at the end of plotting function would mute any extra text out.
data.plot('A', 'B', kind='scatter' );
<IPython.core.display.Javascript object>
We can also choose the plot kind by using the methods dataframe.plot.kind
instead of passing the kind
argument as we shall see below. Lets now create another scatter plot with points varying in color and size. We'll perform following steps:
- Use
df.plot.scatter
and pass in columns A and C. - Set the color
c
and sizes
of the data points to change based on the value of column B. - Choose the color palette by passing a string into the parameter
colormap
.
A complete list of colormaps is available at Official Documentation
Let's see this in action:
data.plot.scatter('A', 'C',
c = 'B',
s = data['B'],
colormap = 'viridis');
<IPython.core.display.Javascript object>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUg