Giter VIP home page Giter VIP logo

dsc-probability-mass-function-houston-ds-042219's Introduction

The Probability Mass Function (PMF)

Introduction

In this lesson you'll look at a way to represent discrete distributions - the probability mass function (pmf), which maps from each value to its probability. You'll explore Probability Density Functions (PDFs) for continuous data later!

We'll investigate a similar concept, the probability density function (PDF) the next lessons.

Objectives

You will be able to:

  • Develop a mathematical understanding of a discrete variable PMF
  • Develop an intuition for PMF while working with a toy example
  • Visualize the PMF and describe its relationship with histograms

What is a Probability Mass Function (PMF)?

A probability mass function (pmf), sometimes referred to as a frequency function, are functions that associate probabilities with discrete random variables. You already learned about this in the context of coin flips and dice rolls. The discrete part in discrete distributions means that there is a known number of possible outcomes.

Based on your experience of rolling a dice, you know can develop a PMF showing the probabilities of each possible value between 1 and 6 occurring.

More formally:

The Probability Mass Function (PMF) maps a probability ($P$) of observing an outcome $x$ of our discrete random variable $X$ in a way that this function takes the form $f(x) = P(X = x)$.

$X$ being a discrete random variable, we can say that the range $R_X$ is a countable set of all possible values of X. They can be represented as a list as follows:

$R_x = { x_1,x_2,x_3,\ldots }$

where $x_1,x_2,x_3,\ldots$ are the possible values of $x$.

We are interested in quantifying the probability that $X$ is equal to some given quantity $x_3$. That is, we want to know $P(x_3)$. In the case of our dice, we might be interested in the probability of getting a 3 which would be $P(3) = \dfrac{1}{6}$

Think of the event $A$, so that $A = { X = x_k }$ is defined as the set of outcomes $s$ in the sample space $S$ for which the corresponding value of $X$ is equal to $x_k$. This can be written as:

$$\large A = { s \in S \mid X(s) = x_k }$$

PMF Intuition

Let's work through a brief example calculating the PMF for a discrete random variable!

You have previously seen that a probability is a number in the range [0,1] that is calculated as frequency expressed as a fraction of the sample size. This means that, in order to convert and random variable's frequency into a probability, we need to perform the following steps:

  • Get the frequency of each possible value in the dataset
  • Divide the frequency of each value by the total number of values (length of dataset)
  • Get the probability for each value

Let's show this using a simple toy example:

# Count the frequency of values in a given dataset
import collections
x = [1,1,1,1,2,2,2,2,3,3,4,5,5]
counter = collections.Counter(x)
print(counter)

print(len(x))
Counter({1: 4, 2: 4, 3: 2, 5: 2, 4: 1})
13

You'll notice that this returned a dictionary, with keys being the possible outcomes, and values of these keys set to the frequency of items. You can calculate the PMF using step 2 above.

# Convert frequency to probability - divide each frequency value by total number of values
pmf = []
for key,val in counter.items():
    pmf.append(round(val/len(x), 2))
    
print(counter.keys(), pmf)
dict_keys([1, 2, 3, 4, 5]) [0.31, 0.31, 0.15, 0.08, 0.15]

You notice that the PMF is normalized so total probability is 1.

import numpy as np

np.array(pmf).sum()
1.0

Visualizing a PMF

You can inspect the PMF of a discrete variable by visualizing the distribution using matplotlib. You can use a simple bar graph to show the PMF using the probabilities calculated above. Here's the code:

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

plt.bar(counter.keys(), pmf);
plt.title("A Probability Mass Function");

png

This looks pretty familiar. It's essentially a normalized histogram! You can use plt.hist(x) to obtain the histogram.

plt.hist(x);
plt.title('Histogram');

png

If you look carefully, there is only a difference in the y-axis. A histogram shows the frequency count of each value in a dataset, whereas the bar plot here shows probabilities.

You can alter your histogram to show probabilities instead of frequency counts using the density = True argument.

While we're at it, let's rescale our x-axis a little bit better in our histogram. You can also change the width of your vertical bars using the argument rwidth.

xtick_locations = np.arange(1.5, 7.5, 1)  # x=5, 15, 25, ...
xtick_labels = ['1', '2', '3', '4', '5']
bins = range(1, 7, 1) 
plt.xticks(xtick_locations, xtick_labels)

plt.hist(x, bins=bins,  rwidth=0.25, density=True);

png

Expected Value and Variance

When talking about distributions, there will generally be two descriptive quantities you're interested in: the expected value and the mean. For discrete distributions, the expected Value of your discrete random value X is given by:

$$\large E(X) = \mu = \sum_i p(x_i)x_i$$

The variance is given by:

$$\large E((X-\mu)^2) = \sigma^2 = \sum_i p(x_i)(x_i-\mu)^2$$

The table below pust these formulas into practice using our example to get a better understanding!

outcome 1 2 3 4 5 $\sum$
probability $0.31$ $ 0.31$ $0.15$ $0.08$ $0.15$ $ 1$
$p(x_i)x_i$ $0.31$ $0.62$ $0.45$ $0.32$ $0.75$ $2.45$
$(x_i-\mu)^2$ $(-1.45)^2 = 2.1025$ $(-0.45)^2=0.2025$ $(0.55)^2=0.3025$ $(1.55)^2=2.4025$ $(2.55)^2=6.5025$
$ p(x_i)(x_i-\mu)^2$ $0.65175$ $0.062775$ $0.045375$ $0.1922$ $0.975375$ $1.927475$

As you can see from the far right column, the expected vslue is equal to 2.45, and the variance is equal to 1.927475. Even though for this example these values may not be super informative, you'll learn how these two descriptive quantities are often important parameters in many distributions to come!

NOTE: In some literature, the PMF is also called just the probability distribution. The phrase distribution function is usually reserved exclusively for the cumulative distribution function CDF.

Summary

In this lesson you learned more about the probability mass function, and how to get a list of probabilities for each possible value in a discrete random variable, by looking at their frequencies. You also learned about the concept of expected value and variance for discrete distributions. Moving on, you'll learn about probability density functions for continuous variables.

dsc-probability-mass-function-houston-ds-042219's People

Contributors

loredirick avatar shakeelraja avatar mathymitchell avatar tkoar avatar

Watchers

James Cloos avatar Lindsey avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.