Giter VIP home page Giter VIP logo

dsc-central-limit-theorem-lab's Introduction

Central Limit Theorem - Lab

Introduction

In this lab, we'll learn how to use the Central Limit Theorem to work with non-normally distributed datasets as if they were normally distributed.

Objectives

You will be able to:

  • Use built-in methods to detect non-normal datasets
  • Create a sampling distribution of sample means to demonstrate the central limit theorem

Let's get started!

First, import the required libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as st
np.random.seed(0) #set a random seed for reproducibility

Next, read in the dataset. A dataset of 10,000 numbers is stored in non_normal_dataset.csv. Use pandas to read the data into a series.

Hint: Any of the read_ methods in pandas will store 1-dimensional in a Series instead of a DataFrame if passed the optimal parameter squeeze=True.

# Your code here

Detecting Non-Normal Datasets

Before we can make use of the normal distribution, we need to first confirm that our data is normally distributed. If it is not, then we'll need to use the Central Limit Theorem to create a sampling distribution of sample means that will be normally distributed.

There are two main ways to check if a sample follows the normal distribution or not. The easiest is to simply plot the data and visually check if the data follows a normal curve or not.

In the cell below, use seaborn's distplot method to visualize a histogram of the distribution overlaid with the probability density curve.

# Your code here

As expected, this dataset is not normally distributed.

For a more formal way to check if a dataset is normally distributed or not, we can make use of a statistical test. There are many different statistical tests that can be used to check for normality, but we'll keep it simple and just make use of the normaltest() function from scipy.stats, which we imported as st --see the documentation if you have questions about how to use this method.

In the cell below, use normaltest() to check if the dataset is normally distributed.

# Your code here

The output may seem a bit hard to interpret since we haven't covered hypothesis testing and p-values in further detail yet. However, the function tests the hypothesis that the distribution passed into the function differs from the normal distribution. The null hypothesis would then be that the data is normally distributed. We typically reject the null hypothesis if the p-value is less than 0.05. For now, that's all you need to remember--this will make more sense once you work with p-values more which you'll do subsequently.

Since our dataset is non-normal, that means we'll need to use the Central Limit Theorem.

Sampling With Replacement

In order to create a Sampling Distribution of Sample Means, we need to first write a function that can sample with replacement.

In the cell below, write a function that takes in an array of numbers data and a sample size n and returns an array that is a random sample of data, of size n. Additionally, we've added a marker for random seed for reproducability.

def get_sample(data, n, seed):
    #Adding random seed for reproducibility
    np.random.seed(seed)
    
    #Your code here

    pass

test_sample = get_sample(data, 30, 0)
print(test_sample[:5]) 
# [56, 12, 73, 24, 8] (This will change if you run it multiple times)

Generating a Sample Mean

Next, we'll write another helper function that takes in a sample and returns the mean of that sample.

def get_sample_mean(sample):
    
    # Your code here

    pass

test_sample2 = get_sample(data, 30, 0)
test_sample2_mean = get_sample_mean(test_sample2)
print(test_sample2_mean) 
# 32.733333333333334

Creating a Sampling Distribution of Sample Means

Now that we have helper functions to help us sample with replacement and calculate sample means, we just need to bring it all together and write a function that creates a sampling distribution of sample means!

In the cell below, write a function that takes in 3 arguments: the dataset, the size of the distribution to create, and the size of each individual sample. The function should return a sampling distribution of sample means of the given size.

Make sure to include some way to change the seed as your function proceeds!

def create_sample_distribution(data, dist_size=100, n=30):
    pass

test_sample_dist = create_sample_distribution(data)
print(test_sample_dist[:5]) 

# If you set your seed to start at zero and iterate by 1 each sample you should get:
# [32.733333333333334, 54.266666666666666, 50.7, 36.53333333333333, 40.0]

Visualizing the Sampling Distribution as it Becomes Normal

The sampling distribution of sample means isn't guaranteed to be normal after it hits a magic size. Instead, the distribution begins to approximate a normal distribution as it gets larger and larger. Generally, 30 is accepted as the sample size where the Central Limit Theorem begins to kick in--however, there are no magic numbers when it comes to probability. On average, and only on average, a sampling distribution of sample means where the individual sample sizes were 29 would only be slightly less normal, while one with sample sizes of 31 would likely only be slightly more normal.

Let's create some sampling distributions of different sizes and watch the Central Limit Theorem kick in. As the sample size increases, you'll see the distributions begin to approximate a normal distribution more closely.

In the cell below, create a sampling distribution from data of dist_size 10, with a sample size n of 3. Then, visualize this sampling distribution with displot.

# Your code here

Now, let's increase the dist_size to 30, and n to 10. Create another visualization to compare how it changes as size increases.

# Your code here

The data is already looking much more 'normal' than the first sampling distribution, and much more 'normal' that the raw non-normal distribution we're sampling from.

In the cell below, create another sampling distribution of data with dist_size 1000 and n of 30. Visualize it to confirm the normality of this new distribution.

# Your code here

Great! As you can see, the dataset approximates a normal distribution. It isn't pretty, but it's generally normal enough that we can use it to answer statistical questions using $z$-scores and p-values.

Another handy feature of the Central Limit Theorem is that the mean and standard deviation of the sampling distribution should also approximate the population mean and standard deviation from the original non-normal dataset! Although it's outside the scope of this lab, we could also use the same sampling methods seen here to approximate other parameters from any non-normal distribution, such as the median or mode!

Summary

In this lab, we learned to apply the central limit theorem in practice. We learned how to determine if a dataset is normally distributed or not. From there, we used a function to sample with replacement and generate sample means. Afterwards, we created a normal distribution of sample means in order to answer questions about non-normally distributed datasets.

dsc-central-limit-theorem-lab's People

Contributors

bpurdy-ds avatar cheffrey2000 avatar hoffm386 avatar julianward147 avatar lmcm18 avatar loredirick avatar mas16 avatar mathymitchell avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsc-central-limit-theorem-lab's Issues

Code missing value

Link to Canvas

https://learning.flatironschool.com/courses/6852/assignments/251031?module_item_id=592806
https://github.com/learn-co-curriculum/dsc-central-limit-theorem-lab/tree/master

Issue Subtype

  • Master branch code
  • Solution branch code
  • Code tests
  • Layout/rendering issue
  • Instructions unclear
  • Other (explain below)

Describe the Issue

Source

1) Missing "seed" value = 0 in code in all branches.
2) The result of running code in solution branch does not correspond to the right answer.

Concern

Screenshot 2023-10-14 194908
Screenshot 2023-10-14 190904

(Optional) Proposed Solution

test_sample2 = get_sample(data, 30, 0)

What OS Are You Using?

  • OS X
  • Windows
  • WSL
  • Linux
  • Saturn Cloud from Canvas

Any Additional Context?

The additional resources link most of them don't work!

Link to Canvas

Issue Subtype

  • Master branch code
  • Solution branch code
  • Code tests
  • Layout/rendering issue
  • Instructions unclear
  • Other (explain below)

Describe the Issue

Source


Concern

(Optional) Proposed Solution

What OS Are You Using?

  • OS X
  • Windows
  • WSL
  • Linux
  • Saturn Cloud from Canvas

Any Additional Context?

Sample vs Sampling Distribution Confusion

Link to Canvas

https://github.com/learn-co-curriculum/dsc-central-limit-theorem-lab

Issue Subtype

  • [x ] Master branch code
  • [x ] Solution branch code
  • Code tests
  • Layout/rendering issue
  • [x ] Instructions unclear
  • Other (explain below)

Describe the Issue

Source

The examples are numerous throughout this lab. Here's one.
Before we can make use of the normal distribution, we need to first confirm that our data is normally distributed. If it is not, then we'll need to use the Central Limit Theorem to create a sample distribution of sample means that will be normally distributed.

Concern

The lab refers to a sample distribution of sample means instead of a sampling distribution of sample means. The sampling distribution language is used in the Objectives and then reverts to sample throughout the lab. Two labs later in the Confidence Intervals Lab, a distinction is called out between sample and sampling distributions, but this lab seems to ignore this difference.

(Optional) Proposed Solution

Replace instances of sample distribution with sampling distribution when describing the distribution of sample means for consistency with other course material.

What OS Are You Using?

  • OS X
  • [x ] Windows
  • WSL
  • Linux
  • IllumiDesk

Any Additional Context?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.