Giter VIP home page Giter VIP logo

pratibhaawasthi / cleaning-and-preparing-imdb-data Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 6 KB

The document outlines a data cleaning project for the IMDB dataset. The project includes loading the dataset, dropping unnecessary columns, identifying missing values, filling missing values, formatting and cleaning the data. The timeline for the project is 3 days, with specific tasks assigned to each day.

Jupyter Notebook 100.00%
data-cleaning data-cleaning-and-preprocessing dataframe python load-data pandas

cleaning-and-preparing-imdb-data's Introduction

Cleaning-and-Preparing-IMDB-Data

Introduction:

The IMDB dataset is a comprehensive dataset of movies released over the years. This dataset contains information about movie titles, release years, ratings, budgets, gross earnings, and more. This project aims to clean and preprocess the dataset to make it suitable for further analysis.

Tasks:

  1. Load the IMDb data.
  2. Rename the file to the dataset name.
  3. Using the Pandas dataframe drop function, get rid of unnecessary columns.
  4. Determine the number of columns removed. In this case, we removed one column - budget - because it was highly correlated to the gross attribute.
  5. Identify the number of missing values within each column.
  6. Determine which missing values can be easily filled without changing the basic statistics of the data. In this case, the "gross" column can be easily filled using the mean result without changing the statistics of the column.
  7. Fill the columns from step 6 with the fillna function.
  8. Uppercase all of the country values and replace any reference to "United States" with "USA".
  9. Replace "N/A", "NaN", and "Null" with an empty string.
  10. Fix any unicode issues in the "movie_title" column with the ftfy library.
  11. Assume a movie cannot be less than 10 minutes or greater than 300 minutes. If a movie is outside those bounds, set the value to 0.
  12. Identify what would be considered an outlier for the "imdb_score" column.
  13. Fix "imdb_score" and "title_year" outliers.
  14. Output the cleaned up file onto a new csv called "clean_imdb.csv".

Timeline:

Day 1:

  • Load the IMDb dataset.
  • Rename the file to the dataset name.
  • Use the Pandas dataframe drop function to get rid of unnecessary columns.
  • Identify the number of columns removed.
  • Determine the number of missing values within each column.
  • Determine which missing values can be easily filled without changing the basic statistics of the data.
  • Fill the columns from the previous step with the fillna function.

Day 2:

  • Uppercase all of the country values and replace any reference to "United States" with "USA".
  • Replace "N/A", "NaN", and "Null" with an empty string.
  • Fix any unicode issues in the "movie_title" column with the ftfy library.
  • Assume a movie cannot be less than 10 minutes or greater than 300 minutes. If a movie is outside those bounds, set the value to 0.
  • Identify what would be considered an outlier for the "imdb_score" column.

Day 3:

  • Fix "imdb_score" and "title_year" outliers.
  • Output the cleaned up file onto a new csv called "clean_imdb.csv".

Conclusion:

The IMDB dataset was successfully cleaned and preprocessed for further analysis using the methods outlined above. The cleaned up file, clean_imdb.csv, can be used for further analysis and modeling.

cleaning-and-preparing-imdb-data's People

Contributors

pratibhaawasthi avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.