Giter VIP home page Giter VIP logo

cse5525-final-project's Introduction

Requirements

scikit-learn scipy numpy pandas tqdm

Dataset

The movies dataset from Kaggle: [https://www.kaggle.com/rounakbanik/the-movies-dataset]. Data directory: ./data/.

Context

These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

Content

  • movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies. This file contains tmdbId and imdbId, while only tmdbId is included in all records.

  • credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object. This file uses tmdbId as the index.

  • links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset. This file has movieId, tmdbId, and imdbIds, all of them are not consecutive. The processed data has assign consecutive mIds for movies and provides mappings to tmdbId and movieId.

  • ratings.csv: This file contains 26 million ratings from 2700 users on all 45,000 movies.

Data processing

preprocessing.py creates relationship and mappings based on the raw dataset. It has the following arguments:

  • --links: the path to links.csv, default is data/links.csv
  • --metadata: the path to the movies' metadata file, default is data/movies_metadata.csv
  • --credit: the path to credit data, default is data/credits.csv
  • --rating: the path to rating data, default is data/ratings.csv

Running preprocessing.py will build the following files in the ./processed_data/ directory.

  • Genre2Id.txt: 20 lines, each line has <gId, genre>.
  • mId2CC.txt: each line has <mId, # of cast/crews, cIds> representing each movie's director(s) and top 8 casts.
  • mId2Genre.txt: each line has <mId, # of genres, gIds> representing each movie's genre attributes.
  • mId2Title.csv: each line consists of <mId, tmdbId, title>.
  • overviews.csv: each line has <mId, overview>.
  • rating_test.csv: each line has <uId, mId, binary, rating> representing a user, a movie, if the user's rating on that movie > 3.5 (50%), and its exact rating (0.5 - 5.0).
  • rating_train.csv: similar to the test data.

Brainstorm

Reference:

Idea01: Product-Catalog-Size-Recommendation-Framework

Idea02: Movie Recommendation Systems

Idea03: Fashion Recommendation System

Idea04: Book Recommendation System

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.