Giter VIP home page Giter VIP logo

sdl60660 / letterboxd_recommendations Goto Github PK

View Code? Open in Web Editor NEW
217.0 4.0 14.0 5.58 GB

Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username

Home Page: https://bit.ly/movie-recs-letterboxd

License: GNU General Public License v3.0

Python 37.36% Shell 0.92% CSS 9.97% HTML 9.96% JavaScript 32.92% SCSS 8.80% Procfile 0.06%
letterboxd-recommendations letterboxd movie-recommendations svd web-scraping redis-queue collaborative-filtering flask

letterboxd_recommendations's Introduction

Letterboxd Recommendations

This project scrapes publicly-accessible Letterboxd data and creates a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username.

Live project lives here: https://letterboxd.samlearner.com/

Methodology

A user's "star" ratings are scraped from their Letterboxd profile and assigned numerical ratings from 1 to 10 (accounting for half stars). Their ratings are then combined with a sample of ratings from the top 4000 most active users on the site to create a collaborative filtering recommender model using singular value decomposition (SVD). All movies in the full dataset that the user has not rated are run through the model for predicted scores and the items with the top predicted scores are returned. Due to constraints in time and computing power, the maxiumum sample size that a user is allowed to select is 500,000 samples, though there are over five million ratings in the full dataset from the top 4000 Letterboxd users alone.

Notes

The underlying model is completely blind to genres, themes, directors, cast, or any other content information; it recommends only based on similarities in rating patterns between users and movies. I've found that it tends to recommend very popular movies often, regardless of an individual user's taste ("basically everyone who watches 12 Angry Men seems to like it, so why wouldn't you?"). To help counteract that, I included a popularity filter that filters by how many times a movie has been rated in the dataset, so that users can specifically find more obscure recommendations. I've also found that it occasionally just completely whiffs (I guess most people who watch "Taylor Swift: Reputation Stadium Tour" do like it, but it's not really my thing). I think that's just the nature of the beast, to some extent, particularly when working with a relatively small sample. It'll return 50 recommendations and that's usually enough to work with if I'm looking for something to watch, even if there are a couple misses here or there.

Running this on your own

The web crawling/data processing portion of this project (everything that isn't related to what happens on the webpage) lives in the data_processing subdirectory. There you'll find a bash script called run_scripts.sh. Use this as your guide for running the crawler, building a training data set, or running the model on your own. However, keep in mind that a full crawl of users, ratings, and movies will take several hours. If you'd like to skip that step, I'll keep this Kaggle dataset up to date with the data from my latest crawl. Regardless of whether you run the crawl on your own or download the exported data from Kaggle, there are three very quick things you'll need to do to get up and running outside of installing the dependencies in Pipfile using pipenv:

  1. Start up a local MongoDB server (ideally at the default port 27017)
  2. Add a file to the data_processing subdirectory called "db_config" with some basic information on your MongoDB server. If you're running a local server on the default port, all you'd need in that file is this: config = { 'MONGO_DB': 'letterboxd', 'CONNECTION_URL': 'mongodb://localhost:27017/'}

At that point, if you'd like to run the crawl on your own, you can just run the first three scripts listed in data_processing/run_scripts.sh (get_users.py, get_ratings.py, get_movies.py). If you download the data from Kaggle, you'll just need to import each CSV into your Mongo database as its own collection. The other three python scripts (create_training_data.py, build_model.py, run_model.py) will build and run the SVD model for you.

If you'd like to run the web server with the front-end locally, you'll need to run a local Redis instance, as well. You can then run pipenv run python worker.py to activate the Redis worker in the background and run start the web server by running pipenv run uvicorn main:app --reload. Navigate into the frontend directory and run npm install to install packages and then npm start to start the frontend React app.

Built With

  • Python (requests/BeautifulSoup/asyncio/aiohttp) to scrape review data
  • MongoDB (pymongo) to store user/rating/movie data
  • FastAPI as a python web server
  • HTML/CSS/Javascript/React/MaterialUI on the front-end
  • Redis/redis queue for managing queued tasks (scraping user data, building/running the model)
  • Heroku/Vercel for hosting

letterboxd_recommendations's People

Contributors

sdl60660 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

letterboxd_recommendations's Issues

Add alternate gallery/grid view for results

It's arguably easier to mentally filter results from poster alone, for example, ignoring movies you've already seen. I suggest a grid or gallery view in 5 cols x 10 rows since the search returns 50 results.

Dataset snapshots

Would it be possible to redistribute snapshots of the dataset so that there wouldn't need to be duplicated scraping? e.g. via automated torrents or something

Duplicates

the recommendation list is showing movies that I've already seen

Some deleted films not identifiable from download, only from web view

The download I just did from your engine included the row: https://letterboxd.com/film/film:590400/ which when uploaded to Letterboxd, had no match. I went back to the list, still open on your site, and it was identified as [Movie poster for Das Boot] [Das Boot (1987)] https://letterboxd.com/film/film:590400/) 8.36 – so presumably the TV series was briefly in Letterboxd, before being deleted by them.

As the additional identifying data was in your cache anyway, would you consider an option to include it in the download? It's not uncommon for obscure but interesting films to disappear from Letterboxd, whether because they're ‘video’, or have never had a theatrical release or festival screening.

Add accordion to each results to "see details"

  • Add material UI accordion to Result component with more details retrieved from tmdb
    (Data is already in the mongo database and being returned to the client, just not being surfaced right now)

Decade Slider and Genre Exclusion filter

Hi Sam, as per our mail, I think the usefulness of the tool could be increased a lot by having the results be between a decade of the users choosing, as well as excluding all films of a certain 'meta' genre, those being documentary, and animation. I know movielens.org suffers from this same issue, and will often end up recommending you things like 'stand up' comedy specials or documentaries, even though your preferences are clearly for fiction films.

I don't think its necessary to have every genre listed, but I do think documentary and animation should definitely be their own thing when generating a list of results. As for the decade, it makes it a lot easier when doing any serious kind of film research and wanting to get a good vibe for what the highest rated films looked or sounded like for that specific point in time.

Genre blacklist doesn't work

Hello, I've tried the website today and the genres blacklist section doesn't work, for example I asked for no romance and animation movies, and my recommandations were full of romance movies and some animations too

The date thing does work fine though

image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.