Giter VIP home page Giter VIP logo

letterboxd-list-scraper's Introduction

Letterboxd-list-scraper

A tool for scraping Letterboxd lists from a simple URL. The output is a file with film titles, release year, director, cast, owner rating, average rating and a whole lot more (see example CSVs and JSONs in /example_output/).

Version v2.1.0 supports the scraping of:

  • Lists (e.g. https://letterboxd.com/bjornbork/list/het-huis-anubis/)
  • Watchlists (e.g. https://letterboxd.com/joelhaver/watchlist/)
  • User films (e.g. https://letterboxd.com/mscorsese/films/)
  • Generic Letterboxd films (e.g. https://letterboxd.com/films/popular/this/week/genre/documentary/)

The current scrape rate is about 1.2 films per second. Multiple lists can be concurrently scraped using separate CPU threads (default max of 4 threads, but this is configurable).

Getting Started

Dependencies

Requires Python 3.x, numpy, BeautifulSoup (bs4), requests, tqdm and lxml.

If dependencies are not met it is recommended to install everything needed in one go using pip install -r requirements.txt (ideally in a clean virtual environment).

Installing

  • Clone the repository and work in there.

Executing program

  • Execute the program by running python -m listscraper [options] [list-url] on the command line in the project directory.

    Multiple list URLs can be provided, separated by a space. The output file(s) can then be found in the folder /scraper_outputs/, which will be created if not already present. Some of the optional flags are:

    • -p or --pages can be used to select specific pages.
    • -on or --output-name can be used to give the output file(s) a user-specified name.
    • -f or --file can be used to import a .txt file with multiple list URLs that should be scraped.
    • -op or --output-path can be used to write the output file(s) to a desired directory.
    • -ofe or --output-file-extension can be used to specify what type of file is outputted (support for CSV and json).
    • --concat will concatenate all films of the given lists and output them in a single file.

Note

Please use python -m listscraper --help for a full list of all available flags including extensive descriptions on how to use them.

Tip

Scraping multiple lists is most easily done by running python -m listscraper -f <file> with a custom .txt file that contains the URL on each newline. Each newline can take unique -p and -on optional flags. For an example of such a file please see target_lists.txt.

Important

Program currently does not support the scraping of extremely long generic Letterboxd pages (e.g. https://letterboxd.com/films/popular/this/week/genre/documentary/, which contains ~152000 films). To circumvent this, please use the -p flag to make a smaller page selection.

TODO

  • Add further options for output, currently supports CSV and json.
  • Add scrape functionality for user top 4 and diary.
  • Add -u <username> flag that scrapes the diary, top 4, films and lists of a single user.
  • Add a --meta-data flag to print original list name, scrape date, username above CSV header.
  • Optimize thread usage to increase scrape speed.

Authors

Arno Lafontaine

Acknowledgments

Thanks to BBotml for the inspiration for this project https://github.com/BBottoml/Letterboxd-friend-ranker.

letterboxd-list-scraper's People

Contributors

alouafi avatar besweets avatar denjackson42 avatar jonathanhouge avatar l-dot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

letterboxd-list-scraper's Issues

UnicodeEncodeError

Some unicode characters are not recognized and, when this happens, it stops writing the csv.
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 556: character maps to <undefined>

Any plans to update the app?

Hey there, absolutely loving the hell out of this.

Is there any plan to add any more features to this app? (Like finding the number of watches, appearances on lists, etc)

Would also love to to be able to run the scraper on the main pages.

Keep up the good work!

help me...

hi guys, i dont know anything about python or codes, but this is incredible. I really wanna figure out how i can make this work, but i'am so dumb. Someone could explaine better or give me another way. Just for the record this is incredible e i'trying alot but i "old guy" not tecnology guy. However study cinema is basicaly organization and this program help me alot! Thanks and any help is welcome.

The CSV files generated from these lists are almost empty (more details inside)

Hi, first off, hats off to you for creating this scraper. I'm trying to build a dataset from the Letterboxd users I follow and this has been a timesaver. I was just manually scraping LOL :)

I ran into an issue with this link and this link as both of them return almost empty CSV files. I've tried other links before with a similar option and they came out OK. The first one should have returned a CSV with 87 titles and the second one should return 145 titles. What got generated are 1kb CSV files with only 1 title each. I also noticed there was no notification of "Written to xxx-film.csv" either.

I'd like to understand what's causing the issues for these 2 particular links. Again, thank you for creating this scraper tool and I hope you have other development plans for this in the future.

That is not a valid list URL, please try again.

Awesome looking project! I'm hoping it will enable me to export-->import just one list from another user. However, it's failing as such:

python3 /Users/redacted/Downloads/Letterboxd-list-scraper-master/main.py
====================================================
Welcome to the Letterboxd List scraper!
Provided with an URL, this program outputs a CSV file
of movie title, release data and Letterboxd link.
Example url: https://letterboxd.com/.../list/short-films/).
The program currently only supports lists and watchlists.
Enter q or quit to exit the program.
====================================================

Enter the URL of the list you wish to scrape:https://letterboxd.com/bjornbork/list/het-huis-anubis/

Scraping list data...

  0%|                                                                                                                                                                                                                                                   | 0/100 [00:00<?, ?it/s]
That is not a valid list URL, please try again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.