l-dot / letterboxd-list-scraper Goto Github PK

A program that can scrape Letterboxd lists from an input URL. The output CSV contains information about the film title, release year, director, cast, personal rating, average rating and a lot more from an input URL.

License: MIT License

Python 100.00%

letterboxd csv python scraper

letterboxd-list-scraper's Introduction

Letterboxd-list-scraper

A tool for scraping Letterboxd lists from a simple URL. The output is a file with film titles, release year, director, cast, owner rating, average rating and a whole lot more (see example CSVs and JSONs in /example_output/).

Version v2.1.0 supports the scraping of:

Lists (e.g. https://letterboxd.com/bjornbork/list/het-huis-anubis/)
Watchlists (e.g. https://letterboxd.com/joelhaver/watchlist/)
User films (e.g. https://letterboxd.com/mscorsese/films/)
Generic Letterboxd films (e.g. https://letterboxd.com/films/popular/this/week/genre/documentary/)

The current scrape rate is about 1.2 films per second. Multiple lists can be concurrently scraped using separate CPU threads (default max of 4 threads, but this is configurable).

Getting Started

Dependencies

Requires Python 3.x, numpy, BeautifulSoup (bs4), requests, tqdm and lxml.

If dependencies are not met it is recommended to install everything needed in one go using pip install -r requirements.txt (ideally in a clean virtual environment).

Installing

Clone the repository and work in there.

Executing program

Execute the program by running python -m listscraper [options] [list-url] on the command line in the project directory.

Multiple list URLs can be provided, separated by a space. The output file(s) can then be found in the folder /scraper_outputs/, which will be created if not already present. Some of the optional flags are:
- -p or --pages can be used to select specific pages.
- -on or --output-name can be used to give the output file(s) a user-specified name.
- -f or --file can be used to import a .txt file with multiple list URLs that should be scraped.
- -op or --output-path can be used to write the output file(s) to a desired directory.
- -ofe or --output-file-extension can be used to specify what type of file is outputted (support for CSV and json).
- --concat will concatenate all films of the given lists and output them in a single file.

Note

Please use python -m listscraper --help for a full list of all available flags including extensive descriptions on how to use them.

Tip

Scraping multiple lists is most easily done by running python -m listscraper -f <file> with a custom .txt file that contains the URL on each newline. Each newline can take unique -p and -on optional flags. For an example of such a file please see target_lists.txt.

Important

Program currently does not support the scraping of extremely long generic Letterboxd pages (e.g. https://letterboxd.com/films/popular/this/week/genre/documentary/, which contains ~152000 films). To circumvent this, please use the -p flag to make a smaller page selection.

TODO

Add further options for output, currently supports CSV and json.
Add scrape functionality for user top 4 and diary.
Add -u <username> flag that scrapes the diary, top 4, films and lists of a single user.
Add a --meta-data flag to print original list name, scrape date, username above CSV header.
Optimize thread usage to increase scrape speed.

Authors

Arno Lafontaine

Acknowledgments

Thanks to BBotml for the inspiration for this project https://github.com/BBottoml/Letterboxd-friend-ranker.

letterboxd-list-scraper's People

Contributors

Stargazers

Watchers

Forkers

marker004 jmcruz14 francescabudel 3werking denjackson42 mcherniak besweets sezarrk alouafi strppy

letterboxd-list-scraper's Issues

UnicodeEncodeError

Some unicode characters are not recognized and, when this happens, it stops writing the csv.
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 556: character maps to <undefined>

Specific list outputs almost empty .csv

This list outputs a .csv with only one entry. I've attempted it twice. All the other lists I've tried seem to have outputted everything right.

Any plans to update the app?

Hey there, absolutely loving the hell out of this.

Is there any plan to add any more features to this app? (Like finding the number of watches, appearances on lists, etc)

Would also love to to be able to run the scraper on the main pages.

Keep up the good work!

help me...

hi guys, i dont know anything about python or codes, but this is incredible. I really wanna figure out how i can make this work, but i'am so dumb. Someone could explaine better or give me another way. Just for the record this is incredible e i'trying alot but i "old guy" not tecnology guy. However study cinema is basicaly organization and this program help me alot! Thanks and any help is welcome.

The CSV files generated from these lists are almost empty (more details inside)

Hi, first off, hats off to you for creating this scraper. I'm trying to build a dataset from the Letterboxd users I follow and this has been a timesaver. I was just manually scraping LOL :)

I ran into an issue with this link and this link as both of them return almost empty CSV files. I've tried other links before with a similar option and they came out OK. The first one should have returned a CSV with 87 titles and the second one should return 145 titles. What got generated are 1kb CSV files with only 1 title each. I also noticed there was no notification of "Written to xxx-film.csv" either.

I'd like to understand what's causing the issues for these 2 particular links. Again, thank you for creating this scraper tool and I hope you have other development plans for this in the future.

That is not a valid list URL, please try again.

Awesome looking project! I'm hoping it will enable me to export-->import just one list from another user. However, it's failing as such:

python3 /Users/redacted/Downloads/Letterboxd-list-scraper-master/main.py
====================================================
Welcome to the Letterboxd List scraper!
Provided with an URL, this program outputs a CSV file
of movie title, release data and Letterboxd link.
Example url: https://letterboxd.com/.../list/short-films/).
The program currently only supports lists and watchlists.
Enter q or quit to exit the program.
====================================================

Enter the URL of the list you wish to scrape:https://letterboxd.com/bjornbork/list/het-huis-anubis/

Scraping list data...

  0%|                                                                                                                                                                                                                                                   | 0/100 [00:00<?, ?it/s]
That is not a valid list URL, please try again.