timf34 / substack2markdown Goto Github PK

Script to scrape free and premium Substack posts, saving them as Markdown files. Also generates HTML interfaces to allow you to browse and sort the markdown files for each author.

License: MIT License

Python 81.48% JavaScript 7.94% HTML 3.24% CSS 7.34%

substack-scraper-python python scraper selenium substack markdown html javascript ui

substack2markdown's Issues

[Feature request] Metadata for posts?

Thank you for developing this!

I was wondering if you might consider adding the option to download the metadata (i.e., author name, date created, etc.) for the posts.

Thanks again!

[Feature request] Comments?

I was wondering if you might consider adding the option of downloading the comments for a post. Thank you!

Errors downloading from a paid subscription

(Using Latest SourceCode from 2024-03-29)

I requested for three posts and only one was downloaded successfully and hitting the following errors

FIRST post download hit exception at

def get_url_soup(self, url: str) -> BeautifulSoup:
    """
    Gets soup from URL using logged in selenium driver
    """
    try:
        self.driver.get(url)   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< EXCEPTION HIT HERE            
        return BeautifulSoup(self.driver.page_source, "html.parser")
    except Exception as e:
        raise ValueError(f"Error fetching page: {e}") from e

CALLSTACK
get_url_soup (/Users/username/Dev/Substack2Markdown/substack_scraper.py:341)
scrape_posts (/Users/username/Dev/Substack2Markdown/substack_scraper.py:228)
main (/Users/username/Dev/Substack2Markdown/substack_scraper.py:394)
(/Users/username/Dev/Substack2Markdown/substack_scraper.py:398)

OUTPUT
0%| | 0/3 [00:00<?, ?it/s]Error scraping post: Error fetching page: Message: no such execution context
(Session info: MicrosoftEdge=123.0.2420.65)
Stacktrace:
0 msedgedriver 0x0000000104bc99d8 msedgedriver + 4823512
1 msedgedriver 0x0000000104bc1a13 msedgedriver + 4790803
2 msedgedriver 0x0000000104787d35 msedgedriver + 359733
3 msedgedriver 0x000000010477434a msedgedriver + 279370
4 msedgedriver 0x00000001047732a3 msedgedriver + 275107
5 msedgedriver 0x00000001047736df msedgedriver + 276191
6 msedgedriver 0x0000000104781fa4 msedgedriver + 335780
7 msedgedriver 0x000000010479211b msedgedriver + 401691
8 msedgedriver 0x00000001047968ab msedgedriver + 420011
9 msedgedriver 0x0000000104773c8b msedgedriver + 277643
10 msedgedriver 0x0000000104791da0 msedgedriver + 400800
11 msedgedriver 0x000000010480887f msedgedriver + 886911
12 msedgedriver 0x00000001047ec543 msedgedriver + 771395
13 msedgedriver 0x00000001047c0dbf msedgedriver + 593343
14 msedgedriver 0x00000001047c171e msedgedriver + 595742
15 msedgedriver 0x0000000104b85f32 msedgedriver + 4546354
16 msedgedriver 0x0000000104b8c2c6 msedgedriver + 4571846
17 msedgedriver 0x0000000104b67d5a msedgedriver + 4423002
18 msedgedriver 0x0000000104b8cd2d msedgedriver + 4574509
19 msedgedriver 0x0000000104b583d4 msedgedriver + 4359124
20 msedgedriver 0x0000000104bb0348 msedgedriver + 4719432
21 msedgedriver 0x0000000104bb04c1 msedgedriver + 4719809
22 msedgedriver 0x0000000104bc15a7 msedgedriver + 4789671
23 libsystem_pthread.dylib 0x00007ff803e6818b _pthread_start + 99
24 libsystem_pthread.dylib 0x00007ff803e63ae3 thread_start + 15

33%|███████████████████████████████████████████████████████████████▎ | 1/3 [00:16<00:32, 16.22s/it]Error scraping post: 'NoneType' object has no attribute 'text'
67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 2/3 [00:50<00:25, 25.26s/it]

SECOND post download hit exception at

def scrape_posts(self, num_posts_to_scrape: int = 0) -> None:
    """
    Iterates over all posts and saves them as markdown files
    """
    ... 
    
    title, subtitle, like_count, date, md = self.extract_post_data(soup)    <<<<<<<<<<<<<<<<<<<<< EXCEPTION HIT HERE

CALLSTACK
scrape_posts (/Users/avib/AviDev/Substack2Markdown/substack_scraper.py:232)
main (/Users/avib/AviDev/Substack2Markdown/substack_scraper.py:394)
(/Users/avib/AviDev/Substack2Markdown/substack_scraper.py:398)

OUTPUT
33%|████████████████████████████▎ | 1/3 [02:02<04:05, 122.79s/it]Error scraping post: 'NoneType' object has no attribute 'text'

[Feature Request] Add date of posts to their file names

so that they can be seen in the order they need to be seen as (even when seeing in the folder)

can't use Mac Edge Dev

I'm unable to get the script to work using

# Specify the path to Microsoft Edge Dev
edge_binary_path = "/Applications/Microsoft Edge Dev.app/Contents/MacOS/Microsoft Edge Dev" 

# Set the Edge options
options = EdgeOptions()
options.binary_location = edge_binary_path

Is there something else I need to modify?

Substack Requiring Captcha at Login

Description

When using the premium scraper in headless mode, I received the below error:

Exception: Warning: Login unsuccessful. Please check your email and password, or your account status.
Use the non-premium scraper for the non-paid posts.

After investigating this error, I noticed that Substack was throwing a captcha within an error-container.

Screenshot

Fix

Issue can be resolved by adding a valid user agent to EdgeOptions() in PremiumSubstackScraper. The below works on my end:

options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"
)

[Feature Request] Rather than opening the file://<file-name>.md why not embed it into a html to cleanly view it?

Using md to html web component like

https://zerodevx.github.io/zero-md/

https://github.com/leaverou/md-block

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.