Giter VIP home page Giter VIP logo

timf34 / substack2markdown Goto Github PK

View Code? Open in Web Editor NEW
54.0 2.0 10.0 261 KB

Script to scrape free and premium Substack posts, saving them as Markdown files. Also generates HTML interfaces to allow you to browse and sort the markdown files for each author.

License: MIT License

Python 81.48% JavaScript 7.94% HTML 3.24% CSS 7.34%
substack-scraper-python python scraper selenium substack markdown html javascript ui

substack2markdown's Issues

[Feature request] Metadata for posts?

Thank you for developing this!

I was wondering if you might consider adding the option to download the metadata (i.e., author name, date created, etc.) for the posts.

Thanks again!

[Feature request] Comments?

I was wondering if you might consider adding the option of downloading the comments for a post. Thank you!

Errors downloading from a paid subscription

(Using Latest SourceCode from 2024-03-29)

I requested for three posts and only one was downloaded successfully and hitting the following errors

FIRST post download hit exception at

def get_url_soup(self, url: str) -> BeautifulSoup:
    """
    Gets soup from URL using logged in selenium driver
    """
    try:
        self.driver.get(url)   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< EXCEPTION HIT HERE            
        return BeautifulSoup(self.driver.page_source, "html.parser")
    except Exception as e:
        raise ValueError(f"Error fetching page: {e}") from e

CALLSTACK
get_url_soup (/Users/username/Dev/Substack2Markdown/substack_scraper.py:341)
scrape_posts (/Users/username/Dev/Substack2Markdown/substack_scraper.py:228)
main (/Users/username/Dev/Substack2Markdown/substack_scraper.py:394)
(/Users/username/Dev/Substack2Markdown/substack_scraper.py:398)

OUTPUT
0%| | 0/3 [00:00<?, ?it/s]Error scraping post: Error fetching page: Message: no such execution context
(Session info: MicrosoftEdge=123.0.2420.65)
Stacktrace:
0 msedgedriver 0x0000000104bc99d8 msedgedriver + 4823512
1 msedgedriver 0x0000000104bc1a13 msedgedriver + 4790803
2 msedgedriver 0x0000000104787d35 msedgedriver + 359733
3 msedgedriver 0x000000010477434a msedgedriver + 279370
4 msedgedriver 0x00000001047732a3 msedgedriver + 275107
5 msedgedriver 0x00000001047736df msedgedriver + 276191
6 msedgedriver 0x0000000104781fa4 msedgedriver + 335780
7 msedgedriver 0x000000010479211b msedgedriver + 401691
8 msedgedriver 0x00000001047968ab msedgedriver + 420011
9 msedgedriver 0x0000000104773c8b msedgedriver + 277643
10 msedgedriver 0x0000000104791da0 msedgedriver + 400800
11 msedgedriver 0x000000010480887f msedgedriver + 886911
12 msedgedriver 0x00000001047ec543 msedgedriver + 771395
13 msedgedriver 0x00000001047c0dbf msedgedriver + 593343
14 msedgedriver 0x00000001047c171e msedgedriver + 595742
15 msedgedriver 0x0000000104b85f32 msedgedriver + 4546354
16 msedgedriver 0x0000000104b8c2c6 msedgedriver + 4571846
17 msedgedriver 0x0000000104b67d5a msedgedriver + 4423002
18 msedgedriver 0x0000000104b8cd2d msedgedriver + 4574509
19 msedgedriver 0x0000000104b583d4 msedgedriver + 4359124
20 msedgedriver 0x0000000104bb0348 msedgedriver + 4719432
21 msedgedriver 0x0000000104bb04c1 msedgedriver + 4719809
22 msedgedriver 0x0000000104bc15a7 msedgedriver + 4789671
23 libsystem_pthread.dylib 0x00007ff803e6818b _pthread_start + 99
24 libsystem_pthread.dylib 0x00007ff803e63ae3 thread_start + 15

33%|███████████████████████████████████████████████████████████████▎ | 1/3 [00:16<00:32, 16.22s/it]Error scraping post: 'NoneType' object has no attribute 'text'
67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 2/3 [00:50<00:25, 25.26s/it]

SECOND post download hit exception at

def scrape_posts(self, num_posts_to_scrape: int = 0) -> None:
    """
    Iterates over all posts and saves them as markdown files
    """
    ... 
    
    title, subtitle, like_count, date, md = self.extract_post_data(soup)    <<<<<<<<<<<<<<<<<<<<< EXCEPTION HIT HERE           

CALLSTACK
scrape_posts (/Users/avib/AviDev/Substack2Markdown/substack_scraper.py:232)
main (/Users/avib/AviDev/Substack2Markdown/substack_scraper.py:394)
(/Users/avib/AviDev/Substack2Markdown/substack_scraper.py:398)

OUTPUT
33%|████████████████████████████▎ | 1/3 [02:02<04:05, 122.79s/it]Error scraping post: 'NoneType' object has no attribute 'text'

can't use Mac Edge Dev

I'm unable to get the script to work using

# Specify the path to Microsoft Edge Dev
edge_binary_path = "/Applications/Microsoft Edge Dev.app/Contents/MacOS/Microsoft Edge Dev" 

# Set the Edge options
options = EdgeOptions()
options.binary_location = edge_binary_path

Is there something else I need to modify?

Substack Requiring Captcha at Login

Description

When using the premium scraper in headless mode, I received the below error:

Exception: Warning: Login unsuccessful. Please check your email and password, or your account status.
Use the non-premium scraper for the non-paid posts.

After investigating this error, I noticed that Substack was throwing a captcha within an error-container.

Screenshot

ss_captcha

Fix

Issue can be resolved by adding a valid user agent to EdgeOptions() in PremiumSubstackScraper. The below works on my end:

options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.