A Sanf*undry Scraper for use in 2023 and after
- Download the
samyfoundry.py
file from this Github repo. - Install the required libraries in command prompt for windows:
pip install cfscrape
pip install beautifulsoup4
pip install pdfkit
- Also, need to install the
wkhtmltopdf
module from https://wkhtmltopdf.org/downloads.html or build from the source. - Then the PATH variable set to binary folder in Environment variables.
- Run the
samyfoundry.py
file withpython samyfoundry.py
orpython3 samyfoundry.py
in command prompt - Enter the Sanf*undry page link containing all the topics of a particular chapter.
- The
cfscrape
module bypasses the Cloudflare security and captcha present while entering the site of the provided link. - Then it parses the HTML content of the webpage with the help of
BeautifulSoup
. - The code automatically creates an html file within the same folder as the code file
samyfoundry.py
, with the name scraped from the provided link - It scrapes the initial page containing all the topic links and gathers all of them in a list.
- Then it parses each topic link and scrapes all the questions and answers, avoiding all the sticky banner ads, links and extra advertisements on the site
- And stores/appends each topic's Q/A in the html file created at the beginning (alongwith all the tags from those sections).
- After it has completed parsing and scraping all the topics links, it stops the scraping loop, prints
Total Topics Found:
,Total Q/A Found:
, then it converts the html file to pdf file of the same filename usingpdfkit
andwkhtmltopdf
, printsEnd
, closes the file, and stops the execution. - In the meantime, while the code is being executed, it keeps printing the topic numbers and its total questions numbers, which has been successfully scraped and extracted into the file.
Update: Added the functionality of producing a pdf file instead of a txt file directly from the html.
Note: This code was written by me(Soumya Majhi / Samy) in April 2023 for my own personal academic purposes, as all other codes available in Github and elsewhere to scrape this website was outdated and didn't provide proper outputs. So, it should only be used for personal/academic purposes, and not for any type of commercial reasons. And, if anyone would like to use this code anywhere, just provide a link to this repository of mine there, that will be all that is needed.
P.S. There is one problem in this code: It writes the the lines containing some form of code in single lines, instead of formatted, indented form of code.
So, if anyone has any solution to this problem please feel free to reach out to me or write about it in the discussions section.
(An important fact: the codes are contained in <pre>
tags)
P.S. Current problem is that it doesn't print some special characters sometimes (like: �). Any suggestion would be helpful.