Description :
A simple and an interesting web scraping project to gather all chapters of Jujustu Kaisen Manga till date.
Tech Stack : Python + various modules like re
, os
, bs4 ( beautifulSoup )
, urllib
, requests
, zipfile
and selenium
( basic )
Additional Description
Web scraping is an essential tool for gathering data and there are plenty of examples that focus on getting tabular data from websites. I would like to improve on it and extract and download other resources such as images, links, etc. This tutorial covers the following topics in the same order :
- Dealing with single page applications and infinite scrolling websites using
Selenium
- Extracting useful links using
BeautifulSoup
and requests
- Cleaning the data using custom
Regex
- Downloading all the chapters as jpeg ( can be customized easily to limit the number of chapters ) :
- Using
urllib
and os
module
- Mocking requests to bypass rate limiting and bot detection
- Organizing and grouping the chapters
- Zipping and converting the chapters into
.cbz
format :
- Can be extended to Volume based zipping ( For example, Volume 1 has Chapters 1 to 60, Volume 2 has Chapters 61 to 120, and so on )
- The file format can be customized as well ( For example, .pdf, etc. )
Other
I have the code ready and I have used it as well ๐. I'm including a screenshot to give an idea about the downloaded file and folder structure. It might require a bit of refactoring and additional comments but that should not take long.
@Mrinank-Bhowmick Let me know what you think about this and if this can be added to your repo. If so, please assign this task to me along with the hacktoberfest2022 tag.
Thank you in advance ! ๐