richardjrl / pocketmagstopdf Goto Github PK

View Code? Open in Web Editor NEW

40.0 10.0 8.0 106 KB

Download pocketmags magazines in PDF format from the HTML5 reader

License: Creative Commons Zero v1.0 Universal

Python 100.00%

pocketmagstopdf's Introduction

Download pocketmags magazines in PDF format from the HTML5 reader.

PLEASE USE THIS SCRIPT RESPONSIBLY. THE MAGAZINE PUBLISHING INDUSTRY RELIES HEAVILY ON INCOME FROM SALES WITH VERY SLIM PROFIT MARGINS.

Acknowledgements:

This is a modified version of the GitHub Gist called pmdown.py written by the GitHub user rjw57. I would have contributed my changes to the original but alas it is only a Gist, not a GitHub Repository.

With thanks to:

rjw57 for the original pmdown.py Python script.
bani6809 for revealing in the comments that the "high" and "extrahigh" quality image urls end in bin not jpg.
shirblc for replacing my collection of Python print statements with proper Python logging.

NB: I have only been able to test this on the small number of magazines I have purchased on pocketmags.com

Feature Additions:

14/07/2022

Add the option to enable downloading of magazines in the elusive "high" quality format (only when --quality=high is used, otherwise the default is "mid").
Added the option to insert a custom title into the generated PDF's metadata to replace the default of "untitled.pdf".

13/08/2022

Add the option to specify a range of pages to download, rather than the whole magazine.
Add the option to save images to a separate directory in addition to generating the PDF.
And the option to set a delay between downloading pages in case of any server-imposed rate-limiting.

30/09/2022

Add the option to enable downloading of magazines in the Holy-Grail "original" format (only when --quality=original is used, otherwise the default is "mid").
Add options to alter the verbiage level of the program's output:
- --quiet suppresses all output except warnings and errors.
- No option given will present a normal level of informational output.
- --debug prints comprehensive PDF-related information.
Add the option to hide the User UUID watermark that is inserted on each page of the PDF when --quality=original is used.

09/12/2022

Add proper Python Logger support (implemented by shirblc)

19/03/2023

Add the option to enable downloading of magazines in the newly-discovered "extrahigh" quality format (only when --quality=extrahigh is used, otherwise the default is "mid").

Usage:

pocketmagstopdf.py (-h | --help)
pocketmagstopdf.py [options] <pdf> <url>

Options:

-h, --help                  Print brief usage summary.

--quality=QUALITY           Set magazine download quality.
                            Choose from extralow, low, mid, high, extrahigh or original. (Optional)
                            [default: mid]

--dpi=DPI                   Set image resolution in dots per inch. (Optional)
                            Not used with '--quality=original'.
                            [default: 150]

--title=TITLE               Set magazine title in the PDF metadata. (Optional)
                            Not used with '--quality=original'.
                            default value is the filename with;
                                - underscores replaced with spaces
                                - the file extension removed

--range-from=PAGE-FROM      Define a portion of the magazine to download, starting from this page number. (Optional)
                            Downloads from the beginning of the magazine - page 1 - if absent.
                            [default: 1]

--range-to=PAGE-TO          Define a portion of the magazine to download, ending on this page number. (Optional)
                            Downloads to the end of the magazine if absent.
                            [default: 999]

--delay=DELAY               Set the time in seconds to wait between downloading each page of the magazine. (Optional)
                            There is no delay if absent. The value of the delay may be integer or decimal.
                            Used both whenenever probing for the last valid page number of the magazine and
                            between downloading each individual page for all quality settings except 'original'.
                            [default: 0]

--save-images               Save the downloaded JPEG images of the magazine pages to a subdirectory with the same
                            name as the magazine in addition to generating the PDF of the magazine.
                            Not used with '--quality=original'.
                            [default: False]

--image-subdir-prefix=PFX   If --save-images=yes then prefix name of the subdirectory the images are saved to with
                            this string. Blank by default. (Optional)
                            Not used with '--quality=original'.
                            [default: ]

--image-subdir-suffix=SFX   If --save-images=yes then suffix name of the subdirectory the images are saved to with
                            this string. Blank by default. (Optional)
                            Not used with '--quality=original'.
                            [default: ]

--uuid=UUID                 Specifies the User UUID to use to download the PDF when '--quality=original' is used
                            and --uuid-randomise is not used.
                            Read the 'Notes' section below for details of how to find it. (Optional/Required)
                            Only used with '--quality=original'.
                            [default: None]

--uuid-randomise            Uses a random UUID to download the PDF when '--quality=original' is specified. (Optional)
                            [default: False]

--uuid-hide                 Hides the User UUID watermark on each page of the PDF by making it transparent.
                            This option is overridden by '--uuid-destroy'.
                            Only used with '--quality=original' as watermark not present on lower quality downloads.
                            [default: False]

--uuid-destroy              Completely wipes the User UUID watermark from each page of the PDF. (Experimental)
                            This option overrides by '--uuid-hide'.
                            Only used with '--quality=original' as watermark not present on lower quality downloads.
                            [default: False]

--timestamp-change          Alters the timestamp within the downloaded PDF.
                            Only used with '--quality=original'.
                            [default: False]

--quiet                     Suppress printing of all output except warning and error messages.
                            [default: False]

--debug                     Print extra output to aid debugging of the program.
                            Setting both '--quiet' and '--debug' is contradictory
                            If this happens, a warning is issued and the debug setting overrides the quiet setting.
                            [default: False]

<pdf>                       Save output to this file. (Required)
<url>                       A URL to one image from the magazine. (Required)

Examples:

pocketmagstopdf.py --quality=extrahigh --delay=2 --title="My Magazine, Issue 73, October 2022" my_magazine.pdf https://mcdatastore.blob.core.windows.net/mcmags/<STORAGE_BUCKET_UUID>/<ISSUE_UUID>/extralow/0000.jpg

pocketmagstopdf.py --quality=original --delay=0.5 --uuid-hide --uuid=<USER_UUID> my_magazine.pdf https://mcdatastore.blob.core.windows.net/mcmags/<STORAGE_BUCKET_UUID>/<ISSUE_UUID>/extralow/0000.jpg

Notes:

PLEASE USE THIS SCRIPT RESPONSIBLY. THE MAGAZINE PUBLISHING INDUSTRY RELIES HEAVILY ON INCOME FROM SALES WITH VERY SLIM PROFIT MARGINS.

URLs for pocketmags images and User UUIDs can be found by using the HTML 5 reader and right-clicking on a page and selecting "inspect element". Look for URLs of the form:

https://mcdatastore.blob.core.windows.net/mcmags/<uuid1>/<uuid2>/extralow/<num>.jpg

where <uuid{1,2}> are strings of letters and numbers with dashes separating them and is some 4-digit number.

The User UUID required for downloading the magazine when '--quality=original' can be found by searching the HTML for the text "userGuid:" and copying the hexadecimal value that follows it without the surrounding single quote characters.

Support this Project:

pocketmagstopdf's People

Contributors

Stargazers

Watchers

Forkers

luis-guilherme shirblc security-101 andresant fralacticus lkowolowski darthbill07 j-mcavoy

pocketmagstopdf's Issues

HTTP error code 405

Python neophyte here. I was able to find the various IDs and to get the latest script running, but after finding the last good page of the mag, the script terminates with ERROR - Unable to download magazine: HTTP error code 405. Any guidance would be appreciated.

Error message repeating no matter what I do

Hi,

I keep getting the following error message in CMD no matter what I seem to type:

File "C:\Users\Dan\Downloads\pocketmagstopdf.py", line 67
<title>pocketmagstopdf/pocketmagstopdf.py at main · RichardJRL/pocketmagstopdf · GitHub</title>
^
SyntaxError: invalid character '·' (U+00B7)

The image I'm using from Pocketmags is https://mcdatastore.blob.core.windows.net/mcmags/7073c0ad-b1fd-4ecc-a098-fa2963898205/08d67783-d6cd-42e1-9983-417f5b6b46f2/extralow/0000.jpg

I've installed Python, and updated pip, docopt, Pillow & reportlab. I'm very new to Python so I may be doing something wrong, but I'm not sure whereabouts that would be. Any assistance would be greatly appreciated!

SyntaxError: invalid syntax

I'm probably being an absolute dullard, but I keep getting "SyntaxError: invalid syntax" with the https part of the blog url highlighted. What am I doing wrong? Many thanks!

corrupt pages inserted into assembled pdf

It seems sometimes corrupt pages get put into the assembled PDF. for example /high/0031.bin from retro gamer issue 168.

when I download the blob files individually and modify the header, the jpg appears normal. not sure why they are being corrupted when inserted into the assembled pdf.

R6034 Runtime Error

Hi,
First of all, thanks for creating this, it's been a massive help with my work. I've been using it for many months now with little issue, but this morning I've encountered this error:

And these are the errors in CMD:

C:\Users\Dan\Downloads>pocketmagstopdf.py --quality=high Woman_09012023_UploadedByUK.pdf https://mcdatastore.blob.core.windows.net/mcmags/9f77c4ca-97be-41c0-85da-3afa0a8a36b1/41586ac2-e951-45fd-ab76-b132ae72be63/extralow/0000.jpg
Traceback (most recent call last):
File "C:\Users\Dan\Downloads\pocketmagstopdf.py", line 129, in
from urllib.request import urlopen
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\urllib\request.py", line 88, in
import http.client
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 71, in
import email.parser
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\email\parser.py", line 12, in
from email.feedparser import FeedParser, BytesFeedParser
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\email\feedparser.py", line 27, in
from email._policybase import compat32
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\email_policybase.py", line 9, in
from email.utils import _has_surrogates
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\email\utils.py", line 29, in
import socket
File "C:\Users\Dan\AppData\Local\Programs\Python\Python311\Lib\socket.py", line 51, in
import _socket
ImportError: DLL load failed while importing _socket: A dynamic link library (DLL) initialization routine failed.

When I booted up the computer this morning the Windows installation welcome screen appeared for some reason, so this may be related. Just wondering if you've encountered this before and how I would go about rectifying it. I've tried rebooting a couple of times but no joy.

Cheers,
Dan

HTTP 403 Forbidden

Today I've started getting 403 Forbidden reponses when the script tries downloading the bin files:

2024-06-30 20:04:51,568 - DEBUG - https://files.magazineclonercdn.com/mcmags/xxx/yyy/extrahigh/0000.bin
Traceback (most recent call last):
  File "pocketmagstopdf/pocketmagstopdf.py", line 769, in <module>
    main()
  File "pocketmagstopdf/pocketmagstopdf.py", line 411, in main
    raise e
  File "pocketmagstopdf/pocketmagstopdf.py", line 373, in main
    with urlopen(req) as f:
         ^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

I reproduced this via curl on my system and I found it happens regardlless of system proxy settings, and it happens when making GET requests to https://files.magazineclonercdn.com/mcmags/xxx/yyy/extrahigh/0000.bin with no User-Agent Header.

Fixing the script to include a User-Agent header resolved this issue:

2024-06-30 20:06:52,178 - INFO - User-Agent is Mozilla/5.0.0
2024-06-30 20:06:52,178 - INFO - Quiet output is false
2024-06-30 20:06:52,178 - INFO - Debug output is true
2024-06-30 20:06:52,178 - DEBUG - https://files.magazineclonercdn.com/mcmags/xxx/yyy/extrahigh/0000.bin
2024-06-30 20:06:55,127 - INFO - Downloading page 1 from https://files.magazineclonercdn.com/mcmags/xxx/yyy/extrahigh/0000.bin...
2024-06-30 20:06:55,838 - INFO - Image is 2365 x 3072 pixels and 15.77in x 20.48in at 150.0 DPI

j-mcavoy@5555094

pocketmags now uses extrahigh quality

pocketmags now uses https://mcdatastore.blob.core.windows.net/mcmags/{uuid}/{uuid}/extrahigh/0000.bin as a quality level, which pocketmagstopdf doesn't recognize as a valid url pattern (line 140).

URL is now different for magazines

Hi,

The URL for jpgs has now changed from mcdatastore.blob.core.windows.net to files.magazineclonercdn.com - which has broken the script. Hopefully it's a simple fix!

Cheers,

Dan

download actual PDFs?

Think there is any way to automate download the actual pdfs from the pocketmags reader?

Right now you use the pocketmags online reader, hit the printer icon, select up to 2 pages and hit 'download pdf'. It then sends you to blob:https://pocketmags.com/{some guid} which is a pdf instead of jpg. This is the absolute highest quality of the magazine, it is possible to download entire issues in this way, but it is incredibly slow because it has to be done manually.

verification problems???

Fairly new to python and stuff, have used this in the past to download magazines just fine. Bought some new ones and ran into some issues, one of which was already described here due to problems with the "original" quality setting. I downloaded the newest version of the code and tried to get the extrahigh quality to work, but something is still going wrong. I've included the readout below:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1037, in _send_output
self.send(msg)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 975, in send
self.connect()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1454, in connect
self.sock = self._context.wrap_socket(self.sock,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 513, in wrap_socket
return self.sslsocket_class._create(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1071, in _create
self.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1342, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/alexis/Downloads/pocketmagstopdf-main/pocketmagstopdf.py", line 757, in
main()
File "/Users/alexis/Downloads/pocketmagstopdf-main/pocketmagstopdf.py", line 361, in main
with urlopen(page_url) as f:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

I do not have enough experience with coding to even begin to understand what this means. Thanks in advance for your help :)

"Reports"`

I am less than a python newbie, so please forgive me.

Managed to download python and followed the instructions for the dependencies.

I get this when I try to run it

Traceback (most recent call last):
File "pocketmagstopdf.py", line 134, in
import requests as requests
ModuleNotFoundError: No module named 'requests'

Any idea how to fix?