Giter VIP home page Giter VIP logo

xword-dl's Introduction

xword-dl

xword-dl is a command-line tool to download .puz files for online crossword puzzles from supported outlets or arbitrary URLs with embedded crossword solvers. For a supported outlet, you can easily download the latest puzzle, or specify one from the archives.

Supported outlets:

Outlet Keyword Download latest Search by date Search by URL
Atlantic atl ✔️ ✔️
Crossword Club club ✔️ ✔️ ✔️
The Daily Beast db ✔️
Daily Pop pop ✔️ ✔️
Der Standard std ✔️ ✔️
The Globe And Mail cryptic tgam ✔️ ✔️ ✔️
Guardian Cryptic grdc ✔️ ✔️
Guardian Everyman grde ✔️ ✔️
Guardian Prize grdp ✔️ ✔️
Guardian Quick grdq ✔️ ✔️
Guardian Quiptic grdu ✔️ ✔️
Guardian Speedy grds ✔️ ✔️
Guardian Weekend grdw ✔️ ✔️
Los Angeles Times lat ✔️ ✔️
The McKinsey Crossword mck ✔️ ✔️ ✔️
New York Times nyt ✔️ ✔️ ✔️
New York Times Mini nytm ✔️ ✔️ ✔️
New York Times Variety nytv ✔️
The New Yorker tny ✔️ ✔️ ✔️
Newsday nd ✔️ ✔️
Puzzmo pzm ✔️
Simply Daily Puzzles sdp ✔️ ✔️ ✔️
Simply Daily Puzzles Cryptic sdpc ✔️ ✔️ ✔️
Simply Daily Puzzles Quick sdpq ✔️ ✔️ ✔️
Universal uni ✔️ ✔️
USA Today usa ✔️ ✔️
Vox vox ✔️
Washington Post wp ✔️ ✔️

To download a puzzle, install xword-dl and run it on the command line.

Installation

The easiest way to install xword-dl is through pip. Install the latest version with:

pip install xword-dl

You can also install xword-dl by downloading or cloning this repository from Github. From a terminal, simply running

python setup.py install

in the downloaded directory may be enough.

But in either case, you probably want to install xword-dl and its dependencies in a dedicated virtual environment. I use virtualenv and virtualenvwrapper personally, but that's a matter of preference. If you're already feeling overwhelmed by the thought of managing Python packages, know you're not alone. The official documentation is pretty good, but it's a hard problem, and it's not just you. If it's any consolation, learning how to use virtual environments today on something sort of frivolous like a crossword puzzle downloader will probably save you from serious headaches in the future when the stakes are higher.

Usage

Once installed, you can invoke xword-dl, providing the short code of the site from which to download. If you run xword-dl without providing a site keyword, it will print some usage instructions and then exit.

For example, to download the latest Newsday puzzle, you could run:

xword-dl nd --latest

or simply

xword-dl nd

You can also download puzzles that are embedded in AmuseLabs solvers or on supported sites by providing a URL, such as:

xword-dl https://rosswordpuzzles.com/2021/01/03/cover-up/

In either case, the resulting .puz file can be opened with cursewords or any other puz file reader.

Due to the constraints of the .puz format, the xword-dl's conversion may be a bit lossy. For example, the most common form of .puz files only support Latin-1 text encoding, which means that some special characters (and even “curly quotes”) need to be converted before saving.

xword-dl will also, by default, convert provided HTML to plaintext markdown. If you want to skip that step, you can provide the --preserve-html flag at runtime or set the preserve-html key to True in your config file.

Specifying puzzle date

Some outlets allow specification of a puzzle to download by date using the --date or -d flag. For example, to download the Universal puzzle from September 22, 2021, you could run:

xword-dl uni --date 9/22/21

The argument provided after the flag is parsed pretty liberally, and you can use relative descriptors such as "yesterday" or "monday". Use quotes if your date contains spaces (such as "June 16, 2022").

Specifying filenames

By default, files will be given a descriptive name based on puzzle metadata. If you want to specify a name for a given download, you can do so with the -o or --output flag. The following tokens are available:

Token Value
%outlet Outlet name
%prefix Hardcoded outlet prefix
%title Puzzle title
%author Puzzle author
%cmd Puzzle outlet keyword
%netloc Network location (domain and subdomain)
date tokens strftime tokens

Configuration file

When running xword-dl, a configuration file is created to store persistent settings. By default, this file is located at ~/.config/xword-dl/xword-dl.yaml. You can manually edit this file to pass options to xword-dl at runtime.

Most settings are specified by the command keyword. For example, if you want to save USA Today puzzles in this format:

USA Today - By Brooke Husic  Ed. Erik Agard - Right Turns - 221115.puz

you can specify that by editing your config file to include the following lines:

usa:
  filename: '%prefix - %author - %title - %y%m%d'

In addition to command keywords, you can also use the keys general (to apply to all puzzles), url (to apply to embedded puzzles selected by URL at runtime) or with a given netloc (to apply to embedded puzzles at a given domain or subdomain).

New York Times authentication

New York Times puzzles are only available to subscribers. Attempting to download with the nyt keyword without authentication will fail. To authenticate, run:

xword-dl nyt --authenticate

and you will be prompted for your New York Times username and password. (Those credentials can also be passed at runtime with the --username and --password flags.)

If authentication is successful, an authentication token will be stored in a config file. Once that token is stored, you can download puzzles with xword-dl nyt.

In some cases, the authentication may fail because of anti-automation efforts on New York Times servers. If the automatic authentication doesn't work for you, you can manually find your NYT-S token and save it in your config file.

xword-dl's People

Contributors

afontenot avatar blandford avatar boisvert42 avatar dependabot[bot] avatar divergentdave avatar ehcloninger avatar federicomenaquintero avatar integrair2021 avatar rchrdlln avatar samirw avatar sergent12345 avatar thisisparker avatar twirlip-of-the-mists avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xword-dl's Issues

USA Today url is a little flaky

It sometimes doesn't work on the first attempt but then works on the second. I should build in some retry logic for that one and maybe the rest of them, too.

NYT authentication fails with a 403: Forbidden for url <url>

From spotchecking it looks like maybe the iOS app, which I was emulating in my login requests, has changed the data it sends. But it also looks like reproducing that login request exactly in curl returns a CAPTCHA, so it's possible that the NYT has stepped up anti-automation features.

Previously completed logins continue to work, and the stored NYT-S cookie should have no problems for existing sessions.

I'm looking into alternative solutions for new authentications, but it's also possible that the CAPTCHA measures are temporary and this will start working again.

Problems with AmuseLabs and PuzzleMe puzzles? New obfuscation?

I seem to be having issues with PuzzleMe and AmuseLab Puzzles. It's probably something I've changed on my end, but maybe not.

I'm trying to get these by URL, but I cannot.
https://joeadultman.blogspot.com/2022/11/bangers-and-mash.html
https://mpcryptics.blogspot.com/2022/12/advent-4.html

I get:
File "/usr/local/lib/python3.7/site-packages/xword_dl-2022.11.29-py3.7.egg/xword_dl/downloader/amuselabsdownloader.py", line 124, in fetch_data xword_data = load_rawc(rawc, amuseKey=amuseKey) File "/usr/local/lib/python3.7/site-packages/xword_dl-2022.11.29-py3.7.egg/xword_dl/downloader/amuselabsdownloader.py", line 122, in load_rawc return json.loads(base64.b64decode(amuse_b64(rawc, amuseKey)).decode("utf-8")) File "/usr/local/lib/python3.7/site-packages/xword_dl-2022.11.29-py3.7.egg/xword_dl/downloader/amuselabsdownloader.py", line 115, in amuse_b64 while F<len(H):J=H[F];K=int(J,16);E.append(K);F+=1 TypeError: object of type 'NoneType' has no len()

error fetching several puzzle sources: invalid start byte

Hi, I have recently started receiving an error when trying to use xword-dl to fetch New Yorker's puzzles:

'utf-8' codec can't decode byte 0x9b in position 1: invalid start byte

This is on Ubuntu 20.04.04, Python 3.8.10 and the latest xword-dl 2022.05.21 - all the other outlets/shortcodes seem fine! I'm not sure exactly how far back this started, probably a week or two. Thank you for this wonderful tool 😄

Sanitized clues erroneously end in new lines

Thanks to some sleuthing over in jpnance/xw#3, I've learned that html2text() appends a newline to the end of the strings it processes. That behavior probably makes enough sense when you're using it on full documents, but it's a little weird for short strings like crossword clues.

I currently use html2text to clean up AmuseLabs puzzles and WSJ puzzles. Easy enough to strip out, and that will probably result in more "standard" puz files.

Better filenames

Currently the default filename is a bit utilitarian, and while that works for me, I bet people would prefer more descriptive names. for example, named puzzles should probably include that name. I'm thinking the scheme should probably be

outlet - YYMMDD - puzzle name.puz

with maybe the constructor name added, or something similar, omitting any parts that aren't available. And more of that can probably go up into the BaseDownloader class! don't repeat yourself!

(In the long run, I could maybe take tokens from the user if they have their own naming convention, but that seems like overkill at present.)

Handling of italics in clues

If (part of) a clue is in italics, it is currently parsed and shown with underscores, rather than being in italics.

As an example, 67A of Friday 6 January's New Yorker puzzle displayed:

Type of bean in _ful_ _medames_

Edit: related to #9?

Incorrect usage message

The usage message implies that all switches go before the source, e.g. "xword-dl -d yesterday nyt", which is the Unix standard (all switches first). In fact, the source must be first: "xword-dl nyt -d yesterday".

Obviously an easy fix. I'll do a fix and PR if I can find the time.

Install should remove or explain hard dateparser dependency

Apparently the dateparser library relies on regex, which is not released as a binary for Mac platforms. So when a Mac user tries to install, they will have to compile regex locally, which introduces a requirement for xcode's command line tools. That may or may not be a blocker for folks, so I am going to look into whether to include steps on how to install the compiler or make the dateparser dependency optional (providing date-guessing from text where provided but otherwise using strict date formatting) at install time.

Guardian Cryptic Support Request

    While this is open I'll add the feature request for the Guardian puzzles. I saw you were helping rentalcustard with [https://github.com/rentalcustard/guardianpuz](https://github.com/rentalcustard/guardianpuz) the guardian cryptics. He's got the basic parsing, but doesn't put the solutions in the .puz output. I'll take a look at your comments on his repo and see if I can parse the solutions, but if you're adding some cryptics, my vote would be for the guardian ones as well.

Originally posted by @mixographer in #52 (comment)

Limited rebus support

Related to #9, this script should start to support rebuses in puzzles where possible. I need to find an online puzzle to test with for each scraper.

NYT authentication fails with a 403: Forbidden for url

Might be the same as #38, not sure if NYT has put in place some mitigation measures or not.

Invoked with xword-dl -u <myuser> -p <mypass> nyt

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://myaccount.nytimes.com/svc/ios/v2/login

I was able to authenticate from one location (home), but not from another (office). Manually copying an NYT-S cookie from an already authenticated session and dropping it into the yaml file for the office worked.

Any ideas on what to try?

Support for the Globe and Mail's cryptic crosswords?

This is purely a wishlist item on my part, as I have no competency to provide a patch, but I'm filing it in case someone is interested by the challenge :)

My dad plays The Globe and Mail's "Cryptic Crossword" everyday and it would be nice if he didn't have to go through a browser for it, and if it was made available in GNOME Crosswords, so I'll presume this library here is where the scraping magic happens. The Globe and Mail has various puzzles and crosswords:

  • Universal crossword: somehow that one doesn't seem to load in my browser, I don't know what's wrong with that one, and if your parser can do some miracles with it.
  • The Daily Cryptic Crossword: this used to have issues, but nowadays it loads even with uBlock Origin activated in Firefox, apparently.

Naming consistency, download by url (Guardian)

I was wondering if the naming convention inconsistency is intentional. For instance if I download the latest Quiptic i might get the file named:
xword-dl % xword-dl grdu
Puzzle downloaded and saved as Guardian Quiptic - 20221113 - Quiptic crossword No 1,200.puz.

while if I download by url:
xword-dl % xword-dl 'https://www.theguardian.com/crosswords/quiptic/1200'
Puzzle downloaded and saved as Guardian - 20221113 - Quiptic crossword No 1,200.puz.

I realize this is in the xword-dl.py and not in the downloader, so maybe it is intentional and the end-user should set up tokens to name the file the way they like?

Guardian 'no solution provided' question.

If I grab the puzzle at 'https://www.theguardian.com/crosswords/everyman/3969'

I get this message:
Puzzle downloaded and saved as Guardian - 20221105 - Everyman crossword No 3,969 - no solution provided.puz.

and the file is named with the 'no solution...' naming convention,

but the solution is provided and saved in the .puz file. something changes in the json when they add the solutions after the initial publication. I'll take a look and see if I can provide more details.

No versioning scheme

Now that there have been a few updates to this, it'd be nice to include a versioning mechanism. I think probably just date-based is fine

New Yorker puzzles not downloading

Starting on Wednesday, I'm no longer able to download New Yorker puzzles. All of the other sites I pull from (WSJ, Atlantic, LAT) work fine, just New York and the error is consistent every time. Am I the only one having a problem?

Error message:

Traceback (most recent call last):
  File "/Users/aa/opt/anaconda3/bin/xword-dl", line 8, in <module>
    sys.exit(main())
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/xword_dl.py", line 1139, in main
    puzzle, filename = by_url(args.source,
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/xword_dl.py", line 102, in by_url
    puzzle = dl.download(puzzle_url)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/xword_dl.py", line 265, in download
    solver_url = self.find_solver(url)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/xword_dl.py", line 616, in find_solver
    pubdate_dt = dateparser.parse(pubdate)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/conf.py", line 89, in wrapper
    return f(*args, **kwargs)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/__init__.py", line 54, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/date.py", line 421, in get_date_data
    parsed_date = _DateLocaleParser.parse(
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/date.py", line 178, in parse
    return instance._parse()
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/date.py", line 182, in _parse
    date_data = self._parsers[parser_name]()
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/date.py", line 234, in _get_translated_date
    self._translated_date = self.locale.translate(
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/languages/locale.py", line 131, in translate
    relative_translations = self._get_relative_translations(settings=settings)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
    self._generate_relative_translations(normalize=True))
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
    pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/regex/regex.py", line 702, in _compile_replacement_helper
    is_group, items = _compile_replacement(source, pattern, is_unicode)
  File "/Users/aa/opt/anaconda3/lib/python3.9/site-packages/regex/_regex_core.py", line 1737, in _compile_replacement
    raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7

I tried uninstalling and reinstalling, I uninstalled the regex package thinking maybe I had a weird version. I created a new venv, nothing seems to fix it and like I said, all the other puzzles seem to work. I tried using the normal behavior (xword-dl tny) and tried installing by URL, the problem happens regardless.

This does sync up with some updating of my Python environment that I did so I was pretty sure this was my problem but the fact that it only happens to New Yorker puzzles made me wonder. Apologies in advance if it does end up being my problem, it's happened many times...

Add Universal Uclick (daily) support

Will require a custom parser but it looks like the online solver (which I didn't realize existed!) exposes JSON data for puzzles sorted by date in the same way as the USA Today puzzle does. Should be pretty straightforward but I might want to shuffle some of the class structures there to combine shared code.

Can't specify dates at all

Greetings, I am unable to specify a date of any kind to download a puzzle. For instance, I downloaded the current universal puzzle using the command
xword-dl uni --latest

But any addition of a date, such as:
xword-dl uni --date 8/12/22
or the example in the README
xword-dl uni --date 9/22/21
or
xword-dl uni --date yesterday
etc

Fails. Here's the error output from one of them:

 xword-dl uni --date 8/10/2022
Traceback (most recent call last):
  File "/usr/local/bin/xword-dl", line 33, in <module>
    sys.exit(load_entry_point('xword-dl==2022.6.20', 'console_scripts', 'xword-dl')())
  File "/usr/local/lib/python3.10/site-packages/xword_dl-2022.6.20-py3.10.egg/xword_dl.py", line 1142, in main
  File "/usr/local/lib/python3.10/site-packages/xword_dl-2022.6.20-py3.10.egg/xword_dl.py", line 58, in by_keyword
  File "/usr/local/lib/python3.10/site-packages/xword_dl-2022.6.20-py3.10.egg/xword_dl.py", line 153, in parse_date_or_exit
  File "/usr/local/lib/python3.10/site-packages/xword_dl-2022.6.20-py3.10.egg/xword_dl.py", line 149, in parse_date
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/conf.py", line 89, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/__init__.py", line 54, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/date.py", line 421, in get_date_data
    parsed_date = _DateLocaleParser.parse(
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/date.py", line 178, in parse
    return instance._parse()
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/date.py", line 182, in _parse
    date_data = self._parsers[parser_name]()
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/date.py", line 196, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/date.py", line 234, in _get_translated_date
    self._translated_date = self.locale.translate(
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/languages/locale.py", line 131, in translate
    relative_translations = self._get_relative_translations(settings=settings)
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/languages/locale.py", line 158, in _get_relative_translations
    self._generate_relative_translations(normalize=True))
  File "/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/languages/locale.py", line 172, in _generate_relative_translations
    pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
  File "/usr/local/lib/python3.10/site-packages/regex-2022.7.25-py3.10-macosx-12-x86_64.egg/regex/regex.py", line 710, in _compile_replacement_helper
    is_group, items = _compile_replacement(source, pattern, is_unicode)
  File "/usr/local/lib/python3.10/site-packages/regex-2022.7.25-py3.10-macosx-12-x86_64.egg/regex/_regex_core.py", line 1737, in _compile_replacement
    raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7

Standardize options passed at runtime and in settings file

Thinking about adding a flag to preserve clue HTML when provided (as discussed in #89) has resurfaced some inconsistency in the names of settings when they're provided from the command line vs. from the settings file vs. as a keyword in a Python environment. Ideally those would be exactly the same, and the cascading hierarchy of their application should be clear!

One technical note on that from the example at hand: if the command line flag is something like --preserve-html, the keyword argument in Python will use an underscore instead preserve_html) and the settings file keys should follow one or the other consistently.

regex._regex_core.error: bad escape \d at position 7

Getting this error when trying to use the date option:

/usr/local/bin/xword-dl atl -d 8/1/2022
Traceback (most recent call last):
File "/usr/local/bin/xword-dl", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/xword_dl.py", line 1142, in main
puzzle, filename = by_keyword(args.source, **options)
File "/usr/local/lib/python3.7/site-packages/xword_dl.py", line 58, in by_keyword
parsed_date = parse_date_or_exit(date)
File "/usr/local/lib/python3.7/site-packages/xword_dl.py", line 153, in parse_date_or_exit
guessed_dt = parse_date(entered_date)
File "/usr/local/lib/python3.7/site-packages/xword_dl.py", line 149, in parse_date
return dateparser.parse(entered_date, settings={'PREFER_DATES_FROM':'past'})
File "/usr/local/lib/python3.7/site-packages/dateparser/conf.py", line 89, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dateparser/init.py", line 54, in parse
data = parser.get_date_data(date_string, date_formats)
File "/usr/local/lib/python3.7/site-packages/dateparser/date.py", line 422, in get_date_data
locale, date_string, date_formats, settings=self._settings)
File "/usr/local/lib/python3.7/site-packages/dateparser/date.py", line 178, in parse
return instance._parse()
File "/usr/local/lib/python3.7/site-packages/dateparser/date.py", line 182, in _parse
date_data = self._parsersparser_name
File "/usr/local/lib/python3.7/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
File "/usr/local/lib/python3.7/site-packages/dateparser/date.py", line 235, in _get_translated_date
self.date_string, keep_formatting=False, settings=self._settings)
File "/usr/local/lib/python3.7/site-packages/dateparser/languages/locale.py", line 131, in translate
relative_translations = self._get_relative_translations(settings=settings)
File "/usr/local/lib/python3.7/site-packages/dateparser/languages/locale.py", line 158, in _get_relative_translations
self._generate_relative_translations(normalize=True))
File "/usr/local/lib/python3.7/site-packages/dateparser/languages/locale.py", line 172, in _generate_relative_translations
pattern = DIGIT_GROUP_PATTERN.sub(r'?P\d+', pattern)
File "/usr/local/lib/python3.7/site-packages/regex/regex.py", line 702, in _compile_replacement_helper
is_group, items = _compile_replacement(source, pattern, is_unicode)
File "/usr/local/lib/python3.7/site-packages/regex/_regex_core.py", line 1737, in _compile_replacement
raise error("bad escape \%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7

It doesn't matter what date I use or the format of the date, or the source.

Running:
MaxOS Monterey 12.5
Termal 2.12.7

Support for The Modern

The Modern Crossword hasn't yet launched but will this week, and if it's like other Puzzle Society puzzles it should make at least the latest available in a "Uclick" xml format that we don't yet have a parser for but shouldn't be too bad to rig up. I think the other Puzzle Society puzzles are all available elsewhere but building this now would make it easier to support them or other new puzzles in the future.

Generalize date-finding method

Currently, AmuseLabs puzzles try to suss out their date from their id if a date isn't set by the time a filename gets set. But probably, the parent filename picker should do that check, and it should delegate to each class where they might find the date. One obvious possibility is the puzzle title!

Downloader classes should define whether they match a given URL

Currently I've got a map of hardcoded netlocs to Downloaders, which is suboptimal because

  • updates require changes to the main module file
  • it "locks in" a given set of Downloaders instead of accepting flexibility
  • netloc matching is too rigid for the task anyway, requiring some hackiness to e.g. distinguish NYT from NYT Variety puzzles

What I think should happen is Downloader classes should have a static method for whether they match a URL pattern, and maybe a class method to initialize a new class from a URL

Support for derstandard.at crossword puzzles

It's an AmuseLabs based crossword puzzle, e.g., you can see one here:
https://www.derstandard.at/story/2000128976430/kreuzwortraetsel-d-9870

However, your AmuseLabs detector doesn't detect a puzzle, possibly because you have to confirm a "yes, watch page with ads" dialog before seeing the actual puzzle.

Is there a good way to use your tool to download those puzzles? If necessary, and you know how, I can also provide a pull request if you point me into the right direction. However, I'm not sure how and if you can accept such a "view with ads" dialog.

Best regards,
D.R.

AmuseLabs variable name change

Some outlets that use AmuseLabs, and URLs where AmuseLabs puzzles are embedded, may be failing because of a small change on their end to where the puzzle data is stored. Should be a quick fix but will require a new version.

Support of Times UK crosswords?

The Times has a lot of great crossword varieties, and the data is in a good format for parsing. It does require a subscription, similar to the way the NY Times works. I don't know if you're are interested in adding more 'authenticated' puzzle sources. I'm using 'browser_cookie3' for my cookiejar and then parsing "oApp.puzzle_json" quite easily.

Can xword-dl download NYT variety puzzles?

(I've manually edited my xword-dl.yaml to circumvent #58)

# python xword-dl/xword_dl.py nyt --latest
Puzzle downloaded and saved as NY Times - 20221011.puz.

# python xword-dl/xword_dl.py https://www.nytimes.com/crosswords/game/variety/2022/10/02
Unable to find a puzzle at https://www.nytimes.com/crosswords/game/variety/2022/10/02.

I've determined that this is likely because NewYorkTimesDownloader isn't in supported_sites:

xword-dl/xword_dl.py

Lines 76 to 78 in bbb4877

supported_sites = [('wsj.com', WSJDownloader),
('newyorker.com', NewYorkerDownloader),
('amuselabs.com', AmuseLabsDownloader)]

However, adding ('nytimes.com', NewYorkTimesDownloader) to the list produces a JSON error, which I don't think I'm well-equipped to make sense of:

# python xword-dl/xword_dl.py https://www.nytimes.com/crosswords/game/variety/2022/10/02
Traceback (most recent call last):
  File "/home/george/pandas/xword-dl/xword_dl.py", line 1162, in <module>
    main()
  File "/home/george/pandas/xword-dl/xword_dl.py", line 1145, in main
    puzzle, filename = by_url(args.source,
  File "/home/george/pandas/xword-dl/xword_dl.py", line 104, in by_url
    puzzle = dl.download(puzzle_url)
  File "/home/george/pandas/xword-dl/xword_dl.py", line 267, in download
    xword_data = self.fetch_data(solver_url)
  File "/home/george/pandas/xword-dl/xword_dl.py", line 971, in fetch_data
    return res.json()['results'][0]
  File "/home/george/miniconda3/lib/python3.9/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/george/miniconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/george/miniconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/george/miniconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

NYT 2022-05-01 crashes

The NYT puzzle for 2022-05-01 crashes xword-dl:

Traceback (most recent call last): File "/usr/bin/xword-dl", line 33, in <module> sys.exit(load_entry_point('xword-dl==2022.2.16', 'console_scripts', 'xword-dl')()) File "/usr/lib/python3.6/site-packages/xword_dl-2022.2.16-py3.6.egg/xword_dl.py", line 1096, in main File "/usr/lib/python3.6/site-packages/xword_dl-2022.2.16-py3.6.egg/xword_dl.py", line 63, in by_keyword File "/usr/lib/python3.6/site-packages/xword_dl-2022.2.16-py3.6.egg/xword_dl.py", line 265, in download File "/usr/lib/python3.6/site-packages/xword_dl-2022.2.16-py3.6.egg/xword_dl.py", line 962, in parse_xword IndexError: string index out of range

That line is solution += square[0][0], which happens after some other code that proves the puzzle isn't empty, and after some checks on the shape of the square. I can see that it would fail, for example, if square was multiple characters. So my guess is that the NYT has violated some standard (is there one?) or some assumption about puzzle formats.

No installation flow

I have to work out some way to get this installed with a command line entry point, etc. pip feels a little silly but might be the move.

filenames are not sanitized

now that filenames include puzzle title, they should be sanitized to remove any characters that are invalid for that purpose. Those characters can probably just be stripped out

Daily Beast dropped the date in the title?

I saw db broke today and when I looked at the puzzleme on the site today's puzzle doesn't have the date.

Then I noticed this comment:
" # Daily Beast puzzle IDs, unusually for AmuseLabs puzzles, don't include
# the date. This pulls it out of the puzzle title, which will work
# as long as that stays consistent."

I'm not familiar yet with the format, so I don't have a PR, but I thought I'd let you know. maybe tomorrow they'll add today's date to the current puzzle, and only not date the current puzzles.

Not possible to search WSJ by date

But it would be nice and it doesn't seem like it is impossible. The issue is that the landing pages for a given puzzle are not at reliable URLs (because they include a slugified version of the title and a long id string, e.g. https://www.wsj.com/articles/floating-upstream-thursday-crossword-december-1-11669405858); the embedded iframe puzzle there includes the date but also a shorter identifier in the URL (e.g. https://www.wsj.com/puzzles/crossword/20221201/52272/?embed=1); and the underlying puzzle data is at a location derived from that URL (e.g. https://www.wsj.com/puzzles/crossword/20221201/52272/data.json).

So what we would need is a way to go from date to any of those URLs. It seems doable, but I don't know how!

WSJ new Json format from 1/18/23?

I'm wondering if WSJ changed their format. I'm getting JSON errors on any WSJ puzzle now. (although I did break my python environment pretty badly yesterday, so it could be me.)

Add Vox support

Vox is using AmuseLabs for its puzzles so adding some support should be pretty straightforward

New York Times changed puzzle format?

I can't seem to get the Times puzzle starting today.

I get this:
`Traceback (most recent call last):
File "./xword_dl.py", line 1153, in
main()
File "./xword_dl.py", line 1142, in main
puzzle, filename = by_keyword(args.source, **options)
File "./xword_dl.py", line 65, in by_keyword
puzzle = dl.download(puzzle_url)
File "./xword_dl.py", line 267, in download
puzzle = self.parse_xword(xword_data)
File "./xword_dl.py", line 989, in parse_xword
puzzle_data = xword_data['puzzle_data']
KeyError: 'puzzle_data'

and if I go for a certain date:

$ ./xword_dl.py https://www.nytimes.com/crosswords/game/daily/2022/10/02

Unable to find a puzzle at https://www.nytimes.com/crosswords/game/daily/2022/10/02.

`

Readme should document limitations

This tool has limited support for markup and no support for rebuses. I can build some of that in (and I can probably get all of the Amuse readers in one fell swoop, which is the bulk of the outlets) but in the meantime I should explain some of that in the Readme. Maybe a chart!

Clarify project license

I couldn't find any reference to what license this project is distributed under. Could you please clarify that, and ideally add a LICENSE file to the repo for future reference? Thanks!

Usage warning: PytzUsageWarning: The localize method is no longer necessary

Receiving this warning when I download. Not sure if this is a code thing or a configuration thing on my side? I

/usr/local/lib/python3.10/site-packages/dateparser-1.0.0-py3.10.egg/dateparser/date_parser.py:35: PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html
  date_obj = stz.localize(date_obj)

Need less brittle NYT authentication method

Currently, xword-dl kind of spoofs the NYT Crosswords app with a request that passes a user-supplied username and password and then saves the authenticated NYT-S token it gets in response.1 That appears to work inconsistently, resulting in occasional 403 errors that have been difficult to diagnose and mitigate (e.g. #51). Given that this approach is unreliable now and subject to become even more reliable, it would be good if xword-dl had a more straightforward approach that was less subject to breakage.

One option I'm considering is using browser-cookie3, which could retrieve a cookie from basically any browser's storage without interaction. It seems like there might be complications with casting a very wide net here, like a user might be logged in to different accounts on different browsers, or use some kind of custom profile that the library can't figure out, and debugging that might be a bit of a chore.

I could imagine combining it with a webbrowser approach, where --authenticate would

  1. get a webbrowser.controller object, which I think would expose to me what the user's default is
  2. open a browser tab in their new window pointed to e.g. https://nytimes.com/login
  3. [... somehow determine that the user was done with login ...]
  4. use browser-cookie3 to check the nytimes.com specific cookies for that specific browser

That seems a little less invasive and could even be reused for other sites that require authentication. But I'm not sure if there's something I'm overlooking! Welcoming feedback here before I start trying to implement anything.

Footnotes

  1. The NYT-S token is just saved in the config file for subsequent requests, so any manual approach to copy a valid token into that file will also work just as well.

Data URL for the WSJ downloader is outdated

Thanks for making this tool available and keeping it maintained! I've been having trouble with the WSJ downloader for a while, and I finally took the time to investigate. It looks like the data URL is no longer valid. I was able to fix it by replacing:

data_url = solver_url.replace('index.html', 'data.json')

with:

data_url = solver_url.replace('?embed=1', 'data.json')

in WSJDownloader.fetch_data().

This worked on the couple of puzzles I tested it on, using the latest release of xword-dl (2021.6.24).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.