aparrish / gutenberg-dammit Goto Github PK

View Code? Open in Web Editor NEW

209.0 209.0 14.0 26 KB

I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this

Python 100.00%

gutenberg-dammit's People

Contributors

Stargazers

Watchers

Forkers

hugovk jesseabeyta ryanavella patrickdrouin armendk rossgoodwin marcelraschke textangel senderle rngwrldngnr esatboucaud joeyburzynski hanok2 kostasx

gutenberg-dammit's Issues

at least one file's utf-8 encoding is wrong, presumably more?

Hi, thanks for this excellent work!

I suspect it's not an isolated incident, but don't presently have anything beyond a single anecdote:

夢溪筆談 is valid UTF-8 Chinese text on the Project Gutenberg website.
But file 073/07317.txt in the gutenberg-dammit corpus is valid UTF-8 gibberish.
If you take the gutenberg-dammit file, and convert it from utf-8 to "latin-1", you end up with a file which chardet says is Big5-encoded text. This appears to be mostly correct, except that there is some garbage in it and so it can not be recoded successfully by any of the few different tools I tried.

Anyway that's the data I have for now…

metadata missing some titles

I dunno whether this is a helpful kind of issue to report here or if it just reflects some upstream problem, but reporting here is pretty easy, so:

The metadata has "?" for some titles. When I search teh internets for those works, I found some titles:

50624: Lorenzo de' Medici, the Magnificent (vol. 1 of 2)
50625: Lorenzo de' Medici, the Magnificent (vol. 2 of 2)
51307: This House to Let
51950: The Prodigal Son

When I eyeball these records, it looks like they're also missing Author info, maybe other stuff? I stopped looking.

Make the corpus zip file into a torrent or something

From https://github.com/aparrish/gutenberg-dammit/blob/master/README.md#next-steps:

Make the corpus zip file into a torrent or something so I'm not paying for every download

Maybe the Internet Archive? They also provide torrents.

Any chance of getting an updated zip file? gutenberg-dammit-files-v003.zip, maybe?

I know this is related to #3 and #5, but I wonder if it'd be possible to get a datadump from the current state of Project Gutenberg. I also thought about the mirror, but it's painful to setup.

Around 50 files have broken encodings on french.

Hello,

For those of you that intend to use french documents in this corpus, know that on the 2647 french books included 49 have broken encoding and all accent letters are removed.
A quick way to find the culprits is to look the book for the letter 'é'.

remove dependency on GutenTag

GutenTag is an amazing project but I really only used its code and corpus as a way to quickly "bootstrap" the necessary code and files. I don't think it's a sustainable foundation moving forward, especially for keeping the files in this corpus up-to-date with the latest releases (and metadata changes) on Project Gutenberg itself. My best idea so far is to set up a Project Gutenberg mirror and modify the code to work directly on the files from the mirror, but that obviously takes a lot of effort (and hard drive space). Open to other suggestions.

cache/store chardet results per file

The chardet detection works pretty well but also takes a long time—I didn't time it exactly, but it felt like it added at least an hour to the time it took the corpus-building process to run on my MacBook Air. Since these files aren't going to change, it makes sense to pre-build and cache the results so that subsequent runs of the corpus-building script don't need to re-run the detection process.

UnicodeDecodeError/NameError when installing from setup.py (Windows-related?)

First of all, I wanted to say that this package is awesome! I have been wanting to interact with the Gutenberg corpus for a while now, but I always ended up running into obstacles and giving up prematurely. I'm glad someone else beat me to the punch!

So I've had some issues getting this package up and running on Windows. I'm running a 64-bit install of Python 3.6.5, for reference.

I initially cloned the repository and then attempted to install from setup.py. This is the output I saw:

>>> python setup.py install
Traceback (most recent call last):
  File "setup.py", line 4, in <module>
    readme = readme_file.read()
  File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10949: character maps to <undefined>

I think that this issue is Windows-related. The open() function on line 3 seems to be defaulting to the console's encoding of Windows-1252 rather than UTF-8. I was able to fix it by specifying the encoding on line 3.

from setuptools import setup
    
with open('README.md', encoding='UTF-8') as readme_file:
    readme = readme_file.read()
...

After changing line 3, I tried installing again and received the following output:

>>> python setup.py install
Traceback (most recent call last):
  File "setup.py", line 14, in <module>
    packages=setuptools.find_packages(),
NameError: name 'setuptools' is not defined

This issue seems more related to the version of Python I'm running (at least, I doubt it is platform dependent like the UnicodeDecodeError). I was able to fix it by adding an explicit import at the top of the file:

import setuptools
from setuptools import setup
...

I'm willing to submit a pull request with the above changes if they all seem fine to you.

update metadata?

In some cases, it looks like the metadata from the GutenTag dump (itself based on the DVD ISO) is out-of-date with the live Project Gutenberg site. For example, Coleridge's Complete Poetical Works has a subject tag on the live site, but that subject tag is missing in the GutenTag HTML metadata (and thus from the metadata in the Gutenberg, dammit archive). Fixing this might depend on a fix for #3, but could also possibly be fixed by just using the most up-to-date RDFs from the catalog data?

Can this generate a CSV of the metadata of all Gutenberg books?

Aloha,
Can this generate a CSV file of the key metadata (title, author, year, link to text) Gutenberg books? If so, I can make a pretty cool tile in Ohayo to allow people to instantly do text analysis and other things on any gutenberg title (breck7/ohayo#24)

check file extension when retrieving files from zip

A small subset of files (e.g. etext98/sesli10.zip, etext04/stryb10.zip) have a JPEG as the first entry in their ZIP file from the ISO, which the code blithely interprets as a text file (since it's only looking for the first entry in the ZIP, see this line). It should probably only look at files with a particular extension (i.e., .txt).

aparrish / gutenberg-dammit Goto Github PK

gutenberg-dammit's People

Contributors

Stargazers

Watchers

Forkers

gutenberg-dammit's Issues

at least one file's utf-8 encoding is wrong, presumably more?

metadata missing some titles

Make the corpus zip file into a torrent or something

Any chance of getting an updated zip file? gutenberg-dammit-files-v003.zip, maybe?

Around 50 files have broken encodings on french.

remove dependency on GutenTag

cache/store chardet results per file

UnicodeDecodeError/NameError when installing from setup.py (Windows-related?)

update metadata?

Can this generate a CSV of the metadata of all Gutenberg books?

check file extension when retrieving files from zip

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent