Giter VIP home page Giter VIP logo

ebook-corpus's Introduction

Ebook Corpus - A parser and extractor for electronic books

Ebook Corpus is a set of tools for parsing and extracting the text of ebooks in various formats, designed for the purpose of creating large multilingual ebook-based text corpora.

Many people have amassed enormous collections of ebooks, often containing millions of lines of text when taken as a whole, so it is always surprising to find that there aren't more tools and libraries available to work with ebooks as a corpus source. It seems that almost all the existing tools are focused on consuming (reading) ebooks, while the remaining few provide the functionality to create ebooks to be thus consumed.

As wonderful as ebooks are, they are often packaged in formats that are incredibly underspecified, or worse, that don't follow the specifications that do exist. A remarkable number of parsing libraries choke on very simple books even in presumably well-supported formats like EPUB3.

There are many ways for an ebook to defy the expectations of the parser -- perhaps it has been written in Unicode and the parser only handles US-ASCII, or the parser expects Unicode and it's written in KOI-8. Maybe the ebook contains an OPF file called content.opf in the root directory, or maybe it's in a separate CONTENT subfolder -- or called something completely different, like mytoc.opf or ็›ฎๅฝ•.opf.

The Ebook Corpus tools won't solve all of these problems, but they nevertheless provide a number of options to make it easier to work with large, multilingual collections of ebooks as a raw text source.

Usage

Invoking the program on the command-line is straightforward:

./ebook.rb [options] [filename]

Where [filename] is the path to the ebook file that you want to work with. If the file has a standard extension (*.epub, *.mobi, *.fb2) it should be detected automatically.

Options

  • -a or --all: Extract all contents of epub
  • -c or --cover: Extract cover image
  • -f or --flatten-dir: Save all files to the current folder rather than an individual directory
  • -h or --html: Extract raw html
  • -i or --images: Extract images to a separate folder
  • -m or --metadata: Print metadata
    • -T or --title: Print title metadata only
    • -A or --author: Print author metadata only
    • -I or --isbn: Print ISBN metadata only
    • -L or --language: Print language metadata only
    • -P or --publisher: Print publisher metadata only
    • -D or --description: Print description metadata only
  • -o or --output-dir DIR: Save output to specified director
  • -s or --save: Save (text or html) to file instead of printing
  • -t or --text: Extract plain text
  • -T or --tests: Run test suite
  • -p or --pager: View text in pager
  • -v or --view: Open images in viewer

Supported formats

Format File extension
EPUB .epub
FictionBook .fb2
Mobipocket .mobi, .prc, azw

Support for Mobipocket files is provided via a wrapper for the python script mobiunpack.py by @kevinhendricks (released as GPL3). If you know of a drop-in replacement library in Ruby for parsing MOBI files (or are interested in writing one), please let me know!

Note that only ebooks without DRM will work with this script.

Contributing

PRs, suggestions, examples of ebooks that don't parse properly, and other contributions are always welcome! Providing support for additional formats or opening issues for bugs are examples of ways to help.

MOBI support has only been tested against files with the .mobi extension. It should in theory also work for other extensions. If you have access to ebooks with a .prc or .azw file extension and can confirm this, that would be appreciated!

To do

Code is pretty ad hoc at the moment and in general need of a cleanup. Different formats are handled separately but should probably be merged.

Other things:

  • Guess alternately-named content.opf files
  • Figure out cross-platform way of opening images in default viewer (current kludge is hard-coded to open image folder in Gwenview since xdg-open doesn't play nicely with cleaning up temporary files after viewing)

License

MIT.

ebook-corpus's People

Contributors

dohliam avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.