It-Chapterize

Chapterize by Jonathan Reeve is a command-line tool that breaks up Gutenberg Project English plain text e-books into chapters, removing both the chapter headings and the text not included between headings.

It-Chapterize is an adaptation of Chapterize for the Italian language with additional minor changes concerning the output.

Main Changes
Installation and Testing
State of the Tool
Tested on

Main Changes

All regular expressions were modified so as to detect the most likely Italian chapters headings
Chapter headings are included at the beginning of each extracted chapter
The value of the delta variable for removing chapter headings that are likely to be part of a Table of Contents was increased
An additional function removes short detected chapters, that are likely to be false positive chapters/spurious text

Installation and Testing

# Clone the repository
git clone https://github.com/GiuseppeDellaCorte/It-Chapterize.git

# Grab a copy of "I tre Moschettieri - Volume 1 " from Project Gutenberg: 
wget https://www.gutenberg.org/files/60641/60641-0.txt

# Run It-Chapterize on it as it follows:  
python /path-to/itchapterize/itchapterize.py /path-to/60641-0.txt

It will output a new directory in the current working directory named 60641-0.txt-chapters, containing files ranging from 01.txt to 16.txt.

State of the Tool

It-Chapterize has been tested on a few set of Italian e-books, which means that the tool does not handle many possible Italian chapter headings.

Tested on

It-Chapterize has been tested successfully on these Italian Gutenberg Project files:

It-Chapterize has also been tested on the Gutenberg Project files that follows this paragraph. It worked relatively well on them, but not perfectly: the output text files include between one and two false positives chapters. In addition, for a few of them, sometimes spurious information are included usually in the first or last detected extracted chapters. Manual correction of false negatives requires around 1/2 minutes per parsed file.

giuseppe-della-corte / it-chapterize Goto Github PK

it-chapterize's Introduction

It-Chapterize

Main Changes

Installation and Testing

State of the Tool

Tested on

it-chapterize's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent