Giter VIP home page Giter VIP logo

giuseppe-della-corte / it-chapterize Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 40 KB

A tool for extracting chapters from Gutenberg Project Italian raw text e-books. RegEx are used to match chapter headings and extract the text between them.

License: GNU General Public License v3.0

Python 98.53% Nix 1.47%
regex regexp regex-pattern regular-expression text-processing text-preprocessing literary-analysis gutenberg-project italian italian-nlp

it-chapterize's Introduction

It-Chapterize

Chapterize by Jonathan Reeve is a command-line tool that breaks up Gutenberg Project English plain text e-books into chapters, removing both the chapter headings and the text not included between headings.

It-Chapterize is an adaptation of Chapterize for the Italian language with additional minor changes concerning the output.

Main Changes

  • All regular expressions were modified so as to detect the most likely Italian chapters headings
  • Chapter headings are included at the beginning of each extracted chapter
  • The value of the delta variable for removing chapter headings that are likely to be part of a Table of Contents was increased
  • An additional function removes short detected chapters, that are likely to be false positive chapters/spurious text

Installation and Testing

# Clone the repository
git clone https://github.com/GiuseppeDellaCorte/It-Chapterize.git

# Grab a copy of "I tre Moschettieri - Volume 1 " from Project Gutenberg: 
wget https://www.gutenberg.org/files/60641/60641-0.txt

# Run It-Chapterize on it as it follows:  
python /path-to/itchapterize/itchapterize.py /path-to/60641-0.txt

It will output a new directory in the current working directory named 60641-0.txt-chapters, containing files ranging from 01.txt to 16.txt.

State of the Tool

It-Chapterize has been tested on a few set of Italian e-books, which means that the tool does not handle many possible Italian chapter headings.

Tested on

It-Chapterize has been tested successfully on these Italian Gutenberg Project files:

It-Chapterize has also been tested on the Gutenberg Project files that follows this paragraph. It worked relatively well on them, but not perfectly: the output text files include between one and two false positives chapters. In addition, for a few of them, sometimes spurious information are included usually in the first or last detected extracted chapters. Manual correction of false negatives requires around 1/2 minutes per parsed file.

it-chapterize's People

Contributors

giuseppe-della-corte avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.