Giter VIP home page Giter VIP logo

among / fusus Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 1.0 706.02 MB

a workflow to transform Arabic classical works in printed form to structured text

Home Page: https://among.github.io/fusus/fusus/index.html

License: MIT License

Python 0.27% Jupyter Notebook 55.70% Shell 0.01% HTML 44.03%
arabic ocr workflow text-processing image-processing python opencv kraken text-fabric digital-humanities

fusus's Introduction

DOI SWH Project Status: WIP โ€“ Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

Fusus

This is a workflow that transforms scanned pages into readable text.

The pages come from several printed Arabic books from the past few centuries.

The workflow takes care of cleaning, OCR, proofing, converting to tab separated files and from there to Text-Fabric from where the text material can be processed further.

pipeline

Features

  • cleaning is included: specks and symbols can be specified for cleaning by copying and pasting such fragments and storing them in a designated directory;
  • column layout and line boundaries are detected prior to OCRing;
  • individual lines will be passed to the OCR engine, which is Kraken using a model trained on many printed Arabic books, see model;
  • the results are stored in tab-separated files, retaining boundary boxes and confidences;
  • proofing pages can be generated for manually checking the OCR results;
  • the OCR results of each book are composed into Text-Fabric datasets.

This lays the foundations for:

  • correcting OCR mistakes;
  • enriching the text with morphological/linguistic annotations, named entities;
  • perform intertextuality research between the ground work (the "Fusus" by Ibn Arabi) and its commentary books.

A lot of cleaning has been carried out on two editions of the Fusus: Lakhnawi and Afifi. After that these editions have been aligned and brought together in a single dataset, in which it is possible read back the individual editions.

Text-Fabric interface

Get started with the tutorial.

We also have generated a static search interface.

Just click fusus-search and off you go.

You can do full text search via regular expressions, not only in the full-text, but also in attributes of the text, notably the bounding box information of each word.

Authors

Project

Fusus has been funded by the IT Research Innovation Fund.

It has been developed between 2020-03-01 and 2021-03-01

Correction, enrichment and alignment of the two Fusus editions was done from the end of the project till the end of 2021.

Docs

There is more documentation about sources, the research project, and how to use this software in the docs.

fusus's People

Contributors

dirkroorda avatar lwcvl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

lwcvl

fusus's Issues

Adaption to demargined

Demargined zou het grootste vlak moeten kiezen en alleen die overhouden, zodat valse strepen ook worden gemarginaliseerd.

Met rode pijlen aangegeven wat weg mag.

Screen Shot 2020-04-02 at 13 54 25

Relating marks to bands

De marks mogen ook, in de view 'box', de naam van de band hebben (of een afkorting daarvan) waarin ze gezocht worden. Zo kan sneller gezien worden of de banden aangepast moeten worden.

Input should allow tiff/tiff/jpg

Scripts are too clean currently: tucking away too much of the mechanics behind the scenes which makes adapting it to different situations difficult. Most images we will process are tiff or tif, input for pages and elements should accommodate this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.