Giter VIP home page Giter VIP logo

omeka-s-module-extractocr's Introduction

Extract OCR (plugin upgraded for Omeka S)

Module for Omeka S to extract OCR text in XML and TSV from PDF files, allowing instant fulltext searching within any IIIF viewer like Universal Viewer or Mirador with IIIF-Search module).

The xml format is the simple pdf2xml or the most common standard alto. The tsv format is a simple two columns with the words and the list of positions by page.

The tsv format is recommended as it is a lot quicker, in particular for items with many pages.

Installation

  • This module needs pdftohtml command-line tool on your server, from the poppler utilities:
# Debian and derivatives
sudo apt install poppler-utils
# Red Hat and derivatives
sudo dnf install poppler-utils
  • Before Omeka S version 3.1, the module requires to set the base uri in the config file of Omeka config/local.config.php in order to upload the file in background:
    'file_store' => [
        'local' => [
            'base_path' => null, // Or the full path on the server if needed.
            'base_uri' => 'https://example.org/files', // To be removed in Omeka S v3.1.
        ],
    ],
  • Upload and unzip the Extract OCR module folder into your modules folder on the server, or you can install the module via github:
cd omeka-s/modules
git clone [email protected]:bubdxm/Omeka-S-module-ExtractOcr.git "ExtractOcr"
  • Take care to rename the folder "ExtractOcr".
  • Install it from the admin → Modules → Extract Ocr -> install
  • Extract OCR automaticaly allows the upload of XML files.

Using the Extract OCR module

  • Create an item
  • Save this Item
  • After save, add PDF file(s) to this item
  • To locate extracted OCR xml or tsv file, select the item to which the PDF is attached. Normally, you should see an XML or a tsv file attached to the record with the same filename than the pdf file.

Optional modules

  • IIIF-Server: Module for Omeka S that adds the IIIF specifications to serve any images and medias.
  • IIIF-Search: Module for Omeka S that adds IIIF Search Api for fulltext searching on universal viewer.
  • Universal Viewer: Module for Omeka S that includes UniversalViewer, a unified online player for any file. It can display books, images, maps, audio, movies, pdf, 3D views, and anything else as long as the appropriate extensions are installed.
  • Mirador
  • Or any other IIIF viewers, like Diva.

TODO

  • Extract strings with pdftotext with arg -tsv and store them in a file or in database for simpler and quicker search.
  • Extract strings by word, but with one position by row, allowing to search with "AND", not only "OR".

Troubleshooting

See online Extract OCR issues.

License

This module is published under GNU/GPL.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Copyright

  • Copyright Syvain Machefert, Université Bordeaux 3 (see symac)
  • Copyright Daniel Berthereau, 2020-2024 (see Daniel-KM on GitLab)

omeka-s-module-extractocr's People

Contributors

daniel-km avatar symac avatar xmuller avatar smachefert avatar orsucciu avatar

Stargazers

 avatar Régis Robineau avatar

Watchers

 avatar Régis Robineau avatar James Cloos avatar  avatar

omeka-s-module-extractocr's Issues

Vérification sur les répertoires

@Daniel-KM je viens de réinstaller un nouvel omeka et je voulais y ajouter ce module et j'ai un message d'erreur qui me semblait bizarre lors de l'installation du module qui m'indique sur le dossier files/temp n'est pas writable.

En y regardant de plus près j'ai l'impression que c'est lié au commit ee4cfef (il y a trois ans, oui c'est pas récent, dans Module.php à la ligne 301). En effet dans checkdir on va tester sir le basename est writable sans être dans le contexte donc il va nous répondre toujours non, je pense qu'en testant le dirname plutôt que le basename cela résoudrait le problème, est-ce que ça vous semble juste ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.