Giter VIP home page Giter VIP logo

islandora_hocr's Introduction

Islandora hOCR

Introduction

Adds the hOCR derivative functionality.

Installation

Install as usual, see this for further information.

This module contains a migration facilitating the creation of a media use term for use in common Islandora configurations. Enabling the module will expose the islandora_hocr_media_uses migration to generate a media use term of the URI https://discoverygarden.ca/use#hocr.

# Flow might be something like:
drush en islandora_hocr
drush migrate:import islandora_hocr_media_uses

Configuration

Derivatives

An action must be created and configured to generate an hOCR derivative. The action must also be triggered by a context in order for the derivative to be made. Refer to the official Islandora docs for more information.

Solr

We expect to make use of the Solr OCR Highlighting Plugin. The particulars of its installation are ultimately up to the environment into which it is being installed.

We have a single environment variable to allow the path of the library on the Solr instance to be specified, such that we can add its path to the configset for Solr:

  • SOLR_HOCR_PLUGIN_PATH: A path resolvable by Solr to the directory containing the OCR Highlighting Plugin JAR.

There are a couple of config entities included:

  • the islandora_hocr field type to perform tokenization
  • the "Select w/ HOCR highlighting" /select_ocr request handler.

HOCR Indexing

To node entities, we have added the ability to index HOCR from related media, making use of the Solr OCR Highlighting Plugin

As an example, you might add the islandora_hocr_field:content property to be indexed in Solr via the Search API Solr config, as islandora_hocr_field, as a Fulltext ("islandora_hocr") field.

Something of an aside, but the islandora_hocr_field:uri is presently prototypical: The Solr OCR Highlighting plugin has another character filter which handles processing paths into the contents of the files; however, in the context of things communicating via the network, such access might not always be possible, particular should access control enter in to the equation... as such, we presently expect the full page-level OCR document to be pushed for each page.

Usage

Assuming indexing is configured as above, with a islandora_hocr_field, then you might programmatically perform a Search API query with something like:

$index = \Drupal\search_api\Entity\Index::load('default_solr_index');
$query = $index->query();

// The search term(s).
$query->keys('bravo');
// Additional conditions, as desired.
$query->addCondition('type', 'islandora_object');
// Activate our highlighting behaviour.
$query->setOption('islandora_hocr_properties', [
  'islandora_hocr_field' => [],
]);

// Perform the query.
$results = $query->execute();

// Get the additionally-populated property info, so we can identify what fields from the highlighted results correspond to which property.
$info = $results->getQuery()->getOption('islandora_hocr_properties');
// This should be an associative array mapping language codes to Solr fields,
// which can then be found in the $highlights below.
$language_fields = $info['islandora_hocr_field']['language_fields'];

// When processing the results, the
foreach ($results as $result) {
  // Highlighting info can be acquired from the items. The format here is the
  // same as the format from https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/query/#response-format
  // for the given item/document.
  $highlights = $result->getExtraData('islandora_hocr_highlights');
}

Troubleshooting/Issues

Having problems or solved one? contact discoverygarden.

Known issues

  • Solr Cloud Package (in)compatibility: The path to the library could be omitted; however, the conditional inclusion of prefixes in the config entities is problematic.

Maintainers/Sponsors

Current maintainers:

Sponsor:

License

GPLv3

islandora_hocr's People

Contributors

adam-vessey avatar bibliophileaxe avatar jordandukart avatar lutaylor avatar nchiasson-dgi avatar willtp87 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lutaylor

islandora_hocr's Issues

hOCR Files not being Indexed

Setup

  1. Got the latest isle-dc stack up
  2. Installed and enabled the module
  3. Ran drush migrate:import islandora_hocr_media_uses
  4. Added islandora_hocr_field:content property to be indexed in Solr via the Search API and also set its Type to Fulltext ("islandora_hocr")
  5. Setup a Repository item with a .tif file and .hocr file in media
  • The .hocr file has the hOCR media use on it
  1. Installed the Solr OCR Highlighting Plugin per instructions from the documentation (In case I messed up the config but here is my configs for Solr)
  • The schema for the installation is here
  • I've also included the plugin's directive in the solrconfig.xml and added the needed lines of config in solrconfig_extra.xml here
  1. Set the correct path for the SOLR_HOCR_PLUGIN_PATH environment variable
  2. Restarted the SOLR container and also Indexed the nodes in Drupal

Problem

I cannot seem to get the hOCR to be indexed into Solr even after all the above setup steps. I've traced the code and found that the processor is properly doing its job of reading the content out and adding the value into Solr. However, using the Solr web interface I cannot see the field when I perform a query. I can see the field as the raw file content in the Solr query if I change the islandora_hocr_field:content property type to Fulltext. The OCR highlighting also doesn't show anything.

Am I missing something from the setup steps that are preventing the module from working? Some guidance would be appreciated! Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.