Giter VIP home page Giter VIP logo

wagtail_textract's Introduction

Build Status Coverage Report

⚠️ Deprecation warning

This package is unmaintained, and we have no plans to maintain it.

We advise you to use it as an example, maybe copy the code into your own project, but don't install the package.

Text extraction for Wagtail document search

This package is for replacing Wagtail's Document class with one that allows searching in Document file contents using textract.

Textract can extract text from (among others) PDF, Excel and Word files.

The package was inspired by the "Search: Extract text from documents" issue in Wagtail.

Documents will work as before, except that Document search in Wagtail's admin interface will also find search terms in the files' contents.

Some screenshots to illustrate.

In our fresh Wagtail site with wagtail_textract installed, we uploaded a file called test_document.pdf with handwritten text in it. It is listed in the admin interface under Documents:

Document List

If we now search in Documents for the word correct, which is one of the handwritten words, the live search finds it:

Document Search finds PDF by searching for "staple"

The assumption is that this search should not only be available in Wagtail's admin interface, but also in a public-facing search view, for which we provide a code example.

Requirements

Maturity

We have been using this package in production since August 2018 on https://nuffic.nl.

Installation

  • Install the Textract dependencies
  • Add wagtail_textract to your requirements and/or pip install wagtail_textract
  • Add to your Django INSTALLED_APPS.
  • Put WAGTAILDOCS_DOCUMENT_MODEL = "wagtail_textract.document" in your Django settings.

Note: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):

requests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.
textract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.

We haven't seen this leading to problems, but it's something to keep in mind.

Tesseract

In order to make textract use Tesseract, which happens if regular textract finds no text, you need to add the data files that Tesseract can base its word matching on.

Create a tessdata directory in your project directory, and download the languages you want.

Transcribing

Transcription is done automatically after Document save, in an asyncio executor to prevent blocking the response during processing.

To transcribe all existing Documents, run the management command::

./manage.py transcribe_documents

This may take a long time, obviously.

Usage in custom view

Here is a code example for a search view (outside Wagtail's admin interface) that shows both Page and Document results.

from itertools import chain

from wagtail.core.models import Page
from wagtail.documents.models import get_document_model


def search(request):
    # Search
    search_query = request.GET.get('query', None)
    if search_query:
        page_results = Page.objects.live().search(search_query)
        document_results = Document.objects.search(search_query)
        search_results = list(chain(page_results, document_results))

        # Log the query so Wagtail can suggest promoted results
        Query.get(search_query).add_hit()
    else:
        search_results = Page.objects.none()

    # Render template
    return render(request, 'website/search_results.html', {
        'search_query': search_query,
        'search_results': search_results,
    })

Your template should allow for handling Documents differently than Pages, because you can't do pageurl result on a Document:

{% if result.file %}
   <a href="{{ result.url }}">{{ result }}</a>
{% else %}
   <a href="{% pageurl result %}">{{ result }}</a>
{% endif %}

What if you already use a custom Document model?

In order to use wagtail_textract, your CustomizedDocument model should do the same as wagtail_textract's Document:

  • subclass TranscriptionMixin
  • alter search_fields
from wagtail_textract.models import TranscriptionMixin


class CustomizedDocument(TranscriptionMixin, ...):
    """Extra fields and methods for Document model."""
    search_fields = ... + [
        index.SearchField(
            'transcription',
            partial_match=False,
        ),
    ]

Note that the first class to subclass should be TranscriptionMixin, so its save() takes precedence over that of the other parent classes.

Tests

To run tests, checkout this repository and:

make test

Coverage

A coverage report will be generated in ./coverage_html_report/.

Contributors

  • Karl Hobley
  • Bertrand Bordage
  • Kees Hink
  • Tom Hendrikx
  • Coen van der Kamp
  • Mike Overkamp
  • Thibaud Colas
  • Dan Braghis
  • Dan Swain

wagtail_textract's People

Contributors

allcaps avatar danielswain avatar khink avatar mikeoverkamp avatar whyscream avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

wagtail_textract's Issues

Order of installed apps

In the readme it says:

Add to your Django INSTALLED_APPS, after wagtail.documents.

Why after? From de Django Docs:

When several applications provide different versions of the same resource (template, static file, management command, translation), the application listed first in INSTALLED_APPS has precedence.

Or is this order not relevant for Python code?

How to make wagtail_textract work with Media Storage Backend

Hi there,

I am trying to use wagtail_textract for my project. I tried previously using just textract but am interested in some of the helper utilities of wagtail_textract. I am wondering how wagtail_textract will work in production with Docker and a Media Storage backend, such as Azure.

The line here is referencing a file path:
text = textract.process(document.file.path).strip()

but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.

Textract dependency issue; Wagtail version dependency

I’m working to set up Wagtail Textract. I use pipenv and was getting package mismatch errors due to Textract on PyPI not being updated with the latest repo from https://github.com/deanmalmgren/textract (there was a chardet dependency error). However, @deanmalmgren ’s repo DOES have an updated chardet dependency (3.0.4, the latest at this point), so I was able to get around all but one of the errors by installing directly from the repo:
pip install git+https://github.com/deanmalmgren/textract.git –-upgrade

One remaining error (I’m at the latest Wagtail, 2.4):

wagtail-textract 1.0 has requirement wagtail<2.2,>=2, but you'll have wagtail 2.4 which is incompatible.

Would you be willing to remove the wagtail<2.2 dependency? If not, I could do a little testing for you by forking and removing that dependency and installing from my fork, but my testing wouldn’t be extensive. I would have around a hundred documents that I could run the transcription command on, but none of them would require OCR.

I would be willing to propose a re-write of your installation instructions based on the above (you could likely get rid of having to mention the statements about incompatibility errors).

Store OCR'ed data in pdf

I have OCR'ed my first set of documents with the fallback to Tesseract. It worked very well. In order for this to be most useful, OCR'ed text should be saved not only to the database but also back into the pdf. That way a user can do a Ctrl+F to find text within the document when viewing it. Have you thought about implementing this functionality?

Not compatible with current Wagtail.

ERROR: wagtail-textract 1.2 has requirement wagtail<2.6,>=2, but you'll have wagtail 2.9.3 which is incompatible.
ERROR: textract 1.6.3 has requirement beautifulsoup4==4.8.0, but you'll have beautifulsoup4 4.8.2 which is incompatible.

I'm aware this is a different project from Textract. But, the problem is related. So I figured I would include that error as well.

Lots of database connections created by transcribe_documents?

Just had a quick browse of the code and noticed that it uses asyncio to create background threads which fetch/extract text from documents.

Is it likely that Django would start handling the next request before the background thread has finished running? Because if the same database connection is used by both the text extraction and the new request at the same time, this could cause issues as database connections are not thread safe.

EDIT: looks like Django has this covered: https://github.com/django/django/blob/master/django/db/utils.py#L142

This might cause another issue: Async IO uses a thread pool of 5 * num_cpus by default which might create too many connections for some users (eg, on shared hosting) so maybe we should add a "concurrency" parameter to the "transcribe_documents" command which allows the user to specify a limit on the number of worker threads? (you can specify this in run_in_executor).

Textract runs when document file hasn't updated

Currently, it appears there's no check for whether the file has actually changed before rerunning textract so it probably reruns even if the user has only updated the title.

@gasman and I were discussing adding file hashing to Wagtail Images/Documents for cache-busting but might help solve this issue too.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.