fourdigits / wagtail_textract Goto Github PK

View Code? Open in Web Editor NEW

31.0 6.0 13.0 1.04 MB

Text extraction for Wagtail document search

License: BSD 3-Clause "New" or "Revised" License

Python 87.16% Shell 10.72% Makefile 2.11%

wagtail textract text-extraction tesseract django search

wagtail_textract's Introduction

⚠️ Deprecation warning

This package is unmaintained, and we have no plans to maintain it.

We advise you to use it as an example, maybe copy the code into your own project, but don't install the package.

Text extraction for Wagtail document search

This package is for replacing Wagtail's Document class with one that allows searching in Document file contents using textract.

Textract can extract text from (among others) PDF, Excel and Word files.

The package was inspired by the "Search: Extract text from documents" issue in Wagtail.

Documents will work as before, except that Document search in Wagtail's admin interface will also find search terms in the files' contents.

Some screenshots to illustrate.

In our fresh Wagtail site with wagtail_textract installed, we uploaded a file called test_document.pdf with handwritten text in it. It is listed in the admin interface under Documents:

If we now search in Documents for the word correct, which is one of the handwritten words, the live search finds it:

The assumption is that this search should not only be available in Wagtail's admin interface, but also in a public-facing search view, for which we provide a code example.

Requirements

Wagtail 2 (see tox.ini)
The Textract dependencies

Maturity

We have been using this package in production since August 2018 on https://nuffic.nl.

Installation

Install the Textract dependencies
Add wagtail_textract to your requirements and/or pip install wagtail_textract
Add to your Django INSTALLED_APPS.
Put WAGTAILDOCS_DOCUMENT_MODEL = "wagtail_textract.document" in your Django settings.

Note: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):

requests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.
textract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.

We haven't seen this leading to problems, but it's something to keep in mind.

Tesseract

In order to make textract use Tesseract, which happens if regular textract finds no text, you need to add the data files that Tesseract can base its word matching on.

Create a tessdata directory in your project directory, and download the languages you want.

Transcribing

Transcription is done automatically after Document save, in an asyncio executor to prevent blocking the response during processing.

To transcribe all existing Documents, run the management command::

./manage.py transcribe_documents

This may take a long time, obviously.

Usage in custom view

Here is a code example for a search view (outside Wagtail's admin interface) that shows both Page and Document results.

from itertools import chain

from wagtail.core.models import Page
from wagtail.documents.models import get_document_model


def search(request):
    # Search
    search_query = request.GET.get('query', None)
    if search_query:
        page_results = Page.objects.live().search(search_query)
        document_results = Document.objects.search(search_query)
        search_results = list(chain(page_results, document_results))

        # Log the query so Wagtail can suggest promoted results
        Query.get(search_query).add_hit()
    else:
        search_results = Page.objects.none()

    # Render template
    return render(request, 'website/search_results.html', {
        'search_query': search_query,
        'search_results': search_results,
    })

Your template should allow for handling Documents differently than Pages, because you can't do pageurl result on a Document:

{% if result.file %}
   <a href="{{ result.url }}">{{ result }}</a>
{% else %}
   <a href="{% pageurl result %}">{{ result }}</a>
{% endif %}

What if you already use a custom Document model?

In order to use wagtail_textract, your CustomizedDocument model should do the same as wagtail_textract's Document:

subclass TranscriptionMixin
alter search_fields

from wagtail_textract.models import TranscriptionMixin


class CustomizedDocument(TranscriptionMixin, ...):
    """Extra fields and methods for Document model."""
    search_fields = ... + [
        index.SearchField(
            'transcription',
            partial_match=False,
        ),
    ]

Note that the first class to subclass should be TranscriptionMixin, so its save() takes precedence over that of the other parent classes.

Tests

To run tests, checkout this repository and:

make test

Coverage

A coverage report will be generated in ./coverage_html_report/.

Contributors

Karl Hobley
Bertrand Bordage
Kees Hink
Tom Hendrikx
Coen van der Kamp
Mike Overkamp
Thibaud Colas
Dan Braghis
Dan Swain

wagtail_textract's People

Contributors

Stargazers

Watchers

Forkers

allcaps overcastsoftware gaybro8777 clebercarmo octavenz mahendra-ramajayam elpatiostudio kunal-kourav danielswain thijskramer hawaii-do ollz272 linksmith

wagtail_textract's Issues

Order of installed apps

In the readme it says:

Add to your Django INSTALLED_APPS, after wagtail.documents.

Why after? From de Django Docs:

When several applications provide different versions of the same resource (template, static file, management command, translation), the application listed first in INSTALLED_APPS has precedence.

Or is this order not relevant for Python code?

How to make wagtail_textract work with Media Storage Backend

Hi there,

I am trying to use wagtail_textract for my project. I tried previously using just textract but am interested in some of the helper utilities of wagtail_textract. I am wondering how wagtail_textract will work in production with Docker and a Media Storage backend, such as Azure.

The line here is referencing a file path:
text = textract.process(document.file.path).strip()

but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.

Error when uploading a zero-byte file

When uploading a zero-byte file, one gets an error.

Textract dependency issue; Wagtail version dependency

I’m working to set up Wagtail Textract. I use pipenv and was getting package mismatch errors due to Textract on PyPI not being updated with the latest repo from https://github.com/deanmalmgren/textract (there was a chardet dependency error). However, @deanmalmgren ’s repo DOES have an updated chardet dependency (3.0.4, the latest at this point), so I was able to get around all but one of the errors by installing directly from the repo:
pip install git+https://github.com/deanmalmgren/textract.git –-upgrade

One remaining error (I’m at the latest Wagtail, 2.4):

wagtail-textract 1.0 has requirement wagtail<2.2,>=2, but you'll have wagtail 2.4 which is incompatible.

Would you be willing to remove the wagtail<2.2 dependency? If not, I could do a little testing for you by forking and removing that dependency and installing from my fork, but my testing wouldn’t be extensive. I would have around a hundred documents that I could run the transcription command on, but none of them would require OCR.

I would be willing to propose a re-write of your installation instructions based on the above (you could likely get rid of having to mention the statements about incompatibility errors).

Store OCR'ed data in pdf

I have OCR'ed my first set of documents with the fallback to Tesseract. It worked very well. In order for this to be most useful, OCR'ed text should be saved not only to the database but also back into the pdf. That way a user can do a Ctrl+F to find text within the document when viewing it. Have you thought about implementing this functionality?

Not compatible with current Wagtail.

ERROR: wagtail-textract 1.2 has requirement wagtail<2.6,>=2, but you'll have wagtail 2.9.3 which is incompatible.
ERROR: textract 1.6.3 has requirement beautifulsoup4==4.8.0, but you'll have beautifulsoup4 4.8.2 which is incompatible.

I'm aware this is a different project from Textract. But, the problem is related. So I figured I would include that error as well.

Lots of database connections created by transcribe_documents?

Just had a quick browse of the code and noticed that it uses asyncio to create background threads which fetch/extract text from documents.

Is it likely that Django would start handling the next request before the background thread has finished running? Because if the same database connection is used by both the text extraction and the new request at the same time, this could cause issues as database connections are not thread safe.

EDIT: looks like Django has this covered: https://github.com/django/django/blob/master/django/db/utils.py#L142

This might cause another issue: Async IO uses a thread pool of 5 * num_cpus by default which might create too many connections for some users (eg, on shared hosting) so maybe we should add a "concurrency" parameter to the "transcribe_documents" command which allows the user to specify a limit on the number of worker threads? (you can specify this in run_in_executor).

Textract runs when document file hasn't updated

Currently, it appears there's no check for whether the file has actually changed before rerunning textract so it probably reruns even if the user has only updated the title.

@gasman and I were discussing adding file hashing to Wagtail Images/Documents for cache-busting but might help solve this issue too.

Project name inconsistency

In README.md, the project is mentioned with the name wagtail_textrans.

I guess this has to do with our discussion in wagtail/wagtail#542 😉

fourdigits / wagtail_textract Goto Github PK

wagtail_textract's Introduction

⚠️ Deprecation warning

Text extraction for Wagtail document search

Requirements

Maturity

Installation

Tesseract

Transcribing

Usage in custom view

What if you already use a custom Document model?

Tests

Coverage

Contributors

wagtail_textract's People

Contributors

Stargazers

Watchers

Forkers

wagtail_textract's Issues

Recommend Projects

Recommend Topics

Recommend Org