Giter VIP home page Giter VIP logo

Comments (10)

aborroy avatar aborroy commented on July 23, 2024

What ACS version are you using?

On my local deployment, when using "mput" to add files using WebDAV protocol, folder rules are not fired at all.

Any additional details on how to reproduce this problem?

from alf-tengine-ocr.

jdelker avatar jdelker commented on July 23, 2024

That's on ACS 7.0.0. Basically an "all-in-one" docker deployment, enhanced with OCR.

The described usecase from above is actually a multi-step process, where docs are uploaded via WebDAV into a site-folder, converted from TIF to PDF and finally OCRed with this t-engine. For this particular issue, I'm currently stripping that down to the most simple procedure, which still shows the problem. I will provide the exact steps later on.

Just one thing, regarding that WebDAV upload:
There seems to be another issue with that, which causes the direct OCR processing of a WebDAV upload to fail.
If I take a "sample.pdf" file and drag&drop it into the site-folder, it get's uploaded and OCRed without problems. Uploading the same document via WebDAV to the same folder, leaves an unprocessed document, together with an exception from the t-engine:

transform-ocr_1       |    INFO - Input file is not a PDF, checking if it is an image...
transform-ocr_1       |   ERROR - cannot identify image file '/tmp/Alfresco/source_2579956250969298231_tmp.pdf'
transform-ocr_1       |   ERROR - UnsupportedImageFormatError
transform-ocr_1       |
transform-ocr_1       |
transform-ocr_1       | org.alfresco.transform.exceptions.TransformException: Transformer exit code was not 0:
transform-ocr_1       |    INFO - Input file is not a PDF, checking if it is an image...
transform-ocr_1       |   ERROR - cannot identify image file '/tmp/Alfresco/source_2579956250969298231_tmp.pdf'
transform-ocr_1       |   ERROR - UnsupportedImageFormatError
transform-ocr_1       |
transform-ocr_1       | 	at org.alfresco.transformer.executors.AbstractCommandExecutor.run(AbstractCommandExecutor.java:59) ~[alfresco-transformer-base-2.5.3.jar!/:1.0.0]
transform-ocr_1       | 	at org.alfresco.transformer.executors.CommandExecutor.run(CommandExecutor.java:56) ~[alfresco-transformer-base-2.5.3.jar!/:1.0.0]
...

However, sample.pdf is certainly a PDF (file sample.pdf: sample.pdf: PDF document, version 1.3)
But that's probably another story. I don't want to overcomplicate things for this issue and stick to the UI drag&drop for now. I'll provide some reproduction steps for that later on ...

from alf-tengine-ocr.

jdelker avatar jdelker commented on July 23, 2024

Ok, here we go:

  1. Take an ACS 7.0.0 setup + your t-engine
  2. Create a new site-folder and add embed-metadata-action (as described in your notes).
  3. Prepare a bunch of test-documents on your local host (see sample.pdf). I duplicated that into sample1.pdf to sample10.pdf.
  4. Open the site-folder in webbrowser (I used firefox), where you can drag&drop files.
  5. Drag one of the test documents to the site-folder. Verify, that your t-engine is fired and that the folder contains the OCRed document shortly after. (Note: I figured, that the ability to select text on the displayed PDF document is the only reliable truth to test this)
  6. If that was successful so far, delete the document in the site-folder, and repeat step 5 with dragging a chosen number of documents at once. Verify, that a matching number of transformation actions are fired (number of parallel ocrmypdf/tesseract processes).
  7. Wait until all transformations have been completed (no more ocrmypdf/tesseract processes)
  8. Verify each uploaded document, if it was OCRed.
  9. Notice that some (or eventually even all) documents, where not OCRed.

Notes:
The particular behavior of this test may depend on the power of your system. I performed this on a rather low-powered host, where the transformations run in the range of minutes (when fired parallel for multiple documents). Eventually the number of test documents need to be increased to experience the same problem on more powerful systems.

If there is anything I can set on my setup to increase logging, let me know.

from alf-tengine-ocr.

aborroy avatar aborroy commented on July 23, 2024

I guess the problem is directly related to resources.

I was able to OCR 10 sample.pdf files in a row using following Docker Compose deployment:
docker-compose.zip

This is a 16 GB RAM based deployment.

What are the resources of your environment?

from alf-tengine-ocr.

jdelker avatar jdelker commented on July 23, 2024

Well, that docker runs on a 32GB RAM host, but is actually shared with other things, so there is certainly less available.
But actually, I don't want to argue about resources here. In fact, I agree with you, that this is most probably related to resources and possible longer execution times resulting of that.
The principle problem may just show much sooner, with lesser resources available, but still persist in general. I guess the critical level for your environment is simply further up. As uploaded documents are processed in parallel, it is just a matter of the number of documents uploaded at once.

The problem with this is, that the failure to OCR a document remains undetected. This renders any archiving process as very unreliable, because you fill your content store with potentially unsearchable documents.

So, I'm very interested in resolving this in a constructive and reliable way.
My assumption is, that either the unlimited parallel processing is running in some resource limits, or that timeouts are encountered.
I would like to determine the root cause, and either find a way to limit the number of parallel processed documents and/or eventually alter timeouts.

Coming back to my previous questions: Do you have hints for more verbose logging, to make the process more transparent?

from alf-tengine-ocr.

aborroy avatar aborroy commented on July 23, 2024

I was reading some code and I guess there are not useful logging details from Transform AIO service.

Main problem is that Community is using synchronous transformation requests. This means that every transformation is generating an HTTP transformation request to the target T-Engine (OCR in this use case). There is no pool or control settings in order to limit the rate of these requests.

An alternative would be to add some code to this project in order to process a limited number of requests in parallel, but this could create some HTTP TimeOuts in the repository code.

from alf-tengine-ocr.

jdelker avatar jdelker commented on July 23, 2024

Does activemq help here?
Shouldn't that be capable of queuing that properly.
The transform-core-aio utilizes that also, doesn't it?

from alf-tengine-ocr.

aborroy avatar aborroy commented on July 23, 2024

For LocalTransformer (Community supported), ActiveMQ is not used.

from alf-tengine-ocr.

jdelker avatar jdelker commented on July 23, 2024

Ok, so must this be considered a general design problem then - at least for the Community Edition?
If the t-engine generally has to cope with a unknown number of parallel requests, it's just a matter of quantity when this collapses. Either the available resources are outrun, or timeouts occur if requests are throttled (and therefor queued for processing) in the t-engine.

I'm currently experimenting with the later, which already looks promising for my usecase.
But still, I'm concerned, that there is some timeout barrier I haven't reached yet and which will result in unprocessed documents again.

from alf-tengine-ocr.

aborroy avatar aborroy commented on July 23, 2024

Hi, I've been reviewing some code and I guess that you can control the amount of transformations required by using the Asynchronous check in the folder rule for OCR.

When using this mode, actions are executed using a Thread Pool controlled by the following properties:
https://github.com/Alfresco/alfresco-community-repo/blob/master/repository/src/main/resources/alfresco/repository.properties#L640

from alf-tengine-ocr.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.