aborroy / alf-tengine-ocr Goto Github PK
View Code? Open in Web Editor NEWAlfresco Transformer For ACS 70+ from PDF to OCRd PDF
License: GNU General Public License v2.0
Alfresco Transformer For ACS 70+ from PDF to OCRd PDF
License: GNU General Public License v2.0
Line 18 references "-SNAPSHOT" which doesn't appear to be valid as of today (my luck!). Removing that and going for "2.5.4" appears to work for me. (my first submitted issue - be gentle)
Error building alfresco/tengine-ocr on Windows 10 (WSL2 enabled DockerDesktop) environment
Environment Details:
wsl -l -v
NAME STATE VERSION
* docker-desktop-data Running 2
docker-desktop Running 2
docker -v
Docker version 20.10.20, build 9fdeb9c
docker-compose -v
Docker Compose version v2.12.1
java -version
java version "11.0.16.1" 2022-08-18 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.16.1+1-LTS-1)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16.1+1-LTS-1, mixed mode)
mvn -v
Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
Maven home: C:\Abhinav\Softwares\Java\Maven\apache-maven-3.8.6
Java version: 11.0.16.1, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk-11.0.16.1
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
``
---------------------------------------------------------------------------------------------------------
**Steps:**
1- Cloned "https://github.com/aborroy/alf-tengine-ocr.git"
2- Changed directory to : C:\Downloads\alf-tengine-ocr\ats-transformer-ocr
3- Started Docker Desktop (WSL2 enabled on Windows 10)
4- Executed `mvn clean install` command.
5- Build failed with error:
`[INFO] --- spring-boot-maven-plugin:2.5.4:repackage (repackage) @ ats-transformer-ocr ---
[INFO] Replacing main artifact with repackaged archive
[INFO]
[INFO] --- spring-boot-maven-plugin:2.5.4:repackage (default) @ ats-transformer-ocr ---
[INFO] Replacing main artifact with repackaged archive
[INFO]
[INFO] --- docker-maven-plugin:0.34.1:build (build-image) @ ats-transformer-ocr ---
[INFO] Building tar: C:\Downloads\alf-tengine-ocr\ats-transformer-ocr\target\docker\alfresco\tengine-ocr\latest\tmp\docker-build.tar
[INFO] DOCKER> [alfresco/tengine-ocr:latest]: Created docker-build.tar in 559 milliseconds
[ERROR] DOCKER> Unable to build image [alfresco/tengine-ocr:latest] : "The command '/bin/sh -c set -eux; ARCH=\"$(dpkg --print-architecture)\"; case \"${ARCH}\" in armhf) ESUM='c6b1fda3f8807028cbfcc34a4ded2e8a5a6b6239d2bcc1f06673ea6b1530df94'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_arm_linux_hotspot_11.0.5_10.tar.gz'; ;; ppc64el|ppc64le) ESUM='d763481ddc29ac0bdefb24216b3a0bf9afbb058552682567a075f9c0f7da5814'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_ppc64le_linux_hotspot_11.0.5_10.tar.gz'; ;; amd64|x86_64) ESUM='6dd0c9c8a740e6c19149e98034fba8e368fd9aa16ab417aa636854d40db1a161'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.5_10.tar.gz'; ;; *) echo \"Unsupported arch: ${ARCH}\"; exit 1; ;; esac; curl -LfsSo /tmp/openjdk.tar.gz ${BINARY_URL}; echo \"${ESUM} */tmp/openjdk.tar.gz\" | sha256sum -c -; mkdir -p /opt/java/openjdk; cd /opt/java/openjdk; tar -xf /tmp/openjdk.tar.gz --strip-components=1; rm -rf /tmp/openjdk.tar.gz;' returned a non-zero code: 35" ["The command '/bin/sh -c set -eux; ARCH=\"$(dpkg --print-architecture)\"; case \"${ARCH}\" in armhf) ESUM='c6b1fda3f8807028cbfcc34a4ded2e8a5a6b6239d2bcc1f06673ea6b1530df94'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_arm_linux_hotspot_11.0.5_10.tar.gz'; ;; ppc64el|ppc64le) ESUM='d763481ddc29ac0bdefb24216b3a0bf9afbb058552682567a075f9c0f7da5814'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_ppc64le_linux_hotspot_11.0.5_10.tar.gz'; ;; amd64|x86_64) ESUM='6dd0c9c8a740e6c19149e98034fba8e368fd9aa16ab417aa636854d40db1a161'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.5_10.tar.gz'; ;; *) echo \"Unsupported arch: ${ARCH}\"; exit 1; ;; esac; curl -LfsSo /tmp/openjdk.tar.gz ${BINARY_URL}; echo \"${ESUM} */tmp/openjdk.tar.gz\" | sha256sum -c -; mkdir -p /opt/java/openjdk; cd /opt/java/openjdk; tar -xf /tmp/openjdk.tar.gz --strip-components=1; rm -rf /tmp/openjdk.tar.gz;' returned a non-zero code: 35" ]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.080 s
[INFO] Finished at: 2022-11-10T13:35:02-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal io.fabric8:docker-maven-plugin:0.34.1:build (build-image) on project ats-transformer-ocr: Unable to build image [alfresco/tengine-ocr:latest] : "The command '/bin/sh -c set -eux; ARCH=\"$(dpkg --print-architecture)\"; case \"${ARCH}\" in armhf) ESUM='c6b1fda3f8807028cbfcc34a4ded2e8a5a6b6239d2bcc1f06673ea6b1530df94'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_arm_linux_hotspot_11.0.5_10.tar.gz'; ;; ppc64el|ppc64le) ESUM='d763481ddc29ac0bdefb24216b3a0bf9afbb058552682567a075f9c0f7da5814'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_ppc64le_linux_hotspot_11.0.5_10.tar.gz'; ;; amd64|x86_64) ESUM='6dd0c9c8a740e6c19149e98034fba8e368fd9aa16ab417aa636854d40db1a161'; BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.5_10.tar.gz'; ;; *) echo \"Unsupported arch: ${ARCH}\"; exit 1; ;; esac; curl -LfsSo /tmp/openjdk.tar.gz ${BINARY_URL}; echo \"${ESUM} */tmp/openjdk.tar.gz\" | sha256sum -c -; mkdir -p /opt/java/openjdk; cd /opt/java/openjdk; tar -xf /tmp/openjdk.tar.gz --strip-components=1; rm -rf /tmp/openjdk.tar.gz;' returned a non-zero code: 35" -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException`
It's not really an Issue, but can you please refer me to some documentation where I can see how to use a share - rule to convert PDF to OCRed PDF?
I only found the REST call to make a rendition but no really useful way to integrate this.
My scenario is that every file which is uploaded to alfresco will automatically be OCRed and be full text searchable. Cant figure out how to do this...
any help is appreciated
regards
stefan
Does this support Khmer language? if not then can we configure ? if yes then how?
Hi there Angel,
I'm trying to rebuild the docker image so to include portuguese support for alf-tengine-ocr as it's not enabled by default. When issuing the mvn clean package on the ats-transformer-ocr folder, i'm having the following issue:
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /tmp/1/alf-tengine-ocr/ats-transformer-ocr/src/main/java/org/alfresco/transformer/executors/OcrmypdfCommandExecutor.java:[29,44] cannot access org.alfresco.transformer.util.RequestParamMap
bad class file: /root/.m2/repository/org/alfresco/alfresco-transformer-base/2.5.4/alfresco-transformer-base-2.5.4.jar(org/alfresco/transformer/util/RequestParamMap.class)
class file has wrong version 55.0, should be 52.0
Please remove or make sure it appears in the correct subdirectory of the classpath.
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:04 min
[INFO] Finished at: 2024-01-29T09:19:24-03:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project ats-transformer-ocr: Compilation failure
[ERROR] /tmp/1/alf-tengine-ocr/ats-transformer-ocr/src/main/java/org/alfresco/transformer/executors/OcrmypdfCommandExecutor.java:[29,44] cannot access org.alfresco.transformer.util.RequestParamMap
[ERROR] bad class file: /root/.m2/repository/org/alfresco/alfresco-transformer-base/2.5.4/alfresco-transformer-base-2.5.4.jar(org/alfresco/transformer/util/RequestParamMap.class)
[ERROR] class file has wrong version 55.0, should be 52.0
[ERROR] Please remove or make sure it appears in the correct subdirectory of the classpath.
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Can you please advice? Also, is it possible to create a docker image on docker hub with english+portuguse+french+spanish?
I know this is probably not the best way to ask this question, but I can't find another place where to ask.
What would be the procedure (if possible) for migrating/upgrading from Alfresco 6.2 with simple-ocr (docker) to Alfresco 7.1 with TransformEngine OCR (docker) ?
I've successfully done a backup of my data from 6.2, restored it to 7.1 installation and almost everything works as expected.
But in some cases, for instance when trying to access User folder from ACA, there is an error that I've tracked down to :
{"error":{"errorKey":"framework.exception.ApiDefault","statusCode":500,"briefSummary":"A namespace prefix is not registered for uri http://www.keensoft.es/model/content/ocr/1.0","stackTrace":"Per motivi di sicurezza l'analisi dello stack non viene più visualizzata, ma viene mantenuta la proprietà per le versioni precedenti","descriptionURL":"https://api-explorer.alfresco.com","logId":"ae148681-aa5b-4cf3-b33d-a46201617545"}}
I guess my documents have reference to the namespace of the previous module, and it is not found.
What should be done to fix the situation?
Hey @aborroy - first off: thanks for the OCR transformer, it looks nice and lean!
I'm struggling with a migration from a hand-rolled OCR pipeline with Alfresco 5.0 (CE) to your OCR transformer with Alfresco 7.4 (CE). The direct integration as a folder rule would be much simpler. My setup works so far that I can upload the quick.pdf from this repo and the OCR magic (new document version) works as expected. That's great!
Here's my problem: When I upload a real PDF file (426kb, one page, PDF version 1.4) then no new document version is created, never. My guess is that the issue is caused by resource limits. I've experimented with file size and I think it's more related to the execution time. A bigger file (508kb, one page, PDF version 1.4) sometimes succeeds in a new document version, but not always. I'm pretty sure it's not the file size as the OCR transformer does not configure the maxSourceSizeBytes
- which defaults to -1
(no limit) according to the docs.
Here are some screenshots:
I searched for transformer timeouts and configured on the repository the following settings:
-Dtransformer.timeout.default=300
-Dtransformserver.transformationTimeout=300
-Dcontent.transformer.default.timeoutMs=300000
but this does not change the situation. Unfortunately, I was not able to figure out where the transformOptions.get(TIMEOUT)
comes from or how to set it properly.
While digging into this I recognized, when the execution time is less than 5 seconds the new document version is created. I didn't found any defaults for the transformOptions
regarding the timeout.
Maybe you could give me a hint? :)
Hey,
we noticed that the tag 1.0.0-deu-fra-spa-ita does not support the amd64 architecture. The tag 1.0.0 does support it.
Was there an error when deploying the latest changes?
I have configured the Tengine-ocr as you mentioned in the document with this code. but still I am not sure somehow it is not coming in dropdown box when we are trying to configure in Alfresco share.
Did you faced this issue. ?
Hey @aborroy,
I noticed problems with processing multiple documents at once. Some documents are OCRed, some are not.
What I did & observed
Conclusions/Assumptions:
I wonder:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.