Comments (6)
After the error message in the usage information you see the available transformations. Currently there are only alto2.0/alto2.1 to hocr transformations. Try to change the namespace of your file
+<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
-<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
and then use
ocr-transform alto2.0 hocr < alto.xml
I will create an issue upstream to support also other versions of alto in the transformations.
from ocr-fileformat.
ALTO v3 & v4 should be supported now.
from ocr-fileformat.
Have you tried https://github.com/filak/hOCR-to-ALTO/blob/master/alto__hocr.xsl ?
from ocr-fileformat.
I have seen the changes because they broke our test case in one PR. Therefore we switched to a fixed commit instead of the newest version always. But this is still on my todo list. Maybe I can do this now...
The different files alto*__hocr look almost identical. Therefore I would instead try to make a more generalized transformation alto__hocr which is applicable to alto files of different version. Should I try to do that as a PR?
We need to change then some things than here in order to integrate the new file names, but that can be done afterwards.
from ocr-fileformat.
I can confirm the transformation of the @jtlz2's file works now with the latest version https://github.com/filak/hOCR-to-ALTO/blob/master/alto__hocr.xsl
from ocr-fileformat.
Yes, thank you. This is updated in e0d9250, which is now part of v0.3.0.
from ocr-fileformat.
Related Issues (20)
- Release version 0.3.0 and 1.0.0 HOT 11
- GCV to HOCR or PAGE conversion not working HOT 9
- Support conversion from and to Textract JSON HOT 4
- "ocr-transform page alto ... ...": loosing text HOT 13
- New Saxon version 10.2 is out HOT 8
- Google Cloud Vision to PAGE-XML HOT 8
- alto to text: too many spaces HOT 7
- Proxy support HOT 7
- Support conversion to MiniOCR HOT 1
- Web interface in Docker container/ Error when uploading document: "Must be either POST with the field 'file'...." HOT 2
- page__text.xsl is not honoring the reading order HOT 7
- Transformation for ImageWare MyBib HOT 2
- page__alto transformation mixes XML with logging in the output HOT 2
- page page2019: does not work
- Conversion from ABBYY to ALTO HOT 2
- [feature request] Support MacOS HOT 13
- regression: page-to-alto is missing HOT 6
- Feature request: Page concatenation during conversion
- Add example files
- Table extraction
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocr-fileformat.