Comments (6)
Hello JBaiter,
it sounds like the namespaces are declared but the other XHTML elements are inheriting the namespace, because they do not have the XHTML namespace explicitly declared. It would be perfect if you could upload an example file, that i can confirm my theory. What program did you use to create the sourcefile?
from ocr-fileformat.
The file comes from a partner's custom OCR engine. Unfortunately I don't think I'm allowed to share a sample file, but here's a small excerpt that should suffice to confirm your theory, the elements indeed don't explicitely state the namespace but inherit it from the root element:
<span class="ocr_line" title="bbox 394 1972 1419 2019;x_wconf 86">
<span class="ocrx_word" title="bbox 394 1972 447 2019;x_wconf 95">In</span> <span class="ocrx_word" title="bbox 459 1972 561 2019;x_wconf 95">Paris</span> <span class="ocrx_word" title="bbox 573 1972 656 2019;x_wconf 78">trägt</span> <span class="ocrx_word" title="bbox 669 1972 745 2019;x_wconf 95">man</span> <span class="ocrx_word" title="bbox 752 1972 1125 2019;x_wconf 61">Busen-Hemdknöpfchen</span> <span class="ocrx_word" title="bbox 1139 1972 1200 2019;x_wconf 95">mit</span> <span class="ocrx_word" title="bbox 1208 1972 1369 2019;x_wconf 95">Brillanten</span> <span class="ocrx_word" title="bbox 1376 1972 1419 2019;x_wconf 78">ein</span>
</span>
from ocr-fileformat.
The XSLT scripts use local-name only, non-namespaced, c.f. https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.1.xsl. I think I ran into this before filak/hOCR-to-ALTO@9f8026c .
from ocr-fileformat.
Sorry for the late response. I am still trying to fix the problem. Your code snippets runs without any problems on my local machine but not on the server. And it seems that i have another parsing issue.
For now i encourage you to update the program and and give it another try.
from ocr-fileformat.
For the example file https://raw.githubusercontent.com/kba/ocr-fileformat-samples/master/samples/hocr/1.1/417576986_0013.hocr I can run hocr2alto2.0 and the result looks fine. However, when I run hocr2alto2.1 the result looks not okay. But this happens only when I try to use the web GUI. Is this a bug on our side?
from ocr-fileformat.
With the updated SAXON and updated hocr__alto scripts I cannot anymore reproduce this issue. The file I have linked above works fine for all transformations in v0.3.0. @jbaiter Can you test your examples files now again and let us now if there is still a problem? Otherwise I suggest to close this issue here.
from ocr-fileformat.
Related Issues (20)
- Support conversion to MiniOCR HOT 1
- Web interface in Docker container/ Error when uploading document: "Must be either POST with the field 'file'...." HOT 2
- page__text.xsl is not honoring the reading order HOT 7
- Transformation for ImageWare MyBib HOT 2
- page__alto transformation mixes XML with logging in the output HOT 2
- page page2019: does not work
- Conversion from ABBYY to ALTO HOT 2
- [feature request] Support MacOS HOT 13
- regression: page-to-alto is missing HOT 6
- Feature request: Page concatenation during conversion
- Add example files
- Table extraction
- Docker installation HOT 1
- Installation with sudo writes local files with root ownership
- `make all` wants to write to `PREFIX`
- Broken badge on repo HOT 2
- Missing CITATION.cff file for repository
- [feature request] Support TSV format
- page to hocr: cr_carea vs ocr_carea HOT 12
- ocr-transform alto hocr: HTML, but xmlns=xhtml HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocr-fileformat.