Giter VIP home page Giter VIP logo

Comments (6)

JKamlah avatar JKamlah commented on June 7, 2024

Hello JBaiter,
it sounds like the namespaces are declared but the other XHTML elements are inheriting the namespace, because they do not have the XHTML namespace explicitly declared. It would be perfect if you could upload an example file, that i can confirm my theory. What program did you use to create the sourcefile?

from ocr-fileformat.

jbaiter avatar jbaiter commented on June 7, 2024

The file comes from a partner's custom OCR engine. Unfortunately I don't think I'm allowed to share a sample file, but here's a small excerpt that should suffice to confirm your theory, the elements indeed don't explicitely state the namespace but inherit it from the root element:

<span class="ocr_line" title="bbox 394 1972 1419 2019;x_wconf 86">
  <span class="ocrx_word" title="bbox 394 1972 447 2019;x_wconf 95">In</span> <span class="ocrx_word" title="bbox 459 1972 561 2019;x_wconf 95">Paris</span> <span class="ocrx_word" title="bbox 573 1972 656 2019;x_wconf 78">tr&#228;gt</span> <span class="ocrx_word" title="bbox 669 1972 745 2019;x_wconf 95">man</span> <span class="ocrx_word" title="bbox 752 1972 1125 2019;x_wconf 61">Busen-Hemdkn&#246;pfchen</span> <span class="ocrx_word" title="bbox 1139 1972 1200 2019;x_wconf 95">mit</span> <span class="ocrx_word" title="bbox 1208 1972 1369 2019;x_wconf 95">Brillanten</span> <span class="ocrx_word" title="bbox 1376 1972 1419 2019;x_wconf 78">ein</span>
</span>

from ocr-fileformat.

kba avatar kba commented on June 7, 2024

The XSLT scripts use local-name only, non-namespaced, c.f. https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.1.xsl. I think I ran into this before filak/hOCR-to-ALTO@9f8026c .

from ocr-fileformat.

JKamlah avatar JKamlah commented on June 7, 2024

Sorry for the late response. I am still trying to fix the problem. Your code snippets runs without any problems on my local machine but not on the server. And it seems that i have another parsing issue.
For now i encourage you to update the program and and give it another try.

from ocr-fileformat.

zuphilip avatar zuphilip commented on June 7, 2024

For the example file https://raw.githubusercontent.com/kba/ocr-fileformat-samples/master/samples/hocr/1.1/417576986_0013.hocr I can run hocr2alto2.0 and the result looks fine. However, when I run hocr2alto2.1 the result looks not okay. But this happens only when I try to use the web GUI. Is this a bug on our side?

from ocr-fileformat.

zuphilip avatar zuphilip commented on June 7, 2024

With the updated SAXON and updated hocr__alto scripts I cannot anymore reproduce this issue. The file I have linked above works fine for all transformations in v0.3.0. @jbaiter Can you test your examples files now again and let us now if there is still a problem? Otherwise I suggest to close this issue here.

from ocr-fileformat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.