papyri / xsugar Goto Github PK

View Code? Open in Web Editor NEW

3.0 7.0 4.0 3.25 MB

XSugar code for EpiDoc<->Leiden+ transformation

Ruby 26.01% Shell 0.18% HTML 58.77% Java 14.54% CSS 0.13% Dockerfile 0.05% Haml 0.32%

xsugar's Introduction

XSugar

Requirements

JRuby 9.1.17.0 - preferably managed with rbenv
Bundler - gem install bundler

Usage

This project contains JRuby libraries and wrappers for XSugar as well as grammars for converting between EpiDoc XML and Leiden+ (a Leiden-style plaintext markup). There is also a pure-Java standalone XSugar transformation servlet in src/standalone.

To convert between the EpiDoc and Leiden+, the utility scripts xml2nonxml.rb and nonxml2xml.rb are provided.

To use them you can simply run:

./bin/xml2nonxml.rb < epidoc.xml > leiden.txt

or ./bin/nonxml2xml.rb < leiden.txt > epidoc.xml

File Structure

bin/                     command-line scripts
    blackboard_agent.rb  blackboard XSugar transformer agent (run many)
    blackboard_server.rb blackboard XSugar transformer server (run one)
    coverage.sh          IDP2 grammar coverage script
    xml2nonxml.rb        command-line xml->non-xml RXSugar utility
    nonxml2xml.rb        command-line non-xml->xml RXSugar utility
epidoc.xsg               Leiden+ XSugar grammar
init.rb                  Rails plugin init script
lib/                     source code
    coverage/            classes for testing XSugar coverage
    standalone/          classes for warming up standalone server
    jruby_helper.rb      helper classes for invoking RXSugar from JRuby
    modules_jruby.rb     Java->Ruby module conversion for JRuby
    modules_rjb.rb       Java->Ruby module conversion for RJB
    rxsugar.rb           main Ruby XSugar wrapper class
    rxsugar_helper.rb    helper classes for using Ruby XSugar wrapper
    util_helper.rb       helper classes for command-line scripts
    xsugar-all.jar       compiled upstream XSugar JAR
src/                     Java source code
    standalone/          source code for standalone transformation server
    xsugar/              upstream XSugar source code
test/                    source code for unit testing
translation_epidoc.xsg   XSugar grammar for EpiDoc translations

Testing

The Ruby testing uses bundler for gem dependencies, so you should invoke rake with:

bundle exec rake

Upstream

The Java XSugar source is tracked in xsugar-vendor. Customizations for this project are in xsugar-customizations, merged into master. To e.g. update to a new upstream version of XSugar, you would unpack it to src/xsugar on the xsugar-vendor branch and commit the changes. Then you would rebase the changes in xsugar-customizations onto the new xsugar-vendor. Then merge xsugar-customizations into master and rebuild lib/xsugar-all.jar (using the rake task java:xsugar:build) and commit. Do not make changes/customizations to the Java XSugar source on master, make them on xsugar-customizations so that the merge upstream/rebase/merge workflow is more straightforward.

xsugar's People

Contributors

Stargazers

Watchers

Forkers

dclp fmcc dchandekstark samosafuz

xsugar's Issues

Round trip sometimes moves accents onto angle brackets

This will introduce characters such as >ͅ or >́ (possibly others?)

`<unclear>` sometimes split to multiple `<unclear>`s

Example:

<unclear>πόλ</unclear>

Becomes:

<unclear>π</unclear><unclear>όλ</unclear>

Files already exhibiting this phenomenon (NB: these all appear to be with <unclear> containing <g type="…">):

DDB_EpiDoc_XML/p.brux.bawit/p.brux.bawit.45.xml
DDB_EpiDoc_XML/p.cair.masp/p.cair.masp.2/p.cair.masp.2.67240.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.54/p.oxy.54.3756.xml
DDB_EpiDoc_XML/p.sijp/p.sijp.36.xml

Files which would exhibit this phenomenon with an idempotent Leiden+ transform:

DDB_EpiDoc_XML/cpr/cpr.22/cpr.22.48.xml
DDB_EpiDoc_XML/o.ashm/o.ashm.6.xml
DDB_EpiDoc_XML/o.bodl/o.bodl.2/o.bodl.2.1849.xml
DDB_EpiDoc_XML/o.petr.mus/o.petr.mus.556.xml
DDB_EpiDoc_XML/o.theb/o.theb.16.xml
DDB_EpiDoc_XML/p.flor/p.flor.1/p.flor.1.93dupl.xml
DDB_EpiDoc_XML/p.mich/p.mich.11/p.mich.11.603.xml
DDB_EpiDoc_XML/p.palaurib/p.palaurib.13.xml
DDB_EpiDoc_XML/p.wisc/p.wisc.1/p.wisc.1.1.xml
DDB_EpiDoc_XML/psi.congr.xxi/psi.congr.xxi.6.xml

Error parsing Leiden+

Some papyri have a correct Leiden+ and correct XML, which is correctly displayed on papyri.info. But when editing their Leiden+ code, they produce an error at a position which was not edited. This happened to me with several different papyri. So it is impossible to edit these papyri (except taking the workaround via xml code). Apparently, this must be a Leiden+ <-> XML parsing error.

Steps to reproduce the error

Navigate to http://papyri.info/ddbdp/p.cair.masp;1;67109
Click on "open in editor"
Click on DDbDP > Leiden+
Click on "Save" (without making any changes to the Leiden+ !!!)
You will get an error on line 8: Διοσκόρου υ ἱ̣(¨)οPOSSIBLE ERRORῦ αὐτοῦ

`<gap reason="illegible" unit="character"/>.` gets round-tripped to `.<lb/>`

Example:

 <gap reason="illegible" quantity="4" unit="character"/>.

Becomes:

 .<lb n="4"/>

Latin "ca" followed by `<gap>` round-trips to `@precision="low"`

Example:

ca<gap reason="illegible" quantity="2" unit="character"/>

Becomes:

<gap reason="illegible" quantity="2" unit="character" precision="low"/>

`<hi rend="supraline"></hi>` with terminal NFD combining accent gets mangled

Example:

<hi rend="supraline">μιᾷ</hi>

Becomes:

<hi rend="supraline">μιᾶ</hi>ͅ

Character codes:

1FB7  GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
1FB6  GREEK SMALL LETTER ALPHA WITH PERISPOMENI
0345  COMBINING GREEK YPOGEGRAMMENI

Leiden+ for figures malfunctioning in DCLP

In Leiden+, one normally uses # to generate the xml for a figure: e.g., #seal >> <figure><figDesc>seal</figDesc></figure>. Attempts to use this Leiden+ in DCLP, however, result in the POSSIBLE ERROR warning. It appears something is amiss with XSugar.

Combining <num> with <gap> and @rend="tick"

In the course of encoding some mathematical papyri, I've discovered a gap in XSugar that it would be nice to fill. If we are dealing with a multi-digit whole number or fraction and one of the digits is in a <gap>, it is not possible to add @rend="tick" to <num> without breaking XSugar.

So, for example, we might consider the following. This first scenario behaves as expected, with XSugar conversion from xml to Leiden+ and back.

κ[.] (= print)
<#κ[.1]=21-29#> (= Leiden+)
<num atLeast="21" atMost="29">κ<gap reason="lost" quantity="1" unit="character"/></num> (= xml)

The addition of a tick is problematic because XSugar does not recognize it in the vicinity of <gap>, as the xml indicates

κ[.] ´ (= print)
<#κ[.1] '=21-29#> (= Leiden+)
<num atLeast="21" atMost="29">κ<gap reason="lost" quantity="1" unit="character"/> '</num> (= xml)

Of course, one can add @rend="tick" manually to the xml, but although this displays correctly in PN or in PE under 'preview', it breaks Leiden+, rendering the file broken for future editors.

This bug persists regardless of the position of <gap>; no matter which digit is lacunose, XSugar fails to process the tick properly

JRuby 1.7.x compatibility

Currently, a lot of the standalone scripts and tasks may fail if run under JRuby 1.7.x instead of (EOL'd) 1.6.8.

The big difference between the two is that 1.7.x defaults to Ruby 1.9 compat mode, while 1.6.8 defaults to Ruby 1.8 compat. You can get JRuby 1.7.x to run in Ruby 1.8 compat mode with e.g. jruby -Xcompat.version=1.8 -S bundle exec [command goes here], which may fix certain things. But we should probably upgrade things so they can run fine in 1.9 mode (with an eventual eye to 2.0 mode and JRuby 9.x).

The big changes between 1.8 and 1.9 (for this project) are how character encoding is handled and some differences in REXML (which JRuby compat modes may not completely account for).

Add Arabic support in Epidoc TEXTLANGLIST

See sosol/sosol#80

`<lb break="no">` inside `<supplied>` gets round-tripped to `.- `

Example:

<supplied reason="omitted">α<lb n="31" break="no"/>ζ</supplied>

Becomes:

<supplied reason="omitted">α31.- ζ</supplied>

From p.oxy/p.oxy.62/p.oxy.62.4340.xml. May also need to check this doesn't happen with other elements?

Travis tests failing

Related to #16. JRuby 1.6.8 doesn't install by default with RVM under Travis, causing the tests to fail.

Multi-line quotes break on conversion after split

If a <q> happens to span lines and the splitting process splits inside the quote, then Leiden+ conversion breaks. This may be a tough one to solve for the splitter, given that there's no way to tell a start quote from an end quote other than counting, and split chunks aren't aware of each other. It might be far easier to use "smart" quotes (“”) instead, because they balance properly, maybe even Guillemets («») in addition.

It's probably possible to work around this if smart quotes are undesirable, but tricky....

JRuby 9.2.x+

Blocks #31

`<unclear>` with accents inside `<hi rend="supraline">` gets moved in front

Example:

<hi rend="supraline"><unclear>Ἡ</unclear></hi>

Becomes:

<unclear>Η</unclear><hi rend="supraline">̔</hi>

Example from sb/sb.20/sb.20.14085.xml. See also #8.

Translation Leiden not translating Leiden->XML for `<gap>`

e.g. [...] and ... do not get transformed into <gap reason='lost' extent='unknown' unit='character'/> and <gap reason='illegible' extent='unknown' unit='character'/> respectively.

https://github.com/papyri/xsugar/blame/master/translation_epidoc.xsg#L152-L157

XML->Leiden seems to work.

<unclear> within <hi> should work round-trip

E.g.:

 <hi rend="diaeresis"><unclear>ἱ</unclear></hi>

Gets transformed to:

ἱ̣(¨)

But fails when transforming back.

More complicated/alternate versions of this in: #4 #10 #15

Latin "lin" preceded by `<gap unit="character">` becomes `unit="line"`

Example:

<gap reason="illegible" quantity="2" unit="character"/>lin

Becomes:

<gap reason="illegible" quantity="2" unit="line"/>

`<gap reason="lost" extent="unknown" unit="line"/>` sometimes becomes `lost<gap reason="illegible" extent="unknown" unit="character"/>lin`

Example:

<gap reason="lost" extent="unknown" unit="line"/>

Becomes:

lost<gap reason="illegible" extent="unknown" unit="character"/>lin

Files already exhibiting this phenomenon:

DDB_EpiDoc_XML/cpr/cpr.4/cpr.4.43.xml
DDB_EpiDoc_XML/p.koeln/p.koeln.11/p.koeln.11.443.xml
DDB_EpiDoc_XML/stud.pal/stud.pal.3(2).1/stud.pal.3(2).1.55.xml

Files which would exhibit this phenomenon with an idempotent Leiden+ transform:

DDB_EpiDoc_XML/bgu/bgu.15/bgu.15.2476.xml
DDB_EpiDoc_XML/bgu/bgu.3/bgu.3.942.xml
DDB_EpiDoc_XML/cpr/cpr.1/cpr.1.83.xml
DDB_EpiDoc_XML/cpr/cpr.14/cpr.14.52.xml
DDB_EpiDoc_XML/cpr/cpr.15/cpr.15.27.xml
DDB_EpiDoc_XML/o.stras/o.stras.1/o.stras.1.645.xml
DDB_EpiDoc_XML/p.ashm/p.ashm.1/p.ashm.1.22.xml
DDB_EpiDoc_XML/p.bad/p.bad.2/p.bad.2.28.xml
DDB_EpiDoc_XML/p.fouad/p.fouad.50.xml
DDB_EpiDoc_XML/p.freib/p.freib.4/p.freib.4.65.xml
DDB_EpiDoc_XML/p.iand.zen/p.iand.zen.68.xml
DDB_EpiDoc_XML/p.oslo/p.oslo.3/p.oslo.3.129.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.67/p.oxy.67.4615.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.68/p.oxy.68.4689.xml
DDB_EpiDoc_XML/p.sijp/p.sijp.12c.xml
DDB_EpiDoc_XML/p.tebt/p.tebt.2/p.tebt.2.291.xml
DDB_EpiDoc_XML/sb/sb.1/sb.1.4870.xml
DDB_EpiDoc_XML/sb/sb.24/sb.24.16129.xml
DDB_EpiDoc_XML/sb/sb.24/sb.24.16220.xml
DDB_EpiDoc_XML/sb/sb.26/sb.26.16417.xml

Leiden+ desiderata

Notes from P3 meeting in Heidelberg:

Support for <note> (and maybe <wit>?) in apparatus.
Tables.

`<gap reason="illegible">` followed by numbers coerces numbers to quantity

Example:

6<gap reason="illegible" quantity="1" unit="character"/>6<unclear>VI</unclear>

Becomes:

6<gap reason="illegible" quantity="16" unit="character"/><unclear>VI</unclear>

`<unclear>` within `<hi rend="diaeresis">` within `<expan>` does not work in Leiden+

The following contains valid XML:

https://github.com/papyri/idp.data/blob/master/DDB_EpiDoc_XML/p.cair.masp/p.cair.masp.1/p.cair.masp.1.67006.xml#L306

<expan><hi rend="diaeresis"><unclear>ἰ</unclear></hi><unclear>ν</unclear>δικ<ex>τίονος</ex></expan>
which is the following in Leiden+
112. ( ἰ̣(¨)ν̣δικ(τίονος))
breaks when saved in Leiden+

112. ( ἰ(¨)ν̣δικ(τίονος))
without the underdot Leiden+ is fine

<expan><hi rend="diaeresis">ἰ</hi><unclear>ν</unclear>δικ<ex>τίονος</ex></expan>
is fine

one can, of course, save in XML as the correct:
<expan><hi rend="diaeresis"><unclear>ἰ</unclear></hi><unclear>ν</unclear>δικ<ex>τίονος</ex></expan>

and submit which I have done, but obviously it would be great to have xSugar be able to handle the underdot.

`<hi rend="supraline"><unclear>ή</unclear></hi>` gets mangled by round-trip

Example:

<hi rend="supraline"><unclear>ή</unclear></hi>

Becomes:

<unclear>η</unclear><hi rend="supraline">́</hi>

`<milestone rend="paragraphos" unit="undefined"/>` gets round-tripped to `----`

Example:

<milestone rend="paragraphos" unit="undefined"/>

Becomes:

----

Files which would exhibit this phenomenon with an idempotent Leiden+ transform:

DDB_EpiDoc_XML/c.pap.gr/c.pap.gr.2.1/c.pap.gr.2.1.15.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.2/p.oxy.2.261.xml
DDB_EpiDoc_XML/o.krok/o.krok.1/o.krok.1.89.xml

<milestone> "paragraphos" followed by "horizontal-rule" gets flipped by round-trip

Example:

<milestone rend="paragraphos" unit="undefined"/><milestone rend="horizontal-rule" unit="undefined"/>

Becomes:

<milestone rend="horizontal-rule" unit="undefined"/><milestone rend="paragraphos" unit="undefined"/>

papyri / xsugar Goto Github PK

xsugar's Introduction

XSugar

Requirements

Usage

File Structure

Testing

Upstream

xsugar's People

Contributors

Stargazers

Watchers

Forkers

xsugar's Issues

Steps to reproduce the error

Recommend Projects

Recommend Topics

Recommend Org