Giter VIP home page Giter VIP logo

xsugar's Introduction

XSugar

XSugar Tests

Requirements

Usage

This project contains JRuby libraries and wrappers for XSugar as well as grammars for converting between EpiDoc XML and Leiden+ (a Leiden-style plaintext markup). There is also a pure-Java standalone XSugar transformation servlet in src/standalone.

To convert between the EpiDoc and Leiden+, the utility scripts xml2nonxml.rb and nonxml2xml.rb are provided.

To use them you can simply run:

./bin/xml2nonxml.rb < epidoc.xml > leiden.txt

or ./bin/nonxml2xml.rb < leiden.txt > epidoc.xml

File Structure

bin/                     command-line scripts
    blackboard_agent.rb  blackboard XSugar transformer agent (run many)
    blackboard_server.rb blackboard XSugar transformer server (run one)
    coverage.sh          IDP2 grammar coverage script
    xml2nonxml.rb        command-line xml->non-xml RXSugar utility
    nonxml2xml.rb        command-line non-xml->xml RXSugar utility
epidoc.xsg               Leiden+ XSugar grammar
init.rb                  Rails plugin init script
lib/                     source code
    coverage/            classes for testing XSugar coverage
    standalone/          classes for warming up standalone server
    jruby_helper.rb      helper classes for invoking RXSugar from JRuby
    modules_jruby.rb     Java->Ruby module conversion for JRuby
    modules_rjb.rb       Java->Ruby module conversion for RJB
    rxsugar.rb           main Ruby XSugar wrapper class
    rxsugar_helper.rb    helper classes for using Ruby XSugar wrapper
    util_helper.rb       helper classes for command-line scripts
    xsugar-all.jar       compiled upstream XSugar JAR
src/                     Java source code
    standalone/          source code for standalone transformation server
    xsugar/              upstream XSugar source code
test/                    source code for unit testing
translation_epidoc.xsg   XSugar grammar for EpiDoc translations

Testing

The Ruby testing uses bundler for gem dependencies, so you should invoke rake with:

bundle exec rake

Upstream

The Java XSugar source is tracked in xsugar-vendor. Customizations for this project are in xsugar-customizations, merged into master. To e.g. update to a new upstream version of XSugar, you would unpack it to src/xsugar on the xsugar-vendor branch and commit the changes. Then you would rebase the changes in xsugar-customizations onto the new xsugar-vendor. Then merge xsugar-customizations into master and rebuild lib/xsugar-all.jar (using the rake task java:xsugar:build) and commit. Do not make changes/customizations to the Java XSugar source on master, make them on xsugar-customizations so that the merge upstream/rebase/merge workflow is more straightforward.

xsugar's People

Contributors

ryanfb avatar paregorios avatar hcayless avatar samosafuz avatar dchandekstark avatar dependabot[bot] avatar

Stargazers

 avatar Gregg Reynolds avatar Thomas Efer avatar

Watchers

 avatar  avatar Elemmire avatar James M.S. Cowey avatar Bridget Almas avatar  avatar James Cloos avatar

xsugar's Issues

`<unclear>` sometimes split to multiple `<unclear>`s

Example:

<unclear>πόλ</unclear>

Becomes:

<unclear>π</unclear><unclear>όλ</unclear>

Files already exhibiting this phenomenon (NB: these all appear to be with <unclear> containing <g type="…">):

DDB_EpiDoc_XML/p.brux.bawit/p.brux.bawit.45.xml
DDB_EpiDoc_XML/p.cair.masp/p.cair.masp.2/p.cair.masp.2.67240.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.54/p.oxy.54.3756.xml
DDB_EpiDoc_XML/p.sijp/p.sijp.36.xml

Files which would exhibit this phenomenon with an idempotent Leiden+ transform:

DDB_EpiDoc_XML/cpr/cpr.22/cpr.22.48.xml
DDB_EpiDoc_XML/o.ashm/o.ashm.6.xml
DDB_EpiDoc_XML/o.bodl/o.bodl.2/o.bodl.2.1849.xml
DDB_EpiDoc_XML/o.petr.mus/o.petr.mus.556.xml
DDB_EpiDoc_XML/o.theb/o.theb.16.xml
DDB_EpiDoc_XML/p.flor/p.flor.1/p.flor.1.93dupl.xml
DDB_EpiDoc_XML/p.mich/p.mich.11/p.mich.11.603.xml
DDB_EpiDoc_XML/p.palaurib/p.palaurib.13.xml
DDB_EpiDoc_XML/p.wisc/p.wisc.1/p.wisc.1.1.xml
DDB_EpiDoc_XML/psi.congr.xxi/psi.congr.xxi.6.xml

Error parsing Leiden+

Some papyri have a correct Leiden+ and correct XML, which is correctly displayed on papyri.info. But when editing their Leiden+ code, they produce an error at a position which was not edited. This happened to me with several different papyri. So it is impossible to edit these papyri (except taking the workaround via xml code). Apparently, this must be a Leiden+ <-> XML parsing error.

Steps to reproduce the error

  1. Navigate to http://papyri.info/ddbdp/p.cair.masp;1;67109
  2. Click on "open in editor"
  3. Click on DDbDP > Leiden+
  4. Click on "Save" (without making any changes to the Leiden+ !!!)
  5. You will get an error on line 8: Διοσκόρου υ ἱ̣(¨)οPOSSIBLE ERRORῦ αὐτοῦ

Leiden+ for figures malfunctioning in DCLP

In Leiden+, one normally uses # to generate the xml for a figure: e.g., #seal >> <figure><figDesc>seal</figDesc></figure>. Attempts to use this Leiden+ in DCLP, however, result in the POSSIBLE ERROR warning. It appears something is amiss with XSugar.

Combining <num> with <gap> and @rend="tick"

In the course of encoding some mathematical papyri, I've discovered a gap in XSugar that it would be nice to fill. If we are dealing with a multi-digit whole number or fraction and one of the digits is in a <gap>, it is not possible to add @rend="tick" to <num> without breaking XSugar.

So, for example, we might consider the following. This first scenario behaves as expected, with XSugar conversion from xml to Leiden+ and back.

  • κ[.] (= print)
  • <#κ[.1]=21-29#> (= Leiden+)
  • <num atLeast="21" atMost="29">κ<gap reason="lost" quantity="1" unit="character"/></num> (= xml)

The addition of a tick is problematic because XSugar does not recognize it in the vicinity of <gap>, as the xml indicates

  • κ[.] ´ (= print)
  • <#κ[.1] '=21-29#> (= Leiden+)
  • <num atLeast="21" atMost="29">κ<gap reason="lost" quantity="1" unit="character"/> '</num> (= xml)

Of course, one can add @rend="tick" manually to the xml, but although this displays correctly in PN or in PE under 'preview', it breaks Leiden+, rendering the file broken for future editors.

This bug persists regardless of the position of <gap>; no matter which digit is lacunose, XSugar fails to process the tick properly

JRuby 1.7.x compatibility

Currently, a lot of the standalone scripts and tasks may fail if run under JRuby 1.7.x instead of (EOL'd) 1.6.8.

The big difference between the two is that 1.7.x defaults to Ruby 1.9 compat mode, while 1.6.8 defaults to Ruby 1.8 compat. You can get JRuby 1.7.x to run in Ruby 1.8 compat mode with e.g. jruby -Xcompat.version=1.8 -S bundle exec [command goes here], which may fix certain things. But we should probably upgrade things so they can run fine in 1.9 mode (with an eventual eye to 2.0 mode and JRuby 9.x).

The big changes between 1.8 and 1.9 (for this project) are how character encoding is handled and some differences in REXML (which JRuby compat modes may not completely account for).

Travis tests failing

Related to #16. JRuby 1.6.8 doesn't install by default with RVM under Travis, causing the tests to fail.

Multi-line quotes break on conversion after split

If a <q> happens to span lines and the splitting process splits inside the quote, then Leiden+ conversion breaks. This may be a tough one to solve for the splitter, given that there's no way to tell a start quote from an end quote other than counting, and split chunks aren't aware of each other. It might be far easier to use "smart" quotes (“”) instead, because they balance properly, maybe even Guillemets («») in addition.

It's probably possible to work around this if smart quotes are undesirable, but tricky....

`<gap reason="lost" extent="unknown" unit="line"/>` sometimes becomes `lost<gap reason="illegible" extent="unknown" unit="character"/>lin`

Example:

<gap reason="lost" extent="unknown" unit="line"/>

Becomes:

lost<gap reason="illegible" extent="unknown" unit="character"/>lin

Files already exhibiting this phenomenon:

DDB_EpiDoc_XML/cpr/cpr.4/cpr.4.43.xml
DDB_EpiDoc_XML/p.koeln/p.koeln.11/p.koeln.11.443.xml
DDB_EpiDoc_XML/stud.pal/stud.pal.3(2).1/stud.pal.3(2).1.55.xml

Files which would exhibit this phenomenon with an idempotent Leiden+ transform:

DDB_EpiDoc_XML/bgu/bgu.15/bgu.15.2476.xml
DDB_EpiDoc_XML/bgu/bgu.3/bgu.3.942.xml
DDB_EpiDoc_XML/cpr/cpr.1/cpr.1.83.xml
DDB_EpiDoc_XML/cpr/cpr.14/cpr.14.52.xml
DDB_EpiDoc_XML/cpr/cpr.15/cpr.15.27.xml
DDB_EpiDoc_XML/o.stras/o.stras.1/o.stras.1.645.xml
DDB_EpiDoc_XML/p.ashm/p.ashm.1/p.ashm.1.22.xml
DDB_EpiDoc_XML/p.bad/p.bad.2/p.bad.2.28.xml
DDB_EpiDoc_XML/p.fouad/p.fouad.50.xml
DDB_EpiDoc_XML/p.freib/p.freib.4/p.freib.4.65.xml
DDB_EpiDoc_XML/p.iand.zen/p.iand.zen.68.xml
DDB_EpiDoc_XML/p.oslo/p.oslo.3/p.oslo.3.129.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.67/p.oxy.67.4615.xml
DDB_EpiDoc_XML/p.oxy/p.oxy.68/p.oxy.68.4689.xml
DDB_EpiDoc_XML/p.sijp/p.sijp.12c.xml
DDB_EpiDoc_XML/p.tebt/p.tebt.2/p.tebt.2.291.xml
DDB_EpiDoc_XML/sb/sb.1/sb.1.4870.xml
DDB_EpiDoc_XML/sb/sb.24/sb.24.16129.xml
DDB_EpiDoc_XML/sb/sb.24/sb.24.16220.xml
DDB_EpiDoc_XML/sb/sb.26/sb.26.16417.xml

Leiden+ desiderata

Notes from P3 meeting in Heidelberg:

  1. Support for <note> (and maybe <wit>?) in apparatus.
  2. Tables.

`<unclear>` within `<hi rend="diaeresis">` within `<expan>` does not work in Leiden+

The following contains valid XML:

https://github.com/papyri/idp.data/blob/master/DDB_EpiDoc_XML/p.cair.masp/p.cair.masp.1/p.cair.masp.1.67006.xml#L306

<expan><hi rend="diaeresis"><unclear>ἰ</unclear></hi><unclear>ν</unclear>δικ<ex>τίονος</ex></expan>
which is the following in Leiden+
112. ( ἰ̣(¨)ν̣δικ(τίονος))
breaks when saved in Leiden+

112. ( ἰ(¨)ν̣δικ(τίονος))
without the underdot Leiden+ is fine

<expan><hi rend="diaeresis">ἰ</hi><unclear>ν</unclear>δικ<ex>τίονος</ex></expan>
is fine

one can, of course, save in XML as the correct:
<expan><hi rend="diaeresis"><unclear>ἰ</unclear></hi><unclear>ν</unclear>δικ<ex>τίονος</ex></expan>

and submit which I have done, but obviously it would be great to have xSugar be able to handle the underdot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.