Giter VIP home page Giter VIP logo

openxliff's Introduction

OpenXLIFF Filters

OpenXLIFF FIlters logo

An open source set of Java filters for creating, merging and validating XLIFF 1.2, 2.0 and 2.1 files.

With OpenXLIFF Filters you can create XLIFF files that don't use proprietary markup and are compatible with most CAT (Computer Asisted Translation) tools.

Features

Filters Configuration

XML and JSON filters are configurable

Related Projects

  • XLIFF Manager implements an easy-to-use user interface for creating, merging, validating, and manipulating XLIFF files in a graphical environment.
  • Swordfish uses OpenXLIFF Filters to extract translatable text from supported formats.
  • RemoteTM uses OpenXLIFF Filters to handle all TMX processing.
  • Stingray uses OpenXLIFF Filters for extracting the text to align from supported monolingual documents.
  • TMXEditor relies on OpenXLIFF Filters XML support for processing TMX files.
  • XLIFF Validation web-based XLIFF Validation Service.
  • Fluenta a Translation Manager that uses OpenXLIFF Filters to generate XLIFF from DITA projects.
  • JavaPM a set of scripts for localizing Java .properties files using XLIFF.

Releases

Version Comment Release Date
3.20.0 Improved region merging for Word documents March 17th, 2024
3.19.0 Moved Language Tags handling to BCP47J March 7th, 2024
3.18.0 Fixed scope builder for DITA filter; removed Machine Translation engines February 5th, 2024
3.17.0 Improved extraction from Word text boxes; switched to Java 21 January 6th, 2024
3.16.0 Added catalog for XLIFF 2.2; adjusted ChatGPT models October 29th, 2023
3.15.0 Added new options to JSON filter configuration September 13th, 2023
3.14.0 Added option to generate XLIFF 2.1 September 1st, 2023
3.13.0 Added export as TMX scripts; allowed relative paths from command line August 27th, 2023
3.12.0 Improved JSON support and localization August 16th, 2023
3.11.0 Improved SVG handling and language sorting July 31st, 2023
3.10.0 Fixed TMX exports July 10th, 2023
3.9.0 Improved Machine Translation support and internal code June 30th, 2023
3.8.0 Improved localization handling June 2nd, 2023
3.7.0 Extracted "alt" text from Word images May 15th, 2023
3.6.0 Improved segmentation of XLIFF 2.0 April 1st, 2023
3.5.0 Improved HTML filter March 15th, 2023
3.4.0 Extracted localizable strings February 27th, 2023
3.3.0 Detected loops while processing @keyref in DITA February 20th, 2023
3.2.0 Improved file management February 4th, 2023
3.1.0 Improved DITA merge January 24th, 2023
3.0.0 Moved XML code to project XMLJava January 9th, 2023
2.13.0 Ignored tracked changes from Oxygen XML Editor December 22nd, 2022
2.12.0 Added "xmlfilter" parameter to conversion options December 5th, 2022
2.11.0 Improved support for DITA from Astoria CMS December 2nd, 2022
2.10.0 Fixed DITA segmentation November 22nd, 2022
2.9.1 Fixed joining of XLIFF 2.0 files and improved PHP Array support October 22nd, 2022
2.9.0 Added support for PHP Arrays October 21st, 2022
2.8.0 Updated TXLF, JSON and DITA filters October 8th, 2022
2.7.0 Fixed resegmenter for XLIFF 2.0 August 12th, 2022
2.6.0 Converted HTML fragments in Excel & Word files to tags July 16th, 2022
2.5.0 Added configuration options to JSON filter; Added scripts to approve all segments; Updated language list July 6th, 2022
2.4.2 Improved support for Trados Studio Packages June 18th, 2022
2.4.1 Fixed conversion of third party XLIFF files June 10th, 2022
2.4.0 Added remove all targets; added feedback for Fluenta on DITA filter June 9th, 2022
2.3.0 Added copy source to target; Fixed DITA conversion and merge May 25th, 2022
2.2.0 Added pseudo-translation May 11th, 2022
2.1.0 Updated dependencies and improved validation of XLIFF 2.x April 21st, 2022
2.0.0 Moved server code to XLIFF Manager project March 29th, 2022
1.17.5 Updated DITA keyscope handling March 18th, 2022
1.17.4 Fixed handling of nested untranslatables in DITA; Improved XLIFF 2.0 support March 6th, 2022
1.17.2 Fixed support for FrameMaker MIF files February 6th, 2022
1.17.1 Improved support for DITA February 5th, 2022
1.17.0 Improved validation of XLIFF 2.0; Added SVG statistics for XLIFF 2.0 December 1st, 2021
1.16.0 Improved support for XLIFF 2.0; Switched to Java 17 November 23rd, 2021
1.15.2 MS Office and DITA fixes November 13th, 2021
1.15.0 Initial support for RELAX NG gammars November 8th, 2021
1.14.0 Improved segmentation for XLIFF 2.0 October 3rd, 2021
1.13.0 Improved DITA support August 31st, 2021
1.12.7 Improved round trip 1.2 -> 2.0 -> 1.2; Ignored untranslatable SVG in DITA maps July 4th, 2021
1.12.6 Improved validation; updated language management June 19th, 2021
1.12.5 Improved support for bilingual files February 3rd, 2021
1.12.4 Allowed concurrent access for XLIFF Validation January 22nd, 2021
1.12.3 Improved support for Trados Studio packages January 1st, 2021
1.12.2 Suppressed output for confirmed empty targets December 7th, 2020
1.12.1 Improved conversion of XLIFF 1.2 files December 3rd, 2020
1.12.0 Added support for Adobe InCopy ICML and SRT subtitles November 23rd, 2020
1.11.1 Fixed JSON encoding; fixed import of XLIFF matches November 1st, 2020
1.11.0 Added support for JSON files September 25, 2020
1.10.1 Fixed handling of TXLF files and improved XML catalog September 5th, 2020
1.10.0 Added conversion of 3rd party XLIFF; improved support for XLIFF 2.0; fixed issues with Trados Studio packages August 25th, 2020
1.9.1 Added improvements required by Swordfish IV. August 13th, 2020
1.9.0 Added 5 Machine Translation (MT) engines (Google, Microsoft Azure, DeepL, MyMemory & Yandex) May 18th, 2020
1.8.4 Improved catalog and other minor fixes April 25th, 2020
1.8.3 Fixed conversion of PO files April 17th, 2020
1.8.2 Switched to synchronized classes in XML package April 10th, 2020
1.8.1 Improved support for Trados Studio packages April 3rd, 2020
1.8.0 Implemented support for Trados Studio packages March 29th, 2020
1.7.0 Major code cleanup; Changed segmentation model for XLIFF 2.0 January 1st, 2020
1.6.0 Added support for XLIFF files from WPML WordPress Plugin December 1st, 2019
1.5.2 Improved segmenter performance October 29th, 2019
1.5.1 Fixed catalog on Windows September 22nd, 2019
1.5.0 Improved support for empty <xref/> elements in DITA; improved support for XML catalogs September 5th, 2019
1.4.2 Added option to join XLIFF files; Fixed merge errors in XLIFF 2.0; added tool info to XLIFF files; cleaned DITA attributes on merging August 14th, 2019
1.4.1 Improved performance embedding skeleton; added Apache Ant building option; renamed module to 'openxliff' July 25th, 2019
1.4.0 Improved report of task results July 17th, 2019
1.3.3 Fixed merging of MS Office files from XLIFF 2.0 July 5th, 2019
1.3.2 Updated for Swordfish 3.4.3 June 30th, 2019
1.3.1 Updated for Swordfish 3.4-0 April 30th, 2019
1.3.0 Added option to export approved XLIFF segments as TMX April 18th, 2019
1.2.1 Improved validation of XLIFF 2.0 April 6th, 2019
1.2.0 Added Translation Status Analysis March 3rd, 2019
1.1.0 Incorporated XLIFFChecker code November 20th, 2018
1.0.0 Initial Release November 12th, 2018

Supported File Formats

OpenXLIFF Filters can generate XLIFF 1.2 and XLIFF 2.0 from these formats:

  • General Documentation
    • Adobe InCopy ICML
    • Adobe InDesign Interchange (INX)
    • Adobe InDesign IDML CS4, CS5, CS6 & CC
    • HTML
    • Microsoft Office (2007 and newer)
    • Microsoft Visio XML Drawings (2007 and newer)
    • MIF (Maker Interchange Format)
    • OpenOffice / LibreOffice / StarOffice
    • PHP Arrays
    • Plain Text
    • SDLXLIFF (Trados Studio)
    • SRT Subtitles
    • Trados Studio Packages (*.sdlppx)
    • TXML (GlobalLink/Wordfast PRO)
    • WPML XLIFF (WordPress Multilingual Plugin)
    • Wordfast/GlobalLink XLIFF (*.txlf)
    • XLIFF from Other Tools (.mqxliff, .txlf, .xliff, etc.)
  • XML Formats
    • XML (Generic)
    • DITA 1.0, 1.1, 1.2 and 1.3
    • DocBook 3.x, 4.x and 5.x
    • SVG
    • Word 2003 ML
    • XHTML
  • Software Development
    • JavaScript
    • JSON
    • Java Properties
    • PHP Arrays
    • PO (Portable Objects)
    • RC (Windows C/C++ Resources)
    • ResX (Windows .NET Resources)
    • TS (Qt Linguist translation source)

Requirements

  • JDK 21 or newer is required for compiling and building. Pre-built binaries already include everything you need to run all options.
  • Apache Ant 1.10.13 or newer

Building

  • Checkout this repository.
  • Point your JAVA_HOME variable to JDK 21
  • Run ant to generate a binary distribution in ./dist

Steps for building

  git clone https://github.com/rmraya/OpenXLIFF.git
  cd OpenXLIFF
  ant

A binary distribution will be created in /dist folder.

Convert Document to XLIFF

You can use the library in your own Java code. Conversion to XLIFF is handled by the class com.maxprograms.converters.Convert.

If you use binaries from the command line, running .\convert.bat or ./convert.sh without parameters displays help for XLIFF generation.

Usage:

   convert.sh [-help] [-version] -file sourceFile -srcLang sourceLang
        [-tgtLang targetLang] [-skl skeletonFile] [-xliff xliffFile] [-type fileType]
        [-enc encoding] [-srx srxFile] [-catalog catalogFile] [-divatal ditaval]
        [-config configFile] [-embed] [-paragraph] [-xmlfilter folder][-2.0]
        [-ignoretc][-charsets]

Where:

   -help:      (optional) Display this help information and exit
   -version:   (optional) Display version & build information and exit
   -file:      source file to convert
   -srcLang:   source language code
   -tgtLang:   (optional) target language code
   -xliff:     (optional) XLIFF file to generate
   -skl:       (optional) skeleton file to generate
   -type:      (optional) document type
   -enc:       (optional) character set code for the source file
   -srx:       (optional) SRX file to use for segmentation
   -catalog:   (optional) XML catalog to use for processing
   -ditaval:   (optional) conditional processing file to use when converting DITA maps
   -config:    (optional) configuration file to use when converting JSON documents
   -embed:     (optional) store skeleton inside the XLIFF file
   -paragraph: (optional) use paragraph segmentation
   -xmlfilter: (optional) folder containing configuration files for the XML filter
   -ignoretc:  (optional) ignore tracked changes from Oxygen XML Editor in XML files
   -2.0:       (optional) generate XLIFF 2.0
   -charsets:  (optional) display a list of available character sets and exit

Document Types

   INX = Adobe InDesign Interchange
   ICML = Adobe InCopy ICML
   IDML = Adobe InDesign IDML
   DITA = DITA Map
   HTML = HTML Page
   JS = JavaScript
   JSON = JSON
   JAVA = Java Properties
   MIF = MIF (Maker Interchange Format)
   OFF = Microsoft Office 2007 Document
   OO = OpenOffice Document
   PHPA = PHP Array
   PO = PO (Portable Objects)
   RC = RC (Windows C/C++ Resources)
   RESX = ResX (Windows .NET Resources)
   SDLPPX = Trados Studio Package
   SDLXLIFF = SDLXLIFF Document
   SRT = SRT Subtitle
   TEXT = Plain Text
   TS = TS (Qt Linguist translation source)
   TXLF = Wordfast/GlobalLink XLIFF
   TXML = TXML Document
   WPML = WPML XLIFF
   XLIFF = XLIFF Document
   XML = XML Document
   XMLG = XML (Generic)

Only two parameters are absolutely required: -file and -srcLang. The library tries to automatically detect format and encoding and exits with an error message if it can't guess them. If automatic detection doesn't work, add -type and -enc parameters.

Character sets vary with the operating system. Run the conversion script with -charsets to get a list of character sets available in your OS.

By default, XLIFF and skeleton are generated in the folder where the source document is located. Extensions used for XLIFF and Skeleton are .xlf and .skl.

The XML type handles multiple document formats, like XHTML, SVG or DocBook files.

Default XML catalog and SRX file are provided. You can also use custom versions if required.

Convert XLIFF to Original Format

You can convert XLIFF files created with OpenXLIFF Filters to original format using class com.maxprograms.converters.Merge in your Java code.

If you use binaries from the command line, running .\merge.bat or ./merge.sh without parameters will display the information you need to merge an XLIFF file.

Usage:

   merge.bat [-help] [-version] -xliff xliffFile -target targetFile [-catalog catalogFile] [-unapproved] [-export]

Where:

   -help:       (optional) Display this help information and exit
   -version:    (optional) Display version & build information and exit
   -xliff:      XLIFF file to merge
   -target:     translated file or folder where to store translated files
   -catalog:    (optional) XML catalog to use for processing
   -unapproved: (optional) accept translations from unapproved segments
   -export:     (optional) generate TMX file from approved segments

XLIFF Validation

The original XLIFFChecker code supports XLIFF 1.0, 1.1 and 1.2. The new version incorporated in OpenXLIFF Filters also supports XLIFF 2.0.

Standard XML Schema validation does not detect the use of duplicated 'id' attributes, wrong language codes and other constraints written in the different XLIFF specifications.

All XLIFF 2.0 modules are validated using XML Schema validation in a first pass. Extra validation is then performed using Java code for XLIFF 2.0 Core and for Metadata, Matches and Glossary modules.

You can validate XLIFF files using your own Java code. Validation of XLIFF files is handled by the class com.maxprograms.validation.XliffChecker.

If you use binaries from the command line, running .\xliffchecker.bat or ./xliffchecker.sh without parameters displays help for XLIFF validation.

Usage:

   xliffchecker.bat [-help] -file xliffFile [-catalog catalogFile]

Where:

   -help:      (optional) Display this help information and exit
   -file:      XLIFF file to validate
   -catalog:   (optional) XML catalog to use for processing

XLIFF Validation Service

You can validate your XLIFF files online at https://dev.maxprograms.com/Validation/

Translation Status Analysis

This library lets you produce an HTML file with word counts and segment status statistics from an XLIFF file.

If you use binaries from the command line, running .\analysis.bat or ./analysis.sh without parameters displays help for statistics generation.

You can generate statistics using your own Java code. Statistics generation is handled by the class com.maxprograms.stats.RepetitionAnalysis.

Usage:

   analysis.sh [-help] -file xliffFile [-catalog catalogFile]

Where:

   -help:      (optional) Display this help information and exit
   -file:      XLIFF file to analyze
   -catalog:   (optional) XML catalog to use for processing

The HTML report is generated in the folder where the XLIFF file is located and its name is the name of the XLIFF plus .log.html.

Join multiple XLIFF files

You can combine several XLIFF files into a larger one using class com.maxprograms.converters.Join from your Java code or using the provided scripts.

Running .\join.bat or ./join.sh without parameters displays help for joining files.

Usage:

   join.sh [-help] -target targetFile -files file1,file2,file3...

 Where:

   -help:     (optional) Display this help information and exit
   -target:   combined output XLIFF file

The merge process automatically splits the files when converting back to original format.

Pseudo-translate XLIFF file

You can pseudo-translate all untranslated segments using class com.maxprograms.converters.PseudoTranslation from your Java code or using the provided scripts.

Running .\pseudotranslate.bat or ./pseudotranslate.sh without parameters displays help for pseudo-translating an XLIFF file.

Usage:

   pseudotranslate.bat [-help] -xliff xliffFile [-catalog catalogFile]

Where:

   -help:      (optional) Display this help information and exit
   -xliff:     XLIFF file to pseudo-translate
   -catalog:   (optional) XML catalog to use for processing

Copy Source to Target

You can copy the content of <source> elements to new <target> elements for all untranslated segments using class com.maxprograms.converters.CopySources from your Java code or using the provided scripts.

Running .\copysources.bat or ./copysources.sh without parameters displays help for copying source to target in an XLIFF file.

Usage:

   copysources.bat [-help] -xliff xliffFile [-catalog catalogFile]

Where:

   -help:      (optional) Display this help information and exit
   -xliff:     XLIFF file to process
   -catalog:   (optional) XML catalog to use for processing

Approve All Segments

You can set all <trans-unit> or <segment> elements as approved or final if they contain target text using class com.maxprograms.converters.ApproveAll from your Java code or using the provided scripts.

Running .\approveall.bat or ./approveall.sh without parameters displays help for approving or confirming all segments in an XLIFF file.

Usage:

   approveall.bat [-help] -xliff xliffFile [-catalog catalogFile]

Where:

   -help:      (optional) Display this help information and exit
   -xliff:     XLIFF file to process
   -catalog:   (optional) XML catalog to use for processing

Remove All Targets

You can remove<target> elements from all <segment> or <trans-unit> elements using class com.maxprograms.converters.RemoveTargets from your Java code or using the provided scripts.

Running .\removetargets.bat or ./removetargets.sh without parameters displays help for removing targets from an XLIFF file.

Usage:

   removetargets.bat [-help] -xliff xliffFile [-catalog catalogFile]

Where:

   -help:      (optional) Display this help information and exit
   -xliff:     XLIFF file to process
   -catalog:   (optional) XML catalog to use for processing

Export Approved Segments as TMX

You can export all aproved segments from an XLIFF file as TMX using class com.maxprograms.converters.TmxExporter from your Java code or using the provided scripts.

Running .\exporttmx.bat or ./exporttmx.sh without parameters displays help for exporting approved segments from an XLIFF file.

Usage:

exporttmx.sh [-help] -xliff xliffFile [-tmx tmxFile] [-catalog catalogFile]

Where:

    -help:      (optional) Display this help information and exit
    -xliff:     XLIFF file to process
    -tmx:       (optional) TMX file to generate
    -catalog:   (optional) XML catalog to use for processing

If the optional -tmx parameter is not provided, the TMX file will be generated in the same folder as the XLIFF file and its name will be the same as the XLIFF file plus .tmx.

openxliff's People

Contributors

rmraya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

openxliff's Issues

Setting to treat SVGs as binary

For the Oxygen Fluenta add-on we have a client using it and the client would like the generated xliff to treat SVGs as binary images as the client is not interested in translating the SVGs. Looking at the code there does not seem to be a setting to state that SVGs should be treated as binary.

Invalid result after merge operation (related to Element#mergeText)

Hello, thank you for sharing this tool.

I found a problem related to merge operation, I'll try to describe it in this issue

How to reproduce

test.skl

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1>%%%1%%%
</NIV1>
</ROOT>

test.xlf

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" srcLang="fr" version="2.1" trgLang="en-US">
  <file original="test.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.15.0 20230913_0710</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve">
          <ph id="ph0"/>
        </source>
      </ignorable>
      <segment state="final" id="1-0">
        <source xml:space="preserve"> Bonjour. </source>
        <target> Hello. </target>
      </segment>
      <ignorable>
        <source xml:space="preserve">
          <ph id="ph1"/>
        </source>
      </ignorable>
      <segment state="final" id="1-1">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
        <target> This text should be translated </target>
      </segment>
    </unit>
  </file>
</xliff>

command

./merge.sh -xliff test.xlf -target result.xml

actual xml result

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1>
          <G>
         Bonjour. 
           Hello. 
          </G>
         Ce text devrait être traduit  This text should be translated </NIV1>
</ROOT>

expected xml result

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Hello. </G> This text should be translated </NIV1>
</ROOT>

Investigation

First, I observed that this issue appears because I formatted the xliff file.
If I rollback some changes, the result is correct. Rolled back changes are:

      <ignorable>
        <source xml:space="preserve">
          <ph id="ph0"/>
        </source>
      </ignorable>

      <ignorable>
        <source xml:space="preserve"><ph id="ph0"/></source>
      </ignorable>

(needed as well for ph1)

Looking into the code, I think I understand the problem.

In https://github.com/rmraya/OpenXLIFF/blob/v3.15.0/src/com/maxprograms/xliff2/FromXliff2.java#L294-L326

We have:

  • l. 301
joinedSource.addContent(src.getContent());
  • l. 308
joinedTarget.addContent(src.getContent());

At this point joinedSource and joinedTarget point to the same content (same java objects per references)

Then l. 326 we have:

src.setContent(harvestContent(joinedSource, tags, attributes));

Which internally calls joinedSource.getContent(); which itself calls Element#mergeText.

And this is precisely were we have an issue.
When harvesting for the source we call the mergeText function that mutates some XMLnodes that are also referenced by the target. After harvesting for the source, the content of target becomes invalid.

To confirm this hypothesis, I simply removed the call to #mergeText in Element#getContent and it actually fixes the problem.

Solution

I don't know what would be the best to fix this issue, I think there are different possibilities like:

  • Not mutating state in a method that apparently only reads data (we don't expect #getContent to mutate the state of Element)
    • I saw that this method was also called in the equals method, which can be dangerous as well
  • Not sharing data between 2 Element objets, in which case that would requires to copy XMLNodes

If you want me to share a PR with a fix proposal, feel free to ask.

By the way, I think it could be very helpful to introduce a pom.xml in order to have maven dependencies (in which case it would be necessary to publish XMLJava in maven central)

For Indian Languages: Source text gets stored in a weird fashion in XLIFF format

For Indian languages like Hindi, Sanskrit etc., apart from the original text, "source" field contains metadata after each word. This is unusual and doesn't happen in case of western languages like English, French German etc. This is problematic because CAT tools present the source field as is in the source language columns.

Screenshot (211)

I am attaching the original text file and the converted XLIFF files as well
OpenXLIFF.zip

Update:
This is happening with OFF files but not in case of TEXT files.

Hello, IDML files to Xliff issue

When the IDML file convert to Xliff file, the order of open tag and close tag (of the element "Content" and "CharacterStyleRange") are reversed.
Can you fix it?

-srx option doesn't work

I used LanguageTool's SRX file and can't see the extracted file is segmented according to the SRX file. I've compared the unsegmented with the segmented one but still can't see the difference

command I ran: ./convert.sh -file ~/Documents/sample.xliff -srcLang en -srx ~/Documents/languagetoolorg-srx.srx

I've attached sample file

Archive.zip

Feature request: Hide leading whitespace

Currently any whitespace before a sentence and between sentences is included in the translatable segment, but not the trailing whitespace (see example below). The suggestion is that all leading and trailing whitespace is hidden from translation.
(It is of course easy to hide it the CAT tool, but it could be an improvement for the OpenXLIFF library anyway.)

With an example document like this, with spaces before, between, and after the sentences:
" Sentence one. Sentence two. "

you get xliff like this:

<source xml:space="preserve">   Sentence one.</source>
...
<source xml:space="preserve"> Sentence two.</source>

(I.e. spaces are there, except after sentence two. Is suppose that the last whitespace is hidden in the skeleton?)

How do I write a given <target> on XLIFF 2.0?

I'm using okapi-lib-xliff2 to read-and-write a .xlf file which generated from OpenXLIFF's ./convert.sh. but after that, Word deems the file is corrupted when I merge the XLIFF file back using OpenXLIFF's ./merge.sh because the okapi-lib has different format of how they write-and-save the XLIFF file.

Any API/Library to write your generated .xlf file after ./convert.sh?

Feature suggestion: Change spelling language of target document to the target language

I noticed that the output document keeps the spelling language of the source document.
Not sure whether this is an easy change or not, but it would be a nice improvement if the output document had the same spelling language as the target language.
Example of current behavior:

  • ./convert.sh -file msdoc.docx -srcLang en -tgtLang sv -2.0 -embed
  • translate the text to Swedish
  • ./merge.sh -xliff msdoc.docx.xlf -target msdoc_sv.docx
  • open the output document msdoc_sv.docx,
  • spelling language is English

StackOverflowException when exporting XLIFF and there is a link inside a list item pointing to itself

Actually inside the list item there are 2 paragraphs, and at some point in the second paragraph there is a cross reference pointing to the first one. I attached a minimal sample: testFluenta.zip.

The error is the following:

Exception in thread "Thread-35" java.lang.StackOverflowError
        at java.base/java.lang.RuntimeException.<init>(RuntimeException.java:52)
        at java.base/java.lang.IllegalArgumentException.<init>(IllegalArgumentException.java:40)
        at java.base/java.util.regex.PatternSyntaxException.<init>(PatternSyntaxException.java:58)
        at java.base/java.util.regex.Pattern.error(Pattern.java:2028)
        at java.base/java.util.regex.Pattern.<init>(Pattern.java:1432)
        at java.base/java.util.regex.Pattern.compile(Pattern.java:1069)
        at java.base/java.lang.String.split(String.java:3155)
        at java.base/java.lang.String.split(String.java:3201)
        at com.maxprograms.converters.ditamap.DitaParser.ditaClass(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)
        at com.maxprograms.converters.ditamap.DitaParser.recurse(Unknown Source)

Make ILogger deterministic for some stages of processing

Hello,

We are currently building an Oxygen XML Editor plugin that integrates Fluenta into the DITA Maps Manager. We are using {{com.maxprograms.fluenta.API}} for calling Fluenta operations. I've noticed that there is a {{com.maxprograms.tmengine.ILogger}} interface for notifying progress. The progress is indeterminate, but for some stages it looks like it could become deterministic. For example, in {{DitaMap2Xliff}} after setting the stage to Processing Files you know how many files will be processed.

Possible performance improvement for Segmenter.segment

Hi. I noticed that for a particular docx file, the convert process takes a long time (about 20 minutes on my computer). The file is not huge (258kB), but it probably has some unusually large section. I have attached the file. I have only seen the problem on this file, but I still thought that it would be good to report it. The file is notice.docx
Example: ./convert.sh -file notice.docx -srcLang en -tgtLang sv -2.0

I have pinned down the bottleneck to Segmenter.segment, in which the time consumer seems to be
the calls to hideTags(pureText.substring(...));. The length of pureText was 16681 in this case.

So there are three levels of nested loops:
for-loop over pureText, while-loop in hideTags(), and string.substring() in the while-loop.
Each loop level has the length of 16681 in this case, which explains the high time consumption.

Perhaps it is possible to solve with StringBuilder or in some other way changing the string handling.

XLIFF 2.0 validator invalidate user defined subState/subType values

According to XLIFF 2.0 specification, XLIFF user can define subState/subType values with custom namespaces.

For example, example XLIFF below defines a custom namespace abc and abc:mt is used for subState attribute in <segment> element.

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0"
 srcLang="en" trgLang="ja" xmlns:abc="http://example.com/xliff/abc">
<file id="f1">
<unit id="u1">
  <segment id="s1" state="translated" subState="abc:mt">
    <source>Hello</source>
    <target>こんにちは</target>
  </segment>
</unit>
</file>
</xliff>

My understanding is that this is supported by XLIFF 2.0 specification. However, the validator returns an error - Invalid prefix 'abc' in "subState" attribute

com.maxprograms.validation.Xliff20 has a list of known prefixes (namespaces) as below.

	private List<String> knownPrefixes = Arrays.asList("xlf", "mtc", "gls", "fs", "mda", "res", "ctr", "slr", "val",
			"its", "my");

The list is fixed, so any other prefixes not included in this list will be invalidated. BTW, "my" in this list is not defined by XLIFF specification, but it is used in some examples in the specification.
The validator probably should append namespaces declared in <xliff> element to knownPrefixes for validating subState/subType values.

IDML to XLIFF identify page

First of all thanks for this great toolsuite 👍
I've converted an IDML file -> XLIFF and I want to know is there any way to identity the pages to which the translation units belong ?

config_dita | Missing config

According to this example
`

<title>Learning Overview topic</title> <title>Objectives</title> When you complete this lesson, you'll know how to do the following: Create a good learning overview topic. Identify clear learning objectives. Add good test items to assess knowledge gained. `

The following config is missing
lcObjective

compilation: unmappable character for encoding US-ASCII

Hello,

I had an issue compiling OpenXLIFF from master with ant, but I managed to fix it by setting the javac compiler option below in build.xml. Would you like me to submit a pull request?


Problem

# ant
  ...
compile:
    [javac] Compiling 112 source files to /root/OpenXLIFF-master/bin
    [javac] /root/OpenXLIFF-master/src/com/maxprograms/converters/xml/Xml2Xliff.java:842: error: unmappable character (0xC2) for encoding US-ASCII
    [javac] 			if (" \u00A0\r\n\f\t\u2028\u2029,.;\":<>?????!()[]{}=+/*\u00AB\u00BB\u201C\u201D\u201E\uFF00"
    [javac] 			                                        ^
    [javac] /root/OpenXLIFF-master/src/com/maxprograms/converters/xml/Xml2Xliff.java:842: error: unmappable character (0xBF) for encoding US-ASCII
    [javac] 			if (" \u00A0\r\n\f\t\u2028\u2029,.;\":<>?????!()[]{}=+/*\u00AB\u00BB\u201C\u201D\u201E\uFF00"
    [javac] 			                                         ^
    [javac] /root/OpenXLIFF-master/src/com/maxprograms/converters/xml/Xml2Xliff.java:842: error: unmappable character (0xC2) for encoding US-ASCII
    [javac] 			if (" \u00A0\r\n\f\t\u2028\u2029,.;\":<>?????!()[]{}=+/*\u00AB\u00BB\u201C\u201D\u201E\uFF00"
    [javac] 			                                           ^
    [javac] /root/OpenXLIFF-master/src/com/maxprograms/converters/xml/Xml2Xliff.java:842: error: unmappable character (0xA1) for encoding US-ASCII
    [javac] 			if (" \u00A0\r\n\f\t\u2028\u2029,.;\":<>?????!()[]{}=+/*\u00AB\u00BB\u201C\u201D\u201E\uFF00"
    [javac] 			                                            ^
    [javac] 4 errors

Fix

in build.xml:

<javac srcdir="src" destdir="bin" classpathref="OpenXLIFF.classpath" modulepathref="OpenXLIFF.classpath" includeAntRuntime="false">
		<compilerarg line="-encoding utf-8" />
</javac>

Environment

# uname -r
5.10.61
# cat /etc/debian_version
11.0
# java --version
openjdk 11.0.12 2021-07-20
OpenJDK Runtime Environment (build 11.0.12+7-post-Debian-2)
OpenJDK 64-Bit Server VM (build 11.0.12+7-post-Debian-2, mixed mode, sharing)
# ant -version
Apache Ant(TM) version 1.10.11 compiled on July 10 2021
# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

Newline lost in conversion cycle for some cases

For certain cases, when running convert + merge, without doing any changes to the XLIFF file, the output document is different from the input, in that a newline is lost. I'm wondering whether it is expected behavior when using default.srx, or a bug.

See attached example files:
input: test3.docx

Test?
Example.

output: test3_sv.docx

Test?Example.

Steps:
./convert.sh -file test3.docx -srcLang en -tgtLang sv -2.0 -embed
./merge.sh -xliff test3.docx.xlf -target test3_sv.docx

Version used: latest master branch, bddd767
The provided default.srx was used (but as far as I understand, the segmentation should not affect whether the output and input docs become similar(?))

Performance improvement for convert with -embed and -2.0

I noticed that the -2.0 option in combination with -embed (which is my use case) makes the Convert step takes a very long time. Maybe you are already aware of it, but anyway here comes some measurements and a small investigation.
Some examples (all done with the same test.docx, (4.3 MB, 4400 words)):

without -embed flag:
./convert.sh -file test.docx -srcLang da -tgtLang sv -2.0
13 seconds

without -2.0 flag:
./convert.sh -file test.docx -srcLang da -tgtLang sv -embed
13 seconds

with both -2.0 and -embed flag:
./convert.sh -file test.docx -srcLang da -tgtLang sv -2.0 -embed
77 seconds

I did some debugging and it is one particular call to com.maxprograms.xml.Element.mergeText() that takes about 1 minute to complete, and it seems like the bottleneck is this line:
https://github.com/rmraya/OpenXLIFF/blob/master/src/com/maxprograms/xml/Element.java#L167

When the mergeText() is run for the <internal-file> element (i.e all the base64 skeleton data), the content member is a big vector (37000 lines in my case) which is then concatenated line by line to a new string:
t.setText(t.getText() + ((TextNode) n).getText());

It can probably be improved fairly easily, so that it runs almost instantly, for example by using a StringBuilder for concatenation.

getting null pointer exception when trying to convert a ditamap to xliff1.2

Hi, I am getting the following stack trace of the error:

Cannot invoke "Object.hashCode()" because "key" is null at java.base/java.util.Hashtable.containsKey(Hashtable.java:353) at openxliff/com.maxprograms.xml.Catalog.resolveEntity(Catalog.java:334) at java.xml/com.sun.org.apache.xerces.internal.util.EntityResolver2Wrapper.resolveEntity(EntityResolver2Wrapper.java:178) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityManager.resolveEntityAsPerStax(XMLEntityManager.java:1026) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1307) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.startPE(XMLDTDScannerImpl.java:732) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.skipSeparator(XMLDTDScannerImpl.java:2101) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:2064) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDExternalSubset(XMLDTDScannerImpl.java:299) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1165) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1040) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:917) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:542) at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:889) at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:825) at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1224) at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:637) at openxliff/com.maxprograms.xml.SAXBuilder.build(SAXBuilder.java:170) at openxliff/com.maxprograms.xml.SAXBuilder.build(SAXBuilder.java:69) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:117) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:172) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:172) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:172) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:172) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:119) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:172) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.recurse(ScopeBuilder.java:172) at openxliff/com.maxprograms.converters.ditamap.ScopeBuilder.buildScope(ScopeBuilder.java:74) at openxliff/com.maxprograms.converters.ditamap.DitaParser.run(DitaParser.java:154) at openxliff/com.maxprograms.converters.ditamap.DitaMap2Xliff.run(DitaMap2Xliff.java:99) at openxliff/com.maxprograms.converters.Convert.run(Convert.java:405) at openxliff/com.maxprograms.converters.Convert.main(Convert.java:280)
It happens in the function: public InputSource resolveEntity(String name, String publicId, String baseURI, String systemId) in Catalog.java.

Apparently the publicId in this case happens to be null whereas the systemId is coming out to be a non null value. And in every case the baseURI is coming out to be null, it seems that it has been set as null by default. So the code ends up at "dtdEntities.containsKey(publicId)" which throws this error. Any ideas how to get about it?

SAXException when converting docx file

For certain file with lots of tags, an exception occurs on converting:

Steps:
./convert.sh -file notice.docx -srcLang en -tgtLang sv -2.0
(file is attached)

Output:

Oct 24, 2019 11:45:17 AM com.maxprograms.xml.CustomErrorHandler fatalError
SEVERE: 1:250 Element type "p" must be followed by either attribute specifications, ">" or "/>".
Oct 24, 2019 11:45:17 AM com.maxprograms.converters.msoffice.MSOffice2Xliff run
SEVERE: Error converting MS Office file
org.xml.sax.SAXException: [Fatal Error] 1:250 Element type "p" must be followed by either attribute specifications, ">" or "/>".
	at openxliff/com.maxprograms.xml.CustomErrorHandler.fatalError(CustomErrorHandler.java:43)
	at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:181)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.seekCloseOfStartTag(XMLDocumentFragmentScannerImpl.java:1433)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:242)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2710)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
	at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
	at openxliff/com.maxprograms.xml.SAXBuilder.build(SAXBuilder.java:89)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.writeSegment(MSOffice2Xliff.java:141)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePara(MSOffice2Xliff.java:386)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:587)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePara(MSOffice2Xliff.java:419)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recurse(MSOffice2Xliff.java:283)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recurse(MSOffice2Xliff.java:285)
	at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.run(MSOffice2Xliff.java:97)
	at openxliff/com.maxprograms.converters.office.Office2Xliff.run(Office2Xliff.java:131)
	at openxliff/com.maxprograms.converters.Convert.run(Convert.java:366)
	at openxliff/com.maxprograms.converters.Convert.main(Convert.java:238)

notice.docx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.