Giter VIP home page Giter VIP logo

translation-memory-tools's Introduction

Python application

Introduction

This is the toolset used at Softcatalà to build the translation memories for all the projects that we know exist in Catalan language and have their translations available openly. This project acts as data pipeline and then project https://github.com/Softcatala/translation-memory-tools-webservice provides an API on top of the data.

You can see it on-line at https://www.softcatala.org/recursos/memories/

The toolset contains the following components with their own responsibility:

Builder (fetch and build memories)

  • Download and unpack the files from source repositories
  • Convert from the different translation formats (ts, strings, etc) to PO
  • Create a translation memory for project in PO and TMX formats
  • Produce a single translation memory file that contains all the projects

Terminology (terminology extraction)

  • Analyzes the PO files and creates a report with the most common terminology across the projects

Quality (feedback on how to improve translations)

  • Runs Pology and LanguageTool and generates HTML reports on translation quality

Installation

Setting up before execution

In order to download the translations of some of the projects you need to use the credentials for these systems, for example API keys.

builder.py expects the credentials to be defined in the following locations:

  • At cfg/credentials in the different YAML files: for Zenata (zanata.yaml), for Weblate (weblate.yaml) and for Crowdin (crowdin.yaml). The files -sample provide examples of how these files should be structured.
  • For Transifex, the credentials should be at ~/.transifexrc since this where Transifex cli tool expects the credentials.

All these projects require you to have the right credentials and often be "member of the Catalan project" to be able to download credentials.

If you are building a local Docker image, place your Transifex credentials file in the cfg/credentials/transifexrc directory, and this will be copied in the right location in the docker image. Remember that docker context cannot access your ~ directory.

Running the builder code locally

This part focuses on helping you to run the builder component locally in case that you want to test quickly new projects configurations. For any other use case, we recommend using the Docker.

Debian:

sudo apt-get update -y && sudo apt-get install python3-dev libhunspell-dev libyaml-dev gettext zip mercurial bzr ruby git curl wget g++ subversion bzip2 python2-dev -y
curl https://raw.githubusercontent.com/transifex/cli/master/install.sh | bash && mv ./tx /usr/bin/
sudo gem install i18n-translators-tools
pip3 install -r requirements.txt

macOS:

brew install python3 breezy hunspell libyaml gettext zip mercurial ruby git curl wget gcc subversion bzip2
curl https://raw.githubusercontent.com/transifex/cli/master/install.sh | bash
sudo gem install i18n-translators-tools
pip3 install -r requirements.txt

For example, to download only the Abiword project:

cd src
./builder.py -p Abiword

Running the system locally using Docker

This requires that you have docker, docker-compose and make installed in your system.

First download the data for the projects and generate the data quality reports:

make docker-run-builder

Downloading all the projects can take up to a day, which is not acceptable for a development cycle. In the docker/local.yml the variable DEV_SMALL_SET forces to only download some projects. This small subset does not require any specific credentials to be defined to download them.

The output files are copied to web-docker local directory to make easy to for you explore the results.

Contributing

If you are looking at how to contribute to the project see HOW-TO.md

Contact Information

Jordi Mas: [email protected]

translation-memory-tools's People

Contributors

bellaperez avatar davidcanovas avatar ecron avatar ereza avatar gforcada avatar jaumeortola avatar jmaspons avatar jmontane avatar jordibrus avatar jordimas avatar jordis avatar julen avatar marcriera avatar miniangel avatar pereorga avatar rbnval avatar rbuj avatar toniher avatar txemaq avatar unho avatar xaloc33 avatar xavivars avatar xispa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

translation-memory-tools's Issues

Split some pages

For example it might be a good idea to put the TM search and the TM list in separate pages.

jQuery is missing in HTML files

Even if all SC-related HTML files are including some jQuery-dependent scripts (like the cookiechecker), they don't include the jQuery library, what generates an error when the pages are displayed in the browser.

Make sure which Python this works on

In the README it says "python 2.7 or higher".

But I have serious doubts about if this code can work on Python 3 (I am thinking on stuff like print "some string" where print is treated like a keyword instead of a function).

Also I am curious why this is not supposed to work on Python 2.6 (maybe it is because of format?).

Choose coding style for shell scripting

It is not clear which coding style is used for the various shell scripts present in this repository. Having some clear guidelines, like how many characters are used for indentation, if then appears in the same line as if or not, and such will be helpful to create consistent code easier to read by humans.

Reachitecture the tool for a version 2.0

Background

This ticket collects all the architecture improvements needed to fully the new set of requirements and address the limitations that we learnt until September 2014.

Downloading, converting and building translation memories

  • Decouple the download process #63
  • Create a configuration subdirectory where every project has its own json file #62

Solution: Consider moving to TMX as native format for the tool. Some considerations:

  • The limitation here is that there are not out of the box tools for merging (msgcat) catalogs.
  • Probably we will need to build something and consider contributing it to translate toolkit.
  • Many tools that convert from other formats like TS, strings, etc, they do convert to PO (ts2po) not to TMX. We need need to think if we are OK converting from these formats to PO and then to TMX or we need native conversors.
  • This will require rewritting the index creator, terminology analysis and other tools since all the of they relay on PO files as source format

Limitation: The conversion from any format to PO format is limited. The problems observed are:

  • Currently we are using the file extensions to identify the formats. In the case of INI or strings files you need sometimes to be more specific since these can have different variations.

Solution: By the default, as today, we have conversors associated to extensions. However, also having some kind of pattern matching in projects.json where you can specify per project which conversors to use.

Web application

Limitation: Currently all the Softcatalà application is tightly coupled with with the backends.
Solution: The Softcatalà application, and any other front ends, should be independent applications that different teams maintain that use APIs to interact with the system. In github, we should have a simple agonistic web application to show the APIs work (instead of Softcatalà one). We should provide 3 APIs:

  • API to search the text index (#28)
  • API to access the translation memory downloads created (date, file, etc)
  • API to access the terminology items created (glossaries)

Limitation: The web application is written using CGI
Solution: Write the application using MVC (#24)

Text Search engine

Potential limitation: We are currently using Whoosh as full text search engine. We are not sure how this will scale if we add 50 languages and 50 projects more for example.
Solution: See (#24)

Integration with Translation Memory servers

The vision here is not to implement a Translation Memory server. Implement #23 to integrate with (Amagrama) https://github.com/translate/amagama.

Have versioning for TMs

This means keep TMs for old releases, like GNOME 2.28 or GNOME 2.30, available for checking how the translation evolves.

These old TMs shouldn't be queried in the default search.

[TRACKER] Add docs

  • Simplify the root README (2 or 3 paragraphs intro, and move the rest to the docs/ directory) (#51)
  • Include this project docs in Read the Docs (#52)
  • Remove all README (or other notes files) in the repository, but the root README (#57)
  • Add docs about how to deploy (#53)
  • Add docs about how to import translations (#54)
  • Add docs about how to help in development (#55)
  • Add coding style for Python (#37)
  • Add coding style for CSS (#38)
  • Add coding style for JavaScript (#39)
  • Add coding style for HTML (#40)
  • Add coding style for shell scripts (#8)

Add license

I can't believe this has no license or copying file.

[TRACKER] Allow to generate TMs for the different projects using different frequencies

This is necessary because some projects update few times in long periods of time, and some other projects are just dead. Also not all the projects use a translation format that can be easily converted to PO or TMX, at least without human intervention.

Acceptable frequencies can be:

  • never (this TM was created manually and is not meant to be recreated automatically)
  • yearly
  • monthly
  • weekly (most important FLOSS projects)

In a future it must be possible to specify different frequencies by language.

This issue has to be split into several smaller ones, i.e. this is a tracker issue.

Add live form to filter returned search results by target translations

This is for providing a way to quickly find all the ways a given English word is translated to a source language.

For example if we want to know all the ways the "window" word is translated to Galician language we can start by specifying "xanela", then "fiestra"... and that hides the results that match and counting the occurrences by project. This useful when discussing terminology.

This can be achieved with a bit of JavaScript on a special search page that returns all the results for the specified query.

[TRACKER] Reorganize/cleanup code

  • Translate comments in Catalan to English (#30)
  • Add license (#35)
  • Softcatalà has a grave accent (so à not á) (#68)
  • Add docs (#14)
  • Include building docs on Travis (#69)
  • Include run of integration tests on Travis (#70)
  • Remove specific references in code to jmas home (and stuff like this) (#56)
  • Move images to img/ directory (#9)
  • Move CSS files to css/ directory (#10)
  • Move templates to templates/ directory (#11)
  • Move SoftCatalá specifics out of repo (#27)
  • Move all tests code to tests/ directory (#31)
  • Perform a cleaning sweep on all the code to ensure we have valid HTML5 (#48)
  • Perform a cleaning sweep on all the code to ensure we have valid CSS (#49)
  • Perform a cleaning sweep on all the code to ensure we have Python following agreed coding style (#50)
  • Get rid of globals on Python code (#60)
  • Move embedded HTML in Python code to standalone templates (#41)
  • Make sure which Python this works on (#43)
  • Remove all PO and TMX files that are not used for testing purposes (#44)
  • Move the commands to a bin/ directory (not having a bunch of .py files in src/)
  • Make it clear which one is the original files directory, the PO directory, the TMX directory... for the importing process.
  • Discuss about maybe separating in two different repositories the translations retrieval (including the translations retrieval and the TMX and PO compendia creation) from the web query server (including index creation and code for the server).

Decouple web from API

This means provide search results using only a JSON API. The web page holding the search form then can retrieve the results from the API and append the results in the very search page. That way the search page can be served as static HTML and we already have an API capable of feeding CAT tools.

Pushstate is essential. It must be possible to be able to share specific search URLs with other people.

Depends on #23

Allow to assign a quality level to each TM

In order to:

  • Prioritize the returned results
  • Allow to just query by default the TMs with higher quality (in some special scenarios it still is necessary to query all the existing TMs)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.