Giter VIP home page Giter VIP logo

translate-html's Introduction

translate-html

Translate HTML using Beautiful Soup and Argos Translate

Install

pip install translatehtml
import argostranslate.package, argostranslate.translate
import translatehtml

from_code = "es"
to_code = "en"

html_doc = """<div><h1>Perro</h1></div>"""

# Download and install Argos Translate package
available_packages = argostranslate.package.get_available_packages()
available_package = list(
    filter(
        lambda x: x.from_code == from_code and x.to_code == to_code, available_packages
    )
)[0]
download_path = available_package.download()
argostranslate.package.install_from_path(download_path)

# Translate
installed_languages = argostranslate.translate.get_installed_languages()
from_lang = list(filter(lambda x: x.code == from_code, installed_languages))[0]
to_lang = list(filter(lambda x: x.code == to_code, installed_languages))[0]

translation = from_lang.get_translation(to_lang)

translated_soup = translatehtml.translate_html(translation, html_doc)

print(translated_soup)

Links

translate-html's People

Contributors

argosopentech avatar dingedi avatar misuzu avatar pierotofy avatar pj-finlay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

translate-html's Issues

Doesn't work with beautifulsoup4==4.12.2

When running the example with latest beautifulsoup4 it doesn't translate:

$ pip freeze | grep -i beautifulsoup4
beautifulsoup4==4.9.3
>>> print(translated_soup)
<div><h1>Dog</h1></div>
$ pip install -U beautifulsoup4
$ pip freeze | grep -i beautifulsoup4
beautifulsoup4==4.12.2
>>> print(translated_soup)
<div><h1>Perro</h1></div>

translate-html should parse whitespace in HTML style

When an HTML string contains whitespace like tabs or newlines inside a tag, translate-html returns separate translations for the lines (or tab separated text). This is not consistent with the usual HTML parsing. For examples, please see LibreTranslate/LibreTranslate#288

All parts employed in the translation work correctly, e.g. BeautifulSoup respects whitespace (and should do so) when returning the tag tree and the translation function of the single text strings correctly assumes that new lines mean new content, not knowing anymore that the string came from an HTML tag.

So, somewhere on the way the HTML code should be "minified" in order to be consistent with a browser's parsing of the code - at the cost of breaking the visual formatting of the HTML code by translation. I am not sure where this should be done best but I would suggest to do it in the translate_html function. I will create a corresponding PR.

Translated comments shown in final document

The contents of the comments get translated correctly, but when reconstructing the BeautifulSoup object their tags are lost, causing the final translated document to show the contents of the comments when opened with a web browser.

The issue, as far as I can work out, is that neither itag_of_soup nor soup_of_itag differentiate between a bs4.element.NavigableString and a bs4.element.Comment (which inherits from the former).

So, itag_of_soup returns an str object regardless of whether its processing a NavigableString or a Comment. When soup_of_itag is called, it checks if the object passed to it is an instance of str and if so constructs a NavigableString, which for the case of comments results in losing the <!--- ---> characters in the final document.

Here's an example:

import argostranslate.translate
import translatehtml

# Original "file"
content = """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <!-- This should not be seen in a browser -->
        <h1>Welcome to Test!</h1>
    </body>
</html>
"""

# Define languages for translation from English to Hindi
en = argostranslate.translate.get_language_from_code("en")
hi = argostranslate.translate.get_language_from_code("hi")
ut = en.get_translation(hi)

# Translate the file with translate_html
content = translatehtml.translate_html(ut, content)

# Write the translated file
with open("test.html", "wt") as fp:
    fp.write(str(content))
Screenshot from 2023-02-15 12-33-32
Original file
Screenshot from 2023-02-15 12-33-45
Translation

doctype is broken after translating

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">

is broken in

html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"

image

it is an example for this doctype but in general as soon as there is a complex doctype it breaks everything

Translation issues (duplication and missing words) for german source with html encoded &lt;

First of all: thanks for your work!

I stumbled into the following issues:

using the example with from_code = "de" and to_code = "en":

html_doc = """<p>&lt;</p>"""

translates to:

<p>&lt;= &lt;= &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt;</p>

expected translation: <p>&lt;</p>

and

html_doc = """<p>&lt; Fehler</p>"""

translates to:
<<p>&lt;=</p>

expected translation: <p>&lt; error</p>

Release for exclusions of tags

Hi,

Could you release a new version on pypi so that it takes into consideration the exclusions of the html tags with translate="no"

translate some html attributs

can attribute translation be easily implemented?

for example translating the "alt" attribute of an image

<img src="..." alt"some text" />

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.