Comments (10)
and the "html" parser will convert everything
<apertium-notrans>
as superblanks...
...which is the problem. That ruins contexts. It'd be much better to let it have an analysis that makes sense in context.
Something like <apertium-notrans pos="np">STARTUP</apertium-notrans>
which would get turned into ^STARTUP/STARTUP<np><m:notrans>$
by something and respected by lt-proc, or done by lt-proc directly.
Yes, it would require changes to lt-proc, but that will be needed anyway for things like markup handling. Might as well make that part generic also.
from apertium.
@unhammer Secondary tags could help here.
from apertium.
Yup. All sorts of outside information could be passed along this way.
from apertium.
I think the key point here is outside the pipeline. How would you add secondary tags to a word outside of the pipeline? I guess that would require also changes to lt-proc to understand that new "do-not-translate-this-word" markup into a secondary tag....
from apertium.
@xavivars I know it says outside the pipeline, but as I understand it, whether to translate or not translate a word is something that will be computed in the pipeline. (Sort of a Named Entity Recognition thingy). Unless if you want to manually provide a list of words that shouldn't be translated. (Isn't this something that can be done in t1x already? Give a list and if the word is part of that list then propagate the SL lemma. It won't be generated but that's fine.)
Or when you say outside the pipeline, do you mean actually having a markup in the input corpus? If you mean that analysis of the word should produce the LU with a <don't translate> tag, that can certainly be done in the monodix if needed.
from apertium.
Yes, I mean completely outside the pipeline. Apertium currently supports doing that "from outside" the pipeline. You can send this text
<apertium-notrans>This text will not be translated</apertium-notrans>, but this one will.
and the "html" parser will convert everything <apertium-notrans>
as superblanks before the morphological anaylizer starts processing tests.
from apertium.
Definitely, not saying we can't touch lt-proc. My point was as a reply to this
@xavivars I know it says outside the pipeline, but as I understand it, whether to translate or not translate a word is something that will be computed in the pipeline.
Unhammer's request was being able to do that from outside the pipeline.
from apertium.
Yes, Xavi interpreted my request correctly :) This was for users who don't want to / are not able to change the translator at all, but just need a way to make something in their texts untranslatable.
from apertium.
Ah alright I thought the request was about a more general "Mark words that one shouldn't translate". But my initial comment was about secondary tags helping and as @TinoDidriksen said, the markup can be converted to a secondary tag (the same way we would convert html tags to secondary tags and attach them to word LUs.
from apertium.
Note that this is also related to the issue of codeswitching, e.g. identifying and marking spans as not translatable because they are in another language. I put some thoughts here.
from apertium.
Related Issues (20)
- Matchings with lookahead in transfer rules HOT 9
- Problem with HTML deformatter entities and UTF-8 HOT 1
- Capitalization restoration does not remove internal marks HOT 1
- apertium-pretransfer -n fails with escaped lemma `\/` HOT 3
- Conversion to and from the universal tagset HOT 8
- Possibility of showing relevant preferences in text
- apertium-tagger mode that adds probability tags <P:42> instead of removing readings HOT 4
- Suppress `APER1053 apertium-transfer warning: <let> on line 123 sometimes discards its value`?
- <reject-current-rule shifting="yes"/> duplicates superblanks
- apertium-tagger: treat `~` as compound separator
- Build failure with utfcpp 4.0.3 HOT 3
- <reject-current-rule shifting="N"/> for lookahead of >2 word rules
- Improve usability of style preferences HOT 1
- deformatters -o: maybe double newline without end-of-line period should give heading-symbol instead of period HOT 1
- apertium-eo-en triggers apertium-postchunk basic_string::substr out_of_range HOT 2
- Wordbound blanks lost in transfer
- apertium-pretransfer with surface forms fails with compounds
- Possible to titlecase only first word, not every? HOT 3
- c:AA/Aa given to "Distrikts-NRK" HOT 12
- crash on reject-current-rule shifting="no", unable to parse int HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from apertium.