Giter VIP home page Giter VIP logo

Comments (11)

asticode avatar asticode commented on May 25, 2024 1

Hi @NhanNguyen700, you made very good points in your last comment. I haven't found time yet to read it more thoroughly though. But I'll try to do it when I find some time 👍

from go-astisub.

asticode avatar asticode commented on May 25, 2024 1

Sorry for answering that late!

I've added escape/unescape to webvtt in master 👍

I haven't found any indication whether & is allowed on some part of webvtt (such as voice names or comments for instance), therefore I've added a global escape/unescape.

Let me know if that fixes the issue and/or if you need me to create a specific tag.

from go-astisub.

tobbee avatar tobbee commented on May 25, 2024

A single & is not a valid character in WebVTT, so it should not be automatically escaped. One could argue that an error should be raised.

Accordiing to WebVTT spec, the following should happen if a & is encountered:

U+0026 AMPERSAND (&)
Set tokenizer state to the HTML character reference in data state, and jump to the step labeled next.

SRT is much less well-specified, so it is not clear what should happen, but I think it is reasonable to do the same thing as for WebVTT, and not automatically change from & to & or vice versa.

Therefore I don't think this should be reported as a bug and be dropped.

from go-astisub.

asticode avatar asticode commented on May 25, 2024

I agree with @tobbee 👍

from go-astisub.

NhanNguyen700 avatar NhanNguyen700 commented on May 25, 2024

I don't get it, that document is about reading WebVTT from a WebVTT file, right? As I understand, if there is a text like this: I & you are talking in a WebVTT file, when a program read it and print the text to console, the console should display: I & you are talking, right? & is an character reference in HTML spec https://html.spec.whatwg.org/multipage/syntax.html#character-references, and it should be unescaped to & when being displayed. But as long as I see, It prints I & you are talking to the console.

Let talk about Tobbe idea a little bit, a single & is not valid in WebVTT, I do agree, I cannot argue with that. When reading a WebVTT with data containing a single &, we should warn, break the parsing or do something else to inform the problem without automatically escaping that single &. But, I am not talking about reading or parsing WebVTT, I am talking about converting to WebVTT from other formats and vice versa.

Back to my question, assume that an SRT file contains a text: I & you are talking.
When reading that text from SRT file and then writing that text into WebVTT file (aka converting from SRT to WebVTT), which one do you expect in WebVTT file? I & you are talking or I & you are talking? As I understand the explanation of Tobbe, single & is not a valid character in WebVTT, so I think that it should be escaped into & before being written into WebVTT file and the result should be I & you are talking. But, currently, the actually result when I did converting by astisub command is I & you are talking in WebVTT file.

If you are not sure what should be done for SRT and we should keep & in converted WebVTT without unescaping it (for some reasons that I haven't know yet, converting is about to change from a format to another format without breaking any rules, right? Why should we keep something that break WebVTT rule?) , then, for TTML with this text: <p begin="00:00:00:00" end="00:00:01:00"> I &amp; you are talking</p>.
What do you expect when converting to WebVTT? I &amp; you are talking, right? TTML and WebVTT have the similar specification for escaping and unescaping in the scope of XML. But, as I see, the text in WebVTT is I & you are talking when I did converting with astisub command. The reason of that issue is because when reading TTML (golang will deserialize TTML - aka XML by it default deserializer), it automatically unescaped &amp; to &, so the text in TTML from I &amp; you are talking is unescaped to I & you are talking, and our WebVTT writer writes that unescaped text to the file carelessly.

from go-astisub.

NhanNguyen700 avatar NhanNguyen700 commented on May 25, 2024

Hi @tobbee and @asticode , do you have any comments above the cases I mentioned above?

from go-astisub.

tobbee avatar tobbee commented on May 25, 2024

The main issue I have is that SRT is not a well-defined format. It started out with ASCII-support only, but over time people have started to use other encodings like UTF-8 and also insert HTML-like formatting like < i >italics< /I> (spaces there only to avoid unescaping).

Regarding a single "&", I made the case that it should not be allowed, but given SRT's history as a text-only protocol, I can agree to change that view and instead escape it. The questions then is what one should do with
SRT input that is already escaped like "& amp;" (neglect the underscore, it is needed for the characters not be converted)Should that be escaped to & amp;& amp;, of should there be some sort of smartness to only escape a single character "&"?

In general, there are only two charcters that need do be escaped in TTML and WebVTT, and they are "&" and "<".

Maybe a reasonable approach is that both these characters should be escaped if provided as input in SRT and then unescaped when outputting SRT again, while they should be reported as errors or escaped as input in WebVTT, but not unescaped when outputting WebVTT (so that the output WebVTT is valid, although the input is not)?
Of course, this depends on the internal representation. if the two characters are not escaped in the internal representation we need to unescape the WebVTT input.

from go-astisub.

NhanNguyen700 avatar NhanNguyen700 commented on May 25, 2024

I have a shallow knowledge in this scope. However, as far as I know, SRT does not have any escaping rule, it can understand single & without escaping that character. So, from my perspective, if we meet &amp; in SRT, we should not unescape it, once we need to convert it to WebVTT format, we then should escape it into &amp;amp; in WebVTT output.

IMO we should unescape all input source follow their own format standard, and we need to escape again when writing them into any other format, the reason is that each standard have their own escaping rules

from go-astisub.

kloon15 avatar kloon15 commented on May 25, 2024

This change breaks styling like italics in the written VTT subtitle, the opening < is replaced with &lt;

from go-astisub.

asticode avatar asticode commented on May 25, 2024

Can you open a new issue and share a test file to reproduce the error ?

from go-astisub.

kloon15 avatar kloon15 commented on May 25, 2024

Can you open a new issue and share a test file to reproduce the error ?

Sure thing.

from go-astisub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.