Comments (11)
Hi @NhanNguyen700, you made very good points in your last comment. I haven't found time yet to read it more thoroughly though. But I'll try to do it when I find some time 👍
from go-astisub.
Sorry for answering that late!
I've added escape/unescape to webvtt in master
👍
I haven't found any indication whether &
is allowed on some part of webvtt (such as voice names or comments for instance), therefore I've added a global escape/unescape.
Let me know if that fixes the issue and/or if you need me to create a specific tag.
from go-astisub.
A single & is not a valid character in WebVTT, so it should not be automatically escaped. One could argue that an error should be raised.
Accordiing to WebVTT spec, the following should happen if a & is encountered:
U+0026 AMPERSAND (&)
Set tokenizer state to the HTML character reference in data state, and jump to the step labeled next.
SRT is much less well-specified, so it is not clear what should happen, but I think it is reasonable to do the same thing as for WebVTT, and not automatically change from &
to &
or vice versa.
Therefore I don't think this should be reported as a bug and be dropped.
from go-astisub.
I agree with @tobbee 👍
from go-astisub.
I don't get it, that document is about reading WebVTT from a WebVTT file, right? As I understand, if there is a text like this: I & you are talking
in a WebVTT file, when a program read it and print the text to console, the console should display: I & you are talking
, right? &
is an character reference in HTML spec https://html.spec.whatwg.org/multipage/syntax.html#character-references, and it should be unescaped to &
when being displayed. But as long as I see, It prints I & you are talking
to the console.
Let talk about Tobbe idea a little bit, a single &
is not valid in WebVTT, I do agree, I cannot argue with that. When reading a WebVTT with data containing a single &
, we should warn, break the parsing or do something else to inform the problem without automatically escaping that single &
. But, I am not talking about reading or parsing WebVTT, I am talking about converting to WebVTT from other formats and vice versa.
Back to my question, assume that an SRT file contains a text: I & you are talking
.
When reading that text from SRT file and then writing that text into WebVTT file (aka converting from SRT to WebVTT), which one do you expect in WebVTT file? I & you are talking
or I & you are talking
? As I understand the explanation of Tobbe, single &
is not a valid character in WebVTT, so I think that it should be escaped into &
before being written into WebVTT file and the result should be I & you are talking
. But, currently, the actually result when I did converting by astisub command is I & you are talking
in WebVTT file.
If you are not sure what should be done for SRT and we should keep &
in converted WebVTT without unescaping it (for some reasons that I haven't know yet, converting is about to change from a format to another format without breaking any rules, right? Why should we keep something that break WebVTT rule?) , then, for TTML with this text: <p begin="00:00:00:00" end="00:00:01:00"> I & you are talking</p>
.
What do you expect when converting to WebVTT? I & you are talking
, right? TTML and WebVTT have the similar specification for escaping and unescaping in the scope of XML. But, as I see, the text in WebVTT is I & you are talking
when I did converting with astisub command. The reason of that issue is because when reading TTML (golang will deserialize TTML - aka XML by it default deserializer), it automatically unescaped &
to &
, so the text in TTML from I & you are talking
is unescaped to I & you are talking
, and our WebVTT writer writes that unescaped text to the file carelessly.
from go-astisub.
Hi @tobbee and @asticode , do you have any comments above the cases I mentioned above?
from go-astisub.
The main issue I have is that SRT is not a well-defined format. It started out with ASCII-support only, but over time people have started to use other encodings like UTF-8 and also insert HTML-like formatting like < i >italics< /I>
(spaces there only to avoid unescaping).
Regarding a single "&"
, I made the case that it should not be allowed, but given SRT's history as a text-only protocol, I can agree to change that view and instead escape it. The questions then is what one should do with
SRT input that is already escaped like "& amp;"
(neglect the underscore, it is needed for the characters not be converted)Should that be escaped to & amp;& amp;
, of should there be some sort of smartness to only escape a single character "&"
?
In general, there are only two charcters that need do be escaped in TTML and WebVTT, and they are "&"
and "<"
.
Maybe a reasonable approach is that both these characters should be escaped if provided as input in SRT and then unescaped when outputting SRT again, while they should be reported as errors or escaped as input in WebVTT, but not unescaped when outputting WebVTT (so that the output WebVTT is valid, although the input is not)?
Of course, this depends on the internal representation. if the two characters are not escaped in the internal representation we need to unescape the WebVTT input.
from go-astisub.
I have a shallow knowledge in this scope. However, as far as I know, SRT does not have any escaping rule, it can understand single &
without escaping that character. So, from my perspective, if we meet &
in SRT, we should not unescape it, once we need to convert it to WebVTT format, we then should escape it into &amp;
in WebVTT output.
IMO we should unescape all input source follow their own format standard, and we need to escape again when writing them into any other format, the reason is that each standard have their own escaping rules
from go-astisub.
This change breaks styling like italics in the written VTT subtitle, the opening <
is replaced with <
from go-astisub.
Can you open a new issue and share a test file to reproduce the error ?
from go-astisub.
Can you open a new issue and share a test file to reproduce the error ?
Sure thing.
from go-astisub.
Related Issues (20)
- clean up go.sum
- func formatDuration has a bug will to output wrong result HOT 4
- Add support for linear correction HOT 2
- DVD Studio Pro reader and writer
- iTunes Timed Text reader and writer
- not support .ass file? HOT 4
- Populate originalEpisodeTitle and originalProgramTitle for STL file convertion HOT 4
- Can go-astisub parse HDMV/PGS subtitle? HOT 1
- convert ass subtitle contains more than one language to vtt fail HOT 6
- Can't change fontColor for STL file HOT 2
- TTML text parsing issues with new lines HOT 1
- What to expect to be supported by ttml parser? HOT 5
- Support AddItem(s) and RemoveItem(s) HOT 2
- Handle dot as well for SRT files HOT 1
- Subtitles.go String() - why are subtitle lines joined with a " - ", instead of a new line ("\n")? HOT 3
- Example usage ReadFromSRT in the README is incorrect HOT 4
- Broken VTT styling in output (0.26.0 only) HOT 3
- The indexes always are zero (v0.26.1 only) HOT 2
- Timestamp tag regression in 0.26.0 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from go-astisub.