Giter VIP home page Giter VIP logo

Comments (11)

kylef avatar kylef commented on May 30, 2024 1

Specification already states that the source maps contain character maps and not bytes:

This location SHOULD include the characters used to build the parent element
A source map is a series of character-blocks

Zero-based index of a character in the source document
Count of characters starting from the character index.

So no chance is needed in the specification. However neither API Blueprint nor Swagger parsers are following these rules.

from api-elements.

kylef avatar kylef commented on May 30, 2024 1

After giving this a bit of though, I think we should use byte based source maps from the original source document. Using characters will be problematic for the following reasons:

  • Various programming languages / string implementations have different idea of what a character is. Swift seems to be the only one that follows the Unicode grapheme cluster boundary rules correctly (example: \r\n being a single character). Both Node/JS and Python are problematic and have a different idea of what a character should be (by design or not).
  • If the document has an invalid or non legal Unicode values, it will be problematic trying to refer to it via a source map. Or anything after it. Using a byte-based source map we have more control over providing source maps for an individual byte inside a unicode grapheme cluster.

Steps to Proceed

  • Update API Elements specification to use bytes (which is what the biggest implementation already does in Drafter)
  • Update Swagger adapter to use bytes
  • Test Apiary tools to ensure correct support for bytes, I think they already use bytes as that has been used in Snowcrash/Drafter for some time.

from api-elements.

w-vi avatar w-vi commented on May 30, 2024

I'll come to this later but a quick note: Relying on JavaScript strings is little unfortunate as we rely on string implementation in one runtime and the rest then needs to treat it the same way. AFAIK JavaScript uses UTF-16.

from api-elements.

kylef avatar kylef commented on May 30, 2024

Yes, and this could lead to different behaviours based on the language and string implementations. Some conversions through buffer -> string -> buffer may actually be lossy and for some implementations that handle unicode normalisation or have grapheme breaking rules.

from api-elements.

w-vi avatar w-vi commented on May 30, 2024

I have thought about this a bit and what I think is best is to use offset in characters where character is not a byte but a logical unit whose size depends on the encoding of the document. In case of utf-8 it might be 3 characters (runes) but 13 bytes. What do you think @kylef ?

from api-elements.

kylef avatar kylef commented on May 30, 2024

@w-vi If I understand you correctly, then I agree.

Due to language differences this can be confusing, especially when languages have different grapheme breaking rules. I could see it being very problematic for consumers to find the original source from a source map.

I think we would should alter our JS parsing libraries (Fury, Protagonist) to accept buffer instead of string to keep intact the original source document and how the graphemes are broken down.

For the Helium parsing service, the input document is embedded inside a JSON payload and already serialised. I fear we could loose the original encoding and source maps can become incorrect. api.apibleuprint.org deals with unserialised data and thus wouldn't have this problem.

Just to confirm we're on the same page. Here's an example:

Given I have five characters (e, \r\n, é, é and 👨‍👨‍👦‍👦), and I want to refer to the family (👨‍👨‍👦‍👦) and the characters have been encoded as utf-8. A character is equal to a unicode grapheme cluster. You would propose the source map be: 4 for location and 1 for length?

I want to point out that not all of those characters are the same, although they very well look identical (é vs é). Along with \r\n being a single grapheme cluster and not two (http://www.unicode.org/reports/tr29/#GB3).

Here is Swift showing the length of those areas and also the characters in base64.

let eAcute: Character = "\u{E9}"                         // é
let combinedEAcute: Character = "\u{65}\u{301}"          // e followed by
// eAcute is é, combinedEAcute is é

let characters: [Character] = ["e", "\r\n", eAcute, combinedEAcute, "👨‍👨‍👦‍👦"]
let string = String(characters)
let utf8Data = string.data(using: .utf8)

print(characters)
// ["e", "\r\n", "é", "é", "👨‍👨‍👦‍👦"]

print(characters.map { String($0).utf8.count })
// [1, 2, 2, 3, 25]

print(characters.count)
// 5
print(string.utf8.count)
// 33

print(string.utf16.count)
// 17

print(utf8Data)
// Optional(33 bytes)

print(utf8Data?.base64EncodedString())
// Optional("ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm")

Then lets take the base64 and decode in Python 3:

>>> import base64
>>> data = base64.decodebytes(b'ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm')
>>> string = data.decode('utf-8')
>>> data
b'e\r\n\xc3\xa9e\xcc\x81\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa6\xe2\x80\x8d\xf0\x9f\x91\xa6'
>>> len(data)
33
>>> string
'e\r\néé👨\u200d👨\u200d👦\u200d👦'
>>> len(string)
13
>>> string[1:2]
'\r'
>>> string[2:3]
'\n'

I am not sure how Python got 13 as the length, this would seem a bug in grapheme breaking. It is not the length of characters nor utf8 or utf16. Python is also treating \r and \n as separate characters.

Then in Node 6 (perhaps there is another way of doing this, I am not that proficient in Node):

> const { StringDecoder } = require('string_decoder')
> const data = Buffer.from('ZQ0Kw6llzIHwn5Go4oCN8J+RqOKAjfCfkabigI3wn5Gm', 'base64')
undefined
> data
<Buffer 65 0d 0a c3 a9 65 cc 81 f0 9f 91 a8 e2 80 8d f0 9f 91 a8 e2 80 8d f0 9f 91 a6 e2 80 8d f0 9f 91 a6>
> data.length
33
> const decoder = new StringDecoder('utf8');
undefined
> console.log(decoder.write(data));
e
éé👨‍👨‍👦‍👦
undefined
> console.log(decoder.write(data).length);
17

It looks like strings are internally stored as utf16, which means that when it comes to serialising it back to utf-8 it may normalise the output (both é would become the same size if normalisation occurs). Since we are embedding the payload as a unicode string into a JSON payload for the parsing service request we have likely lossed the original form which can result in the source maps becoming incorrect from what the user requested. The APIs for Fury/Protagonist/Drafter.js accept strings and not buffer so that the deserialisation has already occurred and re-serialisation into utf-8 for the parser may also provide loss of information.

from api-elements.

w-vi avatar w-vi commented on May 30, 2024

Yes, we are on the same page and mean the same thing. Question is if we should not impose utf-8 as the only acceptable encoding require the input to be base64 encoded in case of helium so we get the raw bytes instead of string and thus loosing the original document. Javascript uses utf-16 for strings not Node specific but Javascript in general. And the Python looks funny, I'll probably look into it in more detail.

from api-elements.

pksunkara avatar pksunkara commented on May 30, 2024

Since we are embedding the payload as a unicode string into a JSON payload for the parsing service request

Wait, we do that? I thought the string we are passing are utf-8 or something.

Other than this, can we all agree that there is no changed needed for the refract spec other than saying that the location and lenght of the source map element refers to the bytes and not characters.

from api-elements.

pksunkara avatar pksunkara commented on May 30, 2024

I thought we wanted to change it to bytes according to the above discussion.

from api-elements.

kylef avatar kylef commented on May 30, 2024

Additional note, I think we should provide convinces in Fury/Minim-API-Description to convert a source map to a specific line number as this a common pattern that various editors and tooling re-implements.

from api-elements.

tjanc avatar tjanc commented on May 30, 2024

Ideally we'd like to index by character. But if we do that, we need to understand all supported encodings.
@kylef What are the encodings we officially support? What about the option of imposing utf-8, as mentioned by @w-vi ?

from api-elements.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.