lower('Შ') returns the exact same character as the

Even as we implement <a class="issue-link js-issue-link" data-error-text="Failed to lo

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

CC: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Incorrect result of lower('Შ') about presto HOT 15 OPEN

kagamiori commented on July 20, 2024

Incorrect result of lower('Შ')

from presto.

Comments (15)

elharo commented on July 20, 2024 3

Nice analysis. We could use ICU4J and ICU4C to get compatibility between C++ and Java, as well as more modern Unicode, regardless of JDK version. I'm not sure what the performance implications of that would be or the priority of this work.

from presto.

tdcmeehan commented on July 20, 2024 3

Even as we implement #16268, we are still handicapped on the minimum required version to support Presto on Spark. As I understand it, this requires Java 8. Meaning, Presto on Spark may retain a correctness bug related to the unicode version, even if we've fixed it by updating our JDK elsewhere. So, I like @elharo's suggestion (which I was unfamiliar with before), because it seems this enables us to decouple these two desires and prioritize them independently.

from presto.

elharo commented on July 20, 2024 1

Do we specify the Unicode version in use by Presto anywhere?

from presto.

kagamiori commented on July 20, 2024 1

@kagamiori Regarding

The lower() function is implemented through the Java's Character.toLowerCase(codepoint) method.

Looks like its here its calling toLowerCase from airlift

presto/presto-main/src/main/java/com/facebook/presto/operator/scalar/StringFunctions.java

Line 645 in 5b1cde7

return toLowerCase(slice);

Which I think is calling this code

https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java#L297-L302

Hi @amitkdutta, LOWER_CODE_POINTS is constructed by calling Java's Character.toLowerCase().
https://github.com/airlift/slice/blob/87deb8a298a433e95e6a9061fb6df3e89469635d/src/main/java/io/airlift/slice/SliceUtf8.java#L52

from presto.

mbasmanova commented on July 20, 2024 1

More generally, the issue is that Unicode version supported by Presto Java and Velox are different.

Velox uses a copy of utf8proc 2.5.0 library which supports Unicode 13.0: https://juliastrings.github.io/utf8proc/releases/

Presto Java in Meta environment uses JDK 11, which supports Unicode 10.0.

Support for Unicode 13.0 was added in JDK 15: https://www.oracle.com/java/technologies/javase/15-relnote-issues.html.

Support for Unicode 14.0 was added in JDK 19: https://www.oracle.com/java/technologies/javase/19-relnote-issues.html

To match Unicode version supported in Velox, JDK version used by Presto Java needs to be in [15, 19) range (at least 15, bot not 19 or newer). If Unicode versions supported in Java and C++ do not match, there will be inconsistencies between behavior applied during constant folding and during evaluation on the worker.

Latest Unicode version is 15.1: https://www.unicode.org/versions/Unicode15.1.0/

It would be nice to figure out how to support latest Unicode version in both Java and C++.

from presto.

mbasmanova commented on July 20, 2024 1

Nice analysis. We could use ICU4J and ICU4C to get compatibility between C++ and Java, as well as more modern Unicode, regardless of JDK version. I'm not sure what the performance implications of that would be or the priority of this work.

FYI, looks like Spark 4.0 switched to using ICU.

from presto.

kagamiori commented on July 20, 2024

cc @mbasmanova

from presto.

mbasmanova commented on July 20, 2024

CC: @rschlussel @tdcmeehan

from presto.

mbasmanova commented on July 20, 2024

CC: @kaikalur

from presto.

mbasmanova commented on July 20, 2024

For reference: https://codepoints.net/U+1CA8

U+1CA8 Georgian Mtavruli Capital Letter Shin

U+1CA8 was added in Unicode version 11.0 in 2018. It belongs to the block Georgian Extended in the Basic Multilingual Plane.

This character is a Uppercase Letter and is mainly used in the Georgian script. Its lowercase variant is Georgian Letter Shin.

from presto.

mbasmanova commented on July 20, 2024

This character appears to be introduced in Unicode 11.0, which is supported only starting from JDK 12:

"The JDK 12 release includes support for Unicode 11.0.0. Following the release of JDK 11, which supported Unicode 10.0.0, Unicode 11.0.0 introduced the following new features that are now included in JDK 12:"

https://www.oracle.com/java/technologies/javase/12-relnote-issues.html#JDK-8209923

"Support has been added for Unicode 10.0.0. Java Platform, Standard Edition (Java SE) 9 and 10 supported Unicode 8.0."

https://docs.oracle.com/en/java/javase/11/intl/internationalization-enhancements1.html#GUID-D0AF5316-F01C-4A3A-A3CA-7875C3D34601

from presto.

amitkdutta commented on July 20, 2024

@kagamiori Regarding

The lower() function is implemented through the Java's Character.toLowerCase(codepoint) method.

Looks like its here its calling toLowerCase from airlift

presto/presto-main/src/main/java/com/facebook/presto/operator/scalar/StringFunctions.java

Line 645 in 5b1cde7

return toLowerCase(slice);

Which I think is calling this code

https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java#L297-L302

from presto.

mbasmanova commented on July 20, 2024

Do we specify the Unicode version in use by Presto anywhere?

I don't believe so, but assuming we compile with JDK 8, then it would be Unicode 7.0.

CC: @tdcmeehan @steveburnett

from presto.

mbasmanova commented on July 20, 2024

It looks like Trino moved to Java 22:

"Trino requires a 64-bit version of Java 22, with a minimum required version of 22.0.0. Earlier versions such as Java 8, Java 11, Java 17 or Java 21 do not work. Newer versions such as Java 23 are not supported – they may work, but are not tested."

from presto.

steveburnett commented on July 20, 2024

Do we specify the Unicode version in use by Presto anywhere?

I don't believe so, but assuming we compile with JDK 8, then it would be Unicode 7.0.

CC: @tdcmeehan @steveburnett

We do not specify the Unicode version in use by Presto.

from presto.

Incorrect result of lower('Შ') about presto HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent