Giter VIP home page Giter VIP logo

Comments (15)

elharo avatar elharo commented on July 20, 2024 3

Nice analysis. We could use ICU4J and ICU4C to get compatibility between C++ and Java, as well as more modern Unicode, regardless of JDK version. I'm not sure what the performance implications of that would be or the priority of this work.

from presto.

tdcmeehan avatar tdcmeehan commented on July 20, 2024 3

Even as we implement #16268, we are still handicapped on the minimum required version to support Presto on Spark. As I understand it, this requires Java 8. Meaning, Presto on Spark may retain a correctness bug related to the unicode version, even if we've fixed it by updating our JDK elsewhere. So, I like @elharo's suggestion (which I was unfamiliar with before), because it seems this enables us to decouple these two desires and prioritize them independently.

from presto.

elharo avatar elharo commented on July 20, 2024 1

Do we specify the Unicode version in use by Presto anywhere?

from presto.

kagamiori avatar kagamiori commented on July 20, 2024 1

@kagamiori Regarding

The lower() function is implemented through the Java's Character.toLowerCase(codepoint) method.

Looks like its here its calling toLowerCase from airlift

Which I think is calling this code

https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java#L297-L302

Hi @amitkdutta, LOWER_CODE_POINTS is constructed by calling Java's Character.toLowerCase().
https://github.com/airlift/slice/blob/87deb8a298a433e95e6a9061fb6df3e89469635d/src/main/java/io/airlift/slice/SliceUtf8.java#L52

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024 1

More generally, the issue is that Unicode version supported by Presto Java and Velox are different.

Velox uses a copy of utf8proc 2.5.0 library which supports Unicode 13.0: https://juliastrings.github.io/utf8proc/releases/

Presto Java in Meta environment uses JDK 11, which supports Unicode 10.0.

Support for Unicode 13.0 was added in JDK 15: https://www.oracle.com/java/technologies/javase/15-relnote-issues.html.

Support for Unicode 14.0 was added in JDK 19: https://www.oracle.com/java/technologies/javase/19-relnote-issues.html

To match Unicode version supported in Velox, JDK version used by Presto Java needs to be in [15, 19) range (at least 15, bot not 19 or newer). If Unicode versions supported in Java and C++ do not match, there will be inconsistencies between behavior applied during constant folding and during evaluation on the worker.

Latest Unicode version is 15.1: https://www.unicode.org/versions/Unicode15.1.0/

It would be nice to figure out how to support latest Unicode version in both Java and C++.

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024 1

Nice analysis. We could use ICU4J and ICU4C to get compatibility between C++ and Java, as well as more modern Unicode, regardless of JDK version. I'm not sure what the performance implications of that would be or the priority of this work.

FYI, looks like Spark 4.0 switched to using ICU.

from presto.

kagamiori avatar kagamiori commented on July 20, 2024

cc @mbasmanova

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024

CC: @rschlussel @tdcmeehan

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024

CC: @kaikalur

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024

For reference: https://codepoints.net/U+1CA8

U+1CA8 Georgian Mtavruli Capital Letter Shin

U+1CA8 was added in Unicode version 11.0 in 2018. It belongs to the block Georgian Extended in the Basic Multilingual Plane.

This character is a Uppercase Letter and is mainly used in the Georgian script. Its lowercase variant is Georgian Letter Shin.

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024

This character appears to be introduced in Unicode 11.0, which is supported only starting from JDK 12:

"The JDK 12 release includes support for Unicode 11.0.0. Following the release of JDK 11, which supported Unicode 10.0.0, Unicode 11.0.0 introduced the following new features that are now included in JDK 12:"

https://www.oracle.com/java/technologies/javase/12-relnote-issues.html#JDK-8209923

"Support has been added for Unicode 10.0.0. Java Platform, Standard Edition (Java SE) 9 and 10 supported Unicode 8.0."

https://docs.oracle.com/en/java/javase/11/intl/internationalization-enhancements1.html#GUID-D0AF5316-F01C-4A3A-A3CA-7875C3D34601

from presto.

amitkdutta avatar amitkdutta commented on July 20, 2024

@kagamiori Regarding

The lower() function is implemented through the Java's Character.toLowerCase(codepoint) method.

Looks like its here its calling toLowerCase from airlift

Which I think is calling this code

https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java#L297-L302

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024

Do we specify the Unicode version in use by Presto anywhere?

I don't believe so, but assuming we compile with JDK 8, then it would be Unicode 7.0.

CC: @tdcmeehan @steveburnett

from presto.

mbasmanova avatar mbasmanova commented on July 20, 2024

It looks like Trino moved to Java 22:

"Trino requires a 64-bit version of Java 22, with a minimum required version of 22.0.0. Earlier versions such as Java 8, Java 11, Java 17 or Java 21 do not work. Newer versions such as Java 23 are not supported – they may work, but are not tested."

from presto.

steveburnett avatar steveburnett commented on July 20, 2024

Do we specify the Unicode version in use by Presto anywhere?

I don't believe so, but assuming we compile with JDK 8, then it would be Unicode 7.0.

CC: @tdcmeehan @steveburnett

We do not specify the Unicode version in use by Presto.

from presto.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.