Comments (15)
Nice analysis. We could use ICU4J and ICU4C to get compatibility between C++ and Java, as well as more modern Unicode, regardless of JDK version. I'm not sure what the performance implications of that would be or the priority of this work.
from presto.
Even as we implement #16268, we are still handicapped on the minimum required version to support Presto on Spark. As I understand it, this requires Java 8. Meaning, Presto on Spark may retain a correctness bug related to the unicode version, even if we've fixed it by updating our JDK elsewhere. So, I like @elharo's suggestion (which I was unfamiliar with before), because it seems this enables us to decouple these two desires and prioritize them independently.
from presto.
Do we specify the Unicode version in use by Presto anywhere?
from presto.
@kagamiori Regarding
The lower() function is implemented through the Java's Character.toLowerCase(codepoint) method.
Looks like its here its calling
toLowerCase
from airliftWhich I think is calling this code
https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java#L297-L302
Hi @amitkdutta, LOWER_CODE_POINTS
is constructed by calling Java's Character.toLowerCase().
https://github.com/airlift/slice/blob/87deb8a298a433e95e6a9061fb6df3e89469635d/src/main/java/io/airlift/slice/SliceUtf8.java#L52
from presto.
More generally, the issue is that Unicode version supported by Presto Java and Velox are different.
Velox uses a copy of utf8proc 2.5.0 library which supports Unicode 13.0: https://juliastrings.github.io/utf8proc/releases/
Presto Java in Meta environment uses JDK 11, which supports Unicode 10.0.
Support for Unicode 13.0 was added in JDK 15: https://www.oracle.com/java/technologies/javase/15-relnote-issues.html.
Support for Unicode 14.0 was added in JDK 19: https://www.oracle.com/java/technologies/javase/19-relnote-issues.html
To match Unicode version supported in Velox, JDK version used by Presto Java needs to be in [15, 19) range (at least 15, bot not 19 or newer). If Unicode versions supported in Java and C++ do not match, there will be inconsistencies between behavior applied during constant folding and during evaluation on the worker.
Latest Unicode version is 15.1: https://www.unicode.org/versions/Unicode15.1.0/
It would be nice to figure out how to support latest Unicode version in both Java and C++.
from presto.
Nice analysis. We could use ICU4J and ICU4C to get compatibility between C++ and Java, as well as more modern Unicode, regardless of JDK version. I'm not sure what the performance implications of that would be or the priority of this work.
FYI, looks like Spark 4.0 switched to using ICU.
from presto.
cc @mbasmanova
from presto.
from presto.
CC: @kaikalur
from presto.
For reference: https://codepoints.net/U+1CA8
U+1CA8 Georgian Mtavruli Capital Letter Shin
U+1CA8 was added in Unicode version 11.0 in 2018. It belongs to the block Georgian Extended in the Basic Multilingual Plane.
This character is a Uppercase Letter and is mainly used in the Georgian script. Its lowercase variant is Georgian Letter Shin.
from presto.
This character appears to be introduced in Unicode 11.0, which is supported only starting from JDK 12:
"The JDK 12 release includes support for Unicode 11.0.0. Following the release of JDK 11, which supported Unicode 10.0.0, Unicode 11.0.0 introduced the following new features that are now included in JDK 12:"
https://www.oracle.com/java/technologies/javase/12-relnote-issues.html#JDK-8209923
"Support has been added for Unicode 10.0.0. Java Platform, Standard Edition (Java SE) 9 and 10 supported Unicode 8.0."
from presto.
@kagamiori Regarding
The lower() function is implemented through the Java's Character.toLowerCase(codepoint) method.
Looks like its here its calling toLowerCase
from airlift
Which I think is calling this code
https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/SliceUtf8.java#L297-L302
from presto.
Do we specify the Unicode version in use by Presto anywhere?
I don't believe so, but assuming we compile with JDK 8, then it would be Unicode 7.0.
from presto.
It looks like Trino moved to Java 22:
- trinodb/trino#20980
- https://trino.io/docs/current/installation/deployment.html#java-runtime-environment
"Trino requires a 64-bit version of Java 22, with a minimum required version of 22.0.0. Earlier versions such as Java 8, Java 11, Java 17 or Java 21 do not work. Newer versions such as Java 23 are not supported – they may work, but are not tested."
from presto.
Do we specify the Unicode version in use by Presto anywhere?
I don't believe so, but assuming we compile with JDK 8, then it would be Unicode 7.0.
We do not specify the Unicode version in use by Presto.
from presto.
Related Issues (20)
- PREPARE fails for INSERT statements using non-standard characters (ex: '-') in quoted identifiers HOT 2
- Pushdown projects into value node HOT 1
- Support different types of COW and MOR queries for Apache Hudi
- More NaNs in UI HOT 1
- Connector specific session properties missed in manually started transaction in test framework
- The links for documents are invalid in presto homepage HOT 3
- Left align message HOT 2
- Query failed: / by zero --> division by zero
- [docs] disable running unneeded tests for docs-only PRs HOT 2
- [Native] Clang format InsertNewlineAtEOF invalid argument error in CI job HOT 1
- Lack of support for the ANSI SQL syntax `FETCH FIRST N ROWS WITH TIES`
- Behavior change in CAST(DATE as VARCHAR(x)) results in versions > 0.280
- Missing an HTTP endpoint to ensure the Presto Docker Container is ready HOT 4
- testQueryHeartbeat is flaky
- CircleCI format check jobs are failing HOT 1
- Session is completely corrupted by the failed statement in a non-autocommit transaction
- CLI should handle JSON error responses
- Rewrite small-medium sized VALUES to unnest
- Equality semantics of TIMESTAMP WITH TIME ZONE type can cause inconsistent behavior HOT 10
- Histograms can consume significant amounts of memory in query history
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from presto.