jecolon / ziglyph Goto Github PK

View Code? Open in Web Editor NEW

207.0 5.0 7.0 33.52 MB

Unicode text processing for the Zig programming language.

unicode zig characters utf-8 strings normalization collation word-wrap grapheme-cl display-width

ziglyph's Introduction

ziglyph

ziglyph is now at: https://codeberg.org/dude_the_builder/ziglyph

ziglyph's People

Contributors

Stargazers

Watchers

Forkers

slimsag courajs ikskuh jeffbdavenport der-teufel-programming 0xaltcunningham cryptocode

ziglyph's Issues

cross validation and comparison (to libgrapheme)

https://git.suckless.org/libgrapheme/file/README.html

Probably you also want to explain when to use which of your libraries?
Its not immediately clear from the description what stuff is for grapheme, glyph, codepoints and code units.
https://stackoverflow.com/a/27331885

Question: More size-optimal grapheme cluster sorting?

Hi @jecolon ! I'm using ziglyph in something like a regexp engine but instead of traditional [\x{1F600}-\x{1F64F}] Unicode codepoint ranges, which require abominations like this to match all valid emojis - I plan to implement range matching in terms of grapheme clusters.

I can do this today with the Unicode collation algorithm, but due to it depending on normalization and the Unicode sort keys (UCA/latest/allkeys.txt) it ends up being a quite large binary. With gzip --best the two files clock in around a half a MB:

hexops-mac:zorex slimsag$ du -sh asset/
568K	asset/
hexops-mac:zorex slimsag$ du -sh asset/*
308K	asset/uca-allkeys.txt.gz
260K	asset/ucd-UnicodeData.txt.gz

I'm aiming to have WebAssembly support be quite nice, so I'm looking for ways to reduce the file size needed for sorting grapheme clusters.

I have a few ideas I'm planning to explore - namely that I suspect a binary encoding of the sort keys and UnicodeData.txt could go a long way in reducing the file size (it also seems these files may have comments in them that could be omitted perhaps?)

But I figured I'd reach out first and ask: has anyone else thought about this? Are there perhaps better ways to do grapheme cluster sorting that I have missed?

Streaming Segmenters

I want to use this library to write my parser, the problem I am facing is that I don't know if the values in the current bufer is a valid utf8 string, or it's just a prefix of a valid utf8 string.

I checked the GraphemeIterator, it checks every char in the given byte array, then it will not work if I don't have the whole string yet...

ziglyph/src/segmenter/Grapheme.zig

Line 80 in 531d309

if (!unicode.utf8ValidateSlice(str)) return error.InvalidUtf8;

So is there any workaround?

Or put simply, I can't make this library work well with Readers, because it expects everything to be already read.

Grapheme segmentation with ZWJ sequences

Hello, and thank you for all the hard work on this great library! It has been a pleasure to use.

I do think I have found a bug though. I am using the GraphemeIterator and I noticed that multiple emoji with Zero Width Joiners in a row are only counted as one grapheme. For example, 🐻‍❄️🐻‍❄️ would be considered a single grapheme. Here is a minimal reproducing example

const std = @import("std");
const ziglyph = @import("ziglyph");

pub fn main() !void {
    var iter = try ziglyph.GraphemeIterator.init("🐻‍❄️🐻‍❄️");
    while (iter.next()) |grapheme| {
        std.debug.print("Found Grapheme: {s} at offset: {}\n", .{grapheme.bytes, grapheme.offset});
    }
}

outputs

Found Grapheme: 🐻‍❄️🐻‍❄️ at offset: 0

Separating the polar bears with a single space leads to three graphemes displayed as expected

Found Grapheme: 🐻‍❄️ at offset: 0
Found Grapheme:   at offset: 13
Found Grapheme: 🐻‍❄️ at offset: 14

It's possible that I misunderstand grapheme clusters, but I was under the impression that these emoji should be considered distinct "characters"/"symbols". When rendered in the browser for example, each is a distinct symbol, not joined as one.

Also, this specific emoji ends with U+FE0F, but I tested several other ZWJ emoji with the same result.

FileNotFound on library

384fd04 changed the casing of the packages entry point from src/Ziglyph.zig to src/ziglyph.zig however the file in source is still uppercase. trying to build a project now fails.

Link problem under Windows in debug mode: unresolved external symbol NtClose

Hello
Thanks for your great work on Ziglyph!

I integrated Ziglyph in a C++ project, using a small layer cziglyph to export the functions I need.
https://github.com/jlfaucher/executor/tree/master/sandbox/jlf/trunk/interpreter/classes/support/m17n/cziglyph
https://github.com/jlfaucher/executor/blob/fe41555797b241d20cd99422e1998e6bfaf023e3/sandbox/jlf/trunk/CMakeLists.txt#L696

Works good, but I have a link problem when building in debug mode:

[ 62%] Linking CXX shared library bin\rexx.dll
   Creating library lib\rexx.lib and object lib\rexx.exp
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtClose referenced in function std.os.windows.CloseHandle
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtCreateFile referenced in function std.target.Arch.isWasm
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtLockFile referenced in function std.target.Arch.isWasm
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtWaitForKeyedEvent referenced in function std.meta.assumeSentinel.176
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtCreateKeyedEvent referenced in function std.fmt.formatValue.226
bin\rexx.dll : fatal error LNK1120: 5 unresolved externals

As a workaround, I build cziglyph in release mode, always, and that's fine.
I'm curious to know if I do something wrong... I have a very limited knowledge of Zig.

Jean-Louis

How to reach optimal performance with stream?

As title. In the source code I see this:

ziglyph/src/segmenter/Grapheme.zig

Line 244 in 531d309

 while (try input_stream.readUntilDelimiterOrEof(&buf, '\n')) |raw| : (line_no += 1) { 

Which is trying to break down the text by lines. However, if the input is so long in one line, this will probably be problematic that we can't store the whole line in the [4096]u8; Also, if there's too many newlines, this will be inefficient, in that it will call the read function multiple times. Of course we may use a bufferedReader, but I feel there's redundant work here, i.e. we are buffering after using an already buffered Reader.

Is there any workaround that I can make this better?

Thank you!

XID_Start/XID_Continue categories?

Hi, just a short request. Could you please add category tests for the XID_Start and XID_Continue categories, so that I can use this library for a small lexer project I'm working on? It would be much appreciated.

It's entirely possible I'm missing something, and you already provide this.

jecolon / ziglyph Goto Github PK

ziglyph's Introduction

ziglyph

ziglyph's People

Contributors

Stargazers

Watchers

Forkers

ziglyph's Issues

cross validation and comparison (to libgrapheme)

Question: More size-optimal grapheme cluster sorting?

Streaming Segmenters

Grapheme segmentation with ZWJ sequences

FileNotFound on library

Link problem under Windows in debug mode: unresolved external symbol NtClose

How to reach optimal performance with stream?

XID_Start/XID_Continue categories?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent