Giter VIP home page Giter VIP logo

ziglyph's Introduction

ziglyph's People

Contributors

cryptocode avatar jecolon avatar natecraddock avatar nektro avatar rsaihe avatar slimsag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ziglyph's Issues

Question: More size-optimal grapheme cluster sorting?

Hi @jecolon ! I'm using ziglyph in something like a regexp engine but instead of traditional [\x{1F600}-\x{1F64F}] Unicode codepoint ranges, which require abominations like this to match all valid emojis - I plan to implement range matching in terms of grapheme clusters.

I can do this today with the Unicode collation algorithm, but due to it depending on normalization and the Unicode sort keys (UCA/latest/allkeys.txt) it ends up being a quite large binary. With gzip --best the two files clock in around a half a MB:

hexops-mac:zorex slimsag$ du -sh asset/
568K	asset/
hexops-mac:zorex slimsag$ du -sh asset/*
308K	asset/uca-allkeys.txt.gz
260K	asset/ucd-UnicodeData.txt.gz

I'm aiming to have WebAssembly support be quite nice, so I'm looking for ways to reduce the file size needed for sorting grapheme clusters.

I have a few ideas I'm planning to explore - namely that I suspect a binary encoding of the sort keys and UnicodeData.txt could go a long way in reducing the file size (it also seems these files may have comments in them that could be omitted perhaps?)

But I figured I'd reach out first and ask: has anyone else thought about this? Are there perhaps better ways to do grapheme cluster sorting that I have missed?

Streaming Segmenters

I want to use this library to write my parser, the problem I am facing is that I don't know if the values in the current bufer is a valid utf8 string, or it's just a prefix of a valid utf8 string.

I checked the GraphemeIterator, it checks every char in the given byte array, then it will not work if I don't have the whole string yet...

if (!unicode.utf8ValidateSlice(str)) return error.InvalidUtf8;

So is there any workaround?

Or put simply, I can't make this library work well with Readers, because it expects everything to be already read.

Grapheme segmentation with ZWJ sequences

Hello, and thank you for all the hard work on this great library! It has been a pleasure to use.

I do think I have found a bug though. I am using the GraphemeIterator and I noticed that multiple emoji with Zero Width Joiners in a row are only counted as one grapheme. For example, ๐Ÿปโ€โ„๏ธ๐Ÿปโ€โ„๏ธ would be considered a single grapheme. Here is a minimal reproducing example

const std = @import("std");
const ziglyph = @import("ziglyph");

pub fn main() !void {
    var iter = try ziglyph.GraphemeIterator.init("๐Ÿปโ€โ„๏ธ๐Ÿปโ€โ„๏ธ");
    while (iter.next()) |grapheme| {
        std.debug.print("Found Grapheme: {s} at offset: {}\n", .{grapheme.bytes, grapheme.offset});
    }
}

outputs

Found Grapheme: ๐Ÿปโ€โ„๏ธ๐Ÿปโ€โ„๏ธ at offset: 0

Separating the polar bears with a single space leads to three graphemes displayed as expected

Found Grapheme: ๐Ÿปโ€โ„๏ธ at offset: 0
Found Grapheme:   at offset: 13
Found Grapheme: ๐Ÿปโ€โ„๏ธ at offset: 14

It's possible that I misunderstand grapheme clusters, but I was under the impression that these emoji should be considered distinct "characters"/"symbols". When rendered in the browser for example, each is a distinct symbol, not joined as one.

Also, this specific emoji ends with U+FE0F, but I tested several other ZWJ emoji with the same result.

FileNotFound on library

384fd04 changed the casing of the packages entry point from src/Ziglyph.zig to src/ziglyph.zig however the file in source is still uppercase. trying to build a project now fails.

Link problem under Windows in debug mode: unresolved external symbol NtClose

Hello
Thanks for your great work on Ziglyph!

I integrated Ziglyph in a C++ project, using a small layer cziglyph to export the functions I need.
https://github.com/jlfaucher/executor/tree/master/sandbox/jlf/trunk/interpreter/classes/support/m17n/cziglyph
https://github.com/jlfaucher/executor/blob/fe41555797b241d20cd99422e1998e6bfaf023e3/sandbox/jlf/trunk/CMakeLists.txt#L696

Works good, but I have a link problem when building in debug mode:

[ 62%] Linking CXX shared library bin\rexx.dll
   Creating library lib\rexx.lib and object lib\rexx.exp
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtClose referenced in function std.os.windows.CloseHandle
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtCreateFile referenced in function std.target.Arch.isWasm
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtLockFile referenced in function std.target.Arch.isWasm
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtWaitForKeyedEvent referenced in function std.meta.assumeSentinel.176
cziglyph.lib(cziglyph.obj) : error LNK2019: unresolved external symbol NtCreateKeyedEvent referenced in function std.fmt.formatValue.226
bin\rexx.dll : fatal error LNK1120: 5 unresolved externals

As a workaround, I build cziglyph in release mode, always, and that's fine.
I'm curious to know if I do something wrong... I have a very limited knowledge of Zig.

Could it be related to lib.bundle_compiler_rt = true ?
https://github.com/jlfaucher/executor/blob/275cb0afa1fecd312c3bc8749a5e109b215b9c60/sandbox/jlf/trunk/interpreter/classes/support/m17n/cziglyph/build.zig#L30

Jean-Louis

How to reach optimal performance with stream?

As title. In the source code I see this:

while (try input_stream.readUntilDelimiterOrEof(&buf, '\n')) |raw| : (line_no += 1) {

Which is trying to break down the text by lines. However, if the input is so long in one line, this will probably be problematic that we can't store the whole line in the [4096]u8; Also, if there's too many newlines, this will be inefficient, in that it will call the read function multiple times. Of course we may use a bufferedReader, but I feel there's redundant work here, i.e. we are buffering after using an already buffered Reader.

Is there any workaround that I can make this better?

Thank you!

XID_Start/XID_Continue categories?

Hi, just a short request. Could you please add category tests for the XID_Start and XID_Continue categories, so that I can use this library for a small lexer project I'm working on? It would be much appreciated.

It's entirely possible I'm missing something, and you already provide this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.