I have an implementation of this in <a href="https://github.com/JAicewizard/ttf-parser

Where you can look up all the variations for a codepoint ? </blockquo

<a href="https://github.com/JAicewizard/ttf-parser/blob/variation_sequence/src/tables/

support reading all available code-points out of the cmap table.,about razrfalcon/ttf-parser

Comments (36)

RazrFalcon commented on May 25, 2024

Can you explain why do you need this?

from ttf-parser.

JAicewizard commented on May 25, 2024

I need to read all glyphs in a font and output some data about the glyph together with the code-point.
This would mean I would have to either iterate over all code-points, or get them out of the cmap.
I was surprised that there was basically no library for this so I looked at this and with a day of work I managed to do all the formats already supported by this library.

I thought I'd at least offer to give back to you so that any future people looking to do this have a library that does this for them.

from ttf-parser.

RazrFalcon commented on May 25, 2024

Your implementation uses Vec, so this is already out of scope. It should be implemented via Iterator. I can write it, no problem.

The problem is that it's more complex than this. A font can have multiple cmap subtables at once, even with the same encoding, so I cannot guarantee char's uniqueness. Also, I'm not sure what to do with variation codepoints (subtable 14).

I too never saw a library that supports this, so I guess this it's a rather unusual task.

from ttf-parser.

JAicewizard commented on May 25, 2024

For the variations I am not sure either, maybe something similar to glyph_variation_index?
Where you can look up all the variations for a codepoint ?

from ttf-parser.

RazrFalcon commented on May 25, 2024

Where you can look up all the variations for a codepoint ?

Are you talking about a font or the Unicode?

from ttf-parser.

JAicewizard commented on May 25, 2024

All the variations of a codepoint supported by the font.

from ttf-parser.

RazrFalcon commented on May 25, 2024

In the subtable 14.

from ttf-parser.

JAicewizard commented on May 25, 2024

Here is an implementation that just gets all the variation sequences.
It isn't very elegant with the Option on top of the vec, but you said you would remove the vecs anyways.
Would this be sufficient?

Also what would exactly be the problem multiple cmap tables at once? You have the same issue when mapping char->glyph right?

from ttf-parser.

RazrFalcon commented on May 25, 2024

Would this be sufficient?

I don't know your use case.

You have the same issue when mapping char->glyph right?

I don't, because I'm using the first matched one.

Anyway, I don't mind implementing it, but we have to settle on the API and use Iterator/zero-allocation implementation instead of Vec.

from ttf-parser.

JAicewizard commented on May 25, 2024

I don't know your use case.
I was more thinking that this is your library you should decide what is considered good enough. I personally am not dealing with variations, but if I had to in the future I think this would be a good implementation for my use case.

And for the duplicate code-points I am not sure it is an actual issue, with the current implementation the GlyphID is also returned so you can get the proper glyph. If there are duplicates in the cmaps it seems to me that that is a font issue. If you do want it to be unique, we can store the table's index and use a function like get_table_index(c: char) -> usize to get the table with the first occurrence. If they match its the first occurrence, and if not we skip it.

For the API I think returning an iterator would be ideal, although I wouldnt know how to implement that. Maybe store the current index of every loop we're in, so you can restart all the loops from that index again?

from ttf-parser.

RazrFalcon commented on May 25, 2024

As for the implementation, I can write it for you. The problem is the API itself.

What about a method that will convert a glyph_id into a codepoint? Like Font::codepoint(glyph_id: GlyphId) -> Option<char> In theory, it will be a bit slower, but it's easier to implement.

from ttf-parser.

JAicewizard commented on May 25, 2024

If that is easy to implement for you it would certainly be good enough for me.

from ttf-parser.

RazrFalcon commented on May 25, 2024

Ok. I will take a look into it in a few days.

from ttf-parser.

JAicewizard commented on May 25, 2024

I thought about this some more, and instead of returning an iterator we call a callback function?
So instead of:

for x in Face::list_codepoints(){
    //code that handles/registers the codepoint
}

users would have to write:

 Face::list_codepoints(|c: char, glyph: GlyphId|{
    //code that handles/registers the codepoint
});

it is basically the same code, and if you have an iterator/vec you are probably going to write a loop that does the exact same thing anyways.

With regards to duplicate glyphs, I think that most use-cases for this would want to know every codepoint for a glyph.
The most obvious usecase is a font-viewer, and you have to show every codepoint, not just every glyph.

sorry for the many edits, I forgot ctrl-enter posts the comment on github.

from ttf-parser.

RazrFalcon commented on May 25, 2024

I still not sure how to implement this. Even the basic gid-to-char variant.
The method you proposed is completely incorrect. This is not how cmap works. Not to mention that CFF can have it's own encoding.

from ttf-parser.

JAicewizard commented on May 25, 2024

Why is simply iterating over all cmap entries incorrect? cmap is essentially a table mapping from a codepoint to a glyphid.
If I am not mistaken(which I very well might be) simply iterating over all entries would result in all the mappings available.

from ttf-parser.

RazrFalcon commented on May 25, 2024

cmap can have multiple encodings. Which one should be used? Again, duplicates, variation glyphs, etc. There are no single, correct solution to implements this. And ttf-parser is strictly following the spec. Any non-trivial stuff should be done manually.

from ttf-parser.

JAicewizard commented on May 25, 2024

The idea behind this feature would be to just simply iterate over the characters.
I understand that this is not something that can be done "correctly" in a strict definition.
I think the best solution for this(besides a fork like it is now) would be for me to publish a crate that does this using Face::table_data() and link that from this issue for people to find.
Its up to you if you want to keep this issue open until this feature makes it in ttf_parser or not.

from ttf-parser.

RazrFalcon commented on May 25, 2024

I have no plans on implementing this feature, sorry.

from ttf-parser.

JAicewizard commented on May 25, 2024

Since the code id coupled to a lot of internal types I don't think its possible to extract it to an external crate.
I don't completely understand why it is not exactly following the specification, but its your crate and your decision, no need to say sorry.

from ttf-parser.

RazrFalcon commented on May 25, 2024

Easy. Can you link another library that does this? FreeType doesn't seems to support it.

from ttf-parser.

JAicewizard commented on May 25, 2024

No, that is why I wrote this and tried to get this in. No other libraries have this and I thought this would be a unique feature to this one.

from ttf-parser.

ebraminio commented on May 25, 2024

Sorry for the drive by comment, is the requested feature needs something like hb_face_collect_unicodes API?

from ttf-parser.

RazrFalcon commented on May 25, 2024

No other libraries support this for a reason.

from ttf-parser.

RazrFalcon commented on May 25, 2024

@ebraminio hb_face_collect_unicodes collects codepoints only from a single subtable.

And looks like it ignores variation glyphs, which is kinda pointless? You have to use collect_variation_selectors and collect_variation_unicodes instead.

from ttf-parser.

JAicewizard commented on May 25, 2024

I dont know the implementation details but yes that is pretty much what this is. I dont know how much it matters whether or not it only covers one table.
And @RazrFalcon see hb_face_collect_variation_selectors and hb_face_collect_variation_unicodes for that.

Edit: you already noticed, sorry I was just typing my comment :)

from ttf-parser.

RazrFalcon commented on May 25, 2024

I guess we can do this as cmap::Subtable method. Without any high-level API. Which also means no C API, in case you need one.

from ttf-parser.

JAicewizard commented on May 25, 2024

What would exactly be the problem with doing this on the whole cmap? If you are afraid of duplicate characters mapping to different glyphs an easy way to resolve that would be to call Face::glyph_index and see if it resolves to the same glyphId. That way we guarantee that if are going to acces that character later on, you will get the same glyph.

from ttf-parser.

RazrFalcon commented on May 25, 2024

to call Face::glyph_index

This will be absurdly slow. What's wrong with using a specific subtable?

from ttf-parser.

JAicewizard commented on May 25, 2024

Nothing, I wanted to know what the objections would be against calling Face::glyph_index.
I completely understand that it would be an extremely slow operation for large fonts.
Thanks for the amount of time you've already put into this issue.

from ttf-parser.

clouds56 commented on May 25, 2024

I'd like to have this feature either.

I guess we can do this as cmap::Subtable method. Without any high-level API. Which also means no C API, in case you need one.

I think it's a good idea to do with subtable, anyway there's enough information in subtable and we could collect them into something like Vec outside the crate.

FYI, I'm coming from this patch and would like to switch to ttf-parser as rusttype do so.

from ttf-parser.

laurmaedje commented on May 25, 2024

I am also interested in this. My use case is indexing fonts to later quickly check whether the font has a specific character without loading it.

I looked at this a little, and my approach would be along these lines:

Add iter methods for each format next to the parse methods. These would either return Option<impl Iterator<Item=(u32, u16)>> or just impl Iterator<Item=(u32, u16)>. I think the former would be simpler because the parsing code uses lots of early returns. For the latter, we would need a MaybeIterator enum that is empty in case we would return early. Generally, the iter methods would probably be quite similar to the parse methods but I'm not sure there's a way around this duplication.
Then, add an iter (or maybe a bit more expressively named codepoint_glyph_pairs) method to Subtable, which would return Option<impl Iterator<Item=(u32, GlyphId)>> (this time the u16s are mapped to GlyphIds). Since each format's iter method returns a distinct type this would need to return either a trait object with dynamic dispatch (which I guess is not possible for this library) or a large enum which is generic over each table's iterator and emulates dynamic dispatch.

If this sounds somewhat sensible I would maybe try and start implementing this. (Maybe not directly all formats, I fear it might get a bit difficult to express the more complex ones as iterators, I wish rust had stable generators ...)

from ttf-parser.

JAicewizard commented on May 25, 2024

Iterators are indeed difficult for the complex tables, that's why I proposed using a callback instead of an iterator.
For the applications that are really just going to iterate over it once this makes no difference, only negative I can see is that .collect() would not be available, but just pushing every element do a Vec shouldn't be that hard.

from ttf-parser.

RazrFalcon commented on May 25, 2024

@laurmaedje I think the bigger problem is that Iterator will be very slow. Too many unnecessary, indirect calls. The better solution is to use a callback. Like:

subtable.codepoints(|c| println!("{}", c));

It's not as nice as iterators, since you can't use filter() and stuff. But iterators will be way inefficient.

PS: there will be no duplicates, since we're working on the subtable level.

from ttf-parser.

laurmaedje commented on May 25, 2024

Yeah, I guess you are both right, that's the better approach here. I can try implementing that!

The remaining question would be what the exact API is here. For once, whether the callback is FnMut(u32) or FnMut(u32, GlyphId). I think, the latter would be a little more flexible and faster in case you need it, but the former would probably lead to less code duplication (no need to parse glyph indices, just codepoints) and would be a bit faster when you don't need the glyph ids.

Also, what would happen when there is an error. Should codepoints return something indicating an error condition or will the callback simply not be called?

from ttf-parser.

RazrFalcon commented on May 25, 2024

I guess just u32 is good enough for now. And you can ignore subtable14 for now.

On error, you should simply stop parsing. ttf-parser doesn't provide error reporting anyway. We assume that the font is valid.

from ttf-parser.

support reading all available code-points out of the cmap table. about ttf-parser HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent