Giter VIP home page Giter VIP logo

Comments (6)

simoncozens avatar simoncozens commented on June 22, 2024

Yeah, we talked about this in harfbuzz/youseedy#1 :-)

Hmm. Getting all the properties sounds like a good idea, but... it's very slow to parse the whole file, even with lxml: on my computer, 25 seconds with Unihan, 7 seconds without Unihan, compared to 0.7 seconds to parse all the text files.

But what if we don't read it as XML?

from lxml import etree
import re

f = open("ucd.all.flat.xml")
ucd = {}
for l in f.readlines():
    m = re.search(r'char cp="([0-9A-F]+)"', l)
    if m:
        ucd[int(m[1],16)] = l

def getucd(cp):
    return etree.fromstring(ucd[cp].strip()).attrib

print(getucd(0x600))

1.5 seconds. Not bad.

from youseedee.

simoncozens avatar simoncozens commented on June 22, 2024

And if you get very friendly with the file layout, you don't even need regular expressions:

from lxml import etree

f = open("ucd.all.flat.xml")
ucd = {}
for l in f.readlines():
    if "char cp" in l:
        cp = l[16:l.index(r'" age')] # "{6 spaces}<char cp="XXXX" age="
        ucd[int(cp,16)] = l

def getucd(cp):
    return etree.fromstring(ucd[cp].strip()).attrib

print(getucd(0x600))

from youseedee.

behdad avatar behdad commented on June 22, 2024

Oof. Interesting. Something to pursue definitely.

from youseedee.

behdad avatar behdad commented on June 22, 2024

Needs to handle groups as well.

from youseedee.

behdad avatar behdad commented on June 22, 2024

I mean, the group file is much smaller than the flat file. But maybe not after compression. Have to check.

from youseedee.

behdad avatar behdad commented on June 22, 2024

I was going to download ucdxml and compare sizes but gave up on the front page this time...
https://twitter.com/behdadesfahbod/status/1312090058548154368

from youseedee.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.