Giter VIP home page Giter VIP logo

han-eacc-ucs's Introduction

EACC/Unicode Ideograph Mappings

The kEACC field in Unihan 6.2 is woefully out of date. Compared to the mappings in the latest MARC-8 Code Table at the Library of Congress (LoC) it has 8 different mappings and is missing 235.

This directory contains an updated table for Unihan derived from the LoC data.

The Source Data

  • Unihan_OtherMappings.txt 6.2 from the Unicode Consortium
  • “MARC-8 to Unicode XML mapping file” from the Library of Congress

The Mapping Table

loc-eacc-ucs.txt was generated with loc.xslt XSLT script from the LoC MARC-8 table.

The Programs

loc.xslt
XSLT script to extract the Han Ideograph mappings from the LoC XML file. Handles the cases where the EACC code maps to both the PUA and to U+3013. The output of this script is a file containing two tab-separated columns:
  1. The 3-byte EACC code as six hexadecimal numbers
  2. The USV of the corresponding Unicode character
eacc-loc-unihan.lisp
functions for reading the mapping tables and comparing their entries. This uses the CL-PPCRE library which is easily installable via QuickLisp. Tested with Clozure Common Lisp it should work with any implementation.

Comparing the Tables

Load eacc-loc-unihan.lisp into your Lisp image and switch to the EACC package.

EACC> (defvar *unihan* (read-unihan-eacc-mappings "Unihan_OtherMappings.txt"))
*UNIHAN*
EACC> (defvar *loc* (read-loc-eacc-mappings "loc-eacc-ucs.txt"))
*LOC*
EACC> (compare-entries *UNIHAN* *LOC*)
4B5F58	0F9B2	096F6
215C32	0FA25	09038
215061	0FA1D	07CBE
4B7421	0F9A9	056F9
4B4B3E	0F9AD	073B2
215F71	0FA1C	09756
4B333E	0F92E	051B7
214339	0FA12	06674
NIL

The output of the call to compare-entries shows the 8 ideographs in EACC that have different mappings in Unihan (e.g., U+F982) than in the LoC table (e.g., U+96F6).

Comparing in the other direction shows the 235 characters that have mappings in the LoC table without a kEACC mapping in Unihan:

EACC> (compare-entries *LOC* *UNIHAN*)
4B3474		0537F
213F53		061F2
4B5361		089D2
214456		06813
;;; lots deleted
216053		0985E
216044		09818
3A284C		053A9
45564B		0865E
NIL

License

The source code is in the public domain: do with it what you will.

han-eacc-ucs's People

Contributors

treerex avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.