cjkvi / cjkvi-ids Goto Github PK

View Code? Open in Web Editor NEW

387.0 39.0 83.0 11.49 MB

IDS data for CJK Unified Ideographs

Home Page: http://kanji-database.sourceforge.net/

cjkvi-ids's Introduction

IDS data

This is a collection of various IDS (Ideographic Description Sequence) data.

Description

IDS (Ideographic Description Sequence) is a way to describe the structure of CJK Unified ideographs.

The IDS consists of IDCs (Ideographic Description Characters), namely "⿰" (U+2FF0) to "⿻" (U+2FFB), and DCs (Description Characters), that are usually ideographs.

IDS is quite important information for ideographs, as it may be possible to identify ideographs from them.

However, there may be ambiguity for encoding IDS. Therefore, tools to normalize IDS and identify the ideographs would be important. IDS tool is one of such example.

Also, IDS sequences use full range of CJK ideographs, so the fonts that covers all encoded ideographs (such as HanaMin or Hanamin AFDKO ) should be used.

Encoding Policies

Compatibility ideographs, whose IDSes are not equal to their corresponding unified ideographs, may be used as DCs. When there are multiple compatibility ideographs with the same IDS, then the one with smaller character code will be used. (e.g. ⻀,並,荒,冗,叟,切,巢,廾,戛,桒,甾,𤾡,舁,蕤,貫,黾)
Following non-ideographs may be used as DCs (for now). "αℓ△⺀⺄⺆⺈⺊⺌⺍⺶⺸⺻⺼〇〢キサ㇀㇉㇢㇞"
Encircled numerics ① ～ ⑳ represents unencoded DCs. Number denotes its stroke count. This would be useful when calculating total strokes of ideographs. Such convention does not conform with the Annex I of ISO/IEC 10646, so please replace them with wildcard character `？' (U+FF1F) if you need a strict conformance with the UCS standard.
IDS data file with name postfix "*-cdp.txt" adopts PUA characters from CDP (CDP stands for "Chinese Document Processing lab" at Academia Sinica) as DCs. They are deonted as entity reference like "&CDP-xxxx;".

At the end of "ids-cdp.txt", mappings between PUA DCs and CDP references are enumerated. For details of usable PUA characters, refer an article on CDP at GlyphWiki. CDP's hexadecmail numbers and Unicode BMP PUA character codepoints relationship is based on EUDC codepoints defined by by Microsoft Big5 to PUA conversion table. HanaMinAFDKO Font supports these glyphs in PUA.
IDS of compatibility ideographs may sometimes have compatibility ideographs as DCs, by mean of clarifying the difference of their structures compared with corresponding unified ideographs.
"G", "T", "J", "K", "V", etc. signs with brackets after IDS indicate that such IDS is specific to each columns of UCS code charts. "A" indicates AJ1-6 shapes, and "X" indicates virtual shape that is not actually appeared in the UCS specification, but possibly matches to that code points according to the Annex S of UCS. Some of such shapes may appear in OS-equipped fonts such as MingLiu, MS-Mincho or SimSun, or famous dictionaries such as "Dai Kanwa Jiten". "O" indicates "obsolete", that was once appeared in older edition of the UCS standard, but no longer.

Licenses

'ids.txt' is derived from CHISE project. License follows their terms. 'ids-ext-cde.txt' is not directly based on CHISE project, and is not restricted to GPLv2 license.
All other data are distributed uner GPLv2.

Author

kawabata

cjkvi-ids's People

Contributors

Stargazers

Watchers

Forkers

mandel59 mazhen2009 loveencounterflow eunheui zaqzrh yumaito hfhchan leonardab mashabow ursamajesty java4chip busensei lingulist jlhwung jsoendermann chssozxw schezuk tang891228 patarapolw ntuanhung natoinet fendaq r4forth pyhustsong kanjieater aaronlifenghan zengjatzau seasker lachlanandrew oscarsun72 svkoh qingkong111 benkasminbullock qujingying wtn zhang-o zhoujun775 hell-to-heaven iequinox aloha0424 lopentu zikosw woooodbond zhao-huang orientalperil hopeskair poethan caiyeahku cb1473258684 duxiaochao shengzhang90 qiuchaofan 2016xjtuzyt ishtien chhayachan zzmcdc xavier-taylor eightyninth fortune-fun lucus-lee clscy fangdizhong mtnakayama fbngrm cqray1990 sunshinezhihuo melnikovag mpsuzuki xiaomeng-ma luodapang xiuxi thmonster lsflyt crabshank toyjack ryanwhite04 songkq charlesribeiro bad-meets-joke jnwysh groverlynn lasfito alotop

cjkvi-ids's Issues

IDS for U+43D5 䏕

U+43D5 䏕 ⿰月𡈼[G] ⿰⺼壬[T] ⿰月王[J] ⿰月壬[K]

should be changed to

U+43D5 䏕 ⿰月壬[GK] ⿰⺼𡈼[T] ⿰月王[J]

Also to say, the ⿰月王 glyph for Japan looks very suspicious (according to IPA 文字情報基盤, the pronunciation of the two unified variants are different), and that the TF glyph (⿰⺼𡈼) has displaced the more correct T4 glyph (⿰⺼壬) which is now in U+2F981.

Two decompositions for 肯 are the same

The entry for 肯 is

U+80AF	肯	⿱止月[GJK]	⿱止月[TV]

Both the [GJK] and the [TV] decompositions are the same, so it could be simplified to

U+80AF	肯	⿱止月

U+2BF4A 𫽊

The glyph for U+2BF4A 𫽊 has been changed from

⿰扌⿱龸子

⿰扌学

in the newest Code Charts.

Admendments to IDS containing 虚/虛 for Extension B and up

U+25CA4 𥲤 ⿱竹虚
U+271FA 𧇺 ⿰光虛
U+27754 𧝔 ⿰衤虚[G] ⿰衤虛[T](control comparison)
U+27D06 𧴆 ⿰豸虛
U+28F0B 𨼋 ⿰阝虛
U+293E3 𩏣 ⿰虛韋
U+2955A 𩕚 ⿰虛頁
U+2B8DE 𫣞 ⿰亻虛

can be changed to:

U+25CA4 𥲤 ⿱竹虚[G] ⿰竹虚[TH]
U+271FA 𧇺 ⿰光虚[G] ⿰光虚[X]
U+27754 𧝔 ⿰衤虚[G] ⿰衤虛[T](control comparison)
U+27D06 𧴆 ⿰豸虚[G] ⿰豸虛[T]
U+28F0B 𨼋 ⿰阝虚[G] ⿰阝虛[TV]
U+293E3 𩏣 ⿰虚韋[G] ⿰虛韋[T]
U+2955A 𩕚 ⿰虛頁[T] ⿰虚頁[X]
U+2B8DE 𫣞 ⿰亻虛[V] ⿰亻虚[X]

No japanese decomposition for 肌

The database contains:

U+808C  肌      ⿰月几[GTV]     ⿰⺼几[T]

The character is a japanese 常用漢字, and should have a japanese decomposition, which would likely be the former. The are other characters around this one that have the same kind of decompositions, where both side have a T decomposition, but no J. I wonder if they are typos, or if both decompositions actually are valid for T.

𧿋 (U+27FCB)

𧿋 (U+27FCB) should be ⿰ 𧾷𡕒 instead of ⿰ 𧾷㔿.

Additional IDS for U+2170F 𡜏

U+2170F 𡜏 ⿰女𣧄

IDS for 䢄

By etymology, according to 説文解字, it should be a top/bottom character, not left right.
Suggested change:
⿱辝林 ⿱辝𣏟[X]

U+5DEB 巫

Wouldn't ⿻工从 be a better or more intuitive IDS for U+5DEB 巫 than ⿱一⿻丄从? For reference, Table I.1 of ISO/IEC 10646 Annex I uses ⿻从工 as the example IDS for U+2FFB ⿻, though I feel that ⿻工从 is more intuitive.

U+20CB9 𠲹

Add ⿰口忌

Additional IDS for U+22001 𢀁

U+22001 𢀁 ⿰𣧄犮

Based on the original sources of U+22001 (T5-3776) (http://dict.variants.moe.edu.tw/variants/rbt/word_attribute.rbt?quote_code=QzAzMDMx):

字彙:

重訂直音篇:

𢀁 Pronunciation 魚乞切
𣧄 Pronunciation 魚乙切 / 魚乞切.

U+293A9 𩎩

U+293A9 𩎩 ⿰韋⿻木𢎘
U+293A9 𩎩 ⿰韋&CDP-85C2;

should be

U+293A9 𩎩 ⿰韋𣎺

Amendment to IDS

U+7690 皐 ⿱白𠦂

is more correctly

U+7690 皐 ⿱白𠦂[GHV] ⿱白⿻十⿰二二[TJK]

This does not affect any character containing U+7690, except for the following.

U+22FCE 𢿎 ⿰皐攴

The T-source of this character the last stroke on the left bends leftwards.

U+3828 㠨

U+3828 㠨 is listed as ⿱山欝 but unless I'm mistaken it appears that it should be ⿰山欝.

Consider using U+31E3 in U+3514 㔔, U+3AB3 㪳 & U+3AC8 㫈

The IDSes for U+3514 㔔, U+3AB3 㪳, and U+3AC8 㫈 use U+3007 〇 as its component, but U+3007 has numeric properties, and is suitable as a component only in shape.

Instead, consider replacing it with CJK Stroke U+31E3, which has no numeric properties, and corresponds to a syllable-final ng/ŋ sound that is related to the shape of U+11BC HANGUL JONGSEONG IEUNG (combining) or U+3147 ㅇ HANGUL LETTER IEUNG (compatibility).

Merge all 朩 entries into 木

朩 is an Idu character which is actually the cursive form of 木, so they are semantically equivalent from a CJK Unified Ideographs point of view. Also, characters with 木 in Taiwan are usually written as 朩 without the hook when placed at the bottom. Separating them causes difficulty in lookup.

Suggested Options:
(1) Merge all 朩 without the hook into 木.
(2) Merge all 朩 with or without the hook into 木.

I prefer Option (1), because Option (2) may confuse users in China.

If no change is made, then the following rules should be added:

a	b	c
U+409E	䂞	⿱石木 [G]
U+202D5	𠋕	⿰亻⿱⿰工几木 [G]

U+217FD

U+217fd 𡟽 should be ⿰女𡋲

U+205B8 𠖸

U+205B8	𠖸	⿰ 冫 ⿱ &CDP-8AFC;() 匕

is actually

U+205B8	𠖸	⿰冫仺

By etymology,
仺 = 倉
𠖸 = 滄

Therefore, the separation of strokes and the presence of final 鉤 need not be counted as different component.

Suggested IDS for U+226B9 𢚹

U+226B9 𢚹 ⿰忄𣅘

To follow the V glyph (and also more correct etymologically 𢚹 = 忄 + 𣅘)

Decomposition for 捗

It currently is

⿰扌步[GTK]	⿰扌歩[J]

But the 常用漢字表 explicitly says the form for that character doesn't match that of 歩, so it seems it should simply be

⿰扌步

(Although my electronic version of the 漢字源 dictionary, which is older than 2010, does show the extra stroke)

U+443D 䐽

U+443D  䐽  ⿰⺼⿱廿兩[T]  ⿰月⿱廿兩[K]  ⿰月㒼[X]

U+443D  䐽  ⿰月⿱㒼[G]  ⿰⺼⿱㒼[T]  ⿰月⿱廿兩[K]

Decomposition for U+5FF9

U+5FF9	忹	⿰忄王[G]	⿰忄壬[T]

should be

U+5FF9	忹	⿰忄王

How do you prefer users to Quote your work?

I am a big fan of this library and I have been using it for personal purposes a lot, but I am considering using small parts of the data in a paper to illustrate how Chinese character evolution could be modeled. I could quote this library by just passing the github link, however, I'd prefer if you share how you like to be quoted explicitly, so that you get the credit you deserve. Furthermore, I'd suggest that you make a release of the data, and that you also submit this release to some open repository like Zenodo.org, so that we can quote you correctly, with a DOI for a particular version that is being used in research (non-profit, of course).

Decomposition for 雨

Hi. I'm not sure that this is the right place to report this. I was using Atom factoids plugin (which uses the data from this repo) and noticed that 雨 is decomposed as ⿱一𠕒 and this looks very strange to me.
I'm not a specialist in this stuff, so probably it's correct.

Consider "⿱ 一八" as one unit?

The component would be &hkcs-comp-0001;
Glyph of component would be http://en.glyphwiki.org/glyph/hkcs_comp-0001.svg

IDS for 騺

U+9A3A 騺 ⿱執馬[GT] ⿱埶馬[K]

would now be

U+9A3A 騺 ⿱執馬

in the latest code charts.

𫋑 U+2B2D1

U+2B2D1	𫋑	⿰虫⿲王㐅王

should be changed to

U+2B2D1	𫋑	⿰虫班

Reference:

The 𫋑 has a semantic component 虫 and phono component 班.

Suggested IDS for U+21D2C 𡴬

U+21D2C 𡴬 ⿰𡴤𡴤

Decomposition for 𧶚 (U + 27D9A)

Additional Decomposition ⿱𠦔貝 which is the semantical decomposition

Update 𤚌 (U+2468C)

U+2468C 𤚌 ⿸尸⿱⿰二⿺乚二牛

U+2468C 𤚌 ⿸㞑牛

Also refer to:

U+21CB6 𡲶 ⿸㞑出

Decompositions for U+66F1 to U+6707

Some of them decompose using 曰, some of them decompose using 日, some of them decompose using both. I think at least some of the ones using 日 should also have a decomposition using 曰 (at least
更, 最, 書, 曹, 曽, 替, for those I know)

Decomposition for 𡳲 U+21CF2

U+21CF2 | 𡳲 | ⿺尾⿱爫萬

should be

U+21CF2 | 𡳲 | ⿺尾⿱爫禺

Decomposition for 後

It currently is:

⿰彳⿱幺夊

I think there should at least be this variant:

⿰彳⿱幺夂[J]

Note there is precedent in characters like 夏, 憂, 致, 𢖻

Decomposition for 祈 and 祉

They are:

U+7948	祈	⿰礻斤[GT]	⿰示斤[JK]
U+7949	祉	⿰礻止[GT]	⿰示止[JK]

I think they should be:

U+7948	祈	⿰礻斤[GTJ]	⿰示斤[K]
U+7949	祉	⿰礻止[GTJ]	⿰示止[K]

The reason is that while there are 旧字体 with the 示 form, their form in the 常用漢字表 is with 礻.
Plus, the variants with 示 now have their own codepoints: U+FA4E (祈) and U+FA4D (祉)

Decomposition for 嗅

It is

⿰口臭

but it should arguably be

⿰口臭

U+27627 𧘧

U+27627 𧘧 ⿰ 衤 ⿱ 丿 &CDP-85F1;()

should be

U+27627 𧘧 ⿰衤𤣫

where 𤣫 is a variant of 斗 and 𧘧 a variant of 𧘞

Decomposition for 𨤜 U+2891C

Additional decomposition ⿱𢍏豕 which is the logical decomposition following 豢 from 說文

Change to IDS for 𨖙

U+28599 𨖙    ⿺辶眾   ⿺辶衆

is more accurately

U+28599 𨖙    ⿺辶眾[T]    ⿺辶⿱⿱丿罒&CDP-8C76;[V] ⿺辶衆[X]

Change to IDS U+2378C 𣞌

U+2378C 𣞌    ⿰木⿱⿱白冖⿱太万

Should be

U+2378C 𣞌    ⿰木⿱⿱白冖⿱大方

This is evident with the T-source glyph (TF-5F2D). In T-source glyphs containing 太, the dot is always joined with 大. Also, for glyphs containing 方, the dot is always joined with 万. In this case since the dot is joined with 万, it can be deducted that the correct decomposition is 大方.

This is also reflected in CNS11643's decomposition at
http://www.cns11643.gov.tw/AIDB/query_general_view.do?page=f&code=5f2d:
木白冖大亠&cdp-8b6c

IDS for 為

The Kang Xi radical code for 為 is for 灬, so at the very least, it seems the IDS for it should be:

⿱⑤灬

More generally, there seems to be a few characters with no decomposition that should at least be decomposed into some IDC + some number + their 部首, where both the number and the 部首 can be derived from the kRSKangXi information from the UniHan DB... (although in some cases, kRSAdobe_Japan1_6 and/or kRSUnicode can have different values, but I'm not sure if that's the case for the characters that currently have no decomposition)

I guess I could run a script to cross-check the UniHan DB vs. the characters in the idx.txt file that have the same thing in both columns 2 and 3 (i.e. have no decomposition).

𧆪

IDS for 𧆪 should include 𠂞 (u2009e) according to etymology

Additional IDS for 𩇞 (U+291DE)

Additional IDS as:
| U+291DE | 𩇞 | ⿰青炁 |

𩇞 is the variant form of 靝 (天), and 炁 is the variant form of 氣 (qi4). 旡 (ji4) is the phonetic element of 炁, it is unrelated to the similar glyph 无 which is simple form of 無.

𪼘 U+2AF18

U+2AF18 𪼘 ⿰王⿱𫇦亘

should be

U+2AF18 𪼘 ⿰王萱

𦺌 (U+26E8C)

U+26E8C | 𦺌 | ⿱艹⿰氵⿳穴人夂[U]⿱艹⿰氵⿳穴人又[GT]

change to

U+26E8C | 𦺌 | ⿱艹𣸈

Decomposition for U+27C9D 𧲝 and U+27C9E 𧲞

U+27C9D	𧲝	⿱䘙豚

should be

U+27C9D	𧲝	⿱衞豚[G]	⿱衛豚[T]

Additionally,

U+27C9E	𧲞	⿱䘙𧱔

should be

U+27C9E	𧲞	⿱衞𧱔[G]	⿱衛𧱔[T]

Edits: U+27C9E is 𧲞 instead of 𧲝, though the decomposition is correct.

Decomposition of U+2189B

U+2189B	𡢛	⿱⿲幺⿱丿幺幺𡚴[G]	⿱𪺕𡚴[T]

should be changed to

U+2189B	𡢛	⿱⿲幺⿱丿幺幺𡚴[G]	⿳丿⿲幺幺幺𡚴[T]

Decomposition for U+9269 鉩

U+9269	鉩	⿰金尒[GJK]  ⿰金𡭗[T]

should be

U+9269	鉩	⿰金尒

U+2C606

U+2C606	𬘆	⿰ 糹 ⿻ 申 ⿱ 𠔽 月 ⿰ 糹 ⑭

should be

U+2C606	𬘆	⿰ 糹𤲥

instead.

Reference:

U+2A9A7	𪦧	⿰ 女 𤲥

Additional IDS for 𠄚 (U+2011A)

Suggest add IDS for U+2011A 𠄚 ⿰𠀭丁[X]

𠄚 in Kangxi:

𠄚 in Jiyun:

It is likely that Kangxi mis-copied the form again. 𠀭 is old form of 平, and based on the description of Kangxi and the arrangement of 𠄚 in Jiyun, 𠄚 is likely old form of 𢆊, so the left hand side should theoretically be 𠀭.

Update U+4C8A 䲊

U+4C8A 䲊 ⿲魚卩&CDP-8CA9;[G] ⿰魚隋[T]

U+4C8A 䲊 ⿰魚隋

cjkvi / cjkvi-ids Goto Github PK

cjkvi-ids's Introduction

IDS data

Description

Encoding Policies

Licenses

Author

cjkvi-ids's People

Contributors

Stargazers

Watchers

Forkers

cjkvi-ids's Issues

Recommend Projects

Recommend Topics

Recommend Org