charlesloder / havarotjs Goto Github PK
View Code? Open in Web Editor NEWA Typescript package for getting syllabic data about Hebrew text with niqqud.
Home Page: https://www.npmjs.com/package/havarotjs
License: MIT License
A Typescript package for getting syllabic data about Hebrew text with niqqud.
Home Page: https://www.npmjs.com/package/havarotjs
License: MIT License
The doc CI job always fails because typedoc was updated. Either downgrade typedoc or find a better pages plugin
When there is a word with a "waw with a holem" and a "holem waw" in the same word, the "waw with a holem" is incorrectly replaces with a "holem waw"
E.g. עֲוֹנוֹתֵינוּ
There are two ways to write a holem-waw:
Pattern | Word |
---|---|
(1) consonant + holem + waw | שָׁלֹום |
(2) consonant + waw + holem | שָׁלוֹם |
Additionally, instead of a holem
(U+05B9), a holem haser for vav
(U+05BA) can also be used for typographic reasons, meaning there are four possible patterns for encoding a holem-vav.
Pattern (2) is preferred because:
Because the holem haser for vav
(U+05BA) is primarily used for typographic reasons, it will be best to convert all occurrences of U+05BA to U+05B9.
In order to semantically encode a holem-waw, all occurrences in each word of:
a waw preceding a holem, but no vowel preceding the waw will be swapped so that the holem precedes the waw.
Examples:
Because taamei can occur before a waw but do not need to occur before a waw, the taamei will be removed, the characters swapped, and then the strings rebuilt like the qametsQatan
sanitation.
A single shureq וּ
fails with the error:
TypeError: Cannot read properties of undefined (reading 'hasTaamim')
at /Users/charlesloder/Documents/code/personal/havarot/dist/utils/syllabifier.js:297:70
at Array.filter (<anonymous>)
at setIsAccented (/Users/charlesloder/Documents/code/personal/havarot/dist/utils/syllabifier.js:297:42)
at /Users/charlesloder/Documents/code/personal/havarot/dist/utils/syllabifier.js:347:37
at Array.forEach (<anonymous>)
at syllabify (/Users/charlesloder/Documents/code/personal/havarot/dist/utils/syllabifier.js:347:15)
at get syllables [as syllables] (/Users/charlesloder/Documents/code/personal/havarot/dist/word.js:67:44)
at /Users/charlesloder/Documents/code/personal/havarot/dist/text.js:146:46
at Array.map (<anonymous>)
at get syllables [as syllables] (/Users/charlesloder/Documents/code/personal/havarot/dist/text.js:146:27)
The error is caused by
Need to check if arr[i]
exists
I keep forgetting to update the changelog. Create some guard to ensure that it's udpated. Maybe even a simple Y/n on the command line
The fix in #17 caused an error where a word with a final aleph - ס֣וֹא
would lose the aleph. This is a non-standard Hebrew spelling
When Latin characters are used next to the Divine Name (e.g. a comma), it creates issues.
See this issue for more context.
Will have to adjust this regex:
havarotjs/src/utils/divineName.ts
Line 1 in 7538f7d
To probably something like
const nonChars = /[^\u{05D0}-\u{05F4}]/gu;
The various spellings of 'Jerusalem' do not sequence correctly.
The most uncommon spelling — יְרוּשָׁלַיִם like וִירוּשָׁלַ֨יִם֙ in Jer 26:18 — syllabifies fine ✅
The common spelling of יְרוּשָׁלִַ֗ם like in Josh 10:1 does syllabify correctly, but switches the hiriq and the patach in the final syllable 👎
See יְרוּשָׁלִָֽם in 2 Sam 14:23; the same issue as above 👎
The issues resides in how the Cluster sequences the Chars.
עָתְנִיאֵ֣ל
other names?
Hello Charles,
Thank you so much for your continued work on this fantastic library. I'm experiencing an issue with a Vav Holam shifting to the previous letter after initializing a Text object. Any suggestions on what may be going wrong? I'm using version 0.7.2 with Node for reference.
Word passed to Text() = א֑וֹר
Syllable returned = אֹ֑ור
Thank you!!
Add a check for q.q.
havarotjs/src/utils/qametsQatan.ts
Lines 98 to 104 in 6ca463a
const qametsReg = /\u{05B8}/u;
const qametsQatReg = /\u{05C7}/u;
const hatefQamRef = /\u{05B3}/u;
// if no qamets or has qamets qatan char, return
if (!qametsReg.test(word) || qametsQatReg.test(word)) {
return word;
}
The Divine Name יְהוָה
causes causes the Error A Syllable shouldn't preceded a Cluster with a Mater
.
This wasn't anticipated as the Divine Name does not follow typical rules.
The name can be written two ways:
יְהֹוָה
with a holem, which produces no errorיְהוָה
w/o a holem, which produces an error but is more typical.Perhaps add create property Word.isDivineName
?
Lines 21 to 28 in 6ca463a
There are a lot of variation on the form כל (e.g. וְכָל). Most issues are going to occur in non-Biblical texts where the use of the maqqef is less common.
I will need to identify these — hopefully systematically.
This text:
וְאֵ֗לֶּה שְׁמוֹת֙ בְּנֵ֣י יִשְׂרָאֵ֔ל הַבָּאִ֖ים מִצְרָ֑יְמָה אֵ֣ת יַעֲקֹ֔ב אִ֥ישׁ וּבֵית֖וֹ בָּֽאוּ׃ ברְאוּבֵ֣ן שִׁמְע֔וֹן לֵוִ֖י וִיהוּדָֽה׃ גיִשָּׂשכָ֥ר זְבוּלֻ֖ן וּבִנְיָמִֽן׃ דדָּ֥ן וְנַפְתָּלִ֖י גָּ֥ד וְאָשֵֽׁר׃ הוַֽיְהִ֗י כׇּל־נֶ֛פֶשׁ יֹצְאֵ֥י יֶֽרֶךְ־יַעֲקֹ֖ב שִׁבְעִ֣ים נָ֑פֶשׁ וְיוֹסֵ֖ף הָיָ֥ה בְמִצְרָֽיִם׃ ווַיָּ֤מׇת יוֹסֵף֙ וְכׇל־אֶחָ֔יו וְכֹ֖ל הַדּ֥וֹר הַהֽוּא׃ זוּבְנֵ֣י יִשְׂרָאֵ֗ל פָּר֧וּ וַֽיִּשְׁרְצ֛וּ וַיִּרְבּ֥וּ וַיַּֽעַצְמ֖וּ בִּמְאֹ֣ד מְאֹ֑ד וַתִּמָּלֵ֥א הָאָ֖רֶץ אֹתָֽם׃ {פ} חוַיָּ֥קׇם מֶֽלֶךְ־חָדָ֖שׁ עַל־מִצְרָ֑יִם אֲשֶׁ֥ר לֹֽא־יָדַ֖ע אֶת־יוֹסֵֽף׃ טוַיֹּ֖אמֶר אֶל־עַמּ֑וֹ הִנֵּ֗ה עַ֚ם בְּנֵ֣י יִשְׂרָאֵ֔ל רַ֥ב וְעָצ֖וּם מִמֶּֽנּוּ׃ יהָ֥בָה נִֽתְחַכְּמָ֖ה ל֑וֹ פֶּן־יִרְבֶּ֗ה וְהָיָ֞ה כִּֽי־תִקְרֶ֤אנָה מִלְחָמָה֙ וְנוֹסַ֤ף גַּם־הוּא֙ עַל־שֹׂ֣נְאֵ֔ינוּ וְנִלְחַם־בָּ֖נוּ וְעָלָ֥ה מִן־הָאָֽרֶץ׃
throws the error:
Error: Syllable should not precede a Cluster with a Mater
Figure out why
Words in the form of CǝCûC
are being syllabified as 1 syllable, when they should be two syllables.
Words in the form of CǝCû
(w/o the final consonant) are correct 2 syllables, and so are words of the form CǝCVC
.
There is likely a problem in the groupFinal
logic.
Though the Syllable
has useful properties, it should have linguistic properties of syllables as well. Suggestions:
Syllable.onset: string | null
Syllable.nucleus: string
Syllable.coda: string | null
The overwhelming majority of Hebrew syllables have an onset
. Though the aleph or ayin may not be considered an onset in Modern Hebrew, they were in Biblical, and orthographically function like an onset
.
The only syllable that won't have onset
is a word-initial shureq (e.g. וּמֶלֶךְ [u. 'mε. lεk])
In Biblical Hebrew, there are no medial consonants in the onset
; that is, there are no consonant clusters (i.e. CCV or CCVC types). The only exception is for the numeral שְׁתַּיִם and its various forms.
Every syllable must have a nucleus
(i.e. vowel). A vocal shewa is a nucleus
A coda
is optional. A final qamet-he or qamets-aleph would not count as a coda; these would be of the syllable type CV, but a he with a mappiq would be a coda—it would be a syllable type of CVC.
In the word יִירָא֥וּךָ
the aleph was being was being parsed as a quiesced aleph (i.e. the syllable as רָא֥
) instead of as a consonant (i.e. as א֥וּ
)
Currently, havarot
syllabifies words according to Traditional (i.e. Sephardic) or Tiberian rules.
The ability to syllabify word according to general Modern Hebrew pronunciation would be beneficial, especially for augmenting with transliteration schemas that follow Modern Hebrew
In issue #2, it is proposed to introduce more linguistic properties to syllables.
Modern Hebrew differs in it's syllable properties
A medial
property would need to be included:
Syllable.medial: string | null
Modern Hebrew allows for syllable types of CCV and CCVC.
E.g. גְּדֹולִים is realized as [gdo. 'lim]
For syllables beginning with א, ע, or ה, the onset
can be realized as null
.
Though, orthographically, they do function like an onset
.
In Biblical Hebrew reading traditions, the shewa is often vocalic, but in Modern Hebrew it is often realized as a zero-vowel [Ø] (Coffin and Bolozky, A Reference Grammar of Modern Hebrew, 22), creating syllables of CCV or CCVC types (see above)
The most common times that a word-initial (maybe syllable-initial) shewa is realized as vocalic is when (1) it's onset is a י, ל, מ, נ, or ר, or (2) when the second letter is א, ה, or ע.
Example of (1):
Example of (2):
A shewa preceded by a shewa is typically vocal as well, just like TIberian, but not necessarily so
On v0.1.2, example:
const str = "מַשִׁיחַ";
const doc = new Text(str);
const res = doc.syllables.map((el) => el.text);
[
"מַ", // \u{5DE}\u{5B7} (mem, patach)
"ישִׁ",// \u{5D9}\u{5E9}\u{5C1}\u{5B4} (yod, shin, shin-dot, hiriq)
"חַ" // \u{5D7}\u{5B7} (chet, patach)
]
The yod
should be after the shin
cluster.
Something is wrong in the new syllabifier.ts
logic. Need to add tests.
The Cluster.hasMetheg
property needs to better determine between the use of U+05BD as a metheg or as a siluq.
Cluster
even has a metheg. If no, return false
Cluster
s via this.next
.false
Cluster
has a metheg. If yes, than the second metheg is the siluq, and the current one is a metheg. Return true
true
This logic will have to be tweaked a bit
Similar to #74, a property called vowelName
should exist that returns unicode character name.
new Text("בְּאֶ֣רֶץ").clusters.map(c => c.vowelName)
// ["SHEVA", "SEGOL", "SEGOL", null]
Things to consider:
SHEVA
a "vowel"? See especially hasShewa
propertyOn the Cluster
object, add a property called isVocalSheva
, that return a boolean indicating if the shewa is vocal or not.
new Text("בְּאֶ֣רֶץ").clusters.map(c => c.isVocalSheva)
// [true, false, false, false]
Things to consider:
null
if there is not shewa?Certain letters—שׁ, שׂ, ס, צ, נ, מ, ל, ו, י
—when they have a shewa na' (i.e. vocal shewa) reject a dagesh chazaq (i.e. forte).
E.g. וַיְּהִי*
becomes וַיְהִי
Should be syllabified as: ["וַ", "יְ", "הִי"]
, but instead get ["וַיְ", "הִי"]
.
Some may consider the first syllable (i.e. "וַ"
) as closed, but it will be considered open.
See: אֱלֹהֶ֑יךָ or רָקִ֖יעַ
A word like: וּלְזַמֵּ֖ר
throws the Error "A Syllable shouldn't precede a Cluster"
A text like עַל־כָּל־הַשָּׂרִים֙
is only split into 2 Words. Should be 3
Acc. to GKC §20m there are instances when after the article and the interrogative מה that the dagesh chazaq (or forte) is omitted:
Very frequently in certain consonants with Šewâ mobile, since the absence of a strong vowel causes the strengthening to be less noticeable. This occurs principally in the case of ו and י (on יְ and יֵּ after the article, see § 35 b; on יְּ after מַה־, § 37 b); and in the sonants מ,[6] נ and ל; also in the sibilants, especially when a guttural follows (but note Is 629, מְאַסְפָיו, as ed. Mant. and Ginsb. correctly read, while Baer has מְאָֽסְ׳ with compensatory lengthening, and others even מְאָסְ׳; מִשְׁמַנֵּי Gn 2728, 39; מִשְׁלשׁ 38:24 for מִשְּׁ׳, הַֽשְׁלַבִּים 1 K 728; אֶֽשְֽׁקָה־ 1 K 1920 from נָשַׁק, הַֽשְׁפַתַּ֫יִם Ez 4043 and לַֽשְׁפַנִּים ψ 10418; מִשְׁתֵּים Jon 411, הַֽצְפַרְדְּעִים Ex 81 &c.);—and finally in the emphatic ק.[7]
Of the Begadkephath letters, ב occurs without Dageš in מִבְצִיר Ju 82; ג in מִגְבֽוּרָתָם Ez 3230; ד in נִדְחֵי Is 1112 56:8, ψ 1472 (not in Jer 4936), supposing that it is the Participle Niphʿal of נָדַח; lastly, ת in תִּתְצוּ Is 2210. Examples, עִוְרִים, וַיְהִי (so always the preformative יְ in the imperf. of verbs), מִלְמַ֫עְלָה, לַֽמְנַצֵּחַ, הִנְנִי, הַֽלֲלוּ, מִלְאוּ, כִּסְאִי, יִשְׂאוּ, יִקְחוּ, מַקְלוֹת, מִקְצֵה, &c. In correct MSS. the omission of the Dageš is indicated by the Rāphè stroke (§ 14) over the consonant. However, in these cases, we must assume at least a virtual strengthening of the consonant (Dageš forte implicitum, see § 22 c, end).
The second paragraph is likely beyond the scope of this package.
The first paragraph has three categories for when a dagesh chazaq may be lost, but the shewa should still be counted as a shewa naʿ (or shewa mobile/vocal):
Walkte & O'Connor §13.3d give a simplified explanation:
According to this, the shewa is a shewa nach not a shewa na' seemingly contra GKC.
GKC's references are ambiguous
see charlesLoder/hebrew-transliteration#14
...wip
In the forms with a metheg there is nothing to check. For the others, something like:
cluster.hasShewa
and /י/,test(cluster.text)
and /הַ/.test(cluster.prev.text)
should syllabify as:
["מִן־", "הַ", "יְ", "אֹ֗ר"]
That would limit it only to the article, but it would be a start.
Maybe an option like strict
and when false, allows for incorrect text.
Basically,
havarotjs/src/utils/syllabifier.ts
Lines 214 to 218 in 460869f
and
havarotjs/src/utils/syllabifier.ts
Lines 265 to 269 in 460869f
would need to be bypassed, and error that occur from Cannot read properties of undefined (reading 'has<something>')
See original issue here
See twitter thread
The paseq is included when words as split (e.g. "עֵֽדֹתֶ֨יךָ ׀ "). Because the paseq is a word divider, it should not be included. Additionally, it messes with adding accents to syllables, causing in the example above the the final kaf to be marked as accented.
The paseq should be counted as its own word with a single syllable similar to how non-Hebrew words are handled.
Need to delete the interfaces for the classes as they are not being used correctly
See:
Currently, qamets qatan for the word כל is only recognized when a maqqef is present
This could be potentially fixed with adding
"^כָּל$",
"^כָל$",
For clarity that should be ^kaf+(dagesh)+qamets+lamed$
Single syllable words spelled with a holem waw lost the final letter.
Examples are : י֔וֹם
and ע֜וֹד
The issue lies in holemWaw.ts
Currently, only text with niqqud is allowed
It may be good to have an option that makes it explicit that the user would like to syllabify words w/o niqqud — I'm not totally sure what that'll look like though
Hebrew רוּחַ
Overall, when strict is true
everything should still be correct
The names of Hebrew characters should be standardized (e.g. shewa become sheva).
Also update the documentation
Like מָחֳרָת and יָרָבְעָם
See here
The Latin character after the Divine Name is dropped
Hebrew
כִּי אִם בְּתוֹרַת יְהוָה, חֶפְצוֹ; וּבְתוֹרָתוֹ יֶהְגֶּה, יוֹמָם וָלָיְלָה
Transliteration
kî ʾim bǝtôrat yhwh ḥepṣô; ûbǝtôrātô yehgê, yômām wālāyǝlâ
The way it checks for a schema and then sets options according to that isn't intuitive.
Instead, create premade syllabification schemas that can just be imported
Should through an error if text without niqqud is passed in.
On the Cluster
object, add a property that return the unicode character.
Something like:
new Text("בְּאֶ֣רֶץ").clusters.map(c => c.vowel)
The first three should return the vowel characters of SHEVA, SEGOL, and SEGOL, and the final should return null
.
I went simply w/ havarot at first, then found out that it was taken on npm. I should change everything for consistency
Some texts use a hyphen or double hyphen like a maqqef
The pipe character (e.g. אֲשֶׁר | אָֽנֹכִי
) causes the error Cannot read properties of undefined (reading 'hasVowel')
.
Some texts use a pipe character instead of a paseq.
The pipe characters are separated into their own words, and when they are syllabified, all the Latin chars are removed and an empty array is used when trying to group clusters
In order to fix this, add a check to see if the Word
is Hebrew or not. If not, just make a syllable like is done with the Divine Name
Like a Cluster
, the Syllable
object should extend the Node
objects
The combination of Seghol-Yod (e.g. יךָ◌ֶ) is not counted as a mater.
This is used as a mater in Biblical Hebrew commonly.
Remove reduce()
(slow), and clean up
havarotjs/src/utils/qametsQatan.ts
Lines 86 to 92 in 6ca463a
To something like
const sequenceSnippets = (arr: string[]) => {
return arr.map((snippet) => (sequence(snippet.normalize("NFKD")).flat().join(""));
};
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.