Giter VIP home page Giter VIP logo

punjabi_bible's Issues

Canonical Psalm titles should use \d rather than \s ; acrostic headings should use \qa

Currently in the Psalms there are

  • 175 instances of \s - being used as non-canonical section headings
  • 137 instances of \s1 - being used improperly

The proper USFM tag for the 116 canonical Psalm titles is \d.
The 22 acrostic stanza headings in Psalm 119 should use the tag \qa.

btw. There must be one canonical title missing, as 116 + 22 = 138.

The chapter label tag \cl occurs only once in Psalm 1 after the chapter marker.

\c 1
\ms Book One
\mr Psalms 1-41
\cl Psalm 1

If you wish each Psalm to be properly labeled with the word Psalm rather than Chapter,
then you should use the chapter label tag before the marker for chapter 1, thus:

\cl Psalm
\c 1
\ms Book One
\mr Psalms 1-41

Refer to the USFM User Reference for details.
NB. Some Bible software apps may not support this feature.

Misplaced question mark in 2 Kings 1:6

2 Kings 1:6 reads:
\v 6 ਉਨ੍ਹਾਂ ਨੇ ਉਸ ਨੂੰ ਉੱਤਰ ਦਿੱਤਾ, “ਇੱਕ ਮਨੁੱਖ ਸਾਨੂੰ ਮਿਲਣ ਲਈ ਆਇਆ ਅਤੇ ਸਾਨੂੰ ਆਖਿਆ ਕਿ ਜਿਸ ਰਾਜੇ ਨੇ ਤੁਹਾਨੂੰ ਭੇਜਿਆ ਉਸ ਦੇ ਕੋਲ ਮੁੜ ਜਾਓ ਅਤੇ ਉਸ ਨੂੰ ਆਖੋ ਕਿ ਯਹੋਵਾਹ ਇਸ ਤਰ੍ਹਾਂ ਕਹਿੰਦਾ ਹੈ, ‘ਕੀ ਇਸਰਾਏਲ ਵਿੱਚ ਕੋਈ ਪਰਮੇਸ਼ੁਰ ਨਹੀਂ ਹੈ ਜੋ ਤੂੰ ਅਕਰੋਨ ਦੇ ਦੇਵਤੇ ਬਆਲ-ਜਬੂਬ ਕੋਲ ਪੁੱਛਣ ਲਈ ਭੇਜਦਾ ਹੈਂ’? ਇਸ ਲਈ ਜਿਸ ਬਿਸਤਰ ਉੱਤੇ ਤੂੰ ਪਿਆ ਹੈਂ, ਉਸ ਤੋਂ ਤੂੰ ਨਹੀਂ ਉੱਠੇਂਗਾ ਸਗੋਂ ਤੂੰ ਜ਼ਰੂਰ ਮਰੇਂਗਾ ।”

Shouldn't the question mark actually be before the end quotation mark? i.e.
\v 6 ਉਨ੍ਹਾਂ ਨੇ ਉਸ ਨੂੰ ਉੱਤਰ ਦਿੱਤਾ, “ਇੱਕ ਮਨੁੱਖ ਸਾਨੂੰ ਮਿਲਣ ਲਈ ਆਇਆ ਅਤੇ ਸਾਨੂੰ ਆਖਿਆ ਕਿ ਜਿਸ ਰਾਜੇ ਨੇ ਤੁਹਾਨੂੰ ਭੇਜਿਆ ਉਸ ਦੇ ਕੋਲ ਮੁੜ ਜਾਓ ਅਤੇ ਉਸ ਨੂੰ ਆਖੋ ਕਿ ਯਹੋਵਾਹ ਇਸ ਤਰ੍ਹਾਂ ਕਹਿੰਦਾ ਹੈ, ‘ਕੀ ਇਸਰਾਏਲ ਵਿੱਚ ਕੋਈ ਪਰਮੇਸ਼ੁਰ ਨਹੀਂ ਹੈ ਜੋ ਤੂੰ ਅਕਰੋਨ ਦੇ ਦੇਵਤੇ ਬਆਲ-ਜਬੂਬ ਕੋਲ ਪੁੱਛਣ ਲਈ ਭੇਜਦਾ ਹੈਂ ?’ ਇਸ ਲਈ ਜਿਸ ਬਿਸਤਰ ਉੱਤੇ ਤੂੰ ਪਿਆ ਹੈਂ, ਉਸ ਤੋਂ ਤੂੰ ਨਹੀਂ ਉੱਠੇਂਗਾ ਸਗੋਂ ਤੂੰ ਜ਼ਰੂਰ ਮਰੇਂਗਾ ।”

After all, in English, the quotation is a question:
'Is there no God in Israel that you send to inquire of Baal Zebub, the god of Ekron?'

Should there be a space before the Devanagari Danda?

In the Punjabi Bible
A search for the regexp \S\x{0964} gave 1547 hits.
A search for the regexp \s\x{0964} gave 21326 hits.
Here, those without a space are in a minority, being less than 6.8% of the total.

This prompts the question:

Should there be a space before the Devanagari Danda?

cf. In the Assamese Bible, the results are quite the opposite!
Those with a space are in a minority, being less than 6.3% of the total.

A search for the regexp \S\x{0964} gave 27756 hits.
A search for the regexp \s\x{0964} gave 1855 hits.

NB. These results relate to my fork of the repo after my commits to the master branch.

What is the typographical standard in this matter for the various languages that use an Indic script?

NB. If some sort of space is required before the Danda, it's conceivable that it should be U+2008 PUNCTUATION SPACE rather than an ordinary space.

Two verses that end with an English letter

Deuteronomy 1:8 ends with the letter i
\v 8 ਵੇਖੋ, ਮੈਂ ਇਸ ਦੇਸ਼ ਨੂੰ ਤੁਹਾਡੇ ਸਾਹਮਣੇ ਰੱਖ ਦਿੱਤਾ ਹੈ, ਜਿਸ ਦੇਸ਼ ਦੀ ਯਹੋਵਾਹ ਨੇ ਤੁਹਾਡੇ ਪਿਉ-ਦਾਦਿਆਂ ਨਾਲ ਅਰਥਾਤ ਅਬਰਾਹਾਮ, ਇਸਹਾਕ ਅਤੇ ਯਾਕੂਬ ਨਾਲ ਸਹੁੰ ਖਾਧੀ ਸੀ ਕਿ ਮੈਂ ਇਸ ਦੇਸ਼ ਨੂੰ ਤੁਹਾਨੂੰ ਅਤੇ ਤੁਹਾਡੇ ਬਾਅਦ ਤੁਹਾਡੇ ਵੰਸ਼ ਨੂੰ ਦਿਆਂਗਾ, ਇਸ ਲਈ ਜਾਓ ਅਤੇ ਇਸ ਦੇਸ਼ ਨੂੰ ਆਪਣੇ ਅਧੀਨ ਕਰ ਲਓ । " i

Jeremiah 31:40 ends with the letter s
\v 40 ਤਾਂ ਲੋਥਾਂ ਅਤੇ ਸੁਆਹ ਦੀ ਸਾਰੀ ਵਾਦੀ ਅਤੇ ਸਾਰੇ ਖੇਤ ਕਿਦਰੋਨ ਦੇ ਨਾਲੇ ਤੱਕ ਅਤੇ ਘੇੜੇ ਫਾਟਕ ਦੀ ਨੁੱਕਰ ਤੱਕ ਚੜ੍ਹਦੇ ਪਾਸੇ ਵੱਲ ਯਹੋਵਾਹ ਲਈ ਪਵਿੱਤਰ ਹੋਣਗੇ ਅਤੇ ਉਹ ਫਿਰ ਸਦਾ ਤੱਕ ਨਾ ਕਦੀ ਪੁੱਟਿਆ ਜਾਵੇਗਾ ਨਾ ਡੇਗਿਆ ਜਾਵੇਗਾ । s

These letters are superfluous.

Lookalike characters keyed instead of the Devanagari Danda

I have discovered that the USFM files have numerous instances where a lookalike character has been keyed instead of the Devanagari Danda.

In the Punjabi Bible, a search for the regexp [\x{0A00}-\x{0AFF}]\s*\x6C gave 5172 hits.
These are where the lowercase letter l has been keyed just after a Gurmukhi character.
There were even 2 exclamation marks with no space just before a Gurmukhi letter.
A search for the whole word l gave 5240 hits, 68 more being found.
This merely indicates that some instances are after a punctuation mark instead.
An improved search for l found a total of 5247 hits.
All of these can be safely replaced by a Danda.

It's also possible that many of the existing exclamation marks in the text may be miskeyed lookalikes.
Without understanding each context, this is not something that can be determined merely by counting.

  • There are 3022 hits to the regexp [\x{0A00}-\x{0AFF}]\s*\x21
  • There are 3045 exclamation marks in total.

The 16 hits for !! strongly suggest that these were miskeyed for U+0965 Devanagari Double Danda.
There are 18 hits for the regexp \x{0964}\s*\x{0964} that should also be replaced by the Double Danda ॥.

There were also 62 hits for the pattern ! l and 103 hits for the regexp !\s*\x{0964} that are further candidates for Double Danda. These will be taken care of by doing the replacements in the right order.

Quotation marks

The attached text file provides a character frequency analysis of the 66 USFM files.

merged.usfm.character.frequency.txt

For this issue the following lines are of particular interest.

U+0022	"	1,462	QUOTATION MARK
U+0027	'	171	APOSTROPHE
U+2018	‘	15	LEFT SINGLE QUOTATION MARK
U+2019	’	15	RIGHT SINGLE QUOTATION MARK
U+201C	“	640	LEFT DOUBLE QUOTATION MARK
U+201D	”	638	RIGHT DOUBLE QUOTATION MARK

It's evident that this translation makes no use of continuation quotation marks.

It's apparent that not all the quotations make proper use of left and right quotation marks.
To achieve consistent punctuation of quotations in the translation,

  • Most of the U+0022 pairs will need to be replaced by U+201C and U+201D.
  • Most of the U+0027 pairs will need to be replaced by U+2018 and U+2019.
    NB. As 171 is not an even number, there must be at least one instance of U+0027 that has no corresponding mark at the other end of the quotation.

Notice also the difference of 2 between the counts of U+201C and U+201D.
With some ingenuity in method, I have traced this problem to John 16 where the marking of quotations does not fully reflect the text.
The translation team needs to revisit this chapter and make suitable corrections.

A counted words list to help with proof reading

The attached tab delimited text file is a counted words list derived from the Punjabi Bible USFM files.

merged.usfm.words.count.txt

This may be of considerable help towards proof reading.
The counts are in the first field, the words in the second field.
The output is sorted on the words field, so those with similar spellings will be near to each other.
Browsing through the list may therefore bring to light any words with anomalous spelling.

The file can be dropped into Microsoft Excel™ for further analysis by resorting, filtering, etc.

Notes:

  • The file was output from a bespoke TextPipe filter using Count duplicate lines.
  • Hyphenated words were preserved.
  • All the Gurmukhi text was included, not just the verse text.
  • The collation algorithm (for the sort) is just how TextPipe works.
  • Using Excel™ to sort, the words will not be in the same order.

Some observations:

  • There are 21818 different words in total
  • The most common word is ਦੇ which is found 34846 times
  • The longest 2 words have 16 characters
  • There are 8695 hapax legomena (words with count=00001)
  • There are 712 hyphenated words of which 7 have more than 1 hyphen, namely:
00001	ਬਏਰ-ਲਹਈ-ਰੋਈ
00002	ਬਏਰ-ਲਹੀ-ਰੋਈ
00002	ਮਹੇਰ-ਸ਼ਲਾਲ-ਹਾਸ਼-ਬਜ਼
00001	ਅਟਰੋਥ-ਬੈਤ-ਯੋਆਬ
00001	ਆਬੇਲ-ਬੈਤ-ਮਆਕਾਹ
00001	ਏਲੋਨ-ਬੈਤ-ਹਨਾਨ

Should there be a space before a comma?

A search for the regexp [\x{0A00}-\x{0AFF}] , gave 249 hits.
A search for the regexp [\x{0A00}-\x{0AFF}], gave 37335 hits.

Those without a space before the comma are the majority.
Those with a space are only 0.66% of the total.

Which is preferred? @joshykurian

Widening the searches somewhat:

  • Regexp \s, gave 254 hits.
  • Regexp \S, gave 37855 hits

These counts include where the previous character was not in the Gurmukhi block.

The 5 extra instances of space before comma were , , - typos that have a comma before and after a space!

\v 3 ਤਦ ਯਸਾਯਾਹ ਨਬੀ ਹਿਜ਼ਕੀਯਾਹ ਰਾਜਾ ਦੇ ਕੋਲ ਆਇਆ ਅਤੇ ਉਸ ਨੂੰ ਪੁੱਛਿਆ, ਇਨ੍ਹਾਂ ਮਨੁੱਖਾਂ ਨੇ ਕੀ ਆਖਿਆ ਅਤੇ ਉਹ ਕਿੱਥੋਂ ਤੇਰੇ ਕੋਲ ਆਏ ਹਨ ? ਹਿਜ਼ਕੀਯਾਹ ਨੇ ਅੱਗੋਂ ਉੱਤਰ ਦਿੱਤਾ, , ਉਹ ਇੱਕ ਦੂਰ ਦੇ ਦੇਸ ਤੋਂ ਮੇਰੇ ਕੋਲ ਆਏ, ਅਰਥਾਤ ਬਾਬਲ ਤੋਂ ।
\v 3 ਹੇ ਯਾਕੂਬ ਦੇ ਘਰਾਣੇ, ਮੇਰੀ ਸੁਣੋ, ਨਾਲੇ ਇਸਰਾਏਲ ਦੇ ਘਰਾਣੇ ਦੇ ਸਾਰੇ ਬਚੇ ਹੋਇਓ, ਤੁਸੀਂ ਜਿਨ੍ਹਾਂ ਨੂੰ ਮੈਂ ਜਨਮ ਤੋਂ ਸੰਭਾਲਿਆ ਅਤੇ ਕੁੱਖੋਂ ਹੀ ਚੁੱਕੀ ਫਿਰਦਾ ਰਿਹਾ, ,
\v 12 ਤੂੰ ਆਪਣੀਆਂ ਝਾੜਾ-ਫੂਕੀਆਂ ਵਿੱਚ, ਅਤੇ ਆਪਣੀਆਂ ਜਾਦੂਗਰੀਆਂ ਦੇ ਵਾਧੇ ਵਿੱਚ ਕਾਇਮ ਰਹਿ, , ਜਿਨ੍ਹਾਂ ਵਿੱਚ ਤੂੰ ਆਪਣੀ ਜੁਆਨੀ ਤੋਂ ਮਿਹਨਤ ਕੀਤੀ, ਸ਼ਾਇਦ ਤੈਨੂੰ ਲਾਭ ਹੋ ਸੱਕੇ, ਸ਼ਾਇਦ ਤੂੰ ਉਹਨਾਂ ਨੂੰ ਡਰਾ ਸਕੇਂ !
\v 6 ਆਪਣੀਆਂ ਅੱਖਾਂ ਅਕਾਸ਼ ਵੱਲ ਚੁੱਕੋ, ਅਤੇ ਹੇਠਾਂ ਧਰਤੀ ਉੱਤੇ ਨਿਗਾਹ ਮਾਰੋ, , ਅਕਾਸ਼ ਤਾਂ ਧੂੰਏਂ ਵਾਂਗੂੰ ਅਲੋਪ ਹੋ ਜਾਵੇਗਾ, ਅਤੇ ਧਰਤੀ ਕੱਪੜੇ ਵਾਂਗੂੰ ਪੁਰਾਣੀ ਪੈ ਜਾਵੇਗੀ, ਉਹ ਦੇ ਵਾਸੀ ਮੱਖੀਆਂ ਵਾਂਗੂੰ ਮਰ ਜਾਣਗੇ, ਪਰ ਮੇਰੀ ਮੁਕਤੀ ਸਦੀਪਕ ਹੋਵੇਗੀ, ਅਤੇ ਮੇਰਾ ਧਰਮ ਅਨੰਤ ਹੋਵੇਗਾ ।
\v 17 ਵੇਖੋ, ,ਮੈਂ ਉਹਨਾਂ ਲਈ ਤਲਵਾਰ, ਕਾਲ ਅਤੇ ਬਵਾ ਨੂੰ ਘੱਲਾਂਗਾ, ਸੈਨਾਂ ਦਾ ਯਹੋਵਾਹ ਇਸ ਤਰ੍ਹਾਂ ਆਖਦਾ ਹੈ, ਮੈਂ ਉਹਨਾਂ ਨੂੰ ਸੜੀਆਂ ਹੋਈਆਂ ਹਜੀਰਾਂ ਵਾਂਗੂੰ ਬਣਾਵਾਂਗਾ ਜਿਹੜੀਆਂ ਖਰਾਬ ਹੋਣ ਦੇ ਕਾਰਨ ਖਾਧੀਆਂ ਨਹੀਂ ਜਾਂਦੀਆਂ

These typos must be corrected as well.

Preliminary tidy up

As a preliminary tidy up, I'd like to run the following 3 filters on the 66 USFM files:

  • Remove blanks from End of Line
  • Remove any blank lines
  • Remove multiple whitespace

Before I do, @joshykurian please advise whether the last one is acceptable.
cf. There are 2926 instances of (2 spaces) of which

  • 1264 are after a Devanagari Danda.
  • 1430 are after a question mark
  • 230 are after a Gurmukhi character
  • 1 is after a paragraph tag \p
  • 1 is after a right single quotation mark

btw. There are 7050 instances of lines ending with a space.
Removing these at this stage will help me to deal with lines that visibly end with various punctuation marks.

The non-use of Gurmukhi digits?

A search for the regexp [\x{0A66}-\x{0A6F}] gave no hits, showing that no use is made of Gurmukhi digits.

i.e. All scripture references in parallel passage markers and book introductions use ordinary Western digits for chapter and verse numbers and the numerical part of some book names.

Even so, this prompts the question:

Why not use the Gurmukhi digits in the Punjabi Bible ?

Marking proper names in Indic scripts?

Unlike Latin scripts and several other writing systems, Indic scripts do not have any equivalent to capitalising the first letter of a word to show that it's a proper name.

For new readers of the Bible in Indic languages, this may be somewhat daunting if they have no immediate means to recognise that a particular word is a proper name.

  • USFM 2.4 already provides the marker pair \pn_...\pn* for marking proper names.
  • USFM 3.0 documents a further pair \png_...\png* to mark geographic names.

Might it be worthwhile to make use of these markers in all the Bibles maintained by FreeBiblesIndia ?

This would give app developers the chance to add features to toggle the display of all proper names in [say] a different text colour (as a user option).

Unless someone makes a start in such a direction, there will be no motivation for app developers to even consider such a notion.

Multiple hyphens

A search for the regexp --+ gave 39 hits, of which 17 are triple hyphens, the rest are doubles.

Suggestion:

  • Replace triple hyphens by U+2015 HORIZONTAL BAR
  • Replace double hyphens by U+2014 EM DASH

Any objections? @joshykurian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.