corcencc / cytag Goto Github PK

A collection of surface-level natural language processing tools for Welsh, written in Python. Includes: text segmenter; sentence splitter; tokeniser; part-of-speech (POS) tagger.

Home Page: https://cytag.corcencc.org

License: GNU General Public License v3.0

Python 100.00%

cymraeg nlp pos-tagging python welsh

cytag's People

Contributors

Stargazers

Watchers

Forkers

linguacelta ucrel

cytag's Issues

some rules in def handle_empty_lookup(token) not firing

A couple of rules here are not firing - it seems to be the use of == rather than in that's the problem. If you change this:

	elif token[0][-1:] == ["a", "â", "e", "ê", "i", "î", "o", "ô", "u", "û", "w", "ŵ", "y", "ŷ"]:
		readings = lookup_multiple_readings(["{}f".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True
	elif token[0][-1:] == ["b", "c", "d", "f", "g", "h", "j", "l", "m", "n", "p", "r", "s", "t"] or token[0][-2:] == ["ch", "dd", "ff", "ng", "ll", "ph", "rh", "th"]:
		readings = lookup_multiple_readings(["{}r".format(token[0]), "{}l".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True

to this

	elif token[0][-1:] in ["a", "â", "e", "ê", "i", "î", "o", "ô", "u", "û", "w", "ŵ", "y", "ŷ"]:
		readings = lookup_multiple_readings(["{}f".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True
	elif token[0][-1:] in ["b", "c", "d", "f", "g", "h", "j", "l", "m", "n", "p", "r", "s", "t"] or token[0][-2:] in ["ch", "dd", "ff", "ng", "ll", "ph", "rh", "th"]:
		readings = lookup_multiple_readings(["{}r".format(token[0]), "{}l".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True

it should work.

I've also tried out adding a rule to look for an -au ending where an unknown word ends with -e or -a (both are extremely common colloquially - the former is typically southern, the latter typically northern, e.g. "gorau" becomes "gore" or "gorau"). It works well on my data - might be worth adding to your postagger? (Has to go before the first rule mentioned above, of course, or it will be superseded by it and never fire.)

	elif token[0][-1:] in ["e", "a"]:
		readings = lookup_multiple_readings(["{}au".format(token[0][0:-1])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True

Some interjections/fillers not in lexicon

Some of the fillers and interjections used as standard in the spoken transcriptions are not included in the lexicon, and are therefore tagged as unknown.

e.g. "yyy" (equivalent of English "uuh"):

yyy 50,4 yyy unk unk

No lemma provided for some masc. pl. nouns

A few masc. pl. nouns are not tagged with a lemma - these are the four present in my data (possible that others are affected, of course):

83333 newyddion 9897,7 E Egll
84691 ffermwyr 10057,7 E Egll
88274 gwelliannau 10479,5 E Egll
1880923 graddedigion 229789,21 E Egll

Word recognition affected by capitalization

Proper nouns which are uncapitalized are not recognized:

517 cymraeg 70,23 cymraeg unk unk
508 loegr 70,14 loegr unk unk

In addition, many words in all-caps or with initial caps are wrongly tagged as proper nouns:

217 AAAAAAAAAH 32,1 AAAAAAAAAH E Ep
219 CUSANAU 33,1 CUSANAU E Ep
2502 Good 311,2 Good E Ep
2503 Luck 311,3 Luck E Ep

Support for emoticons?

I wonder whether it would be practical/possible to support common multi-character emoticons, such as

:)
:-)
;)
<3

These are pretty common in computer-mediated communication, and being able to recognise them within a text (and disambiguate them from conventional use of the same marks as numerals/punctuation) could add significant value for certain types of analysis.

Common Welsh pronouns + determiners not covered

CyTag doesn't seem to recognise "neb", which is a very common Welsh word.

475604 neb 58466,18 neb unk unk
487562 neb 59877,16 neb unk unk
1982256 neb 241862,14 neb unk unk

Some conjugated forms of prepositions missing

The full conjugation pattern is not present in the lexicon for some prepositions.

e.g. the prep. "gyda" is missing in forms such as "gydaf" (= "with me") and "gydan" (= "with" before the pronoun "nhw").

Some colloquial forms are also not included, despite being very common in informal writing - e.g. "trwy" -> "trwoch" (= "through you"; the standard form is "trwyddoch").

Mutations not checked on proper nouns

Only the forms of words as given are checked against the gazetters, yet it's fairly common for proper nouns to undergo mutation. So, currently, "Cwmafan" is correctly tagged E:Ep, but "Nghwmafan" and "Gwmafan" are tagged "unk".

Common colloquial spellings are not recognized

It is extremely common in informal writing for "au" and "ai" to be written as "a" or "e". There is currently no support for this in CyTag, so common forms like "pethe" for "pethau" ("things") are being tagged "unknown".

Words without accents are misidentified

It's fairly common for accents on vowels to be missed off, or placed where they don't belong, especially in informal texts. The tagger is currently strict about this, because the lexicon only includes words with the standard use of accents.

E.g. "swn y mor" is almost certain to mean "the sound of the sea", but the standard spelling would be "sŵn y môr". The tagger therefore understands "swn" to be a form of the verb "to be", and "mor" to be an adverb, meaning "so":

59 swn 6,2 bod B Bdibdyf1u

60 y 6,3 y YFB YFB

61 mor 6,4 mor Adf Adf

62 . 6,5 . Atd Atdt

The optimal tagging would be:

59 swn 6,2 sŵn E Egu

60 y 6,3 y YFB YFB

61 mor 6,4 môr E Egu

62 . 6,5 . Atd Atdt

Non-Welsh words not recognized as such

There's a pos tag for non-Welsh words (Gwest), but it seems not to be used by CyTag:

2572 amazing 318,16 amazing unk unk
2709 sober 335,9 sober unk unk
443129 including 54540,33 including unk unk

Is there some way this could be implemented? It would be useful to separate English words (in particular) in order to look at code-switching in the data. It would also allow a separation of English words from non-standard/incorrect spelling of Welsh words.

Sequence [A-Z][a-zA-Z]*!"[A-Za-z]+ in input file causes error

When input text file contains a sequence matching [A-Z][a-zA-Z]*!"[A-Za-z]+, CyTag fails with an error.

For clarity: this occurs when the following example sequences are present in the input file:

Afal!"a
A!"araith
Afal!"Arth

But is not triggered when the following example sequences are present:

Afal! "a
A!" Araith
arth!"Aros

Command-line dump follows:

python3 CyTag.py -i TEST.txt -n output -f all

cy_postagger - A part-of-speech (POS) tagger for Welsh texts
------------------------------------------------------------

Producing readings...

From 1 tokens:
--- 1 tokens were given readings
------ 0 tokens only have a single reading pre-CG
--------- 0 of which were definite tags (punctuation, symbols etc.)
------ 0 tokens have multiple readings pre-CG
------ 1 tokens have no readings pre-CG
------ 1 tokens without readings were assumed to be proper nouns
--- 0 tokens are still without readings (marked as 'unknown')

Running VISL CG-3 over 1 tokens...

Mapping CG output tokens to CyTag output formats |################################| 1/1Traceback (most recent call last):
  File "CyTag.py", line 109, in <module>
    process(arguments.input, output_name=arguments.name, directory=arguments.dir, component=arguments.component, output_format=arguments.format)
  File "CyTag.py", line 69, in process
    output = pos_tagger(input_text, output_name, directory, output_format)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 834, in pos_tagger
    cytag_output = map_cg(cg_output.strip(), mapping_bar)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 710, in map_cg
    mapped_output += process_cg_token(cg_readingcount, cg_readings[cg_readingcount-1], get_token_position(cg_readings[cg_readingcount-1][1]))
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 666, in process_cg_token
    processed_token = process_double_reading(token_id, token, readings)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 571, in process_double_reading
    processed_readings = list_readings(readings)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 520, in list_readings
    if info[-2] == "+":
IndexError: list index out of range

Punctuation not always correctly interpreted if not followed by whitespace

If the end of a sentence, clause, etc. is marked with a fullstop, comma, colon, or semi-colon, and the author begins the next word immediately (without a whitespace character), CyTag does not split the words correctly.

Heb redeg heibio fan na ers blynydde.Bwys yr Undeb?

This text does not split before "Bwys" - you end up with an unknown word "blynydde.Bwys".

The same seems to happen with , : ; characters.

Obviously there could be issues here with text that contains hyperlinks, for example - splitting after a . could be problematic there. Perhaps there needs to be some way to recognise hyperlinks as such, and treat all other . followed by [a-zA-Z] as a word delimiter?