Giter VIP home page Giter VIP logo

cytag's People

Contributors

dravidian-codemix avatar linguacelta avatar steveneale avatar vigneshwaranm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

linguacelta ucrel

cytag's Issues

some rules in def handle_empty_lookup(token) not firing

A couple of rules here are not firing - it seems to be the use of == rather than in that's the problem. If you change this:

	elif token[0][-1:] == ["a", "â", "e", "ê", "i", "î", "o", "ô", "u", "û", "w", "ŵ", "y", "ŷ"]:
		readings = lookup_multiple_readings(["{}f".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True
	elif token[0][-1:] == ["b", "c", "d", "f", "g", "h", "j", "l", "m", "n", "p", "r", "s", "t"] or token[0][-2:] == ["ch", "dd", "ff", "ng", "ll", "ph", "rh", "th"]:
		readings = lookup_multiple_readings(["{}r".format(token[0]), "{}l".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True

to this

	elif token[0][-1:] in ["a", "â", "e", "ê", "i", "î", "o", "ô", "u", "û", "w", "ŵ", "y", "ŷ"]:
		readings = lookup_multiple_readings(["{}f".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True
	elif token[0][-1:] in ["b", "c", "d", "f", "g", "h", "j", "l", "m", "n", "p", "r", "s", "t"] or token[0][-2:] in ["ch", "dd", "ff", "ng", "ll", "ph", "rh", "th"]:
		readings = lookup_multiple_readings(["{}r".format(token[0]), "{}l".format(token[0])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True

it should work.

I've also tried out adding a rule to look for an -au ending where an unknown word ends with -e or -a (both are extremely common colloquially - the former is typically southern, the latter typically northern, e.g. "gorau" becomes "gore" or "gorau"). It works well on my data - might be worth adding to your postagger? (Has to go before the first rule mentioned above, of course, or it will be superseded by it and never fire.)

	elif token[0][-1:] in ["e", "a"]:
		readings = lookup_multiple_readings(["{}au".format(token[0][0:-1])])
		reading_string += format_multireading_lookup(readings, token[0], token[1])
		if len(readings) > 0:
			count_readings = True

Some interjections/fillers not in lexicon

Some of the fillers and interjections used as standard in the spoken transcriptions are not included in the lexicon, and are therefore tagged as unknown.

e.g. "yyy" (equivalent of English "uuh"):

yyy 50,4 yyy unk unk

No lemma provided for some masc. pl. nouns

A few masc. pl. nouns are not tagged with a lemma - these are the four present in my data (possible that others are affected, of course):

83333 newyddion 9897,7 E Egll
84691 ffermwyr 10057,7 E Egll
88274 gwelliannau 10479,5 E Egll
1880923 graddedigion 229789,21 E Egll

Word recognition affected by capitalization

Proper nouns which are uncapitalized are not recognized:

517 cymraeg 70,23 cymraeg unk unk
508 loegr 70,14 loegr unk unk

In addition, many words in all-caps or with initial caps are wrongly tagged as proper nouns:

217 AAAAAAAAAH 32,1 AAAAAAAAAH E Ep
219 CUSANAU 33,1 CUSANAU E Ep
2502 Good 311,2 Good E Ep
2503 Luck 311,3 Luck E Ep

Support for emoticons?

I wonder whether it would be practical/possible to support common multi-character emoticons, such as

:)
:-)
;)
<3

These are pretty common in computer-mediated communication, and being able to recognise them within a text (and disambiguate them from conventional use of the same marks as numerals/punctuation) could add significant value for certain types of analysis.

Some conjugated forms of prepositions missing

The full conjugation pattern is not present in the lexicon for some prepositions.

e.g. the prep. "gyda" is missing in forms such as "gydaf" (= "with me") and "gydan" (= "with" before the pronoun "nhw").

Some colloquial forms are also not included, despite being very common in informal writing - e.g. "trwy" -> "trwoch" (= "through you"; the standard form is "trwyddoch").

Mutations not checked on proper nouns

Only the forms of words as given are checked against the gazetters, yet it's fairly common for proper nouns to undergo mutation. So, currently, "Cwmafan" is correctly tagged E:Ep, but "Nghwmafan" and "Gwmafan" are tagged "unk".

Common colloquial spellings are not recognized

It is extremely common in informal writing for "au" and "ai" to be written as "a" or "e". There is currently no support for this in CyTag, so common forms like "pethe" for "pethau" ("things") are being tagged "unknown".

Words without accents are misidentified

It's fairly common for accents on vowels to be missed off, or placed where they don't belong, especially in informal texts. The tagger is currently strict about this, because the lexicon only includes words with the standard use of accents.

E.g. "swn y mor" is almost certain to mean "the sound of the sea", but the standard spelling would be "sŵn y môr". The tagger therefore understands "swn" to be a form of the verb "to be", and "mor" to be an adverb, meaning "so":

59 swn 6,2 bod B Bdibdyf1u

60 y 6,3 y YFB YFB

61 mor 6,4 mor Adf Adf

62 . 6,5 . Atd Atdt

The optimal tagging would be:

59 swn 6,2 sŵn E Egu

60 y 6,3 y YFB YFB

61 mor 6,4 môr E Egu

62 . 6,5 . Atd Atdt

Non-Welsh words not recognized as such

There's a pos tag for non-Welsh words (Gwest), but it seems not to be used by CyTag:

2572 amazing 318,16 amazing unk unk
2709 sober 335,9 sober unk unk
443129 including 54540,33 including unk unk

Is there some way this could be implemented? It would be useful to separate English words (in particular) in order to look at code-switching in the data. It would also allow a separation of English words from non-standard/incorrect spelling of Welsh words.

Sequence [A-Z][a-zA-Z]*!"[A-Za-z]+ in input file causes error

When input text file contains a sequence matching [A-Z][a-zA-Z]*!"[A-Za-z]+, CyTag fails with an error.

For clarity: this occurs when the following example sequences are present in the input file:

Afal!"a
A!"araith
Afal!"Arth

But is not triggered when the following example sequences are present:

Afal! "a
A!" Araith
arth!"Aros

Command-line dump follows:

python3 CyTag.py -i TEST.txt -n output -f all

cy_postagger - A part-of-speech (POS) tagger for Welsh texts
------------------------------------------------------------

Producing readings...

From 1 tokens:
--- 1 tokens were given readings
------ 0 tokens only have a single reading pre-CG
--------- 0 of which were definite tags (punctuation, symbols etc.)
------ 0 tokens have multiple readings pre-CG
------ 1 tokens have no readings pre-CG
------ 1 tokens without readings were assumed to be proper nouns
--- 0 tokens are still without readings (marked as 'unknown')

Running VISL CG-3 over 1 tokens...

Mapping CG output tokens to CyTag output formats |################################| 1/1Traceback (most recent call last):
  File "CyTag.py", line 109, in <module>
    process(arguments.input, output_name=arguments.name, directory=arguments.dir, component=arguments.component, output_format=arguments.format)
  File "CyTag.py", line 69, in process
    output = pos_tagger(input_text, output_name, directory, output_format)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 834, in pos_tagger
    cytag_output = map_cg(cg_output.strip(), mapping_bar)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 710, in map_cg
    mapped_output += process_cg_token(cg_readingcount, cg_readings[cg_readingcount-1], get_token_position(cg_readings[cg_readingcount-1][1]))
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 666, in process_cg_token
    processed_token = process_double_reading(token_id, token, readings)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 571, in process_double_reading
    processed_readings = list_readings(readings)
  File "/Users/btt/Documents/PhD/CyTag/CyTag/src/cy_postagger.py", line 520, in list_readings
    if info[-2] == "+":
IndexError: list index out of range

Punctuation not always correctly interpreted if not followed by whitespace

If the end of a sentence, clause, etc. is marked with a fullstop, comma, colon, or semi-colon, and the author begins the next word immediately (without a whitespace character), CyTag does not split the words correctly.

Heb redeg heibio fan na ers blynydde.Bwys yr Undeb?

This text does not split before "Bwys" - you end up with an unknown word "blynydde.Bwys".

The same seems to happen with , : ; characters.

Obviously there could be issues here with text that contains hyperlinks, for example - splitting after a . could be problematic there. Perhaps there needs to be some way to recognise hyperlinks as such, and treat all other . followed by [a-zA-Z] as a word delimiter?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.