Giter VIP home page Giter VIP logo

genusidator's Introduction

Genusidator ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ฆ๐Ÿ‡น ๐Ÿ‡จ๐Ÿ‡ญ

A learning aid to explain grammatical gender assignment in German nouns

Background

One of the biggest challenges for learners of German is accurately identifying the grammatical gender of German nouns (masculine, feminine, or neuter). Unlike native German speakers who learned the grammatical gender mappings via the process of language acquisition, learners of German as a foreign language are not naturally exposed to the gender of German nouns during the formative period of language development. Moreover, the topic is generally not taught in German schools. As a result, even native German speakers tasked with teaching German to foreigners are rarely able to teach their students how to match nouns to their gender.

Rationale

This program aims to address the above limitation by automatically generating the rules governing grammatical gender assignment. In order to accomplish this task, the system relies on a combination of semantic taxonomic relationships and word morphology. One hopes that by understanding these rules, learners will be able to more confidently use entire categories of nouns with the correct grammatical gender.

Technology

Genusidator employs the following technologies:

  • spaCy German transformer pipeline is used for grammatical class detection and lemmatization.
  • DeepL API is used to output US-English translation. The translation both helps furnish semantic context and is required to generate a hypernm taxonomy. Make sure to supply your own DeepL API key, which can be obtained here. The earlier versions of this program implemented the Google Translate API, which proved much less reliable and accurate than DeepL.
  • German Compound Noun Splitter is used to split nominal composita and output the base noun. Note that a dictionary object is required for morphological parsing. Any dictionary with one item per line will do. The present implementation employs Free German Dictionary by Jan Schreiber. An abridged version of this resource is included in the repo.
  • NLTK and WordNet are used to generate the hypernym taxonomy for each noun
  • Monosyllabicity is verified with Syllables, a package to estimate the number of syllables in English words. It works well detecting monosyllabic German words, however the string needs to be first stransformed to remove the Umlauts and the Eszett.
  • Foreign borrowings are detected with the langdetect package.

Evaluation

In order to evaluate the system a list of 102,444 German nouns was extracted from this list. After removing the duplicates 100064 nouns remained. All lemmas were analyzed for the grammatical gender with the spaCy pipeline, of which 90623 nouns were successfully morphologically identified. The identified nouns represented the following grammatical classes:

  • 32164 were masculine
  • 36306 were feminine
  • 22153 were neuter

Four feature sets were extracted (semantic, morphological, etymological, and syllabic/phonological) and employed in training a multinomial logistic regression classifier. The baseline accuracy of the model is 0.396, which reflects the imbalanced ratio between the three genders. Below are the accuracy scores for each feature set, followed by the accuracy for all the features combined:

  • Semantic features: 0.419
  • Morphological features: 0.750
  • Etymological features: 0.405
  • Syllabic/phonological features: 0.405
  • All featires combined: 0.752

References

The project was inspired by Der, Die, Das: The Secrets of German Gender by Constantin Vayenas (2019).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.