Giter VIP home page Giter VIP logo

the-oslo-bergen-tagger's Introduction

Introduction

The Oslo-Bergen Tagger is a morphosyntactic tagger for Norwegian bokmål and nynorsk. For general information about the tagger, visit its home page: http://www.tekstlab.uio.no/obt-ny/.

Installation and usage

The tagger consists of three parts:

  • A multitagger (tokenizer, morphological analyzer, and compund analyzer)
  • A Constraint Grammar (CG) tagger
  • A statistical tagger (currently only for bokmål)

Watch the installation on YouTube (On OS X Yosemite)

The multitagger

The multitagger is currently only distributed in binary form. Compiled binaries for 32- and 64-bits Linux and 64-bits Mac OS X can be downloaded from our server at the Text Laboratory, University of Oslo (http://www.tekstlab.uio.no/mtag/linux32/mtag32, http://www.tekstlab.uio.no/mtag/linux64/mtag, and http://www.tekstlab.uio.no/mtag/osx64/mtag-osx64, respectively). The file should be placed in the bin directory. If necessary, rename the file to mtag, and make it executable. For example:

$ cd The-Oslo-Bergen-Tagger/bin
$ wget http://www.tekstlab.uio.no/mtag/osx64/mtag-osx64
$ mv mtag-osx64 mtag
$ chmod +x mtag

The Constraint Grammar tagger

  1. Check out the VISL CG-3 repository from the Subversion repository at the University of Southern Denmark and install it. The repository can be checked out anywhere on your machine since it will be installed into a central location such as /usr/local/bin. Installation instructions for various platforms can be found at http://beta.visl.sdu.dk/cg3/chunked/installation.html.

  2. CG rules for morphological disambiguation of bokmål and nynorsk are found in the cg folder.

    bm_morf.cg and nn_morf.cg should be used when you only want to do CG tagging of bokmål and nynorsk, respectively. The CG tagger may leave some ambiguity, either because it is not confident enough to do complete disambiguation, or because there is genuine ambiguity in the material (such as nouns that can be analyzed as either masculine or feminine).

    bm_morf-prestat.cg should be used when you want to run statistical disambiguation after CG disambiguation in order to obtain completely disambiguated output (currently only available for bokmål). This is useful, for instance, for many language technology purposes. On the other hand, it may remove genuine ambiguity from the text.

The statistical tagger

Clone the OBT-Stat git repository from GitHub in the root folder of the distribution:

	$ git clone git://github.com/andrely/OBT-Stat.git

Running the tagger

Shell scripts are included which will run the entire process: multitagging, CG disambiguation, and optionally statistical disambiguation (for bokmål).

CG and statistical disambiguation, bokmål:

$ ./tag-bm.sh TEXTFILE > DISAMBIGUATED_OUTPUT_FILE

CG disambiguation only, bokmål:

$ ./tag-nostat-bm.sh TEXTFILE > DISAMBIGUATED_OUTPUT_FILE

CG disambiguation only, nynorsk:

$ ./tag-nostat-nn.sh TEXTFILE > DISAMBIGUATED_OUTPUT_FILE

Third-party software that uses the tagger

Clojure library by Aleksander Skjæveland Larsen

the-oslo-bergen-tagger's People

Contributors

kmelve avatar ljos avatar noklesta avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.