Giter VIP home page Giter VIP logo

rakkyo's Introduction

What is Rakkyo

This is code which was used for experiments for NAACL 2019 Paper: Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning. Arseny Tolmachev, Daisuke Kawahara and Sadao Kurohashi. (pdf, bibtex)

Structure

This is a mixed Scala/Spark and Python/TensorFlow 1.x project.

Warning: code is of research quality and pretty unstructured. Beware of dragons, dirty hacks and spaghetti.

You need to use Python 3.6+. Dependencies are Tensofrlow 1.10+ (but not 2.0 series), pyhocon, matplotlib

For compiling preprocessing code you need to have JDK 1.8 installed with sbt 1.0+. For running it you need to have Spark 2.3.1 installed (standalone mode without Hadoop is OK).

Compiling preprocessing code

$ sbt assembly

rakkyo/preproc/target/scala-2.11/preproc-assembly-0.1.0-SNAPSHOT.jar will contain compilation results.

How to preprocess data

You need to prepare the training data (TFExamples) using Spark, then you can train Rakkyo models.

Rakkyo uses a lot of training data. We were using Apache Spark (with Hadoop as storage) for data preprocessing. It is possible to use Apache Spark without HDFS cluster if you can mount the data at the same location on all computational nodes.

You will need Spark 2.3.1 to launch Spark Applications if you use the binary from Releases. You can change Spark version in build.sbt and build preprocessing tools for other Spark versions.

Making raw silver data with Juman++

First, we need to analyse some sentences using Juman++. StreamLines2 class is a Spark application that launches external application, one per partition (input file), which should take its input from stdin and write output to stdout. StreamLines2 outputs a file for each input partition.

spark-submit.sh \
        --master $SPARK_URL \
        --class org.eiennohito.spark.StreamLines2 \
        local:/path/to/preproc-assembly-0.1.0-SNAPSHOT.jar \
        --input={raw sentences, can use globs or several paths here} \
        --output={files will be placed here} \
        --command=/path/to/jumanpp-launch.sh

Where jumanpp-launch.sh is a script like

#!/bin/bash
# set -x
set -e
set -o pipefail

# set niceness of self and children
renice -n 19 $$ >/dev/null 2>/dev/null || true

JUMANPP_BINARY=jumanpp
JUMANPP_MODEL=/path/to/jumanpp.model
JUMANPP_CONFIG=/path/to/jumanpp.conf


exec "$JUMANPP_BINARY" --model="$JUMANPP_MODEL" --config="$JUMANPP_CONFIG" -s1 --auto-nbest=2:8:15

Making dictionaries

Next step creates dictionaries for converting characters and categorical strings into integers for TFExamples.

spark-submit.sh \
    --master $SPARK_URL \
    --executor-memory 10g \
    --class org.eiennohito.spark.JumanppToTags \
    local:/path/to/preproc-assembly-0.1.0-SNAPSHOT.jar \
    --input=... \
    --output=... \
    --max-chars=40000

The last paramters specifies the maximum number of characters (treated as unicode codepoints) in character dictionary. In our experiments, ~3B Japanese corpus had 18k unique characters in total.

Making TFExamples

spark-submit.sh \
        --master $SPARK_URL \
        --driver-memory 60g \
        --executor-memory 40g \
        --class org.eiennohito.spark.JumanppToXMorph \
        --conf spark.driver.maxResultSize=40G \
        local:/path/to/preproc-assembly-0.1.0-SNAPSHOT.jar \
        --input=... \
        --dicts=/path/to/output_of_previous_step \
        --output=... \
        --stats=/statistics/on/sampling/will_be_put_here \
        --unk-symbol-prob=0.001 --zen2-han-prob=0.05 --max-length=150 \
        --diff-dict=/path/to/20p.01.20k.dic \
        --sample-ratio=0.16667 --boundary-ratio=0.02 --output-files=500
  • Dicts are dictionary files outputted by the previous step
  • Stats are always local (relative to Spark Driver) paths
  • Input and Output can be on HDFS and are handled by Hadoop
  • Diff Dict is morpheme difficulty dictionary (computed from morpheme unigram frequencies), you need to uncompress it to use http://lotus.kuee.kyoto-u.ac.jp/~arseny/rakkyo/20p.01.20k.dic.bz2
  • unk-symbol-prob replaces some symbols (single codepoint with 特集-記号 Jumandic POS) with UNK character
  • We also sample sentences proportional to their difficulty (defined as difficulty of the hardest morpheme in the sentence), with assigning lower weight to extreme (both low and high difficulty) sentences. Sample ratio controls the total final number of sentences and boundary ratio are how much sentences with extreme difficulties should be sampled.
  • This program outputs .gz compressed TFRecords

How to train models

You need to use configs. See rakkyo-conf directory for examples.

Basic launch is:

python3 rakkyo.py config.conf shapshot_dir=/path/to/snapshots

How to do inference

TODO

rakkyo's People

Contributors

eiennohito avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.