Giter VIP home page Giter VIP logo

floret-demo-projects's Introduction

Demos for floret vectors

Note: As specified in requirements.txt, all demos currently require a custom version of spaCy, feature/fasttext-bloom-vectors.

Demos currently use data from OSCAR for training vectors, streaming the unshuffled deduplicated corpora using datasets.

A demo for training and loading floret vectors:

A demo for training and comparing standard fasttext and floret vectors using QVEC:

Demos for training vectors and and spaCy pipelines with a focus on cases where floret vectors are expected to improve the performance as compared to standard fasttext vectors on a fixed vocabulary:

  • floret_ko_ud_demo: agglutinative languages with Korean UD

    With 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

    Vectors TAG POS DEP UAS DEP LAS
    none 72.6 85.0 73.3 64.6
    standard (pruned) 78.2 89.7 78.9 73.1
    floret (minn 2, maxn 3) 83.2 94.2 83.4 80.7

    With 12G tokenized training texts and 50K 300-dim vectors (except for unpruned), ~1M keys for the standard vectors:

    Vectors TAG POS DEP UAS DEP LAS SPEED
    none 72.6 85.0 73.3 64.6 15272
    standard (pruned) 78.9 89.9 79.0 73.5 14754
    standard (unpruned) 81.6 91.8 80.8 76.1 14200
    floret (minn 2, maxn 3) 83.6 94.3 83.5 80.7 13530
  • floret_fi_ud_demo: agglutinative languages with Finnish, UD_Finnish-TDT (syntax) and turku-ner-corpus (NER)

    With 13G tokenized training texts and 50K 300-dim vectors (except for unpruned), ~1M keys for the standard vectors:

    Vectors TAG POS MORPH DEP UAS DEP LAS ENTS F SPEED (syntax)
    none 93.5 92.5 86.2 79.4 72.7 62.0 12693
    standard (pruned) 96.6 95.6 89.6 84.6 79.6 72.2 13407
    standard (unpruned) 97.0 96.0 90.9 84.6 80.0 72.2 13269
    floret (minn 4, maxn 5) 97.1 96.0 91.6 84.6 80.2 73.6 12044
  • floret_hu_ner_demo: agglutinative languages with Hungarian NER

    With 500K (1.5G) tokenized training texts and 50K 300-dim vectors, with ~650K unique keys for the standard vectors:

    Vectors P R F
    none 93.7 93.5 93.6
    standard (pruned) 94.9 95.0 95.0
    floret (minn 5, maxn 6) 97.1 95.9 96.5
  • floret_en_noisy_ner_demo: noisy data with English NER for Twitter on emerging events

    With 500K (2.5G) tokenized training texts and 20K 300-dim vectors, with ~200K unique keys for the standard vectors:

    Vectors P R F
    none 30.7 18.7 23.3
    standard (pruned) 34.5 25.6 29.4
    floret (minn 5, maxn 6) 39.9 23.3 29.4
  • floret_en_so_ner_demo: out-of-domain data with English NER for StackOverflow vs. GitHub

    With 500K (2.5G) tokenized training texts and 20K 300-dim vectors:

    Vectors F (in-domain) F (out-of-domain)
    none 55.6 37.6
    standard (pruned) 55.3 35.1
    floret (minn 5, maxn 6) 55.5 35.9

    In this case, the OSCAR texts are not particularly suitable training data for the vectors. It would probably be better to train vectors on texts from StackOverflow or a similar source instead.

Notes

To test a workflow quickly, set max_texts to very small value like 100. A much larger amount of training data for the vectors is obviously needed for more meaningful comparisons in the test cases. The provided defaults are still quite small for typical vector training data, but should show some results and train in a not-too-unreasonable amount of time on a small number of threads.

For reference, 1M texts from the OSCAR datasets are about 5G for English, 4G for Hungarian, and 3G for Korean. fasttext does not support streamed input, so it is necessary to have the tokenized training data saved in a file. Outside of a demo, I'd often use tmpfs.

floret-demo-projects's People

Contributors

adrianeboyd avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.