vgherard / sbo Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 2.0 24.67 MB

Utilities for training and evaluating text predictors based on Stupid Back-off N-gram models.

R 84.59% C++ 15.23% Makefile 0.18%

ngram-models natural-language-processing predictive-text sbo

sbo's Introduction

Welcome 👋

I'm a theoretical physicist, interested in Statistical Software, Machine Learning and Data Science.

sbo's People

Stargazers

Watchers

Forkers

minghao2016 singulritarian7

sbo's Issues

Add build_dictionary() function

Build a dictionary from training corpus, either with fixed size or fixed coverage of training corpus.

Should produce an object of class dictionary

Implement PrefixCompletion C++ class through std::unordered_map

This could give an enormous boost to predict() and, consequently, to eval_sbo_predictor().

Document `filtered` argument of `build_sbo_preds()`

The documentation currently does not explain how to exclude the "End-Of-Sentence" and "Unknown-Word" tokens from next-word predictions.

Also, it might be a good idea to add two dedicate arguments filter_EOS and filter_UNK (== TRUE or FALSE).

Implement reading from file of training corpus in `get_kgram_freqs_fast()`

As in title.

Eliminate global variables using the `.data$` pronoun instead

As in the title, c.f. the tidyeval cheatsheet

Using Run Length Encoding in `sbo_preds` objects.

For N-gram models with N >= 3, using Run Length Encoding for k-gram prefixes in sbo_preds objects could bring two benefits:

Reduce size of these objects.
Make the retrieval of k-gram prefixes more efficient.

Utilities to compute coverage fraction of k-grams in a corpus

It would be cool to add a function coverage(freqs, corpus) to compute the fraction of k-grams in corpus which are covered by kgram_freqs object freqs.

Implement next-word prediction from out-of-memory kgram frequency table

This could be coupled with use of memoise() on the prediction method (i.e. caching next-word predictions) for efficiency in interactive use.

Properly document `plot()` method for `word_coverage` class

NOTE: no visible global function definition for plot

From rhub_check() on Ubuntu Linux 16.04 LTS, R-release, GCC.

* checking R code for possible problems ... NOTE
plot.word_coverage: no visible global function definition for ‘plot’
Undefined global functions or variables:
  plot
Consider adding
  importFrom("graphics", "plot")
to your NAMESPACE file.

Vectorizing eval_sbo_preds()

In the current version, only the predict() part is implemented in C++ (whereas k-gram random sampling is performed in a slow lapply() loop).

Clean up Depends and Imports statements in DESCRIPTION

It doesn't seem necessary to Depend on dplyr and magrittr, it would probably be cleaner to use :: or import the relevant functions (e.g. the pipe).

Examples and vignettes should be modified accordingly.

`summary()` and `str()` methods for `sbo_preds` and `kgram_freqs` S3 classes

As in title. In correspondence, the print() methods could be made a little less verbose.

PREPERROR: ‘unordered_map’ is not a member of ‘std’

From check_rhub() on Debian Linux, R-devel, GCC ASAN/UBSAN

2937#> In file included from /usr/include/c++/10/unordered_map:35,

2938#> from sbo.h:5,

2939#> from PrefixCompletion.cpp:1:

2940#> /usr/include/c++/10/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

2941#> 32 | #error This file requires compiler and library support \

2942#> | ^~~~~

2943#> In file included from PrefixCompletion.cpp:1:

2944#> sbo.h:14:26: error: ‘unordered_map’ is not a member of ‘std’

2945#> 14 | std::vector> freqs_;

2946#> | ^~~~~~~~~~~~~

2947#> sbo.h:14:26: note: ‘std::unordered_map’ is only available from C++11 onwards

2948#> sbo.h:14:26: error: ‘unordered_map’ is not a member of ‘std’

2949#> sbo.h:14:26: note: ‘std::unordered_map’ is only available from C++11 onwards

2950#> sbo.h:14:56: error: spurious ‘>>’, use ‘>’ to terminate a template argument list

2951#> 14 | std::vector> freqs_;

2952#> | ^~

2953#> sbo.h:14:56: error: template argument 1 is invalid

2954#> sbo.h:23:32: error: ‘unordered_map’ is not a member of ‘std’

2955#> 23 | const std::vector>& freqs() const

2956#> | ^~~~~~~~~~~~~

2957#> sbo.h:23:32: note: ‘std::unordered_map’ is only available from C++11 onwards

2958#> sbo.h:23:32: error: ‘unordered_map’ is not a member of ‘std’

2959#> sbo.h:23:32: note: ‘std::unordered_map’ is only available from C++11 onwards

2960#> sbo.h:23:62: error: spurious ‘>>’, use ‘>’ to terminate a template argument list

2961#> 23 | const std::vector>& freqs() const

2962#> | ^~

2963#> sbo.h:23:62: error: template argument 1 is invalid

2964#> sbo.h:35:50: error: ‘>>’ should be ‘> >’ within a nested template argument list

2965#> 35 | std::vector> pc;

2966#> | ^~

2967#> | > >

2968#> PrefixCompletion.cpp: In constructor ‘PrefixCompletion::PrefixCompletion(const List&)’:

2969#> PrefixCompletion.cpp:5:10: warning: extended initializer lists only available with ‘-std=c++11’ or ‘-std=gnu++11’

2970#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2971#> | ^

2972#> PrefixCompletion.cpp:5:31: warning: extended initializer lists only available with ‘-std=c++11’ or ‘-std=gnu++11’

2973#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2974#> | ^

2975#> PrefixCompletion.cpp:5:54: warning: extended initializer lists only available with ‘-std=c++11’ or ‘-std=gnu++11’

2976#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2977#> | ^

2978#> PrefixCompletion.cpp:5:73: error: call of overloaded ‘basic_string()’ is ambiguous

2979#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2980#> | ^

2981#> In file included from /usr/include/c++/10/string:55,

2982#> from /home/docker/R/Rcpp/include/Rcpp/macros/macros.h:25,

2983#> from /home/docker/R/Rcpp/include/Rcpp/r/headers.h:69,

2984#> from /home/docker/R/Rcpp/include/RcppCommon.h:29,

2985#> from /home/docker/R/Rcpp/include/Rcpp.h:27,

2986#> from sbo.h:1,

2987#> from PrefixCompletion.cpp:1:

2988#> /usr/include/c++/10/bits/basic_string.h:525:7: note: candidate: ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, const _Alloc&) [with _CharT = char; _Traits = std::char_traits; _Alloc = std::allocator]’

2989#> 525 | basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())

2990#> | ^~~~~~~~~~~~

2991#> /usr/include/c++/10/bits/basic_string.h:448:7: note: candidate: ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const std::__cxx11::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char; _Traits = std::char_traits; _Alloc = std::allocator]’

2992#> 448 | basic_string(const basic_string& __str)

2993#> | ^~~~~~~~~~~~

2994#> /usr/include/c++/10/bits/basic_string.h:440:7: note: candidate: ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _Alloc&) [with _CharT = char; _Traits = std::char_traits; _Alloc = std::allocator]’

2995#> 440 | basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT

2996#> | ^~~~~~~~~~~~

2997#> PrefixCompletion.cpp:7:42: error: ‘>>’ should be ‘> >’ within a nested template argument list

2998#> 7 | dict = as>(object.attr("dict"));

2999#> | ^~

3000#> | > >

3001#> make: *** [/usr/local/lib/R/etc/Makeconf:177: PrefixCompletion.o] Error 1

3002#> ERROR: compilation failed for package ‘sbo’

3003#> * removing ‘/home/docker/R/sbo’

3004#> Warning message:

3005#> In i.p(...) :

3006#> installation of package ‘/tmp/RtmpHrNntD/file143bd854f6/sbo_0.4.0.9000.tar.gz’ had non-zero exit status

build_sbo_preds() directly from training corpus

The present UI only allows to build_sbo_preds() from a previously obtained kgram_freqs object. This doesn't really make sense, for kgram_freqs objects are (from the text prediction POV) only intermediate objects.

Pruning the dictionary

More of a question about the usage guide than an issue:

In the starter guide, under Evaluating next-word predictions, it's mentioned about pruning the dictionary to remove seldom used words - could somebody clarify the ranking (is this in terms of word frequency in the corpus?) and how to go about pruning an SBO dictionary according to rank?

Thanks. Remarkable package! Thanks for your great work on this!

Rethinking structure of `kgram_freqs` and `sbo_preds` object

From the UI point of view, components such as n, L or lambda would probably appear more naturally as attributes than list elements.
Maybe a simple list (rather than a matrix) would be a better fit for actual kgram_freqs and prediction tables. This could potentially help solving the first part of #10; also would allow to store RLE encoded word sequences and regular sequences in a single object.

V <- 2
sbo::kgram_freqs(corpus = "a b c d", N = 2, dict = max_size ~ V)
#> Warning in make_dict.formula(object = dict, .preprocess = identity, EOS = EOS, :
#> si è prodotto un NA per coercizione
#> Error in if (max_size < 0) {: valore mancante dove è richiesto TRUE/FALSE

^{Created on 2020-12-09 by the reprex package (v0.3.0)}

Add package documentation entry

See the Roxygen Quick Reference

Define S3 class for output of `eval_sbo_predictor()`

The class should support methods for the following basic tasks:

Computing predictor accuracy, with and without "" as possible true completion, and the uncertainty of the estimate.
Computing recall on a limited set of words, and its uncertainty.
Plotting distribution of word-ranks of correct predictions.
...?

Tests for `get_kgram_freqs_fast()` failing on `rhub::platforms()[12,1]` (i.e. "macos-highsierra-release-cran")

All tests whose name starts "correct 1-gram" fails on this platform.

FAILURE (test-eval_sbo_predictor.R:19:9): output has the correct structure

From rhub_check() on Ubuntu Linux 16.04 LTS, R-release, GCC;
Same error from check_win_oldrelease()

checking tests ...
Running ‘testthat.R’ [19s/36s]
ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
Reason: Skip test for updated data

── FAILURE (test-eval_sbo_predictor.R:19:9): output has the correct structure ──
lapply(p_eval, class) not identical to classes.
Component "preds": Lengths (1, 2) differ (string compare on first 1)

── Skipped tests ──────────────────────────────────────────────────────────────
● Skip test for updated data (1)

══ testthat results ═══════════════════════════════════════════════════════════
FAILURE (test-eval_sbo_predictor.R:19:9): output has the correct structure

[ FAIL 1 | WARN 0 | SKIP 1 | PASS 269 ]
Error: Test failures
Execution halted

Huge memory allocations from `predict.sbo_preds`

The current (C++) implementation of the predict.sbo_preds() method has two big issues:

Every call to predict() makes a copy of the entire k-gram prediction tables. This is memory expensive and slow if predict() is called in a non-vectorized way (as would happen e.g. in interactive text prediction).
The look-up method in prediction tables is very slow, and causes huge memory allocations/deallocations for large vector input, which slow down a lot model evaluations in eval_sbo_preds(). Maybe #8 could partially fix this?