Giter VIP home page Giter VIP logo

sbo's Introduction

Welcome 👋

I'm a theoretical physicist, interested in Statistical Software, Machine Learning and Data Science.

Website LinkedIn Twitter CV

sbo's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sbo's Issues

Add build_dictionary() function

Build a dictionary from training corpus, either with fixed size or fixed coverage of training corpus.

Should produce an object of class dictionary

Document `filtered` argument of `build_sbo_preds()`

The documentation currently does not explain how to exclude the "End-Of-Sentence" and "Unknown-Word" tokens from next-word predictions.

Also, it might be a good idea to add two dedicate arguments filter_EOS and filter_UNK (== TRUE or FALSE).

Using Run Length Encoding in `sbo_preds` objects.

For N-gram models with N >= 3, using Run Length Encoding for k-gram prefixes in sbo_preds objects could bring two benefits:

  1. Reduce size of these objects.
  2. Make the retrieval of k-gram prefixes more efficient.

NOTE: no visible global function definition for plot

From rhub_check() on Ubuntu Linux 16.04 LTS, R-release, GCC.

* checking R code for possible problems ... NOTE
plot.word_coverage: no visible global function definition for ‘plot’
Undefined global functions or variables:
  plot
Consider adding
  importFrom("graphics", "plot")
to your NAMESPACE file.

Vectorizing eval_sbo_preds()

In the current version, only the predict() part is implemented in C++ (whereas k-gram random sampling is performed in a slow lapply() loop).

PREPERROR: ‘unordered_map’ is not a member of ‘std’

From check_rhub() on Debian Linux, R-devel, GCC ASAN/UBSAN

2937#> In file included from /usr/include/c++/10/unordered_map:35,

2938#> from sbo.h:5,

2939#> from PrefixCompletion.cpp:1:

2940#> /usr/include/c++/10/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

2941#> 32 | #error This file requires compiler and library support \

2942#> | ^~~~~

2943#> In file included from PrefixCompletion.cpp:1:

2944#> sbo.h:14:26: error: ‘unordered_map’ is not a member of ‘std’

2945#> 14 | std::vector> freqs_;

2946#> | ^~~~~~~~~~~~~

2947#> sbo.h:14:26: note: ‘std::unordered_map’ is only available from C++11 onwards

2948#> sbo.h:14:26: error: ‘unordered_map’ is not a member of ‘std’

2949#> sbo.h:14:26: note: ‘std::unordered_map’ is only available from C++11 onwards

2950#> sbo.h:14:56: error: spurious ‘>>’, use ‘>’ to terminate a template argument list

2951#> 14 | std::vector> freqs_;

2952#> | ^~

2953#> sbo.h:14:56: error: template argument 1 is invalid

2954#> sbo.h:23:32: error: ‘unordered_map’ is not a member of ‘std’

2955#> 23 | const std::vector>& freqs() const

2956#> | ^~~~~~~~~~~~~

2957#> sbo.h:23:32: note: ‘std::unordered_map’ is only available from C++11 onwards

2958#> sbo.h:23:32: error: ‘unordered_map’ is not a member of ‘std’

2959#> sbo.h:23:32: note: ‘std::unordered_map’ is only available from C++11 onwards

2960#> sbo.h:23:62: error: spurious ‘>>’, use ‘>’ to terminate a template argument list

2961#> 23 | const std::vector>& freqs() const

2962#> | ^~

2963#> sbo.h:23:62: error: template argument 1 is invalid

2964#> sbo.h:35:50: error: ‘>>’ should be ‘> >’ within a nested template argument list

2965#> 35 | std::vector> pc;

2966#> | ^~

2967#> | > >

2968#> PrefixCompletion.cpp: In constructor ‘PrefixCompletion::PrefixCompletion(const List&)’:

2969#> PrefixCompletion.cpp:5:10: warning: extended initializer lists only available with ‘-std=c++11’ or ‘-std=gnu++11’

2970#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2971#> | ^

2972#> PrefixCompletion.cpp:5:31: warning: extended initializer lists only available with ‘-std=c++11’ or ‘-std=gnu++11’

2973#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2974#> | ^

2975#> PrefixCompletion.cpp:5:54: warning: extended initializer lists only available with ‘-std=c++11’ or ‘-std=gnu++11’

2976#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2977#> | ^

2978#> PrefixCompletion.cpp:5:73: error: call of overloaded ‘basic_string()’ is ambiguous

2979#> 5 | N{object.attr("N")}, L{object.attr("L")}, EOS{object.attr("EOS")}

2980#> | ^

2981#> In file included from /usr/include/c++/10/string:55,

2982#> from /home/docker/R/Rcpp/include/Rcpp/macros/macros.h:25,

2983#> from /home/docker/R/Rcpp/include/Rcpp/r/headers.h:69,

2984#> from /home/docker/R/Rcpp/include/RcppCommon.h:29,

2985#> from /home/docker/R/Rcpp/include/Rcpp.h:27,

2986#> from sbo.h:1,

2987#> from PrefixCompletion.cpp:1:

2988#> /usr/include/c++/10/bits/basic_string.h:525:7: note: candidate: ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, const _Alloc&) [with _CharT = char; _Traits = std::char_traits; _Alloc = std::allocator]’

2989#> 525 | basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())

2990#> | ^~~~~~~~~~~~

2991#> /usr/include/c++/10/bits/basic_string.h:448:7: note: candidate: ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const std::__cxx11::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char; _Traits = std::char_traits; _Alloc = std::allocator]’

2992#> 448 | basic_string(const basic_string& __str)

2993#> | ^~~~~~~~~~~~

2994#> /usr/include/c++/10/bits/basic_string.h:440:7: note: candidate: ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _Alloc&) [with _CharT = char; _Traits = std::char_traits; _Alloc = std::allocator]’

2995#> 440 | basic_string(const _Alloc& __a) _GLIBCXX_NOEXCEPT

2996#> | ^~~~~~~~~~~~

2997#> PrefixCompletion.cpp:7:42: error: ‘>>’ should be ‘> >’ within a nested template argument list

2998#> 7 | dict = as>(object.attr("dict"));

2999#> | ^~

3000#> | > >

3001#> make: *** [/usr/local/lib/R/etc/Makeconf:177: PrefixCompletion.o] Error 1

3002#> ERROR: compilation failed for package ‘sbo’

3003#> * removing ‘/home/docker/R/sbo’

3004#> Warning message:

3005#> In i.p(...) :

3006#> installation of package ‘/tmp/RtmpHrNntD/file143bd854f6/sbo_0.4.0.9000.tar.gz’ had non-zero exit status

build_sbo_preds() directly from training corpus

The present UI only allows to build_sbo_preds() from a previously obtained kgram_freqs object. This doesn't really make sense, for kgram_freqs objects are (from the text prediction POV) only intermediate objects.

Pruning the dictionary

More of a question about the usage guide than an issue:

In the starter guide, under Evaluating next-word predictions, it's mentioned about pruning the dictionary to remove seldom used words - could somebody clarify the ranking (is this in terms of word frequency in the corpus?) and how to go about pruning an SBO dictionary according to rank?

Thanks. Remarkable package! Thanks for your great work on this!

Rethinking structure of `kgram_freqs` and `sbo_preds` object

  1. From the UI point of view, components such as n, L or lambda would probably appear more naturally as attributes than list elements.

  2. Maybe a simple list (rather than a matrix) would be a better fit for actual kgram_freqs and prediction tables. This could potentially help solving the first part of #10; also would allow to store RLE encoded word sequences and regular sequences in a single object.

Changing data structure for `sbo_preds`

Simple lists (or tibbles) would be a better fit than matrices for sbo_preds objects.

This could potentially help solving the first part of #10; also would allow to store RLE encoded word sequences and regular sequences in a single object.

Formula `dict` argument of `kgram_freqs()` works only with literals

For instance:

V <- 2
sbo::kgram_freqs(corpus = "a b c d", N = 2, dict = max_size ~ V)
#> Warning in make_dict.formula(object = dict, .preprocess = identity, EOS = EOS, :
#> si è prodotto un NA per coercizione
#> Error in if (max_size < 0) {: valore mancante dove è richiesto TRUE/FALSE

Created on 2020-12-09 by the reprex package (v0.3.0)

Define S3 class for output of `eval_sbo_predictor()`

The class should support methods for the following basic tasks:

  • Computing predictor accuracy, with and without "" as possible true completion, and the uncertainty of the estimate.
  • Computing recall on a limited set of words, and its uncertainty.
  • Plotting distribution of word-ranks of correct predictions.
  • ...?

FAILURE (test-eval_sbo_predictor.R:19:9): output has the correct structure

From rhub_check() on Ubuntu Linux 16.04 LTS, R-release, GCC;
Same error from check_win_oldrelease()

  • checking tests ...
    Running ‘testthat.R’ [19s/36s]
    ERROR
    Running the tests in ‘tests/testthat.R’ failed.
    Last 13 lines of output:
    Reason: Skip test for updated data

    ── FAILURE (test-eval_sbo_predictor.R:19:9): output has the correct structure ──
    lapply(p_eval, class) not identical to classes.
    Component "preds": Lengths (1, 2) differ (string compare on first 1)

    ── Skipped tests ──────────────────────────────────────────────────────────────
    ● Skip test for updated data (1)

    ══ testthat results ═══════════════════════════════════════════════════════════
    FAILURE (test-eval_sbo_predictor.R:19:9): output has the correct structure

    [ FAIL 1 | WARN 0 | SKIP 1 | PASS 269 ]
    Error: Test failures
    Execution halted

Huge memory allocations from `predict.sbo_preds`

The current (C++) implementation of the predict.sbo_preds() method has two big issues:

  • Every call to predict() makes a copy of the entire k-gram prediction tables. This is memory expensive and slow if predict() is called in a non-vectorized way (as would happen e.g. in interactive text prediction).
  • The look-up method in prediction tables is very slow, and causes huge memory allocations/deallocations for large vector input, which slow down a lot model evaluations in eval_sbo_preds(). Maybe #8 could partially fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.