Giter VIP home page Giter VIP logo

betaml.jl's Introduction

Beta Machine Learning Toolkit

Machine Learning made simple :-)

   

The Beta Machine Learning Toolkit is a package including many algorithms and utilities to implement machine learning workflows in Julia, Python, R and any other language with a Julia binding.

DOI Build status codecov.io

Currently the following models are available:

BetaML name MLJ Interface Category
PerceptronClassifier PerceptronClassifier Supervised classifier
KernelPerceptronClassifier
KernelPerceptronClassifier Supervised classifier
PegasosClassifier PegasosClassifier Supervised classifier
DecisionTreeEstimator DecisionTreeClassifier, DecisionTreeRegressor Supervised regressor and classifier
RandomForestEstimator RandomForestClassifier, RandomForestRegressor Supervised regressor and classifier
NeuralNetworkEstimator NeuralNetworkRegressor, MultitargetNeuralNetworkRegressor, NeuralNetworkClassifier Supervised regressor and classifier
GaussianMixtureRegressor GaussianMixtureRegressor, MultitargetGaussianMixtureRegressor Supervised regressor
GaussianMixtureRegressor2 Supervised regressor
KMeansClusterer KMeansClusterer Unsupervised hard clusterer
KMedoidsClusterer KMedoidsClusterer Unsupervised hard clusterer
GaussianMixtureClusterer GaussianMixtureClusterer Unsupervised soft clusterer
SimpleImputer SimpleImputer Unsupervised missing data imputer
GaussianMixtureImputer GaussianMixtureImputer Unsupervised missing data imputer
RandomForestImputer RandomForestImputer Unsupervised missing data imputer
GeneralImputer GeneralImputer Unsupervised missing data imputer
MinMaxScaler Data transformer
StandardScaler Data transformer
Scaler Data transformer
PCAEncoder Unsupervised dimensionality reduction transformer
AutoEncoder AutoEncoderMLJ Unsupervised non-linear dimensionality reduction
OneHotEncoder Data transformer
OrdinalEncoder Data transformer
ConfusionMatrix Predictions assessment

Theoretical notes describing many of these algorithms are at the companion repository https://github.com/sylvaticus/MITx_6.86x.

All models are implemented entirely in Julia and are hosted in the repository itself (i.e. they are not wrapper to third-party models). If your favorite option or model is missing, you can try implement it yourself and open a pull request to share it (see the section Contribute below) or request its implementation (open an issue). Thanks to its JIT compiler, Julia is indeed in the sweet spot where we can easily write models in a high-level language and still having them running efficiently.

Documentation

Please refer to the package documentation or use the Julia inline package system (just press the question mark ? and then, on the special help prompt help?>, type the module or function name). The package documentation is made of two distinct parts. The first one is an extensively commented tutorial that covers most of the library, the second one is the reference manual covering the library's API.

If you are looking for an introductory material on Julia, have a look on the book "Julia Quick Syntax Reference"(Apress,2019) or the online course "Scientific Programming and Machine Learning in Julia.

While implemented in Julia, this package can be easily used in R or Python employing JuliaCall or PyJulia respectively, see the relevant section in the documentation.

Examples

  • Using an Artificial Neural Network for multinomial categorisation

In this example we see how to train a neural networks model to predict the specie's name (5th column) given floral sepals and petals measures (first 4 columns) in the famous iris flower dataset.

# Load Modules
using DelimitedFiles, Random
using Pipe, Plots, BetaML # Load BetaML and other auxiliary modules
Random.seed!(123);  # Fix the random seed (to obtain reproducible results).

# Load the data
iris     = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)
x        = convert(Array{Float64,2}, iris[:,1:4])
y        = convert(Array{String,1}, iris[:,5])
# Encode the categories (levels) of y using a separate column per each category (aka "one-hot" encoding) 
ohmod    = OneHotEncoder()
y_oh     = fit!(ohmod,y) 
# Split the data in training/testing sets
((xtrain,xtest),(ytrain,ytest),(ytrain_oh,ytest_oh)) = partition([x,y,y_oh],[0.8,0.2])
(ntrain, ntest) = size.([xtrain,xtest],1)

# Define the Artificial Neural Network model
l1   = DenseLayer(4,10,f=relu) # The activation function is `ReLU`
l2   = DenseLayer(10,3)        # The activation function is `identity` by default
l3   = VectorFunctionLayer(3,f=softmax) # Add a (parameterless) layer whose activation function (`softmax` in this case) is defined to all its nodes at once
mynn = NeuralNetworkEstimator(layers=[l1,l2,l3],loss=crossentropy,descr="Multinomial logistic regression Model Sepal", batch_size=2, epochs=200) # Build the NN and use the cross-entropy as error function. Swith to auto-tuning with `autotune=true`

# Train the model (using the ADAM optimizer by default)
res = fit!(mynn,fit!(Scaler(),xtrain),ytrain_oh) # Fit the model to the (scaled) data

# Obtain predictions and test them against the ground true observations
ŷtrain         = @pipe predict(mynn,fit!(Scaler(),xtrain)) |> inverse_predict(ohmod,_)  # Note the scaling and reverse one-hot encoding functions
ŷtest          = @pipe predict(mynn,fit!(Scaler(),xtest))  |> inverse_predict(ohmod,_) 
train_accuracy = accuracy(ytrain,ŷtrain) # 0.975
test_accuracy  = accuracy(ytest,ŷtest)   # 0.96

# Analyse model performances
cm = ConfusionMatrix()
fit!(cm,ytest,ŷtest)
print(cm)
A ConfusionMatrix BetaMLModel (fitted)

-----------------------------------------------------------------

*** CONFUSION MATRIX ***

Scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}:
 "Labels"       "virginica"    "versicolor"   "setosa"
 "virginica"   8              1              0
 "versicolor"  0             14              0
 "setosa"      0              0              7
Normalised scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}:
 "Labels"       "virginica"   "versicolor"   "setosa"
 "virginica"   0.888889      0.111111       0.0
 "versicolor"  0.0           1.0            0.0
 "setosa"      0.0           0.0            1.0

 *** CONFUSION REPORT ***

- Accuracy:               0.9666666666666667
- Misclassification rate: 0.033333333333333326
- Number of classes:      3

  N Class      precision   recall  specificity  f1score  actual_count  predicted_count
                             TPR       TNR                 support                  

  1 virginica      1.000    0.889        1.000    0.941            9               8
  2 versicolor     0.933    1.000        0.938    0.966           14              15
  3 setosa         1.000    1.000        1.000    1.000            7               7

- Simple   avg.    0.978    0.963        0.979    0.969
- Weigthed avg.    0.969    0.967        0.971    0.966
ϵ = info(mynn)["loss_per_epoch"]
plot(1:length(ϵ),ϵ, ylabel="epochs",xlabel="error",legend=nothing,title="Avg. error per epoch on the Sepal dataset")
heatmap(info(cm)["categories"],info(cm)["categories"],info(cm)["normalised_scores"],c=cgrad([:white,:blue]),xlabel="Predicted",ylabel="Actual", title="Confusion Matrix")

  • Other examples

Further examples, with more models and more advanced techniques in order to improve predictions, are provided in the documentation tutorial. Basic examples in Python and R are given here. Very "micro" examples of usage of the various functions can also be studied in the unit-tests available in the test folder.

Limitations and alternative packages

The focus of the library is skewed toward user-friendliness rather than computational efficiency. While the code is (relatively) easy to read, it is not heavily optimised, and currently all models operate on the CPU and only with data that fits in the pc's memory. For very large data we suggest specialised packages. See the list below:

Category Packages
ML toolkits/pipelines ScikitLearn.jl, AutoMLPipeline.jl, MLJ.jl
Neural Networks Flux.jl, Knet
Decision Trees DecisionTree.jl
Clustering Clustering.jl, GaussianMixtures.jl
Missing imputation Impute.jl, Mice.jl

TODO

Short term

  • Implement autotuning of GaussianMixtureClusterer using BIC or AIC
  • Add Silhouette method to check cluster validity
  • Implement PAM and/or variants for kmedoids

Mid/Long term

  • Add RNN support and improve convolutional layers speed
  • Reinforcement learning (Markov decision processes)
  • Standardize data sampling in training
  • Convert to GPU

Contribute

Contributions to the library are welcome. We are particularly interested in the areas covered in the "TODO" list above, but we are open to other areas as well. Please however consider that the focus is mostly didactic/research, so clear, easy to read (and well documented) code and simple API with reasonable defaults are more important that highly optimised algorithms. For the same reason, it is fine to use verbose names. Please open an issue to discuss your ideas or make directly a well-documented pull request to the repository. While not required by any means, if you are customising BetaML and writing for example your own neural network layer type (by subclassing AbstractLayer), your own sampler (by subclassing AbstractDataSampler) or your own mixture component (by subclassing AbstractMixture), please consider to give it back to the community and open a pull request to integrate them in BetaML.

Citations

If you use BetaML please cite it as:

  • Lobianco, A., (2021). BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia. Journal of Open Source Software, 6(60), 2849, https://doi.org/10.21105/joss.02849
@article{Lobianco2021,
  doi       = {10.21105/joss.02849},
  url       = {https://doi.org/10.21105/joss.02849},
  year      = {2021},
  publisher = {The Open Journal},
  volume    = {6},
  number    = {60},
  pages     = {2849},
  author    = {Antonello Lobianco},
  title     = {BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia},
  journal   = {Journal of Open Source Software}
}

Acknowledgements

The development of this package at the Bureau d'Economie Théorique et Appliquée (BETA, Nancy) was supported by the French National Research Agency through the Laboratory of Excellence ARBRE, a part of the “Investissements d'Avenir” Program (ANR 11 – LABX-0002-01).

BLogos

betaml.jl's People

Contributors

arfon avatar github-actions[bot] avatar pallharaldsson avatar rikhuijzer avatar roland-ka avatar sylvaticus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

betaml.jl's Issues

MLJ Interface is not working anymore

The code

modelType = @load RandomForestClassifier pkg = "BetaML" verbosity=1
mod = modelType(
n_trees = 2,
max_depth = 10
)

is not working in the latest version of BetaML.

Corner case for KernelPerceptronClassifier: unique target class

If only one class is seen in the training data, the model fits okay, but prediction fails. I wonder if this is something that could be supported. Encountered this issue when doing cv for a very small binary classification problem (crabs).

using MLJ

Model = @load KernelPerceptronClassifier

model = Model()

X = (x=rand(10), );

y = coerce(collect("aaaaaaaaaab"), Multiclass)[1:10];

julia> unique(y)
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> levels(y)
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

# works fine:
mach = machine(model, X, y) |> fit!;

# problem:
julia> predict_mode(mach, X)
ERROR: BoundsError: attempt to access 0-element Vector{Matrix{Float64}} at index [1]
Stacktrace:
 [1] getindex
   @ ./array.jl:861 [inlined]
 [2] predict(x::Matrix{Float64}, xtrain::Vector{Matrix{Float64}}, ytrain::Vector{Vector{Int64}}, α::Vector{Vector{Int64}}, classes::Vector{Char}; K::typeof(BetaML.Utils.radialKernel))
   @ BetaML.Perceptron ~/.julia/packages/BetaML/AeLyL/src/Perceptron/Perceptron.jl:622
 [3] predict(model::BetaML.Perceptron.KernelPerceptronClassifier, fitresult::Tuple{NamedTuple{(:x, :y, , :classes, :K), Tuple{Vector{Matrix{Float64}}, Vector{Vector{Int64}}, Vector{Vector{Int64}}, Vector{Char}, typeof(BetaML.Utils.radialKernel)}}, Vector{Char}}, Xnew::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ BetaML.Perceptron ~/.julia/packages/BetaML/AeLyL/src/Perceptron/Perceptron_MLJ.jl:137
 [4] predict_mode(m::BetaML.Perceptron.KernelPerceptronClassifier, fitresult::Tuple{NamedTuple{(:x, :y, , :classes, :K), Tuple{Vector{Matrix{Float64}}, Vector{Vector{Int64}}, Vector{Vector{Int64}}, Vector{Char}, typeof(BetaML.Utils.radialKernel)}}, Vector{Char}}, Xnew::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ MLJBase ~/MLJ/MLJBase/src/interface/model_api.jl:11
 [5] predict_mode(mach::Machine{BetaML.Perceptron.KernelPerceptronClassifier, true}, Xraw::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ MLJBase ~/MLJ/MLJBase/src/operations.jl:85
 [6] top-level scope
   @ REPL[39]:1

Improve oneHotEncode stability for encoding integers embedding categories

julia> oneHotEncoder([-1,1,1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [-1]
Stacktrace:
 [1] setindex!
   @ ./array.jl:903 [inlined]
 [2] oneHotEncoderRow(x::Int64; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:64
 [3] oneHotEncoder(Y::Vector{Int64}; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:127
 [4] oneHotEncoder(Y::Vector{Int64})
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:121
 [5] top-level scope
   @ REPL[5]:1

julia> oneHotEncoder([-1,1,1],factors=[-1,1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [-1]
Stacktrace:
 [1] setindex!
   @ ./array.jl:903 [inlined]
 [2] oneHotEncoderRow(x::Int64; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:64
 [3] oneHotEncoder(Y::Vector{Int64}; d::Int64, factors::Vector{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:127
 [4] top-level scope
   @ REPL[6]:1

Random Forest does not appear to work

Could be doing something I'm not supposed to, but I can't seem to get this to work.

Platform details:

Julia: v1.5.1
BetaML: v0.3.0

Minimum example:

import BetaML

BetaML.Trees.buildForest(rand(100), rand(100))

ERROR: BoundsError: attempt to access (100,)
  at index [2]
Stacktrace:
 [1] indexed_iterate at .\tuple.jl:81 [inlined]
 [2] buildForest(::Array{Float64,1}, ::Array{Float64,1}, ::Int64; maxDepth::Int64, minGain::Float64, minRecords::Int64, maxFeatures::Int64, splittingCriterion::String, forceClassification::Bool) at C:\[...]\.julia\packages\BetaML\w0Pyx\src\Trees.jl:430
 [3] buildForest(::Array{Float64,1}, ::Array{Float64,1}, ::Int64) at C:\[...]\.julia\packages\BetaML\w0Pyx\src\Trees.jl:429 (repeats 2 times)
 [4] top-level scope at REPL[157]:1

Error during precompilation (ERROR: LoadError: InitError: Evaluation into the closed module `Perceptron` ...)

I wanted to use BetaML.jl in a project, however when I try doing so I get the following error:

julia> using Foo
[ Info: Precompiling Foo [4817f03b-69bd-4595-9d0a-a711fd8a192f]
ERROR: LoadError: InitError: Evaluation into the closed module `Perceptron` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `Perceptron` with `eval` during precompilation - don't do this.
Stacktrace:
  [1] eval
    @ ./boot.jl:368 [inlined]
  [2] eval(x::Expr)
    @ BetaML.Perceptron ~/.julia/packages/BetaML/mqBvh/src/Perceptron/Perceptron.jl:19
  [3] metadata_pkg(T::Type; name::String, uuid::String, url::String, julia::Bool, license::String, is_wrapper::Bool, package_name::String, package_uuid::String, package_url::String, is_pure_julia::Bool, package_license::String)
    @ MLJModelInterface ~/.julia/packages/MLJModelInterface/wwFA9/src/metadata_utils.jl:54
  [4] #41
    @ ./broadcast.jl:1284 [inlined]
  [5] _broadcast_getindex_evalf
    @ ./broadcast.jl:670 [inlined]
  [6] _broadcast_getindex
    @ ./broadcast.jl:643 [inlined]
  [7] #29
    @ ./broadcast.jl:1075 [inlined]
  [8] macro expansion
    @ ./ntuple.jl:74 [inlined]
  [9] ntuple
    @ ./ntuple.jl:69 [inlined]
 [10] copy
    @ ./broadcast.jl:1075 [inlined]
 [11] materialize
    @ ./broadcast.jl:860 [inlined]
 [12] __init__()
    @ BetaML ~/.julia/packages/BetaML/mqBvh/src/BetaML.jl:63
 [13] _include_from_serialized(pkg::Base.PkgId, path::String, depmods::Vector{Any})
    @ Base ./loading.jl:831
 [14] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt64)
    @ Base ./loading.jl:1039
 [15] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1315
 [16] _require_prelocked(uuidkey::Base.PkgId)
    @ Base ./loading.jl:1200
 [17] macro expansion
    @ ./loading.jl:1180 [inlined]
 [18] macro expansion
    @ ./lock.jl:223 [inlined]
 [19] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1144
 [20] include
    @ ./Base.jl:419 [inlined]
 [21] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base ./loading.jl:1554
 [22] top-level scope
    @ stdin:1
during initialization of module BetaML
in expression starting at /data_temp/picaud/Temp/Beta/Foo.jl/src/Foo.jl:1
in expression starting at stdin:1
ERROR: Failed to precompile Foo [4817f03b-69bd-4595-9d0a-a711fd8a192f] to /home/picaud/.julia/compiled/v1.8/Foo/jl_a1tr7Z.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
   @ Base ./loading.jl:1707
 [3] compilecache
   @ ./loading.jl:1651 [inlined]
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1337
 [5] _require_prelocked(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1200
 [6] macro expansion
   @ ./loading.jl:1180 [inlined]
 [7] macro expansion
   @ ./lock.jl:223 [inlined]
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1144

The error is not present when I remove precompilation, the BetaML.jl "patch" is:

# function __init__()
#     MMI.metadata_pkg.(MLJ_INTERFACED_MODELS,
#         name       = "BetaML",
#         uuid       = "024491cd-cc6b-443e-8034-08ea7eb7db2b",     # see your Project.toml
#         url        = "https://github.com/sylvaticus/BetaML.jl",  # URL to your package repo
#         julia      = true,     # is it written entirely in Julia?
#         license    = "MIT",    # your package license
#         is_wrapper = false,    # does it wrap around some other package?
#     )
# end

Steps to reproduce :

Create a local package Foo (in /tmp/ by example)

cd /tmp

(@v1.8) pkg> generate Foo.jl
  Generating  project Foo:
    Foo.jl/Project.toml
    Foo.jl/src/Foo.jl

(@v1.8) pkg> activate ./Foo.jl/
  Activating project at `/tmp/Foo.jl`

(Foo) pkg> add BetaML

(Foo) pkg> activate 
  Activating project at `~/.julia/environments/v1.8`

(@v1.8) pkg> dev ./Foo.jl/
   Resolving package versions...

Then modify Foo.jl as follows :

module Foo

using BetaML # <---- here

greet() = print("Hello World!")

end # module Foo

Then from Julia type

julia> using Foo

and I (and maybe you) will get the error I mentioned at the beginning.


Thanks!

MLJ model docstrings

I notice that examples in docstrings use thepredict and fit from MLJModelInterface (which are not exported by MLJ, and not intended for use by general MLJ user) rather than the machine fit!, predict, etc methods exported by MLJ. In this respect, these model docstrings differ from all the other MLJ model docstrings, so I'd consider them "uncompliant".

I understand this is some work to correct. Still, it would be great, for uniformity, to have these changed.

BetaML v11.0 Gaussian Mixture Model not compatible with MLJ

I recently updated my packages and noticed that I couldn't create an MLJ machine with the Gaussian Mixture Model with BetaML v0.11.0. The older version v0.10.4 is working fine. I have not checked whether this is true for other models in BetaML

Reproducable example:

julia> using MLJ

julia> GMM = MLJ.@load GaussianMixtureClusterer pkg=BetaML verbosity=0
BetaML.GMM.GaussianMixtureClusterer

julia> machine(GMM(), rand(100, 10))
ERROR: MethodError: no method matching machine(::BetaML.GMM.GaussianMixtureClusterer, ::Matrix{Float64})

Closest candidates are:
  machine(::Type{<:Model}, ::Any...; kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:336
  machine(::Static, ::Any...; cache, kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:340
  machine(::Union{Symbol, Model}, ::Any, ::AbstractNode, ::AbstractNode...; kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:359
  ...

Stacktrace:
 [1] top-level scope
   @ REPL[4]:1

julia> using Pkg

julia> Pkg.status()
Project MLJ_debug v0.1.0
Status `~/tmp/MLJ_debug/Project.toml`
  [024491cd] BetaML v0.11.0
  [add582a8] MLJ v0.20.2

WARNING: could not import Perceptron ...

During precompilation I encountered some warnings:

[ Info: Precompiling BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
WARNING: could not import Perceptron.KernelPerceptron into BetaML
WARNING: could not import Perceptron.KernelPerceptronHyperParametersSet into BetaML
WARNING: could not import Perceptron.Pegasos into BetaML
WARNING: could not import Perceptron.PegasosHyperParametersSet into BetaML

Maybe BetaML is importing names from Perceptron module that no longer exist?

Deprecation warning from ProgressMeter.jl

┌ BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
│ ┌ Warning: Progress(n::Integer, dt::Real, desc::AbstractString = "Progress: ", barlen = nothing, color::Symbol = :green, output::IO = stderr; offset::Integer = 0) is deprecated, use Progress(n; dt = dt, desc = desc, barlen = barlen, color = color, output = output, offset = offset) instead.
│ │ caller = ip:0x0
│ └ @ Core :-1

`target_scitype` for MultitargetNeuralNetworkRegressor is too broad

Current scitype:

 target_scitype =
     AbstractVecOrMat{<:Union{ScientificTypesBase.Continuous, ScientificTypesBase.Count}},

which allows a vector as target. But using a vector throws an error:

model = BetaML.Nn.MultitargetNeuralNetworkRegressor();
X, y = make_regression();         # y is vector here
mach = machine(model, X, y)
fit!(mach)
[ Info: Training machine(MultitargetNeuralNetworkRegressor(layers = nothing, ), ).
┌ Error: Problem fitting the machine machine(MultitargetNeuralNetworkRegressor(layers = nothing, ), ). 
└ @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:682
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: The label should have multiple dimensions. Use `NeuralNetworkRegressor` for single-dimensional outputs.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] fit(m::BetaML.Nn.MultitargetNeuralNetworkRegressor, verbosity::Int64, X::Tables.MatrixTable{Matrix{Float64}}, y::Vector{Float64})                                    
   @ BetaML.Nn ~/.julia/packages/BetaML/mWUwE/src/Nn/Nn_MLJ.jl:206
 [3] fit_only!(mach::Machine{BetaML.Nn.MultitargetNeuralNetworkRegressor, true}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)                                 
   @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:680
 [4] fit_only!
   @ ~/.julia/packages/MLJBase/97P9U/src/machines.jl:606 [inlined]
 [5] #fit!#63
   @ ~/.julia/packages/MLJBase/97P9U/src/machines.jl:778 [inlined]
 [6] fit!(mach::Machine{BetaML.Nn.MultitargetNeuralNetworkRegressor, true})
   @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:775
 [7] top-level scope
   @ REPL[31]:1

One might also want to support tabular y here, which is what other MLJ multitarget models support.

MLJ traits for GMMClusterer

The GMMClusterer is an unsupervised probabilistic model. However we can't check that programmatically because of JuliaAI/MLJModelInterface.jl#120

Is there any fix to make sure that both KMeans and GMMClusterer return a set of categorical values? Right now predict(Kmeans(), ...) will return a vector of categorical values whereas predict(GMMClusterer(), ...) will return a vector of distributions.

Avoid observation-by-observation construction of UnivariateFinite objects in MLJ interface

The overhead for constructing UnivariateFinite objects one at a time is very high. For this reason a UnivariateFiniteArray implementation of AbstractArray{<:UnivariateFinite} was developed. This includes optimised implementations of broadcasting pdf, and so forth.

I recommend that in the BetaML classifiers one contruct probabilistic predictions by applying the UnivariateFinite(...) constructor (which can construct arrays as well as singletons) to the full matrix of probabilities (with all observations in it). You can see examples of this in all the MLJ probabilistic classifier interfaces. I am copying the doc-string for this constructor below:

cc @OkonSamuel


UnivariateFinite(support,
                 probs;
                 pool=nothing,
                 augmented=false,
                 ordered=false)

Construct a discrete univariate distribution whose finite support is
the elements of the vector support, and whose corresponding
probabilities are elements of the vector probs. Alternatively,
construct an abstract array of UnivariateFinite distributions by
choosing probs to be an array of one higher dimension than the array
generated.

Unless pool is specified, support should have type
AbstractVector{<:CategoricalValue} and all elements are assumed to
share the same categorical pool, which may be larger than support.

Important. All levels of the common pool have associated
probabilities, not just those in the specified support. However,
these probabilities are always zero (see example below).

If probs is a matrix, it should have a column for each class in
support (or one less, if augment=true). More generally, probs
will be an array whose size is of the form (n1, n2, ..., nk, c),
where c = length(support) (or one less, if augment=true) and the
constructor then returns an array of size (n1, n2, ..., nk).

using CategoricalArrays
v = categorical([:x, :x, :y, :x, :z])

julia> UnivariateFinite(classes(v), [0.2, 0.3, 0.5])
UnivariateFinite{Multiclass{3}}(x=>0.2, y=>0.3, z=>0.5)

julia> d = UnivariateFinite([v[1], v[end]], [0.1, 0.9])
UnivariateFinite{Multiclass{3}(x=>0.1, z=>0.9)

julia> rand(d, 3)
3-element Array{Any,1}:
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z

julia> levels(d)
3-element Array{Symbol,1}:
 :x
 :y
 :z

julia> pdf(d, :y)
0.0

Specifying a pool

Alternatively, support may be a list of raw (non-categorical)
elements if pool is:

  • some CategoricalArray, CategoricalValue or CategoricalPool,
    such that support is a subset of levels(pool)

  • missing, in which case a new categorical pool is created which has
    support as its only levels.

In the last case, specify ordered=true if the pool is to be
considered ordered.

julia> UnivariateFinite([:x, :z], [0.1, 0.9], pool=missing, ordered=true)
UnivariateFinite{OrderedFactor{2}}(x=>0.1, z=>0.9)

julia> d = UnivariateFinite([:x, :z], [0.1, 0.9], pool=v) # v defined above
UnivariateFinite(x=>0.1, z=>0.9) (Multiclass{3} samples)

julia> pdf(d, :y) # allowed as `:y in levels(v)`
0.0

v = categorical([:x, :x, :y, :x, :z, :w])
probs = rand(100, 3)
probs = probs ./ sum(probs, dims=2)
julia> UnivariateFinite([:x, :y, :z], probs, pool=v)
100-element UnivariateFiniteVector{Multiclass{4},Symbol,UInt32,Float64}:
 UnivariateFinite{Multiclass{4}}(x=>0.194, y=>0.3, z=>0.505)
 UnivariateFinite{Multiclass{4}}(x=>0.727, y=>0.234, z=>0.0391)
 UnivariateFinite{Multiclass{4}}(x=>0.674, y=>0.00535, z=>0.321)
   ⋮
 UnivariateFinite{Multiclass{4}}(x=>0.292, y=>0.339, z=>0.369)

Probability augmentation

Unless augment=true, sums of elements along the last axis (row-sums
in the case of a matrix) must be equal to one, and otherwise such an
array is created by inserting appropriate elements ahead of those
provided. This means the provided probabilities are associated with
the the classes c2, c3, ..., cn.


UnivariateFinite(prob_given_class; pool=nothing, ordered=false)

Construct a discrete univariate distribution whose finite support is
the set of keys of the provided dictionary, prob_given_class, and
whose values specify the corresponding probabilities.

The type requirements on the keys of the dictionary are the same as
the elements of support given above with this exception: if
non-categorical elements (raw labels) are used as keys, then
pool=... must be specified and cannot be missing.

If the values (probabilities) are arrays instead of scalars, then an
abstract array of UnivariateFinite elements is created, with the
same size as the array.

FYI: NNlib.jl; depend on it?

Hi,

I just happened to (so far) only contribute activation functions to your project. Not that I use it or any of the others. I would like to help the one project where it makes the biggest impact, or one central place and this may be it:

FluxML/NNlib.jl#224

The input scitypes for trees are incorrect

From here:

    input_scitype    = MMI.Table(MMI.Missing, MMI.Known),           # also ok: MMI.Table(Union{MMI.Missing, MMI.Known}),

What is written in the comment is correct. What is actually used is not:

julia> X = (; x=[missing, 1, 2])
(x = Union{Missing, Int64}[missing, 1, 2],)

julia> scitype(X) <: Table(Missing, Known)
false

julia> scitype(X) <: Table(Union{Missing, Known})
true

For more on the Table scitype constructor, see here.

All the tree scitypes need changing.

Sorry that I did not pick this up in my review.

Scaler() of vectors (instead of matrices) result in errors

julia> fit!(Scaler(),[1,10,100])
ERROR: BoundsError: attempt to access Tuple{Int64} at index [2]
Stacktrace:
 [1] indexed_iterate
   @ ./tuple.jl:88 [inlined]
 [2] _fit(m::StandardScaler, skip::Vector{Int64}, X::Vector{Int64}, cache::Bool)
   @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:645
 [3] fit!(m::Scaler, x::Vector{Int64})
   @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:860
 [4] top-level scope
   @ REPL[17]:1

Trouble interpolating feature names in a wrapped tree

What am I missing here?

using MLJ
import BetaML.Trees
import DataFrames as DF

table = OpenML.load(42638)
df = DF.select(DF.DataFrame(table), DF.Not(:cabin))

cleaner = FillImputer()
machc = machine(cleaner, df) |> fit!
dfc     =  transform(machc, df)

y, X = unpack(dfc, ==(:survived))

Tree = @load DecisionTreeClassifier pkg=BetaML
tree = Tree(max_depth=3)
mach = machine(tree, X, y) |> fit!

raw_tree = fitted_params(mach).fitresult[1]
wrapped_tree = Trees.wrap(raw_tree, (feature_names=DF.names(X),))

# 2 == female?
# ├─ 1 == 3?
# │  ├─ "1" => 0.5
# │  │  "0" => 0.5
# │  │
# │  └─ "1" => 0.9470588235294117
# │     "0" => 0.052941176470588235
#
# └─ 3 >= 7.0?
#    ├─ "1" => 0.16817359855334538
#    │  "0" => 0.8318264014466547
#
#    └─ "1" => 0.6666666666666666
#       "0" => 0.3333333333333333

cc @roland-KA

Allow `verbosity` to be any integer?

I was surprised when re-running some ecosystem-wide integration tests to get this message when training these using the MLJ interface: MultitargetNeuralNetworkRegressor NeuralNetworkRegressor:

Wrong verbosity level. Verbosity must be either 0, 10, 20, 30 or 40

I was probably using verbosity =-1 to suppress warnings.

I understand MLJ spec is mostly silent on this, but in practice the rule has been : "With the exception of warnings, training should be silent if verbosity == 0. Lower values should suppress warnings" and I would add "any integer should be allowed".

Perhaps in the MLJ interface for the BetaML models one could map

<= 0 -> 0
1 -> 10
2 -> 20
3 -> 30
>= 5 -> 40

or similar ??

Separate into subpackages?

Specifically, I think separating the modules in this into subpackages (i.e. reexported as part of a larger overall BetaML package) would help a lot with discoverability; for instance, the problem I mentioned earlier of people in the stats community having lots of trouble finding the imputation methods here.

Error generating MLJ model registry

Running MLJModels.@update to update MLJ's model registry is running into this new error:

ERROR: LoadError: Bad `load_path` trait for BetaML.Imputation.BetaMLGMMImputer: BetaMLGMMImputer not a registered package. 
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ ~/MLJ/MLJModels/src/registry/src/update.jl:122
 [3] eval
   @ ./boot.jl:373 [inlined]
 [4] eval(x::Expr)
   @ Base.MainInclude ./client.jl:453
 [5] _update(mod::Module, test_env_only::Bool)
   @ MLJModels.Registry ~/MLJ/MLJModels/src/registry/src/update.jl:153
 [6] var"@update"(__source__::LineNumberNode, __module__::Module)
   @ MLJModels.Registry ~/MLJ/MLJModels/src/registry/src/update.jl:24
in expression starting at REPL[4]:1

MLJ model `BetaMLGMMRegressor` predicting row vectors instead of column vectors

using MLJBase
using MLJModels
model = (@iload BetaMLGMMRegressor)()
X, y = make_regression();
mach = machine(model, X, y) |> fit!
yhat = predict(mach, X);

julia> l2(yhat, y)
ERROR: DimensionMismatch: Encountered two objects with sizes (100, 1) and (100,) which needed to match but don't. 
Stacktrace:
 [1] check_dimensions
   @ ~/.julia/packages/MLJBase/CtxrQ/src/utilities.jl:145 [inlined]
 [2] _check(measure::LPLoss{Int64}, yhat::Matrix{Float64}, y::Vector{Float64})
   @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/measures/measures.jl:60
 [3] (::LPLoss{Int64})(::Matrix{Float64}, ::Vararg{Any})
   @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/measures/measures.jl:126
 [4] top-level scope
   @ REPL[36]:1

initVarainces! doesn't support mixed-type variances

AS it is a template function, it is defined over a single eltype T of the mixtures vector.
Need to be refactored to work with mixed cases (if one really needs different mixture types for the different classes)

"`findall` is ambiguous" error

While working with BetaML, DataFrames and Chain, I found that importing BetaML leads to ambiguity in findall when working with the @chain macro.

using Chain, DataFrames

import BetaML as BML

df = DataFrame(randn(100, 3), :auto)

# This works
transform(df, All() => ByRow((x...) -> sum(x)) => :y)

# This fails
@chain df begin
	transform(_, All() => ByRow((x...) -> sum(x)) => :y)
end

I am not sure what the correct solution would be. The error log suggests defining findall(::F, ::Array{T}) where {T, F<:Function}, but I am not experienced in managing packages and therefore not sure if one would have to keep other things in mind.

Here is the full error log:

LoadError: MethodError: findall(::Chain.var"#4#5", ::Vector{Any}) is ambiguous.

Candidates:
  findall(testf::Function, A)
    @ Base array.jl:2439
  findall(testf::F, A::AbstractArray) where F<:Function
    @ Base array.jl:2447
  findall(el::T, cont::Array{T}; returnTuple) where T
    @ BetaML.Utils ~/.julia/packages/BetaML/QcevM/src/Utils/Processing.jl:73

Possible fix, define
  findall(::F, ::Array{T}) where {T, F<:Function}

Problem with MLJ interface for KMedoidsClusterer

import BetaML
using MLJTestInterface

@testset "generic mlj interface test" begin
    f, s = MLJTestInterface.test(
        [BetaML.Bmlj.KMeansClusterer,],
        MLJTestInterface.make_regression()[1];
        mod=@__MODULE__,
        verbosity=0, # bump to debug
        throw=true, # set to true to debug (`false` in CI)
    )
@test isempty(failures)
end

# generic mlj interface test: Error During Test at REPL[11]:1
#   Got exception outside of a @test
#   UndefVarError: `fitresults` not defined
#   Stacktrace:
#     [1] attempt(f::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}}, message::String; throw::Bool)

< parts omitted for clarity >
  
#   caused by: UndefVarError: `fitresults` not defined
#   Stacktrace:
#     [1] fitted_params(model::BetaML.Bmlj.KMeansClusterer, fitresult::@NamedTuple{classes::Vector{Int64}, centers::Matrix{Float64}, distanceFunction::BetaML.Bmlj.var"#13#15"})
#       @ BetaML.Bmlj ~/.julia/packages/BetaML/SPPMQ/src/Bmlj/Clustering_mlj.jl:175
#     [2] fitted_params(mach::MLJBase.Machine{BetaML.Bmlj.KMeansClusterer, true})
#       @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:820
#     [3] (::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}})()
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:85
#     [4] attempt(f::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}}, message::String; throw::Bool)
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:15
#     [5] #fitted_machine#8
#       @ ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:77 [inlined]
#     [6] fitted_machine
#       @ ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:75 [inlined]
#     [7] test(model_types::Vector{DataType}, data::@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}; mod::Module, level::Int64, throw::Bool, verbosity::Int64)                                                       
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/test.jl:202
#     [8] macro expansion
#       @ REPL[11]:2 [inlined]
#     [9] macro expansion
#       @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
#    [10] top-level scope
#       @ REPL[11]:2
#    [11] eval
#       @ Core ./boot.jl:385 [inlined]
#    [12] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150                                                         
#    [13] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
#    [14] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
#    [15] run_repl(repl::AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::
# Any)                                                                                       
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
#    [16] run_repl(repl::AbstractREPL, consumer::Any)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
#    [17] (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
#       @ Base ./client.jl:432
#    [18] #invokelatest#2
#       @ Base ./essentials.jl:887 [inlined]
#    [19] invokelatest
#       @ Base ./essentials.jl:884 [inlined]
#    [20] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)                                                                            
#       @ Base ./client.jl:416
#    [21] exec_options(opts::Base.JLOptions)
#       @ Base ./client.jl:333
#    [22] _start()
#       @ Base ./client.jl:552
# Test Summary:              | Error  Total  Time
# generic mlj interface test |     1      1  6.6s
# ERROR: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.
 

Cosine distance

Is there not a typing error here?

"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
"""

"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
"""

I guess it should be:

"""Cosine distance"""
cosine_distance(x,y) = 1 - dot(x,y)/(norm(x)*norm(y))
"""

(if I well understood what you wanted to refer to as "cosine distance")

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Rename/Alias `GeneralImputer` to `MICE`

The algorithm listed as GeneralImputer here is more widely-known as MICE (Multiple imputation by chained equations) in statistics. I'm not sure if the name used here is standard in ML, but the lack of a solid MICE implementation is a common complaint in the Julia statistics ecosystem, so I was very surprised to stumble across this pure-Julia implementation of MICE under a completely different name. Would it make sense to either rename or alias GeneralImputer to make this easier to discover?

Rename Nn to NN

Since you where renaming the package only two days ago, should NN also be used (more appropriate as acronym, like RBF).

I'm not sure it's advised to have both (a new module adding the old). If you change, or if not, could something like:

module Nn
 [init] throw("do use: using NN")
end

work or wise versa?

Add MLJ-compliant document strings

We are currently implementing detailed docstrings for all MLJ models, following a standard we have developed. See this issue: alan-turing-institute/MLJ.jl#913

@sylvaticus If it is helpful to you, @josephsdavid, who is helping us this summer as GSoD technical writer can prepare PRs for you to review. David is a working data scientist with some Julia knowledge. You will need to let me know soon if you would like this.

MLJ interface: fit should not mutate model fields

In MLJ learned parameters are distinct from hyper-parameters. A "model" in MLJ is a container for hyper-parameters and that is all.

For this reason, there should be no reason forMMI.fit should to mutate model fields and the original API forbade this (Unfortunately, this rule seems to have disappeared from the docs alan-turing-institute/MLJ.jl#755). Only clean! can mutate the fields, and only if they don't make sense. One execption is that fit may mutate a RNG.

So this is currently non-compliant:

using Pkg
Pkg.activate(temp=true)
Pkg.add("MLJBase")
Pkg.add(name="BetaML", rev="master")

using MLJBase
import BetaML

model = BetaML.Clustering.MissingImputator()
mixtures = deepcopy(model.mixtures)

X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38;
     missing -2.3; 5.2 -2.4] |> MLJBase.table

mach = machine(model, X) |> fit!

julia> @assert model.mixtures == mixtures
ERROR: AssertionError: model.mixtures == mixtures
Stacktrace:
    [1] top-level scope at REPL[40]:1

Maybe MMI.fit can begin by creating a deepcopy of mixtures and p₀, in this and the related models.

Scaler() of Int matrix result in error

julia> using BetaML

julia> fit!(Scaler(),[ 1 10 100; 2 20 200; 3 30 300])
ERROR: InexactError: Int64(-1.224744871391589)
Stacktrace:
  [1] Int64
    @ ./float.jl:900 [inlined]
  [2] convert
    @ ./number.jl:7 [inlined]
  [3] setindex!
    @ ./array.jl:971 [inlined]
  [4] macro expansion
    @ ./multidimensional.jl:932 [inlined]
  [5] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [6] _unsafe_setindex!(::IndexLinear, ::Matrix{Int64}, ::Vector{Float64}, ::Base.Slice{Base.OneTo{Int64}}, ::Int64)
    @ Base ./multidimensional.jl:927
  [7] _setindex!
    @ ./multidimensional.jl:916 [inlined]
  [8] setindex!
    @ ./abstractarray.jl:1397 [inlined]
  [9] _fit(m::StandardScaler, skip::Vector{Int64}, X::Matrix{Int64}, cache::Bool)
    @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:656
 [10] fit!(m::Scaler, x::Matrix{Int64})
    @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:860
 [11] top-level scope
    @ REPL[15]:1

Example with GaussianMixtureClusterer

Can you please provide a full example with GaussianMixtureClusterer? I tried to instantiate the type but it is giving me an error saying m is not defined.

This code used to work:

using MLJ: @load

gmm = @load GMMClusterer pkg=BetaML verbosity=0

gmm(K=4)

Now I understand that the new model name is GaussianMixtureClusterer, but the construction is failing.

MLJ interface for `KernelPerceptronClassifier` is not tracking all target levels

julia> using MLJ

julia> Model = @load KernelPerceptronClassifier
[ Info: For silent loading, specify `verbosity=0`. 
import BetaML ✔
BetaML.Perceptron.KernelPerceptronClassifier

julia> model = Model()
KernelPerceptronClassifier(
  K = BetaML.Utils.radialKernel, 
  maxEpochs = 100, 
  initialα = Int64[], 
  shuffle = false, 
  rng = Random._GLOBAL_RNG())

julia> X = (x=rand(10), );

julia> y = coerce(collect("abababababcc"), Multiclass)[1:10];

julia> unique(y)
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

julia> levels(y)
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> mach = machine(model, X, y) |> fit!;
[ Info: Training machine(KernelPerceptronClassifier(K = radialKernel, ), ).

julia> predict_mode(mach, X) |> levels
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

That last indicates a bug, as all levels in the pool of the training vector should be present in the pool of the predictions.

Curiously in other classifiers I looked at, the levels are indeed being tracked correctly. So perhaps have a look at, eg, the BetaML DecisionTreeClassifier to see how this can be corrected.

This bug is causing a failure when the model is bagged in an ensemble using EnsembleModel because some classes are not present in some of the bagged observations, but are present in others.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.