mwydmuch / napkinxc Goto Github PK

Extremely simple and fast extreme multi-class and multi-label classifiers.

License: MIT License

CMake 1.95% C++ 70.16% C 2.74% Shell 6.87% Python 18.28%

classification datasets extreme-classification hsm label-tree-classifiers machine-learning multi-class-classification multi-label-classification plt probabilistic-label-trees python xmlc

napkinxc's People

Contributors

Stargazers

Watchers

Forkers

atriantafybbc juliexh xiaohan2012 shubhampachori12110095 bbc ngc92 chihchiehchen

napkinxc's Issues

segmentation fault (possibly in kmeans)

Hi,

The following snippet gives segmentation fault on my machine:

from napkinxc.models import PLT
from napkinxc.datasets import load_dataset

trn_X, trn_Y = load_dataset('eurlex-4k', "train", verbose=1)
model = PLT('output/test', tree_type='hierarchicalKmeans',
            arity=32,
            seed=25,
            threads=4, verbose=1)
model.fit(trn_X, trn_Y)

The output is:

napkinXC 0.5.1 - train
  Model: output/test
    Type: plt
  Base models optimizer: liblinear
    Solver: L2R_LR_DUAL, eps: 0.1, cost: 10, max iter: 100, weights threshold: 0.1
  Tree type: hierarchicalKmeans, arity: 32, k-means eps: 0.0001, balanced: 1, weighted features: 0, max leaves: 100
  Threads: 4, memory limit: ~29G
  Seed: 25
Building tree ...
Computing labels' features matrix in 4 threads ...
Hierarchical K-Means clustering in 4 threads ...
[2]    20767 segmentation fault (core dumped)  python mwe.py

Would it be possible to fix this?

Other info:

napkinXC 0.5.1
Python 3.8.3
Ubuntu 16.04.7 LTS

string "amazontitles-3M" to "amazontitles-3m" in datasets.py

Hi,

the following code is giving an error:

from napkinxc.datasets import load_dataset

_ = load_dataset('AmazonTitles-3M', 'train')

ValueError: Dataset AmazonTitles-3M is not available

It should be easy to fix: just change the string amazontitles-3M to amazontitles-3m in the file python/napkinxc/datasets.py

build failed

ub16hp@UB16HP:~/ub16_prj/napkinXML$ make
[ 6%] Building CXX object CMakeFiles/nxml.dir/src/main.cpp.o
In file included from /home/ub16hp/ub16_prj/napkinXML/src/main.cpp:8:0:
/home/ub16hp/ub16_prj/napkinXML/src/base.h: In member function ‘double Base::predictLoss(U*)’:
/home/ub16hp/ub16_prj/napkinXML/src/base.h:65:25: error: ‘pow’ is not a member of ‘std’
if(hingeLoss) val = std::pow(fmax(0, 1 - val), 2); // Hinge squared loss
^
/home/ub16hp/ub16_prj/napkinXML/src/base.h:65:49: error: there are no arguments to ‘fmax’ that depend on a template parameter, so a declaration of ‘fmax’ must be available [-fpermissive]
if(hingeLoss) val = std::pow(fmax(0, 1 - val), 2); // Hinge squared loss
^
/home/ub16hp/ub16_prj/napkinXML/src/base.h:65:49: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/home/ub16hp/ub16_prj/napkinXML/src/base.h:66:32: error: there are no arguments to ‘exp’ that depend on a template parameter, so a declaration of ‘exp’ must be available [-fpermissive]
else val = log(1 + exp(-val)); // Log loss
^
/home/ub16hp/ub16_prj/napkinXML/src/base.h: In member function ‘double Base::predictProbability(U*)’:
/home/ub16hp/ub16_prj/napkinXML/src/base.h:74:50: error: there are no arguments to ‘exp’ that depend on a template parameter, so a declaration of ‘exp’ must be available [-fpermissive]
if(hingeLoss) val = 1.0 / (1.0 + exp(-2 * val)); // Probability for squared Hinge loss solver
^
/home/ub16hp/ub16_prj/napkinXML/src/base.h:75:37: error: there are no arguments to ‘exp’ that depend on a template parameter, so a declaration of ‘exp’ must be available [-fpermissive]
else val = 1.0 / (1.0 + exp(-val)); // Probability
^
CMakeFiles/nxml.dir/build.make:86: recipe for target 'CMakeFiles/nxml.dir/src/main.cpp.o' failed
make[2]: *** [CMakeFiles/nxml.dir/src/main.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/nxml.dir/all' failed
make[1]: *** [CMakeFiles/nxml.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

pickling models

Hi,

I'm trying to use napkinXC under ray, which relies pickle for data serialization.

It seems that napkinXC models cannot be pickled.

For instance,

import pickle
from napkinxc.models import PLT

model = PLT('/tmp/something/')
pickle.dump(model, open('/tmp/some-pickle.pkl', 'wb'))

gives:

TypeError: cannot pickle 'napkinxc._napkinxc.CPPModel' object.

Is there any workaround or any plan to support pickling for this issue?

OOM/SegFault issues?

PLTs train extremely quickly using this implementation which is fantastic to see. However, I have run into a few issues when training on larger datasets:

There is no batching method by default, which then requires very large matrices to be held in matrices in order to train the model. I assume the way to avoid this in memory issue is FitOnFile
Even if the training data fits comfortably in memory, at larger sizes such as >1 million training data points with >10k labels, the Python kernel crashes which I assume is due to an OOM or s error on the C++ side. It feels like there must be a memory leak somewhere, as the actual trees themselves never get that large, and I assume internally that the model trains in batches as outlined in the paper

a possible bug during kmeans initialization

Hi,

During kmeans initialization, the randomly generated row index can produce segfault.

napkinXC/src/models/kmeans.cpp

Line 54 in 6783031

std::uniform_int_distribution<int> dist(0, points);

napkinXC/src/models/kmeans.cpp

Line 56 in 6783031

 centroidsFeatures[i].add(pointsFeatures[dist(rng)]); // set centroid to this vector 

I suppose it should be:

 std::uniform_int_distribution<int> dist(0, points - 1);

Regards,
Han

Support for custom tree (tree_structure in python interface)

Hi,

Thanks for writing this software, which is very useful!

I'm currently experimenting with the effect of label trees and wish to load trees from file.

Is it possible to pass a string to thetree_structure parameter in models.PLT class, so that a custom tree can be loaded? It seems like the current Python interface does not support it.

If possible, I can make a pull request, and it would be nice if some instructions can be given, e.g., where and what to modify.

Cheers,
Han

Segmentation Fault

I'm using napkinXC on Linux with a custom dataset and am having trouble encountering segfaults when trying to fit a PLT on the data. Attempting to print backtraces of the segfault results in unknown symbols as soon as the code in _napkin.cpp (calling fit on a CPPModel) is invoked, so I'm having trouble determining the true source of the issue.

My first best guess is that something went wrong with the installation of napkinXC.
The first thing I tried was installing napkinxc via pip by name, then via the .git link when that failed.
When that failed, I tried downloading the git repository and running setup.py, but that didn't work either.
The default C++ standard for gcc in the environment is C++14, but the environment also supports C++ as a non-default option.
The gcc version is 9.4.0, and the CMake version is 3.16.3.

My second best guess is that because of an issue with the custom dataset, the .cpp code attempts to access an out of bounds location and encounters a segfault.
For the input of .fit(), I'm using a numpy matrix of embedding vectors and a list of lists of numerical ground truth labels as input, which are generated from reading csv files.

X_data = pd.read_csv('embeddings_new_test.csv').to_numpy().astype(np.float32)
Y_data_str = [label.replace('[', '').replace(']','').replace(' ','').split(',') for label in pd.read_csv('labels_new_test.csv').to_string(header=False, index=False).split('\n')]
Y_data = []
for data_list in Y_data_str:
    Y_data.append([int(num) for num in data_list])

Sample embedding vector before np matrix conversion (dimensionality: (1, 44)):

1.1853809,1.8049561,-0.21211958,-4.1932855,-0.33534464,-2.9588652,-3.864022,-5.564808,1.8993871,4.2785244,4.9306583,3.9468246,-1.4078596,2.48531,1.8727794,0.7343951,-2.820231,0.28361112,2.3047895,2.7313123,1.7561926,4.286616,1.871469,-1.2939689,3.575691,1.7148826,2.4899118,-3.9518876,2.0022254,2.736418,-4.215009,-3.3079152,-1.2123864,-1.5709529,-0.20246193,-2.4258933,-2.386864,-2.19349,-2.4682508,1.5998758,-2.934224,-2.6331096,-3.2446184,2.9059627

Sample label list:
[110, 132, 143, 125, 167]

I'd be happy to provide any additional details if I happened to miss something important. Thanks for your time.

pip install napkinxc failed

I've been trying to install napkinxc via pip and have repeatedly run into the below error. Any idea what could be wrong? Thanks.

(base) atl436user1:~ user.name$ pip install napkinxc
Collecting napkinxc
Using cached napkinxc-0.4.0.tar.gz (142 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied: sklearn in ./anaconda3/lib/python3.7/site-packages/sklearn-0.0-py3.7.egg (from napkinxc) (0.0)
Requirement already satisfied: scipy in ./anaconda3/lib/python3.7/site-packages (from napkinxc) (1.3.1)
Requirement already satisfied: numpy in ./anaconda3/lib/python3.7/site-packages (from napkinxc) (1.18.1)
Requirement already satisfied: scikit-learn in ./anaconda3/lib/python3.7/site-packages (from sklearn->napkinxc) (0.22.1)
Requirement already satisfied: joblib>=0.11 in ./anaconda3/lib/python3.7/site-packages (from scikit-learn->sklearn->napkinxc) (0.14.1)
Building wheels for collected packages: napkinxc
Building wheel for napkinxc (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /Users/alec.delany/anaconda3/bin/python /Users/user.name/anaconda3/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/tmpogk81v8q
cwd: /private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-install-4bhkxsrq/napkinxc
Complete output (106 lines):
running bdist_wheel
running build
running build_py
-- downloading/updating pybind11
-- pybind11 directory found, pulling...
From https://github.com/pybind/pybind11

branch master -> FETCH_HEAD
--
Already on 'master'
-- Your branch is up to date with 'origin/master'.

-- pybind11 v2.6.0
-- Configuring done
-- Generating done
-- Build files have been written to: /private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-install-4bhkxsrq/napkinxc/build
Scanning dependencies of target pynxc
[ 6%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/base.cpp.o
[ 9%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/args.cpp.o
[ 15%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/blas/daxpy.c.o
[ 15%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/blas/dnrm2.c.o
[ 15%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/blas/ddot.c.o
[ 21%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/blas/dscal.c.o
[ 21%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/data_reader.cpp.o
[ 31%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/data_readers/vw_reader.cpp.o
[ 31%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/data_readers/libsvm_reader.cpp.o
[ 31%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/liblinear/linear.cpp.o
[ 34%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/liblinear/tron.cpp.o
[ 37%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/log.cpp.o
[ 40%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/main.cpp.o
[ 43%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/measure.cpp.o
[ 46%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/misc.cpp.o
[ 50%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/model.cpp.o
[ 53%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/models/br.cpp.o
[ 56%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/models/extreme_text.cpp.o
[ 59%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/models/hsm.cpp.o
In file included from /private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-install-4bhkxsrq/napkinxc/src/model.cpp:36:
/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-install-4bhkxsrq/napkinxc/src/models/online_plt.h:46:10: error: 'shared_timed_mutex' is unavailable: introduced in macOS 10.12
std::shared_timed_mutex treeMtx;
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/shared_mutex:205:58: note: 'shared_timed_mutex' has been explicitly marked unavailable here
class _LIBCPP_TYPE_VIS _LIBCPP_AVAILABILITY_SHARED_MUTEX shared_timed_mutex
^
[ 62%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir////src/models/kmeans.cpp.o
1 error generated.
make[2]: *** [python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir///__/src/model.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/all] Error 2
make: *** [all] Error 2

[cmake] configuring CMake project...

running build_py (cmake)

[cmake] building CMake project -> build

Traceback (most recent call last):
File "/Users/user.name/anaconda3/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py", line 280, in
main()
File "/Users/user.name/anaconda3/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py", line 263, in main
json_out['return_val'] = hook(hook_input['kwargs'])
File "/Users/user.name/anaconda3/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py", line 205, in build_wheel
metadata_directory)
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 217, in build_wheel
wheel_directory, config_settings)
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 202, in _build_with_temp_dir
self.run_setup()
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 254, in run_setup
self).run_setup(setup_script=setup_script)
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 145, in run_setup
exec(compile(code, file, 'exec'), locals())
File "setup.py", line 64, in
include_package_data=True,
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/cmaketools/init.py", line 98, in setup
_setup(setup_args)
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/setuptools/init.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/Users/user.name/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/Users/user.name/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/Users/user.name/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 290, in run
self.run_command('build')
File "/Users/user.name/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/Users/user.name/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/Users/user.name/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/Users/user.name/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/Users/user.name/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/cmaketools/cmakecommands.py", line 110, in run
self._run_cmake()
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/cmaketools/cmakecommands.py", line 104, in _run_cmake
pkg_version=self.distribution.get_version(),
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/cmaketools/cmakebuilder.py", line 349, in run
env=env,
File "/private/var/folders/p9/mx3qb0ms6_gf8jx7tq82xr14s6lmfx/T/pip-build-env-evty4816/overlay/lib/python3.7/site-packages/cmaketools/cmakeutil.py", line 169, in build
return sp.run(args, env=env).check_returncode()
File "/Users/user.name/anaconda3/lib/python3.7/subprocess.py", line 422, in check_returncode
self.stderr)
subprocess.CalledProcessError: Command '['cmake', '--build', 'build', '-j', '7', '--config', 'Release']' returned non-zero exit status 2.

ERROR: Failed building wheel for napkinxc
Failed to build napkinxc
ERROR: Could not build wheels for napkinxc which use PEP 517 and cannot be installed directly

Running python 3.7.1 on MacOS Catalina version 10.15.6.

C++ compilation error when building

I'm trying to install the latest version using pip install git+https://github.com/mwydmuch/napkinXC.git, which gives the following compilation error:

  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp: In function ‘void solve_l2r_lr_dual(const problem*, float*, float, float, float, int)’:
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:1335:29: error: no matching function for call to ‘max(float&, double)’
      Gmax = max(Gmax, fabs(gp));

Thanks for your time.

PS:

The full error message is:

  ERROR: Command errored out with exit status 1:
   command: /home/cloud-user/code/diverse-xml/.venv/bin/python3.8 /home/cloud-user/code/diverse-xml/.venv/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmphqbldjxt
       cwd: /tmp/pip-req-build-hszug3a8
  Complete output (149 lines):
  running bdist_wheel
  running build
  running build_py
  -- downloading/updating pybind11
  -- pybind11 directory found, pulling...
  From https://github.com/pybind/pybind11
   * branch            master     -> FETCH_HEAD
  --
  fatal: A branch named 'tag_v2.6.2' already exists.
  CMake Warning at GitUtils.cmake:251 (message):
    pybind11 some error happens.
  Call Stack (most recent call first):
    CMakeLists.txt:92 (git_clone)


  -- pybind11 v2.9.0 dev1
  -- Configuring done
  -- Generating done
  -- Build files have been written to: /tmp/pip-req-build-hszug3a8/build
  [  3%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/blas/axpy.c.o
  [  7%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/args.cpp.o
  [ 10%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/base.cpp.o
  [ 14%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/blas/dot.c.o
  [ 17%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/blas/nrm2.c.o
  [ 21%] Building C object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/blas/scal.c.o
  [ 25%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/linear.cpp.o
  [ 28%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/tron.cpp.o
  [ 32%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/log.cpp.o
  [ 35%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/main.cpp.o
  [ 39%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/measure.cpp.o
  [ 42%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/misc.cpp.o
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp: In function ‘void solve_l2r_lr_dual(const problem*, float*, float, float, float, int)’:
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:1335:29: error: no matching function for call to ‘max(float&, double)’
      Gmax = max(Gmax, fabs(gp));
                               ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:16:36: note: candidate: template<class T> T max(T, T)
   template <class T> static inline T max(T x,T y) { return (x>y)?x:y; }
                                      ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:16:36: note:   template argument deduction/substitution failed:
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:1335:29: note:   deduced conflicting types for parameter ‘T’ (‘float’ and ‘double’)
      Gmax = max(Gmax, fabs(gp));
                               ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp: In function ‘float calc_max_p(const problem*, const parameter*)’:
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:2363:38: error: no matching function for call to ‘max(float&, double)’
     max_p = max(max_p, fabs(prob->y[i]));
                                        ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:16:36: note: candidate: template<class T> T max(T, T)
   template <class T> static inline T max(T x,T y) { return (x>y)?x:y; }
                                      ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:16:36: note:   template argument deduction/substitution failed:
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:2363:38: note:   deduced conflicting types for parameter ‘T’ (‘float’ and ‘double’)
     max_p = max(max_p, fabs(prob->y[i]));
                                        ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp: In function ‘model* load_model(const char*)’:
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:3022:35: warning: format ‘%lf’ expects argument of type ‘double*’, but argument 3 has type ‘float*’ [-Wformat=]
    if (fscanf(_stream, _format, _var) != 1)\
                                     ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:3098:4: note: in expansion of macro ‘FSCANF’
      FSCANF(fp,"%lf",&bias);
      ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:3022:35: warning: format ‘%lf’ expects argument of type ‘double*’, but argument 3 has type ‘float*’ [-Wformat=]
    if (fscanf(_stream, _format, _var) != 1)\
                                     ^
  /tmp/pip-req-build-hszug3a8/src/liblinear/linear.cpp:3136:4: note: in expansion of macro ‘FSCANF’
      FSCANF(fp, "%lf ", &model_->w[i*nr_w+j]);
      ^
  [ 46%] Building CXX object python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/model.cpp.o
  python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/build.make:159: recipe for target 'python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/linear.cpp.o' failed
  make[2]: *** [python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/liblinear/linear.cpp.o] Error 1
  make[2]: *** Waiting for unfinished jobs....
  In file included from /tmp/pip-req-build-hszug3a8/src/base.h:34:0,
                   from /tmp/pip-req-build-hszug3a8/src/base.cpp:27:
  /tmp/pip-req-build-hszug3a8/src/vector.h:216:25: warning: inline function ‘virtual Real AbstractVector::at(int) const’ used but never defined
       virtual inline Real at(int index) const = 0;
                           ^
  /tmp/pip-req-build-hszug3a8/src/vector.h:217:26: warning: inline function ‘virtual Real& AbstractVector::operator[](int)’ used but never defined
       virtual inline Real& operator[](int index) = 0;
                            ^
  In file included from /tmp/pip-req-build-hszug3a8/src/misc.h:35:0,
                   from /tmp/pip-req-build-hszug3a8/src/misc.cpp:30:
  /tmp/pip-req-build-hszug3a8/src/matrix.h: In instantiation of ‘void RMatrix<T>::appendRow(const U&, bool) [with U = std::vector<IVPair<float> >; T = SparseVector]’:
  /tmp/pip-req-build-hszug3a8/src/misc.cpp:96:35:   required from here
  /tmp/pip-req-build-hszug3a8/src/matrix.h:38:44: error: invalid initialization of non-const reference of type ‘SparseVector&’ from an rvalue of type ‘void’
           T& row = r.emplace_back(vec, sorted);
                                              ^
  python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/build.make:229: recipe for target 'python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/misc.cpp.o' failed
  make[2]: *** [python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/__/__/__/src/misc.cpp.o] Error 1
  CMakeFiles/Makefile2:145: recipe for target 'python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/all' failed
  make[1]: *** [python/napkinxc/_napkinxc/CMakeFiles/pynxc.dir/all] Error 2
  Makefile:135: recipe for target 'all' failed

feature dimension mismatch between train and test data

Hi,

There seems to be a bug in the data loading process.

For example:

from napkinxc.datasets import load_dataset
trn_X, _ = load_dataset('wiki10-31k', 'train')
tst_X, _ = load_dataset('wiki10-31k', 'test')
print('# of features of training data', trn_X.shape[1])
print('# of features of test data', tst_X.shape[1])

gives:

# of features of training data 101938
# of features of test data 101937

Cheers,
Han

Preparation of custom dataset for training

I'm trying to experiment with NapkinXC with a custom XMLC dataset, but I'm unsure how to prepare the input text and labels, and how to decode the output. Currently, I have the following code to prepare text embeddings and one-hot encoded labels:

Label preparation

from sklearn.preprocessing import MultiLabelBinarizer
import ast
from tqdm.auto import tqdm
y = MultiLabelBinarizer()
subclasses = df['subclass_id'].to_list()
subclasses = [ast.literal_eval(subclass) for subclass in tqdm(subclasses)]
labels = y.fit_transform(subclasses)

Text embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-miniLM-l6-v2', device='cuda')
model.max_seq_length = 256
print("Max Sequence Length:", model.max_seq_length)
sentence_embeddings = model.encode(
    df['patent_text'].values,
    batch_size=512,
    show_progress_bar=True,
    convert_to_numpy=True,
    device='cuda',
)

Then I proceed to convert these vectors to csr_matrices via:

from scipy.sparse import csr_matrix
import numpy as np
X = csr_matrix(X_train.astype(np.float32))
Y = csr_matrix(y_train.astype(np.float32))
X_test = csr_matrix(X_test.astype(np.float32))
Y_test = csr_matrix(y_test.astype(np.float32))

Training

I follow the quickstart like so:

from napkinxc.models import PLT
from napkinxc.measures import precision_at_k
plt = PLT("USPC-model")
plt.fit(X, Y)
Y_pred = plt.predict(X_test, top_k=10)
print(precision_at_k(Y_test, Y_pred, k=10))

This code runs, but I'm not sure how to interpret the result. Y_pred returns a list of lists containing integers (e.g. [[2316, 1056, 1691, 1690, 2322, 1064, 2315, 2301, 1714, 2302]]) and I'm unable to decode this to the original labels. Am I doing the data preparation correctly? How should I go about decoding the output labels? Thank you.