kunaldahiya / pyxclib Goto Github PK

Tools for multi-label classification problems.

License: MIT License

Python 95.90% Shell 1.15% Cython 2.68% Perl 0.26%

machine-learning python extreme-classification extreme-multi-label-classification multi-label-classification multi-label

pyxclib's Introduction

👋 Hi, I’m @kunaldahiya
👀 I’m interested in Machine learning, Deep learning, Extreme multi-label learning, Siamese networks, and Negative sampling among others.
🌱 I’m currently exploring Large Language Models for real-world applications such as Search, Ads, Product recommendation etc.,
💞️ I’m looking to collaborate on Deep learning, Extreme multi-label learning problems with impact on real-world applications.
📫 Drop me an email.

pyxclib's People

Contributors

Stargazers

Watchers

pyxclib's Issues

Xf.txt and Yf.txt files

How does the Xf.txt， Yf.txt file generated in LF-AmazonTitles-131K, did you known?

error when run sparse_bow_features_from_raw_data.py

df = np.zeros(vocabulary.len + 1) # +1 for OOV

when run to text.py, get error TypeError: unsupported operand type(s) for +: 'method-wrapper' and 'int'.
Chaned vocabulary.len to vocabulary.len() but get another error.

run script:
python sparse_bow_features_from_raw_data.py trn.json.gz tst.json.gz train.txt test.txt

ModuleNotFoundError: No module named 'xclib.utils._sparse'

I'm running into the above issue when running from xclib.data import data_utils

can't convert df (dict) to np.array?

in utils/text.py line 577 in _compute_idf
df = np.array(df, dtype=self.dtype)

df is created as a dictionary during _create_vocab. why is the dict object df being passed to _compute_idf to be turned into a np array? it raises the error "TypeError: float() argument must be a string or a number, not 'dict'". is the goal to create a csr matrix from the df dict?

is it more valid to do df = np.array(list(df.values()), dtype=self.dtype)

How to convert from Raw text data to the Parabel sparse matrix text format using Python?

I'm looking to train a XML classifier on the PubMed dataset. I found out after some research that Parabel is the fastest.

So far, I was able to make it work on toy dataset Eur-Lex. Now, I'm trying to with the above mentioned dataset. So, I need to convert according to the format required by the Parabel C++ code.

The format is explained here. I could generate it by writing custom code but I want to know if there is already some script for this.

Thanks.

Might be a bug when calculating precision

In methods precision(link), the indices returned by _setup_metric is the indices of top k labels, with scores sorted ascendingly. But when calculating precision@k, we want to use cumsum when scores in eval_flags are sorted descendingly. The same is to psprecision since it uses precision as a subroutine.

I just add change indices to indices[:, ::-1] in _get_top_k. Guess it is just a workaround.

Discrepancy in the Order of Returned Variables from baseline_test Function

Firstly, I apologize if my explanation is unclear as I'm not very experienced with using GitHub.

I noticed a small discrepancy in the return order of the variables from a function.

The function returns the values in the following order:

https://github.com/kennishida17/Learning-with-Holographic-Reduced-Representations/blob/ce3cb8fc4c63b16f41fc41d8788503ebe146c73c/lib/model.py#L175C1-L177C51

else:
        return total_loss/num_itr, total_f1/num_itr, total_pr/num_itr, total_rec/num_itr

However, when receiving the output of the function, it's assumed to be in this order:

https://github.com/kennishida17/Learning-with-Holographic-Reduced-Representations/blob/ce3cb8fc4c63b16f41fc41d8788503ebe146c73c/run_classifier.py#L222

val_loss, pr, rec, f1 = baseline_test(model, device, val_loader)

Proposed Change

I suggest adjusting the return order in the baseline_test function to match the order when receiving the output. The correct order should be:

return total_loss/num_itr, total_pr/num_itr, total_rec/num_itr, total_f1/num_itr

Precision_k does not work with X np.float32

Precision at k does not work with the probabilities matrix as numpy float 32. It throws UnboundLocalError: local variable 'indices' referenced before assignment. It works fine numpy float64. The error originates from

pyxclib/xclib/evaluation/xc_metrics.py

Line 99 in e8f2130

elif np.issubdtype(X.dtype, np.float):

and from the fact that np.float is a shorthand for python float which is a double see https://stackoverflow.com/questions/16963956/difference-between-python-float-and-numpy-float32. This is why np.issubdtype(np.float32, np.float) returns False whereas np.issubdtype(np.float64, np.float) returns true.

This can be solved easily by replacing the comparison with np.floating see the hierarchy of types in numpy https://numpy.org/doc/stable/reference/arrays.scalars.html#arrays-scalars

To reproduce

from xclib.evaluation.xc_metrics import precision

import numpy as np
import scipy.sparse as sp

Y_pred_proba = np.random.randn(10,5).astype(np.float32)
Y_true = sp.csr_matrix(np.random.randn(10,5) > 1).astype(np.int32)

precision(Y_pred_proba, Y_true)

what is the A and B for xc_metrics.compute_inv_propesity for AmazonCat 13K dataset?

https://github.com/kunaldahiya/pyxclib/blob/master/xclib/evaluation/xc_metrics.py#L128-L137

In the script, it is said: A=0.6, B=2.6 for Amazon. But there are multiple Amazon datasets, for example AmazonCat 13K dataset ,Amazon 670k and Amazon 3M. Should we use the same A and B for calculating the propensity score for all the Amazon datasets? Thanks.

How to read json format dataset?

Please give me refrence for raw data : for Eurlex-4k, mediall mill, delicious

Methods retain_topk, rank does not work

Since numpy has deprecated the dtype np.int, the rank method and any methods that invoke that have been broken. This line uses np.int, which is at the root of this issue.

A basic fix seems to be to just change that to 'int', which numpy recommends, along with multiple other changes.

How to proceed? Am willing to take this up.

Error:

Traceback (most recent call last):
  File "path/projects/OrganicBERT/run_eval.py", line 61, in <module>
    meta_preds = retain_topk(sp.load_npz(f"{DUMP_DIR}/../preds_mat.npz"), k=1)
  File "path/miniconda3/envs/ogb/lib/python3.9/site-packages/xclib/utils/sparse.py", line 137, in retain_topk
    ranks = rank(X)
  File "path/miniconda3/envs/ogb/lib/python3.9/site-packages/xclib/utils/sparse.py", line 36, in rank
    ranks = _rank(X.data, X.indices, X.indptr)
  File "xclib/utils/_sparse.pyx", line 197, in xclib.utils._sparse._rank
  File "path/miniconda3/envs/ogb/lib/python3.9/site-packages/numpy/__init__.py", line 284, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'int'

pip freeze:


colorama==0.4.6
coloredlogs==15.0.1
Cython==3.0.0
fbgemm-gpu==0.4.1
grpcio==1.57.0
hnswlib==0.7.0
huggingface-hub==0.16.4
humanfriendly==10.0
hydra-core==1.3.2
knack==0.10.1
lightning-utilities==0.9.0
numba==0.57.1
numpy==1.24.0
nvidia-ml-py==12.535.77
nvitop==1.2.0
oauthlib==3.2.2
omegaconf==2.3.0
onnxruntime==1.14.0
onnxruntime-gpu==1.14.0
pandas==2.0.3
pathspec==0.11.2
pathtools==0.1.2
Pillow==10.0.0
pkginfo==1.9.6
sacremoses==0.0.53
safetensors==0.3.2
scikit-learn==1.3.0
scipy==1.11.1
sentence-transformers==2.2.2
sentencepiece==0.1.99
sentry-sdk==1.29.2
setproctitle==1.3.2
tokenizers==0.12.1
torch==2.0.1
torchaudio==2.0.2
torchmetrics==1.0.3
xclib @ git+https://github.com/kunaldahiya/pyxclib.git@ae5410f10080742758cdd533f768e3fe5b4f4de3

Recall@1 does not return the same results as micro recall (on single-prediction classification)?

Hi,

First of all, thank you so much for this useful library!

I have been doing a few very simple tests, to compare the output of this library with the output of scikit-learn's micro averaged metrics (precision and recall). Based on my understanding, with k=1, the results of precision@1 and recall@1 should be the same of micro precision and micro recall, if we have just one label predicted by the model.

This holds for precision@1, but recall@1 returns different results. Here is a minimal example:

import numpy as np
from scipy.sparse import csr_matrix
from xclib.evaluation import xc_metrics
from sklearn.metrics import precision_recall_fscore_support

gold_vecs = np.array([[1, 1, 0, 0], [0, 1, 1, 0], [0, 0, 1, 0]])
pred_vecs = np.array([[0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0]])

print("sklearn micro:", precision_recall_fscore_support(y_true=gold_vecs, y_pred=pred_vecs, average='micro', zero_division=0))

print("xclib precision@1:", xc_metrics.precision(csr_matrix(pred_vecs), csr_matrix(gold_vecs), k=1)[0])
print("xclib recall@1:", xc_metrics.recall(csr_matrix(pred_vecs), csr_matrix(gold_vecs), k=1)[0])

this outputs:

sklearn micro: (0.6666666666666666, 0.4, 0.5, None)  # the first two values are precision and recall, respectively
xclib precision@1: 0.66666667
xclib recall@1: 0.5

By doing manual verification, I get:
3 fn, 1 fp, 2 tp
recall = 2/(2+3) = 0.4

Am I missing something?

Bug in recall function in xc_metrics.py

line 51 : predicted_labels = retain_topk(predicted_labels, k) in xc_metric.py should be predicted_labels = retain_topk(predicted_labels, k=k) otherwise k is always set to 5

Some questions about the PSP@k and PSnDCG@k

Evaluating the results when running pyxclib metrics, certain uncertainties surfaced:

Could you clarify the definition of propensity scores utilized for precision and nDCG?
What are the upper and lower limits of these metrics?
Under what circumstances do propensity-scored metrics reach their maximum and minimum values?

Efficient utils.sparse.retain_topk

Current code of retain_topk does : X[ranks > k] = 0.0; X.eliminate_zeros().

Ther first operation seems very expensive and in my experiments i've seen that doing something like X.data[np.where(ranks.data > k)[0]] = 0.0 is much more efficient than X[ranks > k] = 0.0. Can you please check this

Error running compute_features example

Hey I was trying out this repo by running the compute_features() example in xclib/examples/compute_features.py
Unfortunately I keep running into this error
TypeError: 'BoWFeatures' object is not iterable

Full Trace:

Traceback (most recent call last):
  File "/home/user/pyxclib/xclib/examples/compute_features.py", line 26, in <module>
    compute_features()
  File "/home/user/pyxclib/xclib/examples/compute_features.py", line 22, in compute_features
    print(obj.transform(obj).toarray())
  File "/home/user/pyxclib/xclib/utils/text.py", line 504, in transform
    X = self._compute_countf(raw_documents)
  File "/home/user/pyxclib/xclib/utils/text.py", line 535, in _compute_countf
    for doc in raw_documents:
TypeError: 'BoWFeatures' object is not iterable

Please help me out in debugging the issue.

What is the feature value ?

As we know, each datapoint is like this:
label1, label2 ft1:ft1_val
how could you get the feature value, let us take LF-AmazonTitles-131K as an example?

PSP@k and PSnDCG@k

Evaluating the definition of PSP and PSnDCG (shown below), these metrics can assume values greater than 1.0 since propensity is between 0 and 1.

So, do you happen to have any test cases for these metrics that would allow me to verify the correctness of my implementation?

I feel confused about data_utils.read_sparse_file function

When I run
true_labels = data_utils.read_sparse_file('Sandbox/Data/EUR-Lex/tst_X_Y.txt')
It raise an exception that ValueError: buffer size must be a multiple of element size from sparse.py line 257,
i.e., indices = np.frombuffer(ind, np.int64)
I see that ind is an array that with typecode = 'l', it seems that the each element in ind takes 4 bytes, according to https://docs.python.org/3/library/array.html.

So I am confused that why loading frombuffer with dtype np.int64?

My machine is 64bit OS.
Python is the version 3.8.8.

from ..utils.sparse import ll_to_sparse, expand_indptr, _read_file, _read_file_safe File "/home/joao/Downloads/xclib/xclib/utils/sparse.py", line 1, in <module> from ._sparse import _rank, read_file, read_file_safe, _topk ModuleNotFoundError: No module named 'xclib.utils._sparse'

Hi
I tried to import data_utils to run the example that is in repository but doesn't work.
from xclib.data import data_utils.
The following error appears:
from ..utils.sparse import ll_to_sparse, expand_indptr, _read_file, _read_file_safe File "/home/joao/Downloads/xclib/xclib/utils/sparse.py", line 1, in from ._sparse import _rank, read_file, read_file_safe, _topk ModuleNotFoundError: No module named 'xclib.utils._sparse'

Bug in _get_topk

Hi, thanks for making this useful tool! Looks great.

Here is a minimal working example of how I can produce the bug:

  1 import numpy as np
  2 import xclib.evaluation.xc_metrics as xc_metrics
  3 
  4 scores = np.array([[0.1, 0.2, 0.99, 0.2]])
  5 labels = np.array([[0, 0, 1, 1]])
  6 
  7 inv_propen = xc_metrics.compute_inv_propesity(labels, 0.55, 1.5)
  8 acc = xc_metrics.Metrics(true_labels=labels, inv_psp=inv_propen)
  9 args = acc.eval(scores, 1)
 10 print(xc_metrics.format(*args))

What I get is

  File "tmptest.py", line 9, in <module>
    args = acc.eval(scores, 1)
  File "/home/fs01/wz346/.local/lib/python3.7/site-packages/xclib-0.96-py3.7-linux-x86_64.egg/xclib/evaluation/xc_metrics.py", line 446, in eval
    self.inv_psp, k=K)
  File "/home/fs01/wz346/.local/lib/python3.7/site-packages/xclib-0.96-py3.7-linux-x86_64.egg/xclib/evaluation/xc_metrics.py", line 162, in _setup_metric
    num_labels, k)
  File "/home/fs01/wz346/.local/lib/python3.7/site-packages/xclib-0.96-py3.7-linux-x86_64.egg/xclib/evaluation/xc_metrics.py", line 118, in _get_topk
    return indices
UnboundLocalError: local variable 'indices' referenced before assignment

Looking at it closely, it seems like the if condition here fails to capture the np.ndarray type, so the indices never got created

    elif type(X) == np.ndarray:
        if np.issubdtype(X.dtype, np.integer):
            warnings.warn("Assuming indices are sorted.")
            indices = X[:, :k]
        elif np.issubdtype(X.dtype, np.float):
            _indices = np.argpartition(X, -k)[:, -k:]
            _scores = np.take_along_axis(
                X, _indices, axis=-1
            )
            indices = np.argsort(_scores, axis=-1)
            indices = np.take_along_axis(_indices, indices, axis=1)

where X is created here in the setup_metric():

    if inv_psp is not None:
        ps_indices = _get_topk(
            true_labels.dot(
                sp.spdiags(inv_psp, diags=0,
                           m=num_labels, n=num_labels)),
            num_labels, k)

Any ideas how I may fix it? Thanks for any help you may be able provide.

typo/why are train features written twice?

in pyxclib/xclib/examples/sparse_bow_features_from_raw_data.py, lines 62 and 63 are both data_utils.write_data(trn_ofname, trn_features, trn_labels). Is that correct? or did you mean for one of them to be test features and test labels

Threshold based Evaluation Metrics

While using fast_evaluate examples , we have a method to specify K and it will output the metrics.

This internally calls for Top K method to build out the metrics report .

import scipy.sparse as sp

from xclib.utils.sparse import topk

Y_pred = pred_labels

Y_pred = Y_pred.tocsr()

Y_pred.sort_indices()

pad_indx = Y_pred.shape[1]

print(pad_indx)

indices_ , values_ = topk(
Y_pred, 6, pad_indx, 0, return_values=True, use_cython=False)

print(indices_[0]). # [2967 2970 2963 2976 2977 1866]
print(values_[0]). # [0.8342234 0.56523454 0.20331156 0.19142145 0.15245992 0.13709748]

In most cases , we will not know the number of labels , so we will have a threshold set based on relevance .

In this case threshold at 0.50 will have 2967 , 2970.

Do we have a way to say lets have a threshold and then calculate the metrics ?

Why propensity scored precision can be less than precision?

As the formula suggests, 1/p_l=1+C(N_l+B)^(-A) should be always larger than one. So psp@k = 1/k * \sum y_l/p_l should be larger than p@k = 1/k \sum y_l. Thus I am confused by the results shown on XML repo website, in which most of the time psp numbers are smaller.

Update for installation requirements?

Hi There,
I found your toolkit for XMLC-Tasks really useful. However, I had some trouble with the installation process:

python3 setup.py install --user

One issue is, that it seems to require cython and numpy, which is not documented. After installing these manually and running the above installation instructions I get the following warning

SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.

These are followed by version imcompatibility issues with numpy and numba. I am not much of a pythoneer, but I think it might be better to do the installation with something like

pip install .[dev]

Also, it might be safer to wrap this into some kind of virtual environment. Personally, I prefer conda to manage the requirements.

I would offer to open a PR for the README-Instructions, but I would kindly ask for your opinion on these issues first.
Thank you very much!
Best
Maximilian

"zero_based" not explicitly set in read_data function when calling load_svmlight_file

pyxclib/xclib/data/data_utils.py

Line 262 in e4f53c2

features, labels = load_svmlight_file(f, n_features=num_feat, multilabel=True)

when calling load_svmlight_file(f, n_features=num_feat, multilabel=True) in read_data(filename, header=True, dtype='float32', zero_based=True), zero_based flag is not explicitly set so when there is no 0 index in the file load_svmlight_file function thinks that the file is not zero_based and offsets all the indices by 1 (by default zero_based is set to auto in load_svmlight_file).

I think load_svmlight_file(f, n_features=num_feat, multilabel=True, zero_based=zero_based) should fix this.