Giter VIP home page Giter VIP logo

yzhao062 / pyod Goto Github PK

View Code? Open in Web Editor NEW
8.0K 148.0 1.3K 40.15 MB

A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

Home Page: http://pyod.readthedocs.io

License: BSD 2-Clause "Simplified" License

Python 86.63% Jupyter Notebook 13.37%
outlier-detection anomaly-detection outlier-ensembles outliers anomaly python machine-learning data-mining unsupervised-learning python3 fraud-detection autoencoder neural-networks deep-learning data-science data-analysis novelty-detection out-of-distribution-detection

pyod's Introduction

Python Outlier Detection (PyOD)

Deployment & Documentation & Stats & License

PyPI version

Anaconda version

Documentation status

GitHub stars

GitHub forks

Downloads

testing

Coverage Status

Maintainability

License

Benchmark


Read Me First

Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.


About PyOD

PyOD, established in 2017, has become a go-to Python library for detecting anomalous/outlying objects in multivariate data. This exciting yet challenging field is commonly referred as Outlier Detection or Anomaly Detection.

PyOD includes more than 50 detection algorithms, from classical LOF (SIGMOD 2000) to the cutting-edge ECOD and DIF (TKDE 2022 and 2023). Since 2017, PyOD has been successfully used in numerous academic researches and commercial products with more than 17 million downloads. It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including Analytics Vidhya, KDnuggets, and Towards Data Science.

PyOD is featured for:

  • Unified, User-Friendly Interface across various algorithms.
  • Wide Range of Models, from classic techniques to the latest deep learning methods.
  • High Performance & Efficiency, leveraging numba and joblib for JIT compilation and parallel processing.
  • Fast Training & Prediction, achieved through the SUOD framework1.

Outlier Detection with 5 Lines of Code:

# Example: Training an ECOD detector
from pyod.models.ecod import ECOD
clf = ECOD()
clf.fit(X_train)
y_train_scores = clf.decision_scores_  # Outlier scores for training data
y_test_scores = clf.decision_function(X_test)  # Outlier scores for test data

Selecting the Right Algorithm:. Unsure where to start? Consider these robust and interpretable options:

  • ECOD: Example of using ECOD for outlier detection
  • Isolation Forest: Example of using Isolation Forest for outlier detection

Alternatively, explore MetaOD for a data-driven approach.

Citing PyOD:

PyOD paper is published in Journal of Machine Learning Research (JMLR) (MLOSS track). If you use PyOD in a scientific publication, we would appreciate citations to the following paper:

@article{zhao2019pyod,
    author  = {Zhao, Yue and Nasrullah, Zain and Li, Zheng},
    title   = {PyOD: A Python Toolbox for Scalable Outlier Detection},
    journal = {Journal of Machine Learning Research},
    year    = {2019},
    volume  = {20},
    number  = {96},
    pages   = {1-7},
    url     = {http://jmlr.org/papers/v20/19-011.html}
}

or:

Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of machine learning research (JMLR), 20(96), pp.1-7.

For a broader perspective on anomaly detection, see our NeurIPS papers ADBench: Anomaly Detection Benchmark Paper & ADGym: Design Choices for Deep Anomaly Detection:

@article{han2022adbench,
    title={Adbench: Anomaly detection benchmark},
    author={Han, Songqiao and Hu, Xiyang and Huang, Hailiang and Jiang, Minqi and Zhao, Yue},
    journal={Advances in Neural Information Processing Systems},
    volume={35},
    pages={32142--32159},
    year={2022}
}

@article{jiang2023adgym,
    title={ADGym: Design Choices for Deep Anomaly Detection},
    author={Jiang, Minqi and Hou, Chaochuan and Zheng, Ao and Han, Songqiao and Huang, Hailiang and Wen, Qingsong and Hu, Xiyang and Zhao, Yue},
    journal={Advances in Neural Information Processing Systems},
    volume={36},
    year={2023}
}

Table of Contents:


Installation

PyOD is designed for easy installation using either pip or conda. We recommend using the latest version of PyOD due to frequent updates and enhancements:

pip install pyod            # normal install
pip install --upgrade pyod  # or update if needed
conda install -c conda-forge pyod

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/pyod.git
cd pyod
pip install .

Required Dependencies:

  • Python 3.8 or higher
  • joblib
  • matplotlib
  • numpy>=1.19
  • numba>=0.51
  • scipy>=1.5.1
  • scikit_learn>=0.22.0

Optional Dependencies (see details below):

  • combo (optional, required for models/combination.py and FeatureBagging)
  • keras/tensorflow (optional, required for AutoEncoder, and other deep learning models)
  • suod (optional, required for running SUOD model)
  • xgboost (optional, required for XGBOD)
  • pythresh (optional, required for thresholding)optional

API Cheatsheet & Reference

The full API Reference is available at PyOD Documentation. Below is a quick cheatsheet for all detectors:

  • fit(X): Fit the detector. The parameter y is ignored in unsupervised methods.
  • decision_function(X): Predict raw anomaly scores for X using the fitted detector.
  • predict(X): Determine whether a sample is an outlier or not as binary labels using the fitted detector.
  • predict_proba(X): Estimate the probability of a sample being an outlier using the fitted detector.
  • predict_confidence(X): Assess the model's confidence on a per-sample basis (applicable in predict and predict_proba)2.

Key Attributes of a fitted model:

  • decision_scores_: Outlier scores of the training data. Higher scores typically indicate more abnormal behavior. Outliers usually have higher scores.
  • labels_: Binary labels of the training data, where 0 indicates inliers and 1 indicates outliers/anomalies.

ADBench Benchmark and Datasets

We just released a 45-page, the most comprehensive ADBench: Anomaly Detection Benchmark3. The fully open-sourced ADBench compares 30 anomaly detection algorithms on 57 benchmark datasets.

The organization of ADBench is provided below:

benchmark-fig

For a simpler visualization, we make the comparison of selected models via compare_all_models.py.

Comparison_of_All


Model Save & Load

PyOD takes a similar approach of sklearn regarding model persistence. See model persistence for clarification.

In short, we recommend to use joblib or pickle for saving and loading PyOD models. See "examples/save_load_model_example.py" for an example. In short, it is simple as below:

from joblib import dump, load

# save the model
dump(clf, 'clf.joblib')
# load the model
clf = load('clf.joblib')

It is known that there are challenges in saving neural network models. Check #328 and #88 for temporary workaround.


Fast Train with SUOD

Fast training and prediction: it is possible to train and predict with a large number of detection models in PyOD by leveraging SUOD framework4. See SUOD Paper and SUOD example.

from pyod.models.suod import SUOD

# initialized a group of outlier detectors for acceleration
detector_list = [LOF(n_neighbors=15), LOF(n_neighbors=20),
                 LOF(n_neighbors=25), LOF(n_neighbors=35),
                 COPOD(), IForest(n_estimators=100),
                 IForest(n_estimators=200)]

# decide the number of parallel process, and the combination method
# then clf can be used as any outlier detection model
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
           verbose=False)

Thresholding Outlier Scores

A more data based approach can be taken when setting the contamination level. By using a thresholding method, guessing an abritrary value can be replaced with tested techniques for seperating inliers and outliers. Refer to PyThresh for a more in depth look at thresholding.

from pyod.models.knn import KNN
from pyod.models.thresholds import FILTER

# Set the outlier detection and thresholding methods
clf = KNN(contamination=FILTER())

Implemented Algorithms

PyOD toolkit consists of four major functional groups:

(i) Individual Detection Algorithms :

Type Abbr Algorithm Year Ref
Probabilistic ECOD Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions 2022 5
Probabilistic ABOD Angle-Based Outlier Detection 2008 6
Probabilistic FastABOD Fast Angle-Based Outlier Detection using approximation 2008 7
Probabilistic COPOD COPOD: Copula-Based Outlier Detection 2020 8
Probabilistic MAD Median Absolute Deviation (MAD) 1993 9
Probabilistic SOS Stochastic Outlier Selection 2012 10
Probabilistic QMCD Quasi-Monte Carlo Discrepancy outlier detection 2001 11
Probabilistic KDE Outlier Detection with Kernel Density Functions 2007 12

Probabilistic Probabilistic

Sampling GMM

Rapid distance-based outlier detection via sampling Probabilistic Mixture Modeling for Outlier Analysis

2013

13 14 [Ch.2]

Linear Model PCA Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes) 2003 15
Linear Model KPCA Kernel Principal Component Analysis 2007 16
Linear Model MCD Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores) 1999 1718
Linear Model CD Use Cook's distance for outlier detection 1977 19
Linear Model OCSVM One-Class Support Vector Machines 2001 20
Linear Model LMDD Deviation-based Outlier Detection (LMDD) 1996 21
Proximity-Based LOF Local Outlier Factor 2000 22
Proximity-Based COF Connectivity-Based Outlier Factor 2002 23
Proximity-Based (Incremental) COF Memory Efficient Connectivity-Based Outlier Factor (slower but reduce storage complexity) 2002 24
Proximity-Based CBLOF Clustering-Based Local Outlier Factor 2003 25
Proximity-Based LOCI LOCI: Fast outlier detection using the local correlation integral 2003 26
Proximity-Based HBOS Histogram-based Outlier Score 2012 27
Proximity-Based kNN k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) 2000 28
Proximity-Based AvgKNN Average kNN (use the average distance to k nearest neighbors as the outlier score) 2002 29
Proximity-Based MedKNN Median kNN (use the median distance to k nearest neighbors as the outlier score) 2002 30
Proximity-Based SOD Subspace Outlier Detection 2009 31
Proximity-Based ROD Rotation-based Outlier Detection 2020 32
Outlier Ensembles IForest Isolation Forest 2008 33
Outlier Ensembles INNE Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles 2018 34
Outlier Ensembles DIF Deep Isolation Forest for Anomaly Detection 2023 35
Outlier Ensembles FB Feature Bagging 2005 36
Outlier Ensembles LSCP LSCP: Locally Selective Combination of Parallel Outlier Ensembles 2019 37
Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) 2018 38
Outlier Ensembles LODA Lightweight On-line Detector of Anomalies 2016 39

Outlier Ensembles Neural Networks

SUOD AutoEncoder

SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection (Acceleration) Fully connected AutoEncoder (use reconstruction error as the outlier score)

2021

40 41 [Ch.3]

Neural Networks VAE Variational AutoEncoder (use reconstruction error as the outlier score) 2013 42
Neural Networks Beta-VAE Variational AutoEncoder (all customized loss term by varying gamma and capacity) 2018 43
Neural Networks SO_GAAL Single-Objective Generative Adversarial Active Learning 2019 44
Neural Networks MO_GAAL Multiple-Objective Generative Adversarial Active Learning 2019 45
Neural Networks DeepSVDD Deep One-Class Classification 2018 46
Neural Networks AnoGAN Anomaly Detection with Generative Adversarial Networks 2017 47
Neural Networks ALAD Adversarially learned anomaly detection 2018 48
Graph-based R-Graph Outlier detection by R-graph 2017 49
Graph-based LUNAR LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks 2022 50

(ii) Outlier Ensembles & Outlier Detector Combination Frameworks:

Type Abbr Algorithm Year Ref
Outlier Ensembles FB Feature Bagging 2005 51
Outlier Ensembles LSCP LSCP: Locally Selective Combination of Parallel Outlier Ensembles 2019 52
Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) 2018 53
Outlier Ensembles LODA Lightweight On-line Detector of Anomalies 2016 54
Outlier Ensembles SUOD SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection (Acceleration) 2021 55
Outlier Ensembles INNE Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles 2018 56
Combination Average Simple combination by averaging the scores 2015 57
Combination Weighted Average Simple combination by averaging the scores with detector weights 2015 58
Combination Maximization Simple combination by taking the maximum scores 2015 59
Combination AOM Average of Maximum 2015 60
Combination MOA Maximization of Average 2015 61
Combination Median Simple combination by taking the median of the scores 2015 62
Combination majority Vote Simple combination by taking the majority vote of the labels (weights can be used) 2015 63

(iii) Outlier Detection Score Thresholding Methods:

Type Abbr Algorithm Documentation
Kernel-Based AUCP Area Under Curve Percentage AUCP
Statistical Moment-Based BOOT Bootstrapping BOOT
Normality-Based CHAU Chauvenet's Criterion CHAU
Linear Model CLF Trained Linear Classifier CLF
cluster-Based CLUST Clustering Based CLUST
Kernel-Based CPD Change Point Detection CPD
Transformation-Based DECOMP Decomposition DECOMP
Normality-Based DSN Distance Shift from Normal DSN
Curve-Based EB Elliptical Boundary EB
Kernel-Based FGD Fixed Gradient Descent FGD
Filter-Based FILTER Filtering Based FILTER
Curve-Based FWFM Full Width at Full Minimum FWFM
Statistical Test-Based GESD Generalized Extreme Studentized Deviate GESD
Filter-Based HIST Histogram Based HIST
Quantile-Based IQR Inter-Quartile Region IQR
Statistical Moment-Based KARCH Karcher mean (Riemannian Center of Mass) KARCH
Statistical Moment-Based MAD Median Absolute Deviation MAD
Statistical Test-Based MCST Monte Carlo Shapiro Tests MCST
Ensembles-Based META Meta-model Trained Classifier META
Transformation-Based MOLL Friedrichs' Mollifier MOLL
Statistical Test-Based MTT Modified Thompson Tau Test MTT
Linear Model OCSVM One-Class Support Vector Machine OCSVM
Quantile-Based QMCD Quasi-Monte Carlo Discrepancy QMCD
Linear Model REGR Regression Based REGR
Neural Networks VAE Variational Autoencoder VAE
Curve-Based WIND Topological Winding Number WIND
Transformation-Based YJ Yeo-Johnson Transformation YJ
Normality-Based ZSCORE Z-score ZSCORE

(iV) Utility Functions:

Type Name Function Documentation
Data generate_data Synthesized data generation; normal data is generated by a multivariate Gaussian and outliers are generated by a uniform distribution generate_data
Data generate_data_clusters Synthesized data generation in clusters; more complex data patterns can be created with multiple clusters generate_data_clusters
Stat wpearsonr Calculate the weighted Pearson correlation of two samples wpearsonr
Utility get_label_n Turn raw outlier scores into binary labels by assign 1 to top n outlier scores get_label_n
Utility precision_n_scores calculate precision @ rank n precision_n_scores

Quick Start for Outlier Detection

PyOD has been well acknowledged by the machine learning community with a few featured posts and tutorials.

Analytics Vidhya: An Awesome Tutorial to Learn Outlier Detection in Python using PyOD Library

KDnuggets: Intuitive Visualization of Outlier Detection Methods, An Overview of Outlier Detection Methods from PyOD

Towards Data Science: Anomaly Detection for Dummies

Computer Vision News (March 2019): Python Open Source Toolbox for Outlier Detection

"examples/knn_example.py" demonstrates the basic API of using kNN detector. It is noted that the API across all other algorithms are consistent/similar.

More detailed instructions for running examples can be found in examples directory.

  1. Initialize a kNN detector, fit the model, and make the prediction.

    from pyod.models.knn import KNN   # kNN detector
    
    # train kNN detector
    clf_name = 'KNN'
    clf = KNN()
    clf.fit(X_train)
    
    # get the prediction label and outlier scores of the training data
    y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
    y_train_scores = clf.decision_scores_  # raw outlier scores
    
    # get the prediction on the test data
    y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
    y_test_scores = clf.decision_function(X_test)  # outlier scores
    
    # it is possible to get the prediction confidence as well
    y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True)  # outlier labels (0 or 1) and confidence in the range of [0,1]
  2. Evaluate the prediction by ROC and Precision @ Rank n (p@n).

    from pyod.utils.data import evaluate_print
    
    # evaluate and print the results
    print("\nOn Training Data:")
    evaluate_print(clf_name, y_train, y_train_scores)
    print("\nOn Test Data:")
    evaluate_print(clf_name, y_test, y_test_scores)
  3. See a sample output & visualization.

    On Training Data:
    KNN ROC:1.0, precision @ rank n:1.0
    
    On Test Data:
    KNN ROC:0.9989, precision @ rank n:0.9
    visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
        y_test_pred, show_figure=True, save_figure=False)

Visualization (knn_figure):

kNN example figure


Reference


  1. Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).

  2. Perini, L., Vercruyssen, V., Davis, J. Quantifying the confidence of anomaly detectors in their example-wise predictions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2020.

  3. Han, S., Hu, X., Huang, H., Jiang, M. and Zhao, Y., 2022. ADBench: Anomaly Detection Benchmark. arXiv preprint arXiv:2206.09426.

  4. Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).

  5. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C. and Chen, H. G. ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2022.

  6. Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In KDD '08, pp. 444-452. ACM.

  7. Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In KDD '08, pp. 444-452. ACM.

  8. Li, Z., Zhao, Y., Botta, N., Ionescu, C. and Hu, X. COPOD: Copula-Based Outlier Detection. IEEE International Conference on Data Mining (ICDM), 2020.

  9. Iglewicz, B. and Hoaglin, D.C., 1993. How to detect and handle outliers (Vol. 16). Asq Press.

  10. Janssens, J.H.M., Huszár, F., Postma, E.O. and van den Herik, H.J., 2012. Stochastic outlier selection. Technical report TiCC TR 2012-001, Tilburg University, Tilburg Center for Cognition and Communication, Tilburg, The Netherlands.

  11. Fang, K.T. and Ma, C.X., 2001. Wrap-around L2-discrepancy of random sampling, Latin hypercube and uniform designs. Journal of complexity, 17(4), pp.608-624.

  12. Latecki, L.J., Lazarevic, A. and Pokrajac, D., 2007, July. Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 61-75). Springer, Berlin, Heidelberg.

  13. Sugiyama, M. and Borgwardt, K., 2013. Rapid distance-based outlier detection via sampling. Advances in neural information processing systems, 26.

  14. Aggarwal, C.C., 2015. Outlier analysis. In Data mining (pp. 237-263). Springer, Cham.

  15. Shyu, M.L., Chen, S.C., Sarinnapakorn, K. and Chang, L., 2003. A novel anomaly detection scheme based on principal component classifier. MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING.

  16. Hoffmann, H., 2007. Kernel PCA for novelty detection. Pattern recognition, 40(3), pp.863-874.

  17. Hardin, J. and Rocke, D.M., 2004. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4), pp.625-638.

  18. Rousseeuw, P.J. and Driessen, K.V., 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), pp.212-223.

  19. Cook, R.D., 1977. Detection of influential observation in linear regression. Technometrics, 19(1), pp.15-18.

  20. Scholkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J. and Williamson, R.C., 2001. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), pp.1443-1471.

  21. Arning, A., Agrawal, R. and Raghavan, P., 1996, August. A Linear Method for Deviation Detection in Large Databases. In KDD (Vol. 1141, No. 50, pp. 972-981).

  22. Breunig, M.M., Kriegel, H.P., Ng, R.T. and Sander, J., 2000, May. LOF: identifying density-based local outliers. ACM Sigmod Record, 29(2), pp. 93-104.

  23. Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 535-548. Springer, Berlin, Heidelberg.

  24. Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 535-548. Springer, Berlin, Heidelberg.

  25. He, Z., Xu, X. and Deng, S., 2003. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10), pp.1641-1650.

  26. Papadimitriou, S., Kitagawa, H., Gibbons, P.B. and Faloutsos, C., 2003, March. LOCI: Fast outlier detection using the local correlation integral. In ICDE '03, pp. 315-326. IEEE.

  27. Goldstein, M. and Dengel, A., 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In KI-2012: Poster and Demo Track, pp.59-63.

  28. Ramaswamy, S., Rastogi, R. and Shim, K., 2000, May. Efficient algorithms for mining outliers from large data sets. ACM Sigmod Record, 29(2), pp. 427-438.

  29. Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery pp. 15-27.

  30. Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery pp. 15-27.

  31. Kriegel, H.P., Kröger, P., Schubert, E. and Zimek, A., 2009, April. Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 831-838. Springer, Berlin, Heidelberg.

  32. Almardeny, Y., Boujnah, N. and Cleary, F., 2020. A Novel Outlier Detection Method for Multivariate Data. IEEE Transactions on Knowledge and Data Engineering.

  33. Liu, F.T., Ting, K.M. and Zhou, Z.H., 2008, December. Isolation forest. In International Conference on Data Mining, pp. 413-422. IEEE.

  34. Bandaragoda, T. R., Ting, K. M., Albrecht, D., Liu, F. T., Zhu, Y., and Wells, J. R., 2018, Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, 34(4), pp. 968-998.

  35. Xu, H., Pang, G., Wang, Y., Wang, Y., 2023. Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering.

  36. Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In KDD '05. 2005.

  37. Zhao, Y., Nasrullah, Z., Hryniewicki, M.K. and Li, Z., 2019, May. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), pp. 585-593. Society for Industrial and Applied Mathematics.

  38. Zhao, Y. and Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. IEEE International Joint Conference on Neural Networks, 2018.

  39. Pevný, T., 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102(2), pp.275-304.

  40. Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).

  41. Aggarwal, C.C., 2015. Outlier analysis. In Data mining (pp. 237-263). Springer, Cham.

  42. Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

  43. Burgess, Christopher P., et al. "Understanding disentangling in beta-VAE." arXiv preprint arXiv:1804.03599 (2018).

  44. Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M. and He, X., 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering.

  45. Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M. and He, X., 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering.

  46. Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder, A., Müller, E. and Kloft, M., 2018, July. Deep one-class classification. In International conference on machine learning (pp. 4393-4402). PMLR.

  47. Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U. and Langs, G., 2017, June. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging (pp. 146-157). Springer, Cham.

  48. Zenati, H., Romain, M., Foo, C.S., Lecouat, B. and Chandrasekhar, V., 2018, November. Adversarially learned anomaly detection. In 2018 IEEE International conference on data mining (ICDM) (pp. 727-736). IEEE.

  49. You, C., Robinson, D.P. and Vidal, R., 2017. Provable self-representation based outlier detection in a union of subspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  50. Goodge, A., Hooi, B., Ng, S.K. and Ng, W.S., 2022, June. Lunar: Unifying local outlier detection methods via graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence.

  51. Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In KDD '05. 2005.

  52. Zhao, Y., Nasrullah, Z., Hryniewicki, M.K. and Li, Z., 2019, May. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), pp. 585-593. Society for Industrial and Applied Mathematics.

  53. Zhao, Y. and Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. IEEE International Joint Conference on Neural Networks, 2018.

  54. Pevný, T., 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102(2), pp.275-304.

  55. Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).

  56. Bandaragoda, T. R., Ting, K. M., Albrecht, D., Liu, F. T., Zhu, Y., and Wells, J. R., 2018, Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, 34(4), pp. 968-998.

  57. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

  58. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

  59. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

  60. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

  61. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

  62. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

  63. Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.

pyod's People

Contributors

agoodge avatar akarazeev avatar bflammers avatar dependabot[bot] avatar drewnow avatar durgeshsamariya avatar edgarakopyan avatar forestsking avatar frizzodavide avatar ftorrresd avatar gian21391 avatar ingonader avatar john-almardeny avatar kulikdm avatar lambertsbennett avatar lorenzo-perini avatar lorgoc avatar lucew avatar mbongaerts avatar quentin62 avatar rlshuhart avatar roelbouman avatar shangwen777 avatar tam17aki avatar ucabvas avatar winstonll avatar xhan97 avatar xuhongzuo avatar yzhao062 avatar zainnasrullah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyod's Issues

outlier score highly correlated to over distance to points of origin

I calculated the distance of each data points to origins at 0, by use 'np.linalg.norm(x)', while x is just one multi-variate sample, then normalize all these values to 0-1, I called this 'global_score'. When I compare the global score to scores from different methods, it turns out it's highly correlated (0.99) with PCA, autoencoder, CBLOF, KNN. So it seems all these methods are just calculating the overall distance of the samples, instead of anomalies from multiple clusters.
I was very troubled by this fact and hope you can confirm whether this is true and if it is, what's the reason for this.

Thanks

Generate Synthetic Data in Clusters

Adding new feature - that is generate artificial data in clusters.
Creating utility function to generate synthesized data in clusters.
Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.
Highlights

Installing Pyod broke my TensorFlow installation

Ubuntu 16.04

Traceback (most recent call last): File "features_2_3_rot_unet_1.py", line 3, in <module> import tensorflow as tf File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/__init__.py", line 22, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/__init__.py", line 81, in <module> from tensorflow.python import keras File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/__init__.py", line 24, in <module> from tensorflow.python.keras import activations File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.activations import elu File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module> from tensorflow.python.keras._impl.keras import activations File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module> from tensorflow.python.keras._impl.keras import backend as K File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/backend.py", line 36, in <module> from tensorflow.python.layers import base as tf_base_layers File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/layers/base.py", line 25, in <module> from tensorflow.python.keras.engine import base_layer File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/engine/__init__.py", line 23, in <module> from tensorflow.python.keras.engine.base_layer import InputSpec File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/engine/base_layer.py", line 33, in <module> from tensorflow.python.keras import backend File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.backend import abs ImportError: cannot import name 'abs'

CBLOF predict error

Hi,
When I try to use CBLOF to predict one or two or any short number of samples, sometimes it fails like in the example above:

clf_name = 'CBLOF'
clf = CBLOF(alpha=0.7, beta=2, check_estimator=False, n_clusters=6)
clf.fit(a[0:336])
print([a[338]])
clf.predict([a[338]])

Output:

[array([0.21751617])]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-173-5342692feffe> in <module>()
      3 clf.fit(a[0:336])
      4 
----> 5 clf.predict([a[338]])

/usr/local/lib/python3.5/dist-packages/pyod/models/base.py in predict(self, X)
    125         check_is_fitted(self, ['decision_scores_', 'threshold_', 'labels_'])
    126 
--> 127         pred_score = self.decision_function(X)
    128         return (pred_score > self.threshold_).astype('int').ravel()
    129 

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in decision_function(self, X)
    179         X = check_array(X)
    180         labels = self.clustering_estimator_.predict(X)
--> 181         return self._decision_function(X, labels)
    182 
    183     def _validate_estimator(self, default=None):

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in _decision_function(self, X, labels)
    281 
    282         scores[large_indices] = pairwise_distances_no_broadcast(
--> 283             X[large_indices, :], large_centers)
    284 
    285         if self.use_weights:

/usr/local/lib/python3.5/dist-packages/pyod/utils/stat_models.py in pairwise_distances_no_broadcast(X, Y)
     36     :rtype: array of shape (n_samples,)
     37     """
---> 38     X = check_array(X)
     39     Y = check_array(Y)
     40     assert_allclose(X.shape, Y.shape)

/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    580                              " minimum of %d is required%s."
    581                              % (n_samples, shape_repr, ensure_min_samples,
--> 582                                 context))
    583 
    584     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

But when I try to predict ensuring that one of them is not an anomaly, then it works in all the cases:

pred = clf.predict([clf.cluster_centers_[clf.large_cluster_labels_[0]],a[338]])
print (pred)

Output:

[0 1]

Thanks for your help

KNN Mahalanobis distance error

Hi,

When I use the Mahalanobis metric for KNN I always get the error "Must provide either V or VI for Mahalanobis distance" even when I provide V with metric_params. The same request works with sklearn.neighbors.


from pyod.models.knn import KNN  
from pyod.utils.data import generate_data
from sklearn.neighbors import NearestNeighbors
import numpy as np

contamination = 0.1  
n_train = 200  
n_test = 100 

X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination)

#Doesn't work (Must provide either V or VI for Mahalanobis distance)
clf = KNN(algorithm='brute', metric='mahalanobis', metric_params={'V': np.cov(X_train)})
clf.fit(X_train)

#Works
nn = NearestNeighbors(algorithm='brute', metric='mahalanobis', metric_params={'V': np.cov(X_train)})
nn.fit(X_train)

Problem in CBLOF when the number of clusters is big and the train set has too many repeated value

Hi

If the train set has many repeated values and a large number of clusters is used, then some clusters will have the same value for the center. So, when defining the equation self.cluster_sizes_=np.bincount(clf.cluster_labels_), the results is an array smaller than the number of cluster, which generate an error and turns impossible to set large and small clusters. This could be avoided by changing self.cluster_sizes_=np.bincount(clf.cluster_labels_) to self.cluster_sizes_=np.bincount(clf.cluster_labels_, minlength=n_clusters). This is an issue that is damaging my code flexibility and I want to know if it is worth getting fixed.

Example of code:

from pyod.utils.data import generate_data
x = [[ 0.30244003],  [0.01218177],[-0.50835109], [-0.36951435],[ 0.97274482], [-0.68325119], 
     [0.0], [0.0], [0.08], [0.0], [0.0], [ 0.0],[ 0.0], [ 0.0],[0.09], [0.0],[ 0.0], [0.0],
     [0.0], [ 0.0],[-20.29518778], [0.0],[ 0.0], [0.0],[ 0.0], [ 0.0],
     [0.0], [ 8.38548823], [0.0], [ 0.0]]
test = generate_data(train_only=True)
clf_name = 'CBLOF'
clf = CBLOF(alpha=0.1, n_clusters=15, beta=10, check_estimator=False)
try:
    clf.fit(x)
except Exception as ex:
    print(str(ex))
    print("\n Cluster centers: " + str(clf.cluster_centers_))
    print("\n Cluster sizes: " + str(clf.cluster_sizes_))
    print('\n Supposed to be the cluster size: ' + str(np.bincount(clf.cluster_labels_, minlength=15)))
    print("\n Large clusters: " + str(clf.large_cluster_labels_))
    print("\n Small clusters: " + str(clf.small_cluster_labels_))

Output:

index 11 is out of bounds for axis 0 with size 11

 Cluster centers: [[ 0.00000000e+00]
 [-2.02951878e+01]
 [ 8.38548823e+00]
 [ 9.72744820e-01]
 [-5.08351090e-01]
 [ 3.02440030e-01]
 [-6.83251190e-01]
 [-3.69514350e-01]
 [ 8.00000000e-02]
 [ 1.21817700e-02]
 [ 9.00000000e-02]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 8.00000000e-02]
 [ 0.00000000e+00]]

 Cluster sizes: [20  1  1  1  1  1  1  1  1  1  1]

 Supposed to be the cluster size: [20  1  1  1  1  1  1  1  1  1  1  0  0  0  0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-39-1b14a2099b96> in <module>()
     18 try:
---> 19     clf.fit(x)
     20 except Exception as ex:

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in fit(self, X, y)
    168         self._set_cluster_centers(X, n_features)
--> 169         self._set_small_large_clusters(n_samples)
    170 

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in _set_small_large_clusters(self, n_samples)
    251 
--> 252             if size_clusters[sorted_cluster_indices[i]] / size_clusters[
    253                 sorted_cluster_indices[i - 1]] >= self.beta:

IndexError: index 11 is out of bounds for axis 0 with size 11

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-39-1b14a2099b96> in <module>()
     23     print("\n Cluster sizes: " + str(clf.cluster_sizes_))
     24     print('\n Supposed to be the cluster size: ' + str(np.bincount(clf.cluster_labels_, minlength=15)))
---> 25     print("\n Large clusters: " + str(clf.large_cluster_labels_))
     26     print("\n Small clusters: " + str(clf.small_cluster_labels_))
     27 

AttributeError: 'CBLOF' object has no attribute 'large_cluster_labels_'

Thanks for your help,
Giovanna

Documentation / Implementation difference in Autoencoder

While exploring the AutoEncoder in pyod, I've noticed a discrepancy between the generated docs and the implementation.
While the docs (https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.auto_encoder) inform you, that hidden_neurons defaults to a list ([64, 32, 32, 64]), the implementation assigns None as type: https://github.com/yzhao062/pyod/blob/development/pyod/models/auto_encoder.py#L126
Whilst this isn't a problem for itself, instancing an AutoEncoder like that resulted in a TypeError on my side:

         # Verify the network design is valid
>       if not self.hidden_neurons == self.hidden_neurons[::-1]:
E       TypeError: 'NoneType' object is not subscriptable

It might be worth a try to change the default for hidden_neurons to the list mentioned in the docs.
And by the way: Thanks for this framework, it is really a breeze to work with!

n_jobs ignored

Hi, I'm using xgbod with n_jobs = -1 and its no different than using it with n_jobs = 1...

Instructions on setting up Keras and Tensorflow for AutoEncoder in PyOD

It is nice that PyOD includes some neural network based models, such as AutoEncoder. However, you may find that after pip install pyod, AutoEncoder models do not run. This is expected since I do not want PyOD relies on too many packages, and not everyone needs to run AutoEncoder.

**If you have tensorflow-gpu installed, keras would automatically run with GPU. **
If you want to run AutoEncoder, please first install keras+a backend library, e.g., tensorflow. Either of the following two should do the installation for you:

  • pip install keras tensorflow or pip install keras tensorflow-gpu
  • conda install keras tensorflow or pip install keras tensorflow-gpu

You need tensorflow-gpu if your device have GPU and want to leverage it.

After keras and tensorflow being installed, you are ready to run auto_encoder_example.py.

Here are some potential error messages you may encounter:

1. ModuleNotFoundError: No module named 'theano'

In this case, you should specify keras backend to the one you want to use, e.g., TensorFlow
Go to $HOME/.keras/keras.json, and change the "backend" to "tensorflow"

2. ModuleNotFoundError: No module named 'error'

In this case, you need to install keras and tensorflow with conda, which can either be done in the GUI or simply use "conda install keras" and "conda install tensorflow"

LSCP with multiple LOF testing error: range parameter must be finite

I am running the following code:

clf_name = 'LSCP_LOF'

other parameters:

lof_list = [LOF(n_neighbors=5), LOF(n_neighbors=10), LOF(n_neighbors=20), LOF(n_neighbors=30), LOF(n_neighbors=40), LOF(n_neighbors=50), LOF(n_neighbors=75)]

clf = LSCP(lof_list)
#clf = LOF(n_neighbors=5, contamination=outliers_fraction)
clf.fit(X_train)

and got the following error, however, when fit directly with LOF method, it runs fine:

ValueError Traceback (most recent call last)
in ()
12 clf = LSCP(lof_list)
13 #clf = LOF(n_neighbors=5, contamination=outliers_fraction)
---> 14 clf.fit(X_train)
15
16 # get the prediction label and outlier scores of the training data

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in fit(self, X, y)
171
172 # set decision scores and threshold
--> 173 self.decision_scores_ = self._get_decision_scores(X)
174 self._process_decision_scores()
175

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in _get_decision_scores(self, X)
273 pred_scores_ens[i,] = np.mean(
274 test_scores_norm[
--> 275 i, self._get_competent_detectors(pearson_corr_scores)])
276
277 return pred_scores_ens

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in _get_competent_detectors(self, scores)
355 "classifiers, reducing n_bins to n_clf.")
356 self.n_bins = self.n_clf
--> 357 hist, bin_edges = np.histogram(scores, bins=self.n_bins)
358
359 # find n_selected largest bins

/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py in histogram(a, bins, range, normed, weights, density)
668 if not np.all(np.isfinite([first_edge, last_edge])):
669 raise ValueError(
--> 670 'range parameter must be finite.')
671 if first_edge == last_edge:
672 first_edge -= 0.5

ValueError: range parameter must be finite.

Thanks

SOS: overflow encountered in multiply beta[i] = beta[i] * 2.0

I am running the following code:

clf_name = 'SOS'

clf_name = 'SOS'
clf = SOS()
clf.fit(X_train)

and got the following warning:
RuntimeWarning: overflow encountered in multiply
beta[i] = beta[i] * 2.0
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
~/proj/myPylib/lib/python3.6/site-packages/pyod/models/base.py:336: RuntimeWarning: invalid value encountered in greater
self.labels_ = (self.decision_scores_ > self.threshold_).astype(

data.zip
I have uploaded the data for X_train here.

My samples have duplicates and when I remove the duplicates the error does not occur. However I need to retain the duplicates.

The KNN example is incorrect

The KNN example is incorrect
There is no get_outliers_inliers in pyod

from pyod.utils.data import generate_data from pyod.utils.data import get_outliers_inliers

XGBOD and LSCP missing from install

I installed PyOD using:

pip install pyod
pip install --upgrade pyod

However, LSCP and XGBOD are not installed. All of the other models in the repo can be successfully imported into a jupyter notebook. Attempting to import LSCP and XGBOD both yield a "ModuleNotFoundError: No module named" error.

Breadth-First Approach in FeatureBagging

May I ask a question about the implemented approach for "combination" in the "feature_bagging.py" ?

IMHO, the idea of using "maximization" is not a precise reflection of the original paper (lazarevic2005feature). The authors describes there a breath-first search procedure; arguably the numeric differences might be small.
However, please consider a generic toys example as a counter-example:

|------| Alg1| Alg2|
| Obs1 | 10.0| 2.0 |
| Obs2 | 9.0 | 3.0 |
| Obs3 | 8.0 | 4.0 |

Maximization would return the order Obs1 (score:10) , Obs2(score:9), Obs(score:3), Breadth-First Search would return Obs1 (rank1 in Alg 1), Obs3, (rank1 in Alg 2) and the Obs2.

Many thanks.

Correct handling of LOF proba predictions

Hi,

thanks for the great library.
When evaluating whether it is usable for my work i stumbled across an potential issue.
My workflow looks as follows:

  1. Train the LOF detector on a training dataset.
  2. Provide raw scores and outlier probabilites for this set
  3. Deploy the model to generate outlier probabilities on new data

Im not quite sure how to correctly perform step 2. By executing lof.predict_proba(train) it executes lof.decision_function(train) which delegates to the sklearn implementation. In sklearn it explicitely states that this function is only supposed to handle new data (https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/neighbors/lof.py#L233) which is violated here.

Thanks for your help
Alex

Resource updation request of an Article - Outlier Detection using PyOD

Hi,
I have written an article on Outlier Detection using PyOD on Analytics Vidhya Blog -
https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/

In the article, I have tried to explain the need for outlier detection and how can we use pyod for the same and also implemented pyod on a real world data set.
Please consider it including in your resources section on GitHub. I believe it would be really helpful for the people who wanted to get started with pyod.

Thanks

tensorflow pip installation fails with travis-ci python 3.7

matrix:
include:
- python: "3.7"
dist: xenial
sudo: true

Error message:
Collecting tensorflow (from -r requirements_travis.txt (line 8))
Could not find a version that satisfies the requirement tensorflow (from -r requirements_travis.txt (line 8))(from versions: )
No matching distribution found for tensorflow (from -r requirements_travis.txt (line 8))
The command "pip install -r requirements_travis.txt" failed and exited with 1 during .

Wait until pip fixes tensorflow installation under python 3.7.

Slow installation due to the underlying dependencies

It is noted that PyOD depends on a few libraries, including:

  • keras
  • matplotlib (optional, required for running examples)
  • nose
  • numpy>=1.13
  • numba>=0.35
  • scipy>=0.19.1
  • scikit_learn>=0.19.1
  • tensorflow (optional, required if calling AutoEncoder, other backend also works)

It is getting more serious when we started introducing deep learning models into PyOD which is implemented in Keras (and of course with some backend libraries, e.g., TensorFlow).

In addition, for improving the efficiency, we started using JIT in PyOD, specifically Numba, for accelerating the execution, which uses LLVM compiler to overcome the overhead of Python.

In long run, I am also interested in bringing GPU support for PyOD, which could be done through CUDA programming. However, it will clearly make the installation and maintenance a mess due to the complexity.

Therefore, I would like to gather some ideas regarding comprehensiveness vs efficiency vs complexity for the development of PyOD. What is your opinion? Will the current installation too cumbersome for you?

LOCI fails on MacOS with Python 2.7 (caused by np.count_nonzero)

It is noted running LOCI model on MacOS with Python 2.7 may fail. One potential cause is the following code, as np.count_nonzero returns int instead of array.
I am currently investigating how to fix it. Please stay tuned.

 def _get_alpha_n(self, dist_matrix, indices, r):
        """Computes the alpha neighbourhood points.
        
        Parameters
        ----------
        dist_matrix : array-like, shape (n_samples, n_features)
            The distance matrix w.r.t. to the training samples.
        
        indices : int
            Subsetting index
        
        r : int
            Neighbourhood radius
            
        Returns
        -------
        alpha_n : array, shape (n_alpha, )
            Returns the alpha neighbourhood points.       
        """

        if type(indices) is int:
            alpha_n = np.count_nonzero(
                dist_matrix[indices, :] < (r * self._alpha))
            return alpha_n
        else:
            alpha_n = np.count_nonzero(
                dist_matrix[indices, :] < (r * self._alpha), axis=1)
            return alpha_n

The error message looks like below:

(test27) bash-3.2$ python loci_example.py
/anaconda2/envs/test27/lib/python2.7/site-packages/pyod/models/loci.py:199: RuntimeWarning: divide by zero encountered in double_scalars
outlier_scores[p_ix] = mdef/sigma_mdef
/Users/zhaoy9/.local/lib/python2.7/site-packages/numpy/core/_methods.py:101: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
On Training Data:
Traceback (most recent call last):
File "loci_example.py", line 133, in
evaluate_print(clf_name, y_train, y_train_scores)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/pyod/utils/data.py", line 159, in evaluate_print
roc=np.round(roc_auc_score(y, y_pred), decimals=4),
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 356, in roc_auc_score
sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/base.py", line 77, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 328, in _binary_roc_auc_score
sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 618, in roc_curve
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 403, in _binary_clf_curve
assert_all_finite(y_score)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 68, in assert_all_finite
_assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

AUC Score & Precision score are different why not same?

from pyod.utils.data import evaluate_print

evaluate and print the results

print("\nOn Training Data:")
evaluate_print(clf_name, y_true, y_scores)

On Training Data:
KNN ROC:0.9352, precision @ rank n:0.568

from sklearn import metrics
print("Accuracy Score",round(metrics.accuracy_score(y_true, y_pred),2))
print("Precision Score",round(metrics.precision_score(y_true, y_pred),2))
print("Recall Score",round(metrics.recall_score(y_true, y_pred),2))
print("F1 Score",round(metrics.f1_score(y_true, y_pred),2))
print("Roc Auc score",round(metrics.roc_auc_score(y_true, y_pred),2))

Accuracy Score 0.92
Precision Score 0.55
Recall Score 0.59
F1 Score 0.57
Roc Auc score 0.77

specifying categorical features in Python Outlier Detection (PyOD)

How to specify the categorical features in PyOD when using Histogram-based Outlier Detection (HBOS) for anomaly detection ?
I've read that HBOS can be used for anomaly detection when there are categorical features involved. I found it's Python implementation here:
https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.hbos
But I can't figure out how should I pass the position or list of names of categorical features of my dataset while training the model.
The code I've tried:

clf = HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)
clf.fit(train_df)
train_pred = clf.labels_

There is no parameter to mention categorical features while training.

pyod fails to install using pip

When attempting to install without nose, I receive the following error:

(PyVi) Michael:PyVi michael$ pip install pyod
Collecting pyod==0.5.0 (from -r requirements.txt (line 18))
  Using cached https://files.pythonhosted.org/packages/c9/8c/6774fa2e7ae6fe9c2c648114d15ba584f950002377480e14183a0999af30/pyod-0.5.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/setup.py", line 2, in <module>
        from pyod import __version__
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/__init__.py", line 4, in <module>
        from . import models
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/__init__.py", line 2, in <module>
        from .abod import ABOD
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/abod.py", line 17, in <module>
        from .base import BaseDetector
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/base.py", line 27, in <module>
        from ..utils.utility import precision_n_scores
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/utils/__init__.py", line 2, in <module>
        from .utility import check_parameter
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/utils/utility.py", line 18, in <module>
        from sklearn.utils.testing import assert_equal
      File "/Users/michael/anaconda3/envs/PyVi/lib/python3.6/site-packages/sklearn/utils/testing.py", line 49, in <module>
        from nose.tools import raises
    ModuleNotFoundError: No module named 'nose'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/

func:`pyod.utils.data.visualize` is not existed

Is this function pyod.utils.data.visualize deprecated? I cannot import this function.

import sys
import pyod
In[]: sys.version
Out[]: '3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]'
In[]: pyod.utils.data.visualize(clf_name, X_train, X_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)

Traceback (most recent call last):

  File "<ipython-input-9-1628666df63a>", line 2, in <module>
    pyod.utils.data.visualize(clf_name,

AttributeError: module 'pyod.utils.data' has no attribute 'visualize'
 (py36) E:\MyNutshell>pip show pyod                                   
Name: pyod                                                           
Version: 0.5.6                                                       
Summary: A Python Outlier Detection (Anomaly Detection) Toolbox      
Home-page: https://github.com/yzhao062/Pyod                          
Author: Yue Zhao                                                     
Author-email: [email protected]                                 
License: UNKNOWN             

Merge with kenchi

Hi,

I am currently developing an anomaly detection package called kenchi and would like to merge this code into your package.
https://github.com/HazureChi/kenchi

There are three points that I can contribute to pyod.

The first is the implementation of One-time sampling.
https://github.com/HazureChi/kenchi/blob/master/kenchi/outlier_detection/distance_based.py

Sugiyama, M., and Borgwardt, K., "Rapid distance-based outlier detection via sampling," Advances in NIPS, pp. 467-475, 2013.

The second is the implementation of metrics for outlier function.
https://github.com/HazureChi/kenchi/blob/master/kenchi/metrics.py

Lee, W. S, and Liu, B., "Learning with positive and unlabeled examples using weighted Logistic Regression," In Proceedings of ICML, pp. 448-455, 2003.

Goix, N., "How to evaluate the quality of unsupervised anomaly detection algorithms?" In ICML Anomaly Detection Workshop, 2016.

The last is the implementation of the function that loads and return various datasets.
https://github.com/HazureChi/kenchi/blob/master/kenchi/datasets/base.py

If you agree, I actively would like to contribute to pyod in the future.

Thanks.

intended clf.predict_proba usage

I'm trying to make sense of the predict_proba function.

What I want to achieve: Get class probabilities for generating metrics like ROC-curves, calibration curves, Precision, Accuracy, etc with scikit-learn tools. As I am working on a binary classification task, I though I could use the predict_proba for this.

The documentation describes it as "predict the probability of a sample being outlier" that returns:
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
which is what I am looking for currently. What I don't understand is that a ndarray of shape (no_of_observations,2) is returned.
If I compare the output of clf.predict() and clf.predict_proba() side by side, I see a high value in the first column of the predict_proba array all the time:

0 -> [0.86014439 0.13985561]
0 -> [0.96943563 0.03056437]
0 -> [0.88716599 0.11283401]
0 -> [0.87912382 0.12087618]
0 -> [0.9686196   0.0313804]
0 -> [0.87921815 0.12078185]
1 -> [0.83279906 0.16720094]
0 -> [0.87921815 0.12078185]
0 -> [0.86137304 0.13862696]
0 -> [0.98987502 0.01012498]

Might the first column be read as "how confident is the classifier that the predicted class is correct"? It would be great if you could help me out on this one.

By the way: Thanks for building such a great Python module!

I am trying to do RandomizedSearchCV on ABOD, but surprisingly it does not run?

Here is my code:

from pyod.models.abod import ABOD
param_grid = {'neighbours': list(range(1, 5,1)),
'contamination': np.linspace(0.01, 0.05, 5)}

skf = StratifiedKFold(n_splits=10)
folds = list(skf.split(X.toarray(), y_true))
clf = ABOD()
scoring = make_scorer(precision_score)
search = RandomizedSearchCV(estimator=clf, param_distributions=param_grid, scoring=scoring, cv=folds)
search.fit(X.toarray(), y_true)
y_pred= search.predict(X.toarray())
print('Best parameters:%0.10f' % search.best_params_["contamination"], 'Precision score: %0.3f' % precision_score(y_true, y_pred),
'Recall score: %0.3f' % recall_score(y_true, y_pred))

Best parameters:0.0100000000 Precision score: 0.000 Recall score: 0.000

kNN visualization (interpretation)

The visualization produced by knn_example.py for the "Test Set Prediction" shows two false positives, i.e., 12 outlier findings instead of 10 as in the "Test Set Ground Truth" chart. Isn't this somewhat inconsistent with the result printed to console that ROC_AUC = 1?

If so, I think the inconsistency arises because the predicted labels for the chart are based on y_test_pred = clf.predict(X_test). I think that means that the test labels are being predicted from comparison the distance threshold clf.threshold_ obtained in fitting clf to the training data. In contrast, the ROC_AUC value is based on a fixed contamination rate (10%).

It would only make sense to use clf.threshold_ for this purpose if the kNN distance for any point x_i in the test set were being computed over the distances from x_i to each of the 200 training points, not the distances to the other 99 test points. But then, the ROC_AUC curve ought to be based on those same labels, and it isn't, is it? I think it's currently being generated from a set of labels that re-applies the 10% contamination assumption, ignoring clf.threshold_.

(I can't quite follow whether the kNN distances for the points in the test set are being computed vs. the training set or vs. the other points in the test set. Can you clarify this for me? I have to guess that it's the former; if it's the latter, then it would seem really weird to be applying clf.threshold_ from a training set of a different size.)

Is it even appropriate to apply the kNN model from training data directly to the test set? I would have thought this use of kNN is intended to be used on an entire data set all by itself. Although one could perhaps study a training set to make a reasonable judgment about appropriate values of k and contamination rate.

Thanks!

IForest: FutureWarning: behaviour="old" is deprecated

Hi,

Thanks for a great library!

When declaring a new IForest object, Sklearn throws the following warning:

FutureWarning: behaviour="old" is deprecated and will be removed in version 0.22. Please use behaviour="new", which makes the decision_function change to match other anomaly detection algorithm API.
FutureWarning)

This new behavior in sklearn's iforest is about where the threshold is set between anomalies and normal observations. See documentation on behaviour argument and offset_:

behaviour : str, default='old'
Behaviour of the decision_function which can be either 'old' or
'new'. Passing behaviour='new' makes the decision_function
change to match other anomaly detection algorithm API which will be
the default behaviour in the future. As explained in details in the
offset_ attribute documentation, the decision_function becomes
dependent on the contamination parameter, in such a way that 0 becomes
its natural threshold to detect outliers.

offset_ : float
Offset used to define the decision function from the raw scores.
We have the relation: decision_function = score_samples - offset_.
Assuming behaviour == 'new', offset_ is defined as follows.
When the contamination parameter is set to "auto", the offset is equal
to -0.5 as the scores of inliers are close to 0 and the scores of
outliers are close to -1. When a contamination parameter different
than "auto" is provided, the offset is defined in such a way we obtain
the expected number of outliers (samples with decision function < 0)
in training.
Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5, making the decision function independent from the
contamination parameter.

I think a simple fix would be to add argument behaviour="new" in the call to sklearn.ensemble.IsolationForest

if n_samples is large , certain outlier model error rate 200% higher

Hi, YZhao;

I was writing to one possible isuue: in example notebooks compare all models, if the n_samples change to a big number, for example : 10*5 or larger. certain model OD result is totally wrong.
Note: I noticed there is a similar issue. Problem in CBLOF when the number of clusters is big and the train set has too many repeated value #53. But my found should be differnet, so post is as well.

Here is isssue:

  1. the default n_sample =200, outlier_fraction = 0.25. That means od ground trun point is 50. After change the n_sample to 10**5, the ground true outlier point should be 25000.
    However, following models error higher that the groud true OD points. they were:
    Feature Bagging: 35259
    Local Outlier Factor (LOF) 36144
    Locally Selective Combination (LSCP) 37276
    below is the screen capture.
    image

I guess it might related to the dataset type. Doest it mean the simulation data similar to the
"glass", "optdigits" sample date? why the other estimator did not show such higher error rate?

Looking forward your kindly response!

Last but not least, I was know PYOD existed from zhihu. It is a excellent tools, especialy, I have access to more OD resources link for your github. Your work is awesome!

Recently, i has begun to try use some model it in one of my general OD automation tools.(which use docker, and airflow for platform). dataset type, and dataset qty are two of point need considered.

WangYong
[email protected]

Is it possible to make CBLOF ignore contamination parameter?

CBLOF parameters are useless. Basically the only thing that matters when dealing with the method is the contamination parameter, if i set it to 0.3 it will find 30% of anomalies and it doesn't matter how normal they are or if they belong to a big cluster. For what i understood about the method it has the ability to define what is an anomaly and what is not only based on the parameters alpha and beta, why is this happening?
Is there a way to ignore contamination?

ValueError: continuous format is not supported

Hey there,

following the example for KNN im getting this error:

ValueError                                Traceback (most recent call last)
<ipython-input-252-21e0f0751702> in <module>()
      2 # evaluate and print the results
      3 print("\nOn Training Data:")
----> 4 evaluate_print(clf_name, y_train, y_train_scores)
      5 print("\nOn Test Data:")
      6 evaluate_print(clf_name, y_test, y_test_scores)
------
    157     print('{clf_name} ROC:{roc}, precision @ rank n:{prn}'.format(
    158         clf_name=clf_name,
--> 159         roc=np.round(roc_auc_score(y, y_pred), decimals=4),
    160         prn=np.round(precision_n_scores(y, y_pred), decimals=4)))

any suggestions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.