yzhao062 / pyod Goto Github PK

A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

License: BSD 2-Clause "Simplified" License

Python 86.63% Jupyter Notebook 13.37%

outlier-detection anomaly-detection outlier-ensembles outliers anomaly python machine-learning data-mining unsupervised-learning python3 fraud-detection autoencoder neural-networks deep-learning data-science data-analysis novelty-detection out-of-distribution-detection

pyod's Introduction

Python Outlier Detection (PyOD)

Deployment & Documentation & Stats & License

Read Me First

Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.

For time-series outlier detection, please use TODS.
For graph outlier detection, please use PyGOD.
Performance Comparison & Datasets: We have a 45-page, the most comprehensive anomaly detection benchmark paper. The fully open-sourced ADBench compares 30 anomaly detection algorithms on 57 benchmark datasets.
Learn more about anomaly detection @ Anomaly Detection Resources
PyOD on Distributed Systems: you could also run PyOD on databricks.

About PyOD

PyOD, established in 2017, has become a go-to Python library for detecting anomalous/outlying objects in multivariate data. This exciting yet challenging field is commonly referred as Outlier Detection or Anomaly Detection.

PyOD includes more than 50 detection algorithms, from classical LOF (SIGMOD 2000) to the cutting-edge ECOD and DIF (TKDE 2022 and 2023). Since 2017, PyOD has been successfully used in numerous academic researches and commercial products with more than 17 million downloads. It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including Analytics Vidhya, KDnuggets, and Towards Data Science.

PyOD is featured for:

Unified, User-Friendly Interface across various algorithms.
Wide Range of Models, from classic techniques to the latest deep learning methods.
High Performance & Efficiency, leveraging numba and joblib for JIT compilation and parallel processing.
Fast Training & Prediction, achieved through the SUOD framework¹.

Outlier Detection with 5 Lines of Code:

# Example: Training an ECOD detector
from pyod.models.ecod import ECOD
clf = ECOD()
clf.fit(X_train)
y_train_scores = clf.decision_scores_  # Outlier scores for training data
y_test_scores = clf.decision_function(X_test)  # Outlier scores for test data

Selecting the Right Algorithm:. Unsure where to start? Consider these robust and interpretable options:

ECOD: Example of using ECOD for outlier detection
Isolation Forest: Example of using Isolation Forest for outlier detection

Alternatively, explore MetaOD for a data-driven approach.

Citing PyOD:

PyOD paper is published in Journal of Machine Learning Research (JMLR) (MLOSS track). If you use PyOD in a scientific publication, we would appreciate citations to the following paper:

@article{zhao2019pyod,
    author  = {Zhao, Yue and Nasrullah, Zain and Li, Zheng},
    title   = {PyOD: A Python Toolbox for Scalable Outlier Detection},
    journal = {Journal of Machine Learning Research},
    year    = {2019},
    volume  = {20},
    number  = {96},
    pages   = {1-7},
    url     = {http://jmlr.org/papers/v20/19-011.html}
}

or:

Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of machine learning research (JMLR), 20(96), pp.1-7.

For a broader perspective on anomaly detection, see our NeurIPS papers ADBench: Anomaly Detection Benchmark Paper & ADGym: Design Choices for Deep Anomaly Detection:

@article{han2022adbench,
    title={Adbench: Anomaly detection benchmark},
    author={Han, Songqiao and Hu, Xiyang and Huang, Hailiang and Jiang, Minqi and Zhao, Yue},
    journal={Advances in Neural Information Processing Systems},
    volume={35},
    pages={32142--32159},
    year={2022}
}

@article{jiang2023adgym,
    title={ADGym: Design Choices for Deep Anomaly Detection},
    author={Jiang, Minqi and Hou, Chaochuan and Zheng, Ao and Han, Songqiao and Huang, Hailiang and Wen, Qingsong and Hu, Xiyang and Zhao, Yue},
    journal={Advances in Neural Information Processing Systems},
    volume={36},
    year={2023}
}

Table of Contents:

Installation
API Cheatsheet & Reference
ADBench Benchmark and Datasets
Model Save & Load
Fast Train with SUOD
Thresholding Outlier Scores
Implemented Algorithms
Quick Start for Outlier Detection
How to Contribute
Inclusion Criteria

Installation

PyOD is designed for easy installation using either pip or conda. We recommend using the latest version of PyOD due to frequent updates and enhancements:

pip install pyod            # normal install
pip install --upgrade pyod  # or update if needed

conda install -c conda-forge pyod

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/pyod.git
cd pyod
pip install .

Required Dependencies:

Python 3.8 or higher
joblib
matplotlib
numpy>=1.19
numba>=0.51
scipy>=1.5.1
scikit_learn>=0.22.0

Optional Dependencies (see details below):

combo (optional, required for models/combination.py and FeatureBagging)
keras/tensorflow (optional, required for AutoEncoder, and other deep learning models)
suod (optional, required for running SUOD model)
xgboost (optional, required for XGBOD)
pythresh (optional, required for thresholding)optional

API Cheatsheet & Reference

The full API Reference is available at PyOD Documentation. Below is a quick cheatsheet for all detectors:

fit(X): Fit the detector. The parameter y is ignored in unsupervised methods.
decision_function(X): Predict raw anomaly scores for X using the fitted detector.
predict(X): Determine whether a sample is an outlier or not as binary labels using the fitted detector.
predict_proba(X): Estimate the probability of a sample being an outlier using the fitted detector.
predict_confidence(X): Assess the model's confidence on a per-sample basis (applicable in predict and predict_proba)².

Key Attributes of a fitted model:

decision_scores_: Outlier scores of the training data. Higher scores typically indicate more abnormal behavior. Outliers usually have higher scores.
labels_: Binary labels of the training data, where 0 indicates inliers and 1 indicates outliers/anomalies.

ADBench Benchmark and Datasets

We just released a 45-page, the most comprehensive ADBench: Anomaly Detection Benchmark ³. The fully open-sourced ADBench compares 30 anomaly detection algorithms on 57 benchmark datasets.

The organization of ADBench is provided below:

For a simpler visualization, we make the comparison of selected models via compare_all_models.py.

Model Save & Load

PyOD takes a similar approach of sklearn regarding model persistence. See model persistence for clarification.

In short, we recommend to use joblib or pickle for saving and loading PyOD models. See "examples/save_load_model_example.py" for an example. In short, it is simple as below:

from joblib import dump, load

# save the model
dump(clf, 'clf.joblib')
# load the model
clf = load('clf.joblib')

It is known that there are challenges in saving neural network models. Check #328 and #88 for temporary workaround.

Fast Train with SUOD

Fast training and prediction: it is possible to train and predict with a large number of detection models in PyOD by leveraging SUOD framework⁴. See SUOD Paper and SUOD example.

from pyod.models.suod import SUOD

# initialized a group of outlier detectors for acceleration
detector_list = [LOF(n_neighbors=15), LOF(n_neighbors=20),
                 LOF(n_neighbors=25), LOF(n_neighbors=35),
                 COPOD(), IForest(n_estimators=100),
                 IForest(n_estimators=200)]

# decide the number of parallel process, and the combination method
# then clf can be used as any outlier detection model
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
           verbose=False)

Thresholding Outlier Scores

A more data based approach can be taken when setting the contamination level. By using a thresholding method, guessing an abritrary value can be replaced with tested techniques for seperating inliers and outliers. Refer to PyThresh for a more in depth look at thresholding.

from pyod.models.knn import KNN
from pyod.models.thresholds import FILTER

# Set the outlier detection and thresholding methods
clf = KNN(contamination=FILTER())

Implemented Algorithms

PyOD toolkit consists of four major functional groups:

(i) Individual Detection Algorithms :

Type	Abbr	Algorithm	Year	Ref
Probabilistic	ECOD	Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions	2022	⁵
Probabilistic	ABOD	Angle-Based Outlier Detection	2008	⁶
Probabilistic	FastABOD	Fast Angle-Based Outlier Detection using approximation	2008	⁷
Probabilistic	COPOD	COPOD: Copula-Based Outlier Detection	2020	⁸
Probabilistic	MAD	Median Absolute Deviation (MAD)	1993	⁹
Probabilistic	SOS	Stochastic Outlier Selection	2012	¹⁰
Probabilistic	QMCD	Quasi-Monte Carlo Discrepancy outlier detection	2001	¹¹
Probabilistic	KDE	Outlier Detection with Kernel Density Functions	2007	¹²
Probabilistic Probabilistic	Sampling GMM	Rapid distance-based outlier detection via sampling Probabilistic Mixture Modeling for Outlier Analysis	2013	¹³ ¹⁴ [Ch.2]
Linear Model	PCA	Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes)	2003	¹⁵
Linear Model	KPCA	Kernel Principal Component Analysis	2007	¹⁶
Linear Model	MCD	Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores)	1999	¹⁷ ¹⁸
Linear Model	CD	Use Cook's distance for outlier detection	1977	¹⁹
Linear Model	OCSVM	One-Class Support Vector Machines	2001	²⁰
Linear Model	LMDD	Deviation-based Outlier Detection (LMDD)	1996	²¹
Proximity-Based	LOF	Local Outlier Factor	2000	²²
Proximity-Based	COF	Connectivity-Based Outlier Factor	2002	²³
Proximity-Based	(Incremental) COF	Memory Efficient Connectivity-Based Outlier Factor (slower but reduce storage complexity)	2002	²⁴
Proximity-Based	CBLOF	Clustering-Based Local Outlier Factor	2003	²⁵
Proximity-Based	LOCI	LOCI: Fast outlier detection using the local correlation integral	2003	²⁶
Proximity-Based	HBOS	Histogram-based Outlier Score	2012	²⁷
Proximity-Based	kNN	k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score)	2000	²⁸
Proximity-Based	AvgKNN	Average kNN (use the average distance to k nearest neighbors as the outlier score)	2002	²⁹
Proximity-Based	MedKNN	Median kNN (use the median distance to k nearest neighbors as the outlier score)	2002	³⁰
Proximity-Based	SOD	Subspace Outlier Detection	2009	³¹
Proximity-Based	ROD	Rotation-based Outlier Detection	2020	³²
Outlier Ensembles	IForest	Isolation Forest	2008	³³
Outlier Ensembles	INNE	Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles	2018	³⁴
Outlier Ensembles	DIF	Deep Isolation Forest for Anomaly Detection	2023	³⁵
Outlier Ensembles	FB	Feature Bagging	2005	³⁶
Outlier Ensembles	LSCP	LSCP: Locally Selective Combination of Parallel Outlier Ensembles	2019	³⁷
Outlier Ensembles	XGBOD	Extreme Boosting Based Outlier Detection (Supervised)	2018	³⁸
Outlier Ensembles	LODA	Lightweight On-line Detector of Anomalies	2016	³⁹
Outlier Ensembles Neural Networks	SUOD AutoEncoder	SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection (Acceleration) Fully connected AutoEncoder (use reconstruction error as the outlier score)	2021	⁴⁰ ⁴¹ [Ch.3]
Neural Networks	VAE	Variational AutoEncoder (use reconstruction error as the outlier score)	2013	⁴²
Neural Networks	Beta-VAE	Variational AutoEncoder (all customized loss term by varying gamma and capacity)	2018	⁴³
Neural Networks	SO_GAAL	Single-Objective Generative Adversarial Active Learning	2019	⁴⁴
Neural Networks	MO_GAAL	Multiple-Objective Generative Adversarial Active Learning	2019	⁴⁵
Neural Networks	DeepSVDD	Deep One-Class Classification	2018	⁴⁶
Neural Networks	AnoGAN	Anomaly Detection with Generative Adversarial Networks	2017	⁴⁷
Neural Networks	ALAD	Adversarially learned anomaly detection	2018	⁴⁸
Graph-based	R-Graph	Outlier detection by R-graph	2017	⁴⁹
Graph-based	LUNAR	LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks	2022	⁵⁰

(ii) Outlier Ensembles & Outlier Detector Combination Frameworks:

Type	Abbr	Algorithm	Year	Ref
Outlier Ensembles	FB	Feature Bagging	2005	⁵¹
Outlier Ensembles	LSCP	LSCP: Locally Selective Combination of Parallel Outlier Ensembles	2019	⁵²
Outlier Ensembles	XGBOD	Extreme Boosting Based Outlier Detection (Supervised)	2018	⁵³
Outlier Ensembles	LODA	Lightweight On-line Detector of Anomalies	2016	⁵⁴
Outlier Ensembles	SUOD	SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection (Acceleration)	2021	⁵⁵
Outlier Ensembles	INNE	Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles	2018	⁵⁶
Combination	Average	Simple combination by averaging the scores	2015	⁵⁷
Combination	Weighted Average	Simple combination by averaging the scores with detector weights	2015	⁵⁸
Combination	Maximization	Simple combination by taking the maximum scores	2015	⁵⁹
Combination	AOM	Average of Maximum	2015	⁶⁰
Combination	MOA	Maximization of Average	2015	⁶¹
Combination	Median	Simple combination by taking the median of the scores	2015	⁶²
Combination	majority Vote	Simple combination by taking the majority vote of the labels (weights can be used)	2015	⁶³

(iii) Outlier Detection Score Thresholding Methods:

Type	Abbr	Algorithm	Documentation
Kernel-Based	AUCP	Area Under Curve Percentage	AUCP
Statistical Moment-Based	BOOT	Bootstrapping	BOOT
Normality-Based	CHAU	Chauvenet's Criterion	CHAU
Linear Model	CLF	Trained Linear Classifier	CLF
cluster-Based	CLUST	Clustering Based	CLUST
Kernel-Based	CPD	Change Point Detection	CPD
Transformation-Based	DECOMP	Decomposition	DECOMP
Normality-Based	DSN	Distance Shift from Normal	DSN
Curve-Based	EB	Elliptical Boundary	EB
Kernel-Based	FGD	Fixed Gradient Descent	FGD
Filter-Based	FILTER	Filtering Based	FILTER
Curve-Based	FWFM	Full Width at Full Minimum	FWFM
Statistical Test-Based	GESD	Generalized Extreme Studentized Deviate	GESD
Filter-Based	HIST	Histogram Based	HIST
Quantile-Based	IQR	Inter-Quartile Region	IQR
Statistical Moment-Based	KARCH	Karcher mean (Riemannian Center of Mass)	KARCH
Statistical Moment-Based	MAD	Median Absolute Deviation	MAD
Statistical Test-Based	MCST	Monte Carlo Shapiro Tests	MCST
Ensembles-Based	META	Meta-model Trained Classifier	META
Transformation-Based	MOLL	Friedrichs' Mollifier	MOLL
Statistical Test-Based	MTT	Modified Thompson Tau Test	MTT
Linear Model	OCSVM	One-Class Support Vector Machine	OCSVM
Quantile-Based	QMCD	Quasi-Monte Carlo Discrepancy	QMCD
Linear Model	REGR	Regression Based	REGR
Neural Networks	VAE	Variational Autoencoder	VAE
Curve-Based	WIND	Topological Winding Number	WIND
Transformation-Based	YJ	Yeo-Johnson Transformation	YJ
Normality-Based	ZSCORE	Z-score	ZSCORE

(iV) Utility Functions:

Type	Name	Function	Documentation
Data	generate_data	Synthesized data generation; normal data is generated by a multivariate Gaussian and outliers are generated by a uniform distribution	generate_data
Data	generate_data_clusters	Synthesized data generation in clusters; more complex data patterns can be created with multiple clusters	generate_data_clusters
Stat	wpearsonr	Calculate the weighted Pearson correlation of two samples	wpearsonr
Utility	get_label_n	Turn raw outlier scores into binary labels by assign 1 to top n outlier scores	get_label_n
Utility	precision_n_scores	calculate precision @ rank n	precision_n_scores

Quick Start for Outlier Detection

PyOD has been well acknowledged by the machine learning community with a few featured posts and tutorials.

Analytics Vidhya: An Awesome Tutorial to Learn Outlier Detection in Python using PyOD Library

KDnuggets: Intuitive Visualization of Outlier Detection Methods, An Overview of Outlier Detection Methods from PyOD

Towards Data Science: Anomaly Detection for Dummies

Computer Vision News (March 2019): Python Open Source Toolbox for Outlier Detection

"examples/knn_example.py" demonstrates the basic API of using kNN detector. It is noted that the API across all other algorithms are consistent/similar.

More detailed instructions for running examples can be found in examples directory.

Initialize a kNN detector, fit the model, and make the prediction.

from pyod.models.knn import KNN   # kNN detector

# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)

# get the prediction label and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

# it is possible to get the prediction confidence as well
y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True)  # outlier labels (0 or 1) and confidence in the range of [0,1]

Evaluate the prediction by ROC and Precision @ Rank n (p@n).

from pyod.utils.data import evaluate_print

# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

See a sample output & visualization.

On Training Data:
KNN ROC:1.0, precision @ rank n:1.0

On Test Data:
KNN ROC:0.9989, precision @ rank n:0.9

visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
    y_test_pred, show_figure=True, save_figure=False)

Visualization (knn_figure):

Reference

Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).↩
Perini, L., Vercruyssen, V., Davis, J. Quantifying the confidence of anomaly detectors in their example-wise predictions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2020.↩
Han, S., Hu, X., Huang, H., Jiang, M. and Zhao, Y., 2022. ADBench: Anomaly Detection Benchmark. arXiv preprint arXiv:2206.09426.↩
Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).↩
Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C. and Chen, H. G. ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2022.↩
Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In KDD '08, pp. 444-452. ACM.↩
Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In KDD '08, pp. 444-452. ACM.↩
Li, Z., Zhao, Y., Botta, N., Ionescu, C. and Hu, X. COPOD: Copula-Based Outlier Detection. IEEE International Conference on Data Mining (ICDM), 2020.↩
Iglewicz, B. and Hoaglin, D.C., 1993. How to detect and handle outliers (Vol. 16). Asq Press.↩
Janssens, J.H.M., Huszár, F., Postma, E.O. and van den Herik, H.J., 2012. Stochastic outlier selection. Technical report TiCC TR 2012-001, Tilburg University, Tilburg Center for Cognition and Communication, Tilburg, The Netherlands.↩
Fang, K.T. and Ma, C.X., 2001. Wrap-around L2-discrepancy of random sampling, Latin hypercube and uniform designs. Journal of complexity, 17(4), pp.608-624.↩
Latecki, L.J., Lazarevic, A. and Pokrajac, D., 2007, July. Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 61-75). Springer, Berlin, Heidelberg.↩
Sugiyama, M. and Borgwardt, K., 2013. Rapid distance-based outlier detection via sampling. Advances in neural information processing systems, 26.↩
Aggarwal, C.C., 2015. Outlier analysis. In Data mining (pp. 237-263). Springer, Cham.↩
Shyu, M.L., Chen, S.C., Sarinnapakorn, K. and Chang, L., 2003. A novel anomaly detection scheme based on principal component classifier. MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING.↩
Hoffmann, H., 2007. Kernel PCA for novelty detection. Pattern recognition, 40(3), pp.863-874.↩
Hardin, J. and Rocke, D.M., 2004. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4), pp.625-638.↩
Rousseeuw, P.J. and Driessen, K.V., 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), pp.212-223.↩
Cook, R.D., 1977. Detection of influential observation in linear regression. Technometrics, 19(1), pp.15-18.↩
Scholkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J. and Williamson, R.C., 2001. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), pp.1443-1471.↩
Arning, A., Agrawal, R. and Raghavan, P., 1996, August. A Linear Method for Deviation Detection in Large Databases. In KDD (Vol. 1141, No. 50, pp. 972-981).↩
Breunig, M.M., Kriegel, H.P., Ng, R.T. and Sander, J., 2000, May. LOF: identifying density-based local outliers. ACM Sigmod Record, 29(2), pp. 93-104.↩
Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 535-548. Springer, Berlin, Heidelberg.↩
Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 535-548. Springer, Berlin, Heidelberg.↩
He, Z., Xu, X. and Deng, S., 2003. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10), pp.1641-1650.↩
Papadimitriou, S., Kitagawa, H., Gibbons, P.B. and Faloutsos, C., 2003, March. LOCI: Fast outlier detection using the local correlation integral. In ICDE '03, pp. 315-326. IEEE.↩
Goldstein, M. and Dengel, A., 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In KI-2012: Poster and Demo Track, pp.59-63.↩
Ramaswamy, S., Rastogi, R. and Shim, K., 2000, May. Efficient algorithms for mining outliers from large data sets. ACM Sigmod Record, 29(2), pp. 427-438.↩
Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery pp. 15-27.↩
Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery pp. 15-27.↩
Kriegel, H.P., Kröger, P., Schubert, E. and Zimek, A., 2009, April. Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 831-838. Springer, Berlin, Heidelberg.↩
Almardeny, Y., Boujnah, N. and Cleary, F., 2020. A Novel Outlier Detection Method for Multivariate Data. IEEE Transactions on Knowledge and Data Engineering.↩
Liu, F.T., Ting, K.M. and Zhou, Z.H., 2008, December. Isolation forest. In International Conference on Data Mining, pp. 413-422. IEEE.↩
Bandaragoda, T. R., Ting, K. M., Albrecht, D., Liu, F. T., Zhu, Y., and Wells, J. R., 2018, Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, 34(4), pp. 968-998.↩
Xu, H., Pang, G., Wang, Y., Wang, Y., 2023. Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering.↩
Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In KDD '05. 2005.↩
Zhao, Y., Nasrullah, Z., Hryniewicki, M.K. and Li, Z., 2019, May. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), pp. 585-593. Society for Industrial and Applied Mathematics.↩
Zhao, Y. and Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. IEEE International Joint Conference on Neural Networks, 2018.↩
Pevný, T., 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102(2), pp.275-304.↩
Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).↩
Aggarwal, C.C., 2015. Outlier analysis. In Data mining (pp. 237-263). Springer, Cham.↩
Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.↩
Burgess, Christopher P., et al. "Understanding disentangling in beta-VAE." arXiv preprint arXiv:1804.03599 (2018).↩
Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M. and He, X., 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering.↩
Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M. and He, X., 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering.↩
Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder, A., Müller, E. and Kloft, M., 2018, July. Deep one-class classification. In International conference on machine learning (pp. 4393-4402). PMLR.↩
Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U. and Langs, G., 2017, June. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging (pp. 146-157). Springer, Cham.↩
Zenati, H., Romain, M., Foo, C.S., Lecouat, B. and Chandrasekhar, V., 2018, November. Adversarially learned anomaly detection. In 2018 IEEE International conference on data mining (ICDM) (pp. 727-736). IEEE.↩
You, C., Robinson, D.P. and Vidal, R., 2017. Provable self-representation based outlier detection in a union of subspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition.↩
Goodge, A., Hooi, B., Ng, S.K. and Ng, W.S., 2022, June. Lunar: Unifying local outlier detection methods via graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence.↩
Lazarevic, A. and Kumar, V., 2005, August. Feature bagging for outlier detection. In KDD '05. 2005.↩
Zhao, Y., Nasrullah, Z., Hryniewicki, M.K. and Li, Z., 2019, May. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), pp. 585-593. Society for Industrial and Applied Mathematics.↩
Zhao, Y. and Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. IEEE International Joint Conference on Neural Networks, 2018.↩
Pevný, T., 2016. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102(2), pp.275-304.↩
Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C., Wang, Y., Qiao, Z., Sun, J. and Akoglu, L. (2021). SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Conference on Machine Learning and Systems (MLSys).↩
Bandaragoda, T. R., Ting, K. M., Albrecht, D., Liu, F. T., Zhu, Y., and Wells, J. R., 2018, Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, 34(4), pp. 968-998.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩
Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.ACM SIGKDD Explorations Newsletter, 17(1), pp.24-47.↩

pyod's People

Contributors

Stargazers

Watchers

Forkers

zhandao lidaguo xiangnanyue gloriajuice webvul kianqunki cecgreg zhuyansen gucasbrg kitdongbo deltat99 lightge chou-chou shlpu qiuchumo dgq2011 allensmile yunfeihaha zhuwenxiao ksharpdabu endymecy bballamudi ml-ai-nlp-ir zhouli01 weijun05 james-fu dxgung xmur wwlaoxi ptzagk yongduek dingboy castboxer hordaway raphaelrevivor hczheng fayssica jayeshd7 yuangzhang rlshuhart jkliang9714 zhangsilun tanzhuqing zhongkailv chendicao imran1570 biodun moenchishti kellyzhao960510 kushalvenkatesh pjhaest tplink32 xjtueducation till93 master-of-tides tartaruszen celikmustafa89 p-m-m-c lidan456 marcomiglionico94 yxryxryxr3 csingyi autowonderman juanlp eycab tungk sprinterzzj lht1949 lallouslab hhgl widged theshortj jvfisher zchoice valeman sabrish89 rumeysatalu vvcln breakend2010 chapzq77 jiaoshang frapoleon vencent-love-python davidlanz gezijun xybsoft sanjames ienoob rahulrajpl fthou limitedfxw frankfqchen ygshuwu nemocpp fzeeshan daehongkim1 circlez3791117 batermj ironmanchen zhouyonglong

pyod's Issues

Angle-based Outlier Detector (ABOD) returns None

Can you explain in which situations ABOD returns None and how should I interpret these?

outlier score highly correlated to over distance to points of origin

I calculated the distance of each data points to origins at 0, by use 'np.linalg.norm(x)', while x is just one multi-variate sample, then normalize all these values to 0-1, I called this 'global_score'. When I compare the global score to scores from different methods, it turns out it's highly correlated (0.99) with PCA, autoencoder, CBLOF, KNN. So it seems all these methods are just calculating the overall distance of the samples, instead of anomalies from multiple clusters.
I was very troubled by this fact and hope you can confirm whether this is true and if it is, what's the reason for this.

Thanks

Generate Synthetic Data in Clusters

Adding new feature - that is generate artificial data in clusters.
Creating utility function to generate synthesized data in clusters.
Generated data can involve the low density pattern problem and global outliers which are considered as difficult tasks for outliers detection algorithms.
Highlights

Installing Pyod broke my TensorFlow installation

Ubuntu 16.04

Traceback (most recent call last): File "features_2_3_rot_unet_1.py", line 3, in <module> import tensorflow as tf File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/__init__.py", line 22, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/__init__.py", line 81, in <module> from tensorflow.python import keras File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/__init__.py", line 24, in <module> from tensorflow.python.keras import activations File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.activations import elu File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module> from tensorflow.python.keras._impl.keras import activations File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module> from tensorflow.python.keras._impl.keras import backend as K File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/_impl/keras/backend.py", line 36, in <module> from tensorflow.python.layers import base as tf_base_layers File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/layers/base.py", line 25, in <module> from tensorflow.python.keras.engine import base_layer File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/engine/__init__.py", line 23, in <module> from tensorflow.python.keras.engine.base_layer import InputSpec File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/engine/base_layer.py", line 33, in <module> from tensorflow.python.keras import backend File "/home/user/.virtualenvs/tensorflow1.5/lib/python3.5/site-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.backend import abs ImportError: cannot import name 'abs'

CBLOF predict error

Hi,
When I try to use CBLOF to predict one or two or any short number of samples, sometimes it fails like in the example above:

clf_name = 'CBLOF'
clf = CBLOF(alpha=0.7, beta=2, check_estimator=False, n_clusters=6)
clf.fit(a[0:336])
print([a[338]])
clf.predict([a[338]])

Output:

[array([0.21751617])]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-173-5342692feffe> in <module>()
      3 clf.fit(a[0:336])
      4 
----> 5 clf.predict([a[338]])

/usr/local/lib/python3.5/dist-packages/pyod/models/base.py in predict(self, X)
    125         check_is_fitted(self, ['decision_scores_', 'threshold_', 'labels_'])
    126 
--> 127         pred_score = self.decision_function(X)
    128         return (pred_score > self.threshold_).astype('int').ravel()
    129 

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in decision_function(self, X)
    179         X = check_array(X)
    180         labels = self.clustering_estimator_.predict(X)
--> 181         return self._decision_function(X, labels)
    182 
    183     def _validate_estimator(self, default=None):

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in _decision_function(self, X, labels)
    281 
    282         scores[large_indices] = pairwise_distances_no_broadcast(
--> 283             X[large_indices, :], large_centers)
    284 
    285         if self.use_weights:

/usr/local/lib/python3.5/dist-packages/pyod/utils/stat_models.py in pairwise_distances_no_broadcast(X, Y)
     36     :rtype: array of shape (n_samples,)
     37     """
---> 38     X = check_array(X)
     39     Y = check_array(Y)
     40     assert_allclose(X.shape, Y.shape)

/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    580                              " minimum of %d is required%s."
    581                              % (n_samples, shape_repr, ensure_min_samples,
--> 582                                 context))
    583 
    584     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

But when I try to predict ensuring that one of them is not an anomaly, then it works in all the cases:

pred = clf.predict([clf.cluster_centers_[clf.large_cluster_labels_[0]],a[338]])
print (pred)

Output:

[0 1]

Thanks for your help

KNN Mahalanobis distance error

Hi,

When I use the Mahalanobis metric for KNN I always get the error "Must provide either V or VI for Mahalanobis distance" even when I provide V with metric_params. The same request works with sklearn.neighbors.


from pyod.models.knn import KNN  
from pyod.utils.data import generate_data
from sklearn.neighbors import NearestNeighbors
import numpy as np

contamination = 0.1  
n_train = 200  
n_test = 100 

X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination)

#Doesn't work (Must provide either V or VI for Mahalanobis distance)
clf = KNN(algorithm='brute', metric='mahalanobis', metric_params={'V': np.cov(X_train)})
clf.fit(X_train)

#Works
nn = NearestNeighbors(algorithm='brute', metric='mahalanobis', metric_params={'V': np.cov(X_train)})
nn.fit(X_train)

Problem in CBLOF when the number of clusters is big and the train set has too many repeated value

If the train set has many repeated values and a large number of clusters is used, then some clusters will have the same value for the center. So, when defining the equation self.cluster_sizes_=np.bincount(clf.cluster_labels_), the results is an array smaller than the number of cluster, which generate an error and turns impossible to set large and small clusters. This could be avoided by changing self.cluster_sizes_=np.bincount(clf.cluster_labels_) to self.cluster_sizes_=np.bincount(clf.cluster_labels_, minlength=n_clusters). This is an issue that is damaging my code flexibility and I want to know if it is worth getting fixed.

Example of code:

from pyod.utils.data import generate_data
x = [[ 0.30244003],  [0.01218177],[-0.50835109], [-0.36951435],[ 0.97274482], [-0.68325119], 
     [0.0], [0.0], [0.08], [0.0], [0.0], [ 0.0],[ 0.0], [ 0.0],[0.09], [0.0],[ 0.0], [0.0],
     [0.0], [ 0.0],[-20.29518778], [0.0],[ 0.0], [0.0],[ 0.0], [ 0.0],
     [0.0], [ 8.38548823], [0.0], [ 0.0]]
test = generate_data(train_only=True)
clf_name = 'CBLOF'
clf = CBLOF(alpha=0.1, n_clusters=15, beta=10, check_estimator=False)
try:
    clf.fit(x)
except Exception as ex:
    print(str(ex))
    print("\n Cluster centers: " + str(clf.cluster_centers_))
    print("\n Cluster sizes: " + str(clf.cluster_sizes_))
    print('\n Supposed to be the cluster size: ' + str(np.bincount(clf.cluster_labels_, minlength=15)))
    print("\n Large clusters: " + str(clf.large_cluster_labels_))
    print("\n Small clusters: " + str(clf.small_cluster_labels_))

Output:

index 11 is out of bounds for axis 0 with size 11

 Cluster centers: [[ 0.00000000e+00]
 [-2.02951878e+01]
 [ 8.38548823e+00]
 [ 9.72744820e-01]
 [-5.08351090e-01]
 [ 3.02440030e-01]
 [-6.83251190e-01]
 [-3.69514350e-01]
 [ 8.00000000e-02]
 [ 1.21817700e-02]
 [ 9.00000000e-02]
 [ 0.00000000e+00]
 [ 0.00000000e+00]
 [ 8.00000000e-02]
 [ 0.00000000e+00]]

 Cluster sizes: [20  1  1  1  1  1  1  1  1  1  1]

 Supposed to be the cluster size: [20  1  1  1  1  1  1  1  1  1  1  0  0  0  0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-39-1b14a2099b96> in <module>()
     18 try:
---> 19     clf.fit(x)
     20 except Exception as ex:

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in fit(self, X, y)
    168         self._set_cluster_centers(X, n_features)
--> 169         self._set_small_large_clusters(n_samples)
    170 

/usr/local/lib/python3.5/dist-packages/pyod/models/cblof.py in _set_small_large_clusters(self, n_samples)
    251 
--> 252             if size_clusters[sorted_cluster_indices[i]] / size_clusters[
    253                 sorted_cluster_indices[i - 1]] >= self.beta:

IndexError: index 11 is out of bounds for axis 0 with size 11

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-39-1b14a2099b96> in <module>()
     23     print("\n Cluster sizes: " + str(clf.cluster_sizes_))
     24     print('\n Supposed to be the cluster size: ' + str(np.bincount(clf.cluster_labels_, minlength=15)))
---> 25     print("\n Large clusters: " + str(clf.large_cluster_labels_))
     26     print("\n Small clusters: " + str(clf.small_cluster_labels_))
     27 

AttributeError: 'CBLOF' object has no attribute 'large_cluster_labels_'

Thanks for your help,
Giovanna

Documentation / Implementation difference in Autoencoder

While exploring the AutoEncoder in pyod, I've noticed a discrepancy between the generated docs and the implementation.
While the docs (https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.auto_encoder) inform you, that hidden_neurons defaults to a list ([64, 32, 32, 64]), the implementation assigns None as type: https://github.com/yzhao062/pyod/blob/development/pyod/models/auto_encoder.py#L126
Whilst this isn't a problem for itself, instancing an AutoEncoder like that resulted in a TypeError on my side:

         # Verify the network design is valid
>       if not self.hidden_neurons == self.hidden_neurons[::-1]:
E       TypeError: 'NoneType' object is not subscriptable

It might be worth a try to change the default for hidden_neurons to the list mentioned in the docs.
And by the way: Thanks for this framework, it is really a breeze to work with!

n_jobs ignored

Hi, I'm using xgbod with n_jobs = -1 and its no different than using it with n_jobs = 1...

Instructions on setting up Keras and Tensorflow for AutoEncoder in PyOD

It is nice that PyOD includes some neural network based models, such as AutoEncoder. However, you may find that after pip install pyod, AutoEncoder models do not run. This is expected since I do not want PyOD relies on too many packages, and not everyone needs to run AutoEncoder.

**If you have tensorflow-gpu installed, keras would automatically run with GPU. **
If you want to run AutoEncoder, please first install keras+a backend library, e.g., tensorflow. Either of the following two should do the installation for you:

pip install keras tensorflow or pip install keras tensorflow-gpu
conda install keras tensorflow or pip install keras tensorflow-gpu

You need tensorflow-gpu if your device have GPU and want to leverage it.

After keras and tensorflow being installed, you are ready to run auto_encoder_example.py.

Here are some potential error messages you may encounter:

1. ModuleNotFoundError: No module named 'theano'

In this case, you should specify keras backend to the one you want to use, e.g., TensorFlow
Go to $HOME/.keras/keras.json, and change the "backend" to "tensorflow"

2. ModuleNotFoundError: No module named 'error'

In this case, you need to install keras and tensorflow with conda, which can either be done in the GUI or simply use "conda install keras" and "conda install tensorflow"

load_cardio() and load_letter() do not work under Ubuntu 14.04

While running comb_example.py, the program may fail due to loadmat() function.
A quick workaround is to use synthesized data instead of real-world datasets.

This only affects comb_example.py. Will be addressed in the next release.

LSCP with multiple LOF testing error: range parameter must be finite

I am running the following code:

clf_name = 'LSCP_LOF'

other parameters:

lof_list = [LOF(n_neighbors=5), LOF(n_neighbors=10), LOF(n_neighbors=20), LOF(n_neighbors=30), LOF(n_neighbors=40), LOF(n_neighbors=50), LOF(n_neighbors=75)]

clf = LSCP(lof_list)
#clf = LOF(n_neighbors=5, contamination=outliers_fraction)
clf.fit(X_train)

and got the following error, however, when fit directly with LOF method, it runs fine:

ValueError Traceback (most recent call last)
in ()
12 clf = LSCP(lof_list)
13 #clf = LOF(n_neighbors=5, contamination=outliers_fraction)
---> 14 clf.fit(X_train)
15
16 # get the prediction label and outlier scores of the training data

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in fit(self, X, y)
171
172 # set decision scores and threshold
--> 173 self.decision_scores_ = self._get_decision_scores(X)
174 self._process_decision_scores()
175

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in _get_decision_scores(self, X)
273 pred_scores_ens[i,] = np.mean(
274 test_scores_norm[
--> 275 i, self._get_competent_detectors(pearson_corr_scores)])
276
277 return pred_scores_ens

~/proj/myPylib/lib/python3.6/site-packages/pyod/models/lscp.py in _get_competent_detectors(self, scores)
355 "classifiers, reducing n_bins to n_clf.")
356 self.n_bins = self.n_clf
--> 357 hist, bin_edges = np.histogram(scores, bins=self.n_bins)
358
359 # find n_selected largest bins

/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py in histogram(a, bins, range, normed, weights, density)
668 if not np.all(np.isfinite([first_edge, last_edge])):
669 raise ValueError(
--> 670 'range parameter must be finite.')
671 if first_edge == last_edge:
672 first_edge -= 0.5

ValueError: range parameter must be finite.

Thanks

Subspace Outlier Degree (SOD) - Request

I cannot find any subspace-related type algorithm among yours, I believe it will greatly contribute to your collection. So, Can ye please add the SOD algorithm to your package please:
http://www.dbs.ifi.lmu.de/Publikationen/Papers/pakdd09_SOD.pdf

SOS: overflow encountered in multiply beta[i] = beta[i] * 2.0

I am running the following code:

clf_name = 'SOS'

clf_name = 'SOS'
clf = SOS()
clf.fit(X_train)

and got the following warning:
RuntimeWarning: overflow encountered in multiply
beta[i] = beta[i] * 2.0
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
~/proj/myPylib/lib/python3.6/site-packages/pyod/models/base.py:336: RuntimeWarning: invalid value encountered in greater
self.labels_ = (self.decision_scores_ > self.threshold_).astype(

data.zip
I have uploaded the data for X_train here.

My samples have duplicates and when I remove the duplicates the error does not occur. However I need to retain the duplicates.

how to understand ExtremeLowDensityModel?

The KNN example is incorrect

The KNN example is incorrect
There is no get_outliers_inliers in pyod

from pyod.utils.data import generate_data from pyod.utils.data import get_outliers_inliers

XGBOD and LSCP missing from install

I installed PyOD using:

pip install pyod
pip install --upgrade pyod

However, LSCP and XGBOD are not installed. All of the other models in the repo can be successfully imported into a jupyter notebook. Attempting to import LSCP and XGBOD both yield a "ModuleNotFoundError: No module named" error.

Does pydo support regression tasks?

from pyod.models.xgbod import XGBOD

xgb = XGBOD()

print(y[0:10])

xgb.fit(X,y)

Thanks!

Breadth-First Approach in FeatureBagging

May I ask a question about the implemented approach for "combination" in the "feature_bagging.py" ?

IMHO, the idea of using "maximization" is not a precise reflection of the original paper (lazarevic2005feature). The authors describes there a breath-first search procedure; arguably the numeric differences might be small.
However, please consider a generic toys example as a counter-example:

|------| Alg1| Alg2|
| Obs1 | 10.0| 2.0 |
| Obs2 | 9.0 | 3.0 |
| Obs3 | 8.0 | 4.0 |

Maximization would return the order Obs1 (score:10) , Obs2(score:9), Obs(score:3), Breadth-First Search would return Obs1 (rank1 in Alg 1), Obs3, (rank1 in Alg 2) and the Obs2.

Many thanks.

Correct handling of LOF proba predictions

Hi,

thanks for the great library.
When evaluating whether it is usable for my work i stumbled across an potential issue.
My workflow looks as follows:

Train the LOF detector on a training dataset.
Provide raw scores and outlier probabilites for this set
Deploy the model to generate outlier probabilities on new data

Im not quite sure how to correctly perform step 2. By executing lof.predict_proba(train) it executes lof.decision_function(train) which delegates to the sklearn implementation. In sklearn it explicitely states that this function is only supposed to handle new data (https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/neighbors/lof.py#L233) which is violated here.

Thanks for your help
Alex

Resource updation request of an Article - Outlier Detection using PyOD

Hi,
I have written an article on Outlier Detection using PyOD on Analytics Vidhya Blog -
https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/

In the article, I have tried to explain the need for outlier detection and how can we use pyod for the same and also implemented pyod on a real world data set.
Please consider it including in your resources section on GitHub. I believe it would be really helpful for the people who wanted to get started with pyod.

Thanks

tensorflow pip installation fails with travis-ci python 3.7

matrix:
include:
- python: "3.7"
dist: xenial
sudo: true

Error message:
Collecting tensorflow (from -r requirements_travis.txt (line 8))
Could not find a version that satisfies the requirement tensorflow (from -r requirements_travis.txt (line 8))(from versions: )
No matching distribution found for tensorflow (from -r requirements_travis.txt (line 8))
The command "pip install -r requirements_travis.txt" failed and exited with 1 during .

Wait until pip fixes tensorflow installation under python 3.7.

Slow installation due to the underlying dependencies

It is noted that PyOD depends on a few libraries, including:

keras
matplotlib (optional, required for running examples)
nose
numpy>=1.13
numba>=0.35
scipy>=0.19.1
scikit_learn>=0.19.1
tensorflow (optional, required if calling AutoEncoder, other backend also works)

It is getting more serious when we started introducing deep learning models into PyOD which is implemented in Keras (and of course with some backend libraries, e.g., TensorFlow).

In addition, for improving the efficiency, we started using JIT in PyOD, specifically Numba, for accelerating the execution, which uses LLVM compiler to overcome the overhead of Python.

In long run, I am also interested in bringing GPU support for PyOD, which could be done through CUDA programming. However, it will clearly make the installation and maintenance a mess due to the complexity.

Therefore, I would like to gather some ideas regarding comprehensiveness vs efficiency vs complexity for the development of PyOD. What is your opinion? Will the current installation too cumbersome for you?

implemented local Outlier Probabilites (LoOP)

See https://dl.acm.org/citation.cfm?id=1646195

LOCI fails on MacOS with Python 2.7 (caused by np.count_nonzero)

It is noted running LOCI model on MacOS with Python 2.7 may fail. One potential cause is the following code, as np.count_nonzero returns int instead of array.
I am currently investigating how to fix it. Please stay tuned.

 def _get_alpha_n(self, dist_matrix, indices, r):
        """Computes the alpha neighbourhood points.
        
        Parameters
        ----------
        dist_matrix : array-like, shape (n_samples, n_features)
            The distance matrix w.r.t. to the training samples.
        
        indices : int
            Subsetting index
        
        r : int
            Neighbourhood radius
            
        Returns
        -------
        alpha_n : array, shape (n_alpha, )
            Returns the alpha neighbourhood points.       
        """

        if type(indices) is int:
            alpha_n = np.count_nonzero(
                dist_matrix[indices, :] < (r * self._alpha))
            return alpha_n
        else:
            alpha_n = np.count_nonzero(
                dist_matrix[indices, :] < (r * self._alpha), axis=1)
            return alpha_n

The error message looks like below:

(test27) bash-3.2$ python loci_example.py
/anaconda2/envs/test27/lib/python2.7/site-packages/pyod/models/loci.py:199: RuntimeWarning: divide by zero encountered in double_scalars
outlier_scores[p_ix] = mdef/sigma_mdef
/Users/zhaoy9/.local/lib/python2.7/site-packages/numpy/core/_methods.py:101: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
On Training Data:
Traceback (most recent call last):
File "loci_example.py", line 133, in
evaluate_print(clf_name, y_train, y_train_scores)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/pyod/utils/data.py", line 159, in evaluate_print
roc=np.round(roc_auc_score(y, y_pred), decimals=4),
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 356, in roc_auc_score
sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/base.py", line 77, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 328, in _binary_roc_auc_score
sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 618, in roc_curve
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 403, in _binary_clf_curve
assert_all_finite(y_score)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 68, in assert_all_finite
_assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
File "/anaconda2/envs/test27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

AUC Score & Precision score are different why not same?

from pyod.utils.data import evaluate_print

evaluate and print the results

print("\nOn Training Data:")
evaluate_print(clf_name, y_true, y_scores)

On Training Data:
KNN ROC:0.9352, precision @ rank n:0.568

from sklearn import metrics
print("Accuracy Score",round(metrics.accuracy_score(y_true, y_pred),2))
print("Precision Score",round(metrics.precision_score(y_true, y_pred),2))
print("Recall Score",round(metrics.recall_score(y_true, y_pred),2))
print("F1 Score",round(metrics.f1_score(y_true, y_pred),2))
print("Roc Auc score",round(metrics.roc_auc_score(y_true, y_pred),2))

Accuracy Score 0.92
Precision Score 0.55
Recall Score 0.59
F1 Score 0.57
Roc Auc score 0.77

use the Pyod for timeseries anomaly detection

Hi
i'm looking for toolkits for timeseries anomaly detection. is the Pyod provides timeseries anomaly detection??

specifying categorical features in Python Outlier Detection (PyOD)

How to specify the categorical features in PyOD when using Histogram-based Outlier Detection (HBOS) for anomaly detection ?
I've read that HBOS can be used for anomaly detection when there are categorical features involved. I found it's Python implementation here:
https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.hbos
But I can't figure out how should I pass the position or list of names of categorical features of my dataset while training the model.
The code I've tried:

clf = HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)
clf.fit(train_df)
train_pred = clf.labels_

There is no parameter to mention categorical features while training.

pyod fails to install using pip

When attempting to install without nose, I receive the following error:

(PyVi) Michael:PyVi michael$ pip install pyod
Collecting pyod==0.5.0 (from -r requirements.txt (line 18))
  Using cached https://files.pythonhosted.org/packages/c9/8c/6774fa2e7ae6fe9c2c648114d15ba584f950002377480e14183a0999af30/pyod-0.5.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/setup.py", line 2, in <module>
        from pyod import __version__
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/__init__.py", line 4, in <module>
        from . import models
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/__init__.py", line 2, in <module>
        from .abod import ABOD
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/abod.py", line 17, in <module>
        from .base import BaseDetector
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/models/base.py", line 27, in <module>
        from ..utils.utility import precision_n_scores
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/utils/__init__.py", line 2, in <module>
        from .utility import check_parameter
      File "/private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/pyod/utils/utility.py", line 18, in <module>
        from sklearn.utils.testing import assert_equal
      File "/Users/michael/anaconda3/envs/PyVi/lib/python3.6/site-packages/sklearn/utils/testing.py", line 49, in <module>
        from nose.tools import raises
    ModuleNotFoundError: No module named 'nose'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/j4/_68f6f3j4d51_2smq2mh5hyh0000gn/T/pip-install-gjdzzane/pyod/

Fixing generate_data_clusters documentation and adding it to README

func:`pyod.utils.data.visualize` is not existed

Is this function pyod.utils.data.visualize deprecated? I cannot import this function.

import sys
import pyod
In[]: sys.version
Out[]: '3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]'
In[]: pyod.utils.data.visualize(clf_name, X_train, X_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)

Traceback (most recent call last):

  File "<ipython-input-9-1628666df63a>", line 2, in <module>
    pyod.utils.data.visualize(clf_name,

AttributeError: module 'pyod.utils.data' has no attribute 'visualize'

 (py36) E:\MyNutshell>pip show pyod                                   
Name: pyod                                                           
Version: 0.5.6                                                       
Summary: A Python Outlier Detection (Anomaly Detection) Toolbox      
Home-page: https://github.com/yzhao062/Pyod                          
Author: Yue Zhao                                                     
Author-email: [email protected]                                 
License: UNKNOWN

Merge with kenchi

Hi,

I am currently developing an anomaly detection package called kenchi and would like to merge this code into your package.
https://github.com/HazureChi/kenchi

There are three points that I can contribute to pyod.

The first is the implementation of One-time sampling.
https://github.com/HazureChi/kenchi/blob/master/kenchi/outlier_detection/distance_based.py

Sugiyama, M., and Borgwardt, K., "Rapid distance-based outlier detection via sampling," Advances in NIPS, pp. 467-475, 2013.

The second is the implementation of metrics for outlier function.
https://github.com/HazureChi/kenchi/blob/master/kenchi/metrics.py

Lee, W. S, and Liu, B., "Learning with positive and unlabeled examples using weighted Logistic Regression," In Proceedings of ICML, pp. 448-455, 2003.

Goix, N., "How to evaluate the quality of unsupervised anomaly detection algorithms?" In ICML Anomaly Detection Workshop, 2016.

The last is the implementation of the function that loads and return various datasets.
https://github.com/HazureChi/kenchi/blob/master/kenchi/datasets/base.py

If you agree, I actively would like to contribute to pyod in the future.

Thanks.

The questions about the implementation of the hbos

intended clf.predict_proba usage

I'm trying to make sense of the predict_proba function.

What I want to achieve: Get class probabilities for generating metrics like ROC-curves, calibration curves, Precision, Accuracy, etc with scikit-learn tools. As I am working on a binary classification task, I though I could use the predict_proba for this.

The documentation describes it as "predict the probability of a sample being outlier" that returns:
For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
which is what I am looking for currently. What I don't understand is that a ndarray of shape (no_of_observations,2) is returned.
If I compare the output of clf.predict() and clf.predict_proba() side by side, I see a high value in the first column of the predict_proba array all the time:

0 -> [0.86014439 0.13985561]
0 -> [0.96943563 0.03056437]
0 -> [0.88716599 0.11283401]
0 -> [0.87912382 0.12087618]
0 -> [0.9686196   0.0313804]
0 -> [0.87921815 0.12078185]
1 -> [0.83279906 0.16720094]
0 -> [0.87921815 0.12078185]
0 -> [0.86137304 0.13862696]
0 -> [0.98987502 0.01012498]

Might the first column be read as "how confident is the classifier that the predicted class is correct"? It would be great if you could help me out on this one.

By the way: Thanks for building such a great Python module!

I am trying to do RandomizedSearchCV on ABOD, but surprisingly it does not run?

Here is my code:

from pyod.models.abod import ABOD
param_grid = {'neighbours': list(range(1, 5,1)),
'contamination': np.linspace(0.01, 0.05, 5)}

skf = StratifiedKFold(n_splits=10)
folds = list(skf.split(X.toarray(), y_true))
clf = ABOD()
scoring = make_scorer(precision_score)
search = RandomizedSearchCV(estimator=clf, param_distributions=param_grid, scoring=scoring, cv=folds)
search.fit(X.toarray(), y_true)
y_pred= search.predict(X.toarray())
print('Best parameters:%0.10f' % search.best_params_["contamination"], 'Precision score: %0.3f' % precision_score(y_true, y_pred),
'Recall score: %0.3f' % recall_score(y_true, y_pred))

Best parameters:0.0100000000 Precision score: 0.000 Recall score: 0.000

Is there a way to optimize model parameters?

For instance in OCSVM there is this gamma parameter, is there an option to give it a range and test all combinations at once in order to find the best parameters?

broken function (_predict_rank) due to the changes to the dependent libs

This private function is broken due to the changes to the underlying dependency update (sklearn updated to 0.20). It would affect any major functionalities and I will fix this error in the next few days.

Do not use _predict_rank if your sklearn version is not 0.19.

failure_log.txt

local feature importance for outlier prediction?

Hello, is there anything available to identify/highlight what may be the features most probable to be triggering the outlier status?
thx!

kNN visualization (interpretation)

The visualization produced by knn_example.py for the "Test Set Prediction" shows two false positives, i.e., 12 outlier findings instead of 10 as in the "Test Set Ground Truth" chart. Isn't this somewhat inconsistent with the result printed to console that ROC_AUC = 1?

If so, I think the inconsistency arises because the predicted labels for the chart are based on y_test_pred = clf.predict(X_test). I think that means that the test labels are being predicted from comparison the distance threshold clf.threshold_ obtained in fitting clf to the training data. In contrast, the ROC_AUC value is based on a fixed contamination rate (10%).

It would only make sense to use clf.threshold_ for this purpose if the kNN distance for any point x_i in the test set were being computed over the distances from x_i to each of the 200 training points, not the distances to the other 99 test points. But then, the ROC_AUC curve ought to be based on those same labels, and it isn't, is it? I think it's currently being generated from a set of labels that re-applies the 10% contamination assumption, ignoring clf.threshold_.

(I can't quite follow whether the kNN distances for the points in the test set are being computed vs. the training set or vs. the other points in the test set. Can you clarify this for me? I have to guess that it's the former; if it's the latter, then it would seem really weird to be applying clf.threshold_ from a training set of a different size.)

Is it even appropriate to apply the kNN model from training data directly to the test set? I would have thought this use of kNN is intended to be used on an entire data set all by itself. Although one could perhaps study a training set to make a reasonable judgment about appropriate values of k and contamination rate.

Thanks!

add HDBscan to PyOD - new feature

could you add HDBscan ( https://github.com/scikit-learn-contrib/hdbscan ) as another anomaly detection method to PyOD?

Can GPU be used during trainning?

Can GPU be used during trainning such as XGBOD ? Thank you !

IForest: FutureWarning: behaviour="old" is deprecated

Hi,

Thanks for a great library!

When declaring a new IForest object, Sklearn throws the following warning:

FutureWarning: behaviour="old" is deprecated and will be removed in version 0.22. Please use behaviour="new", which makes the decision_function change to match other anomaly detection algorithm API.
FutureWarning)

This new behavior in sklearn's iforest is about where the threshold is set between anomalies and normal observations. See documentation on behaviour argument and offset_:

behaviour : str, default='old'
Behaviour of the decision_function which can be either 'old' or
'new'. Passing behaviour='new' makes the decision_function
change to match other anomaly detection algorithm API which will be
the default behaviour in the future. As explained in details in the
offset_ attribute documentation, the decision_function becomes
dependent on the contamination parameter, in such a way that 0 becomes
its natural threshold to detect outliers.

offset_ : float
Offset used to define the decision function from the raw scores.
We have the relation: decision_function = score_samples - offset_.
Assuming behaviour == 'new', offset_ is defined as follows.
When the contamination parameter is set to "auto", the offset is equal
to -0.5 as the scores of inliers are close to 0 and the scores of
outliers are close to -1. When a contamination parameter different
than "auto" is provided, the offset is defined in such a way we obtain
the expected number of outliers (samples with decision function < 0)
in training.
Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5, making the decision function independent from the
contamination parameter.

I think a simple fix would be to add argument behaviour="new" in the call to sklearn.ensemble.IsolationForest

What is difference between Scikit One Class SVM vs PYOD One Class SVM?

Running Scikit OCSVM gives me Results:

Best parameters: (Nu 0.1500000000 gamma 0.0) Precision score: 0.143 Recall score: 0.800

Running PYOD OCSVM gives me Results:

Best parameters: (Nu 0.0300000000 gamma 0.0) Precision score: 0.868 Recall score: 0.103

implement Connectivity-based outlier factor (COF)

See https://dl.acm.org/citation.cfm?id=693665 for more information.

easy way to plot loss for autoencoder model

Is there an easy way to plot the loss as a function of epochs for the pyod autoencoder module? I want to visualize this to decide how many epochs to train for.

if n_samples is large , certain outlier model error rate 200% higher

Hi, YZhao;

I was writing to one possible isuue: in example notebooks compare all models, if the n_samples change to a big number, for example : 10*5 or larger. certain model OD result is totally wrong.
Note: I noticed there is a similar issue. Problem in CBLOF when the number of clusters is big and the train set has too many repeated value #53. But my found should be differnet, so post is as well.

Here is isssue:

the default n_sample =200, outlier_fraction = 0.25. That means od ground trun point is 50. After change the n_sample to 10**5, the ground true outlier point should be 25000.
However, following models error higher that the groud true OD points. they were:
Feature Bagging: 35259
Local Outlier Factor (LOF) 36144
Locally Selective Combination (LSCP) 37276
below is the screen capture.

I guess it might related to the dataset type. Doest it mean the simulation data similar to the
"glass", "optdigits" sample date? why the other estimator did not show such higher error rate?

Looking forward your kindly response！

Last but not least, I was know PYOD existed from zhihu. It is a excellent tools, especialy, I have access to more OD resources link for your github. Your work is awesome!

Recently, i has begun to try use some model it in one of my general OD automation tools.(which use docker, and airflow for platform). dataset type, and dataset qty are two of point need considered.

WangYong
[email protected]

Is it possible to make CBLOF ignore contamination parameter?

CBLOF parameters are useless. Basically the only thing that matters when dealing with the method is the contamination parameter, if i set it to 0.3 it will find 30% of anomalies and it doesn't matter how normal they are or if they belong to a big cluster. For what i understood about the method it has the ability to define what is an anomaly and what is not only based on the parameters alpha and beta, why is this happening?
Is there a way to ignore contamination?

matplotlib libc++abi.dylib failure on MacOS (conda env)

Users may see example failure by using PyOD on MacOS if virtual env is initialized by Anaconda. This is indeed a matplotlib bug. See posts below:

ValueError: continuous format is not supported

Hey there,

following the example for KNN im getting this error:

ValueError                                Traceback (most recent call last)
<ipython-input-252-21e0f0751702> in <module>()
      2 # evaluate and print the results
      3 print("\nOn Training Data:")
----> 4 evaluate_print(clf_name, y_train, y_train_scores)
      5 print("\nOn Test Data:")
      6 evaluate_print(clf_name, y_test, y_test_scores)
------
    157     print('{clf_name} ROC:{roc}, precision @ rank n:{prn}'.format(
    158         clf_name=clf_name,
--> 159         roc=np.round(roc_auc_score(y, y_pred), decimals=4),
    160         prn=np.round(precision_n_scores(y, y_pred), decimals=4)))

any suggestions?

the whole set of vectors should fit into memory, then it can start training? or it supports shards

Without enough free RAM, model.fit(X) failed. can it fit by split vectors?

yzhao062 / pyod Goto Github PK

pyod's Introduction

Python Outlier Detection (PyOD)

Read Me First

About PyOD

Installation

API Cheatsheet & Reference

ADBench Benchmark and Datasets

Model Save & Load

Fast Train with SUOD

Thresholding Outlier Scores

Implemented Algorithms

Quick Start for Outlier Detection

Reference

pyod's People

Contributors

Stargazers

Watchers

Forkers

pyod's Issues

other parameters:

evaluate and print the results

Recommend Projects

Recommend Topics

Recommend Org