Fraud detection is a process that detects and prevents fraudsters from obtaining money or property through false means. It is a set of activities undertaken to detect and block the attempt of fraudsters from obtaining money or property fraudulently. Fraud is an expensive and complicated problem. To detect and investigate it effectively, you need to see connections โ between people, accounts, transactions, and dates โ and understand complex sequences of events. That means analyzing a lot of data. Fraud detection is prevalent across banking, insurance, medical, government, and public sectors, as well as in law enforcement agencies.
Advantages of Visualizations in Fraud Detection:
The detection of fraud schemes requires an investigation of a vast amount of data that stems from many different anti-fraud systems with varying types of data. The auditors have to combine all the data and use statistical methods to uncover suspicious claims, which is time-consuming and inefficient in most cases.
Visualizations, on the other hand, can enhance the quick identification of relationships and significant structures and the detection of suspicious patterns that may hide in the amount of data. Besides the visual exploration, interaction with the data allows for a deeper understanding of the dependencies within the data changing over time.
One of the most challenging tasks when using visualization for fraud detection is the sheer amount of data that is usually obtained by auditing systems. First, the auditor has to retrieve the data from the auditing system. Visualizing such a large amount of data is the next challenge: the data needs a meaningful arrangement to create a human-readable representation. Providing suitable styling should enable users to identify different types of entities and relations.
Since there exist a lot of different types of fraud schemes, it is clear that there is no unique solution that can detect all of them. Thus, a visualization meant to fight against fraud has to be adaptive to the needs of each auditor.
At first, it must not limit to a specific amount or type of data since the volume of the investigated data grows exponentially and comes from different sources. In some cases, it is also necessary to be able to support and visualize time-dependent data.
A sophisticated visualization should also provide the means for arranging the elements in multiple ways on the screen, i.e., using arrangements that reveal clusters or others that highlight hierarchical structures. Additionally, more sophisticated graph analysis algorithms should be supported for the detection of fraud schemes, e.g., cycle detection, or shortest paths.
Regarding the representation of the elements of the visualization, an auditor should be able to customize the look and feel of the graph elements based on his/her needs and be able to display additional properties of the graph elements. Finally, interaction is one of the essential operations when visualizing fraud data since it allows the auditor to explore its dataset.
Fraud detection can be separated by the use of statistical data analysis techniques or artificial intelligence.
Statistical data analysis techniques include:
- calculating statistical parameters
- regression analysis
- probability distributions and models
- data matching
AI techniques used to detect fraud include:
- Data mining classifies, groups and segments data to search through millions of transactions to find patterns and detect fraud.
- Neural networks learn suspicious-looking patterns and use those patterns to detect them further.
- Machine learning automatically identifies characteristics found in fraud.
- Pattern recognition detects classes, clusters and patterns of suspicious behavior.
Cryptocurrency fraud analysts look at huge volumes of historical data spanning long time periods. Our main idea is to comprehensively examine and visualize the available data related to fraud detection in the Ethereum network.
Our suggested steps to visualize data:
- Downloading and collecting data
- Data cleaning
- Data statistics and distribution
- Comparing different features of data between fraud and non-fraud classes
We will use two data set in this report.
We will analyze these two datasets both individually and in combination.
Number of Features | Total Cases | Fraud Cases | Non-Fraud Cases | |
Ethereum Fraud Detection Dataset | 37 | 9816 | 2179 | 7637 |
Ethereum Fraud Dataset | 31 | 12146 | 5150 | 6996 |
Merged Dataset | 17 | 20302 | 5675 | 14627 |
Table1. Datasets Overview
- Python >= 3.5
- pandas >= 0.24.2
- matplotlib >= 3.0.3
- seaborn >= 0.9.1
- numpy >= 1.18.5
- notebook >= 5.7.4
- Run
pip install -r requirements.txt
orpip3 install -r requirements.txt
GitHub Viewer | NB Viewer | Google Colab | |
Ethereum Fraud Detection Dataset | Link | Link | Link |
Ethereum Fraud Dataset | Link | Link | Link |
Merged Dataset | Link | Link | Link |
Table2. Notebooks
Here you can see a limited number of examples. The full version of this visualization and all codes can be seen in the notebooks!
Fig1. Data Distribution Pie Diagram
Fig2. Most Received Token Type Pie Diagram (Fraud Cases)
Fig3. Received Transactions Different Statistics Comparing
Fig4. Features Correlation Diagram
Fig5. Features Distribution Diagram
If you use this repo in your work, please cite it using the following metadata:
Haghighi, S., & Ramezani, F. (2022). Ethereum Fraud Detection Models (Version 1.0) [Computer software]. https://github.com/sepandhaghighi/Ethereum-Fraud-Detection-Models
@software{Haghighi_Ethereum_Fraud_Detection_2022, author = {Haghighi, Sepand and Ramezani, Farzad}, license = {MIT}, month = {10}, title = {{Ethereum Fraud Detection Models}}, url = {https://github.com/sepandhaghighi/Ethereum-Fraud-Detection-Models}, version = {1.0}, year = {2022} }