Giter VIP home page Giter VIP logo

frequentpatternmining's Introduction

Comparative study of frequent pattern mining algorithm on Adult Census Data

Goal

Implement and compare the Apriori and FP Growth frequent pattern mining algorithms. Introduce improvements on Apriori algorithm and compare the results with the original algorithm.

Dataset Description:

Dataset used: UCI Adult Census Dataset (Source: https://archive.ics.uci.edu/ml/datasets/Adult)

Attributes: Total number of attributes – 15

• Age: Numeric data with ages.

• Work-class: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

• Fnlwgt: continuous data.

• Education: Level of education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

• Education-num: continuous data.

• Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse.

• Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces.

• Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. • Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

• Sex: Female, Male.

• Capital-gain: continuous data.

• Capital-loss: continuous data.

• Hours-per-week: continuous data.

• Native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, ElSalvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

• Income: Income categorized as greater than 50k and less than or equal to 50k. >50k, <= 50k.

Missing Values – Yes, represented by “?”.

Irrelevant Attributes – Yes, there are two attributes (‘fnlwgt’ and ‘edunum’) which do not provide any significant information.

Number of rows: 32560

Data Pre-Processing:

Data Head:

The data was pre-processed to be used for frequent pattern mining. As can be seen, data contains a few irrelevant attributes like fnlwgt and Edunum. These attributes were dropped out from the data set. Number of attributes after dropping irrelevant attributes = 13. It also has missing values represented by question mark. Thus, rows containing missing values were dropped out. Number of rows after removing missing values = 30161. There are certain attributes which are continuous numeric data and should be quantized into bins before applying the algorithm. Such attributes were converted to categorical data as follows:

Age: 0 – 25 -> Young 25 – 40 -> Middle-aged 40 – above -> Old

Capital Gain: 0 -> No Gain More than 0 -> Gain

Capital Loss: 0 -> No Loss More than 0 -> Loss

Hours per Week: 0 – 5 -> Less 5 – 20 -> Medium 20 – 60 -> Reasonable 60 – above -> High

Data head after pre-processing:

Apriori Frequent Pattern Mining Overview:

Apriori algorithm is based on the usage of prior knowledge of frequent item sets to generate frequent patterns in a data set. It mainly consists of two steps: joining and pruning. In the join step, the candidate k itemsets are generated by joining L k-1 itemsets with themselves. In the prune step, the itemsets generated from join step are scanned to determine the count of each item and all the subsets are checked against L k-1. Those subsets which are not present in Lk-1 are removed from Ck.

FP Growth Frequent Pattern Mining Overview:

FP growth adopts divide and conquer strategy to mine frequent patterns. It generates a FP Tree which represents frequent itemsets. And then using this tree, frequent patterns are generated.

Improvement on Apriori Frequent Pattern Mining Algorithm using Sampling:

Sampling improvement on Apriori algorithm is based on computing apriori frequent pattern mining on a sample of the entire dataset instead of scanning entire dataset again and again. In this way, the entire dataset is only scanned once that is while generating the sample dataset. Since the algorithm is applied on a reduced dataset, it drastically reduces the computation time. With an optimum sampling factor and support count we can generate all the frequent patterns that are global frequent patterns thus having a high efficiency as well as accuracy. But most of the times when sampling improvement is used, accuracy is compromised to increase efficiency since algorithm is applied only on a subset of bigger dataset. To increases the accuracy, often a lower support threshold is used instead of minimum support threshold. This improvement is most beneficial to applications where efficiency is critical.

Improvement on Apriori Frequent Pattern Mining Algorithm using Transaction Reduction:

A transaction that does not contain any frequent itemset is useless in subsequent scans and hence thode items can be removed thus reducing the transaction. This can improve the computationa and overall efficiency of the apriori algorithm.

Comparison of the first three algorithms:

Support (Percent) Apriori Sampled FP Growth
SF = 0.3 SF = 0.4 SF = 0.5 SF = 0.6
80 47.006 1.508 2.008 2.518 3.026 31.290
70 115.461 1.525 2.021 39.082 42.076 43.117
60 429.883 1.513 2.073 97.932 98.143 45.932
50 833.279 95.084 97.54 549.787 748.642 47.364
40 1294.586 1013.177 1033.06 1041.036 1162.078 68.771

OBSERVATIONS:

• It was observed that FP Growth is fastest of all the algorithms.

• It was also observed that sampling variation improves the efficiency of the apriori algorithm by a huge amount as the time taken for execution reduces drastically.

• From the outputs shown in previous sections, it was observed that the sampling variation misses out on a few of the global frequent patterns. Thus, its accuracy is lower than the original apriori algorithm. This implies that sampling variation should be used where time is a critical issue or where patterns must be mined frequently. Thus, accuracy is compromised to increase efficiency.

• It wass also observed that as sampling factor increases, time taken by the algorithm also increases.

frequentpatternmining's People

Contributors

srishtisingh3895 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.