The data-mining-algorithms's intro from gsm1011

- HOWTO Compile and Run this program. 
  Compile this project by typing "make" in the current working dir. 
  Run the program by typing: 
  ./AssoRuleMiner p2eqbindata.txt 0.8 0.3 150 3
  Which means: 
  ./AssoRuleMiner datafile minSup minConf g k
  datafile is provided in current dir, minSup should be within (0,1],
  minConf should also be within (0,1], gene is the total number of
  genes (columns) to process, and k is the number of top association
  rules to print out by standard of sup*conf. 
  
  WARNING: Please don't run this program with very low minSup, it
  might consume large resources and even crash your system in the
  worst case. 

- Format of p2ItemMap.txt.
  This file contains mapping of orgiginal discretized data to unique ids.
  Each row represents a transaction of the original data.
  Each column is seperated by ",", and each item is composed of the original
  data and the mapped unique id, which are seperated by space. Please
  see sample below: 

  c 0,b 4,b 8,b 12,n 16
  c 0,c 5,b 8,b 12,p 17
  a 1,c 5,c 9,b 12,n 16	
  b 2,c 5,c 9,b 12,p 17
  a 1,b 4,b 8,a 13,n 16

- Format of p2FreqItemsets.txt.
  This file contains all the frequent itemsets generated by the
  APRIORI algorithm. Each line is an frequent item set, with the
  format freqset:support. The output is the level order traversal of
  the hash tree. Example content of this file is as follows: 

  4,8:0.33871
  4,13:0.306452
  4,8,16:0.209677
  6,10,13:0.209677

- Format of the top k Association rule output. 
  Top k association rules are selected according to sup*conf. It will
  be printed at the end of the program execution. The format of these
  Association Rules are: 
  [anteset]-->[conset] [supxy] [supx] [conf] [supxy*conf]
  Below are some examples of such output: 

  222-->232 0.903226 0.903226 1 0.903226
  122-->222,232 0.887097 0.887097 1 0.887097
  122,222-->232 0.887097 0.887097 1 0.887097
  122,228-->222,230 0.870968 0.870968 1 0.870968
  76,228,230-->232 0.870968 0.870968 1 0.870968

- Files. 
  README - This file. 

  Makefile - The project orgnization file.

  defs.h - File used to define commonly used MACROS and / or
  functions such as hash function, itoa function used for transforming
  integer to string. 
  
  Item.h - Definition and implementation of the Item class. 
  Item class is the representation of the item class of the gene
  data, and it is also the element that combined to form itemsets. 

  Itemset.h, Itemset.cpp - Definition and implementation of the Itemset
  class. Itemset class is the representation of the Frequent itemset
  that we need to generate with the APRIORI algorithm. Generally, it
  is a composition of a group of items. A join method is provided for
  this class. Still, an association rule generation method is provided
  to generated association rules from frequent itemsets. 

  HashTree.h, HashTree.cpp (deprecated) - Definition and
  implementation of the HashNode and HashTree class. HashNode class is
  container of frequent itemsets, which are generated by joining,
  scaning and pruning. The frequent itemsets are stored as hash map
  within the HashNode class. And the HashTree class iteratively
  produces according to different level (length) of frequent
  itemsets. HashNode are orgnized into HashTree and map data structure
  is used to facillatate the search of a specific node and itemset
  within a node.

  DataSet.h, DataSet.cpp - Definition and implementation of the DataSet
  class. This class is responsible for loading data from file, doing Item
  mapping from discretized data to unique integer IDs, doing APRIORI
  algorithm over the mapped gene data sets, and finally save all the
  results to files or print out to screen. 

  AssoRule.h, AssoRule.cpp - Definition and implementation of the
  AssoRule class. This class represents association rules we are
  supposed to generate from the frequent itemsets. The format the
  output is: [anteset]-->[conset] [supxy] [supx] [conf] [supxy*conf]. 

- Abbreviations of HashTree output (deprecated): 
  To output the content of hashtree(level order tree traversal), you need to
  open a switch in Makefile, which is "MACROS += -DDEBUG_APRIORI_TRAVERSAL".

  CN - Create Node. 
  -> - Parent of node.
  :  - Seperator.
  II - Insert Itemset. 
  NN - New Node.
  ND - NoDe.
  VN - Visit Node. 
  NC - Number of Children.
  NFI - Number of Frequent Itemsets.
  FIS - Frequent ItemSets.

- Documentation. 
  You can use doxygen to generate API reference for this project. You
  need to install doxygen and dot in order to generate the document. 
  To generate document, use "doxygen Doxygen".

- About Debugging of project. 
  In this project, I used a lot of conditional compilation macros for
  the purpose of debugging. You can open a debugging by removing the
  "#" in Makefile for a specific feature. And I hope it will be useful. 

- Proof of Correctness. 
  This APRIORI program has been verified by setting the support to the
  smallest value close to 0 (with small input) so that the hash tree
  will generate all the transactions within the dataset. BUT, please
  be alert that, don't use large amount of data, as it will consume
  all the memory and even halt your system. 

- Copyright Notice. 
  This is free software, so you can change and redistribute it. But
  please keep the headlines in the file when doing so or contact with
  through [email protected]. Or you can check out online at:
  svn co http://fall-2010.googlecode.com/svn/fall-2010/data_mining/proj2 proj2
gsm1011 / data-mining-algorithms Goto Github PK

data-mining-algorithms's Introduction

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent