Github repo: https://github.com/MandarGogate/Association-Rule-Mining-Hadoop-Python

A project by: Mandar Gogate, Abhishek Srivastava. Aman Ahuja

Introduction

In this work, we demonstrate association rule mining on the Hadoop distributed environment. We implement the FP­tree association rule mining algorithm on Hadoop mapreduce framework to demonstrate data mining on distributed systems. As a case study, we use this algorithm to extract association rules from the USA deaths dataset, to determine the relationships between different factors related to deaths of the people in the United States.
A case study on mining association rules between different factors related to deaths of people in the United States

Background and Motivation

Association rule mining is the method for discovering association rules between various parameters in the dataset. This is widely useful in systems such as e­commerce and supermarkets, where the association between the purchases of different products by the customers can be useful in marketing.
The process of mining association rules mainly involves the following two steps:

  1. A minimum support threshold is applied to find all frequent itemsets in a database.
  2. A minimum confidence constraint is applied to these frequent item­sets in order to form rules.

Apriori Algorithm

One of the most widely used algorithms in the field of association rule mining is the Apriori algorithm, which uses a breadth­first strategy to discover association rules between the parameters. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. One of the key limitations of this algorithm is costly wasting of time to hold a vast number of candidate sets with much frequent item sets, low minimum support or large itemsets.

FP­Tree

The Frequent Pattern (FP) Tree growth algorithm overcomes the disadvantages of Apriori algorithm by limiting the number of database scans needs for association rule discovery. The number of database scans needed in FP­Tree algorithm is reduced to 2.

In the first scan, it counts the number of occurrences of each item in the dataset to find frequent 1­itemsets. These counts are then sorted in the decreasing order, so that they can be used for the construction of FP­tree in the second scan of the database. Items in each instance that do not meet minimum coverage threshold are discarded.

After the FP­tree is constructed, it then recursively grows the patterns by traversing the FP­tree.

Parallelizing FP­tree growth algorithm:

In large datasets, constructing FP­tree can be cumbersome and time­consuming. To efficiently mine association rules on large datasets, we need to use distributed computing platforms so that the process can be simultaneously split across multiple processing nodes.

We implemented the FP­tree algorithm on Hadoop map­reduce platform to efficiently mine association rules on a large dataset. The overall process is divided into two map­reduce phases. These two map­reduce phases divide the entire algorithm implementation into, intuitively, frequent pattern set generation and rule generation. Next we give an elaborate description of the process.

Map-­reduce­ 1

Map: In the map step of the first map­reduce phase, we construct an FP­tree on the datanodes for each exclusive set of 10,000 transactions, generated in a sequential manner,i.e the first 10000 form the first set, next 10000 the second set and so on. We use a local threshold, so that redundant itemsets are removed early on which optimizes the process further, reducing the following reducer’s work in the process.

Just the give a brief idea regarding the implenetation of the mapper, we create an FPTree class maintaining a header and a mapper table in the process, for in order indexing of the attributes. During the tree construction, the longest common prefix is found for each transaction and thus added to it. Kindly note that FpGrowth is the method actually responsible for each transaction addition to the tree.

Reduce: The reducer serves to combine the partial frequency counts of patterns and writes the summations out to the Hadoop file systems.

The next map­reduce does class association rule mining on the derived frequent item sets from the first phase of the algorithm implementation.

Map-­reduce ­2

Map: This mapper generates all sub patterns for each given frequent item set, by removing only one element. The sets now formed become the base for the rule mining process executed by the following Reducer.

Reduce: The second phase reducer combines the patterns passed on by the aforementioned mapper, generating one antecedent rules with confidence and support greater than the specified threshold.

These rules are then processed to extract maximum semantic meaning with respect to the dataset.

Dataset

To demonstrate our work, we use the USA deaths dataset (https://www.kaggle.com/cdc/mortality) , which contains information about every death that occurred in the United States in the year 2014. These details include information about the causes of the death, and the demographic background of the deceased. The original dataset contains details about the death of nearly 2.6 million people who died in the year 2014. Each record contains 38 attributes related to the deceased.

Data Preprocessing

We preprocessed the dataset, so that the redundant attributes were removed, and only the important attributes were left for each record. In total, 11 attributes were dropped, and the remaining 17 attributes were used in the association rule mining process.

These were:

●  Resident status (state resident or not)
●  Education (years of education using 2003 code)
●  Month of death
●  Sex (male/female)
●  Age (divided in 12 bins)
●  Marital status
●  Day of week of death
●  Injury at work
●  Manner of death (accident/suicide/natural/disease)
●  Method of disposition (burial/cremation)
●  Autopsy performed (yes/no)
●  Activity Code (While engaged in sports activity / While engaged in leisure activity)
●  Place of Injury (Home/Institution)
●  Type of disease
●  Race (white/black)
●  Bridged race
●  Hispanic Origin

After filtering out the attributes, we then discretized all the attribute values. For example, the resident status of 0 or 1 was converted to A0 and A1 respectively for the rule mining process.

The dataset size was reduced from 270 MB to 156 MB after the preprocessing.

Experimental Results: Quantitative Results:

We now present the quantitative results of the experiments conducted the discussed dataset. The time required to for the algorithm to run on a single cluster Hadoop machine was nearly 2 hours. This was significantly reduced to nearly 50 mins, when the algorithm was run on multi­cluster setup consisting of 3 machines.

Visualisation of Results:

Percentage of suicide cases as per education level

Suicides by age

Ordered causes of deaths

  1. Atherosclerotic heart disease
  2. Malignant neoplasm: Bronchus or lung
  3. Unspecified dementia
  4. Acute myocardial infarction
  5. Chronic obstructive pulmonary disease
  6. Alzheimer disease
  7. Stroke, not specified as haemorrhage or infarction
  8. Atherosclerotic cardiovascular disease
  9. Congestive heart failure

4 Comments

Anonymous · January 25, 2017 at 8:17 pm

Can you please tell me what kind of dataset did you use? Can you please send me a snapshot of the data set you used and the rules that you got. Your help will be greatly appreciated.

    Mandar Gogate · January 25, 2017 at 9:04 pm

    We used Death in the United States dataset

kristin · June 12, 2017 at 7:38 pm

Could you please give some introduction for running the sample code?
I cannot find hadoop-streaming-0.20.2-cdh3u2.jar so used the newest version.
But still some errors there.

Leave a Reply

%d bloggers like this: