About Quantitative CBA

Quantitative CBA (QCBA) is a postprocessing algorithm for association rule classification algorithm CBA, which implements a number of optimization steps to improve handling of quantitative (numerical) attributes. The viable properties of these rule lists that make CBA classification models most comprehensible among all association rule classification algorithms, such as one-rule classification and crisp rules, are retained. The postprocessing is conceptually fast, because it is performed on a relatively small number of rules that passed the pruning steps, and can be adapted also for multi-rule classification algorithms. Benchmarks show about 50% decrease in the total size of the model as measured by the total number of conditions in all rules. Model accuracy generally remains on the same level as for CBA with QCBA even providing small improvement over CBA on 11 of the 22 datasets involved in our benchmark.

What this tutorial shows

This tutorial has three parts:

Part 1

Shows how to build CBA and QCBA model using the arc and qCBA packages. To learn how to setup QCBA refer to the Quantitative CBA github homepage.

Part 2

Visually demonstrates all the optimization steps in QCBA:

Refitting rules to value grid. Literals originally aligned to borders of the discretized regions are refit to finer grid.
Attribute pruning. Remove redundant attributes from rules.
Trimming. Literals in discovered rules are trimmed so that they do not contain regions not covered by data.
Extension. Ranges of literals in the body of each rule are extended, escaping from the coarse hypercubic created by discretization.
Data coverage pruning. Remove some of the newly redundant rules
Default rule overlap pruning. Some rules that classify into the same class as the default rule in the end of the classifier can be removed.

Part 3

Introduces advanced options:

Speeding up execution on large datasets
Invoking directly Java-based QCBA backend
Experimental multi-rule classification
Generating ROC curve and computing AUC

Part 1: how to build CBA and QCBA model

Library

The QCBA implementation is in Java. It is also available via an R package wrapper, which we will use for this example.

library(qCBA)

To learn how to install QCBA refer to the Quantitative CBA github homepage.

Data

Let’s look at the humtemp sample data bundled with the arc package, which we will be using throughout this tutorial.

There are two explanatory attributes (Temperature and Humidity). The target attribute is preference (e.g. subjective comfort level).

attach(humtemp)

The first few rows of the data:

head(humtemp)

##   Temperature Humidity Class
## 1          45       33     2
## 2          27       29     3
## 3          40       48     2
## 4          40       65     1
## 5          38       82     1
## 6          37       30     3

And a scatter plot:

plot(Humidity,Temperature,pch=as.character(Class))

Building a CBA classifier

Association rule learning requires discretized data. In this case, we perform simple equidistant binning.

  #custom discretization
  data_raw <- humtemp
  data_discr <- humtemp
  temp_breaks <- seq(from=15,to=45,by=5)
  hum_breaks <- c(0,40,60,80,100)
  temp_unique_vals <- setdiff(unique(Temperature),temp_breaks)
  hum_unique_vals <- setdiff(unique(Humidity),hum_breaks)
  data_discr[,1] <- cut(Temperature,breaks=temp_breaks)
  data_discr[,2] <- cut(Humidity,breaks=hum_breaks)
  #change interval syntax from (15,20] to (15;20], which is required by QCBA R package
  data_discr[,1] <- as.factor(unlist(lapply(data_discr[,1], function(x) {gsub(",", ";", x)})))
  data_discr[,2] <- as.factor(unlist(lapply(data_discr[,2], function(x) {gsub(",", ";", x)})))
  
  data_discr[,3] <- as.factor(Class)
  head(data_discr)

##   Temperature Humidity Class
## 1     (40;45]   (0;40]     2
## 2     (25;30]   (0;40]     3
## 3     (35;40]  (40;60]     2
## 4     (35;40]  (60;80]     1
## 5     (35;40] (80;100]     1
## 6     (35;40]   (0;40]     3

The discretization splits the data space into rectangular regions. Given that we have two attributes, the discovered rule can only correspond to a rectangular region with borders aligned to the grid. If we had more than two attributes, the discovered rule would delimit a hypercube.

plotGrid <- function(plotFineGrid=TRUE, plotDiscrGrid=TRUE)
{
  if (plotDiscrGrid)
  {

    for (i in temp_breaks[-1])
  {
    abline(h=i, lty=2)
  }
  for (i in hum_breaks[-1])
  {
    abline(v=i, lty=2)
  }
  }
  if (plotFineGrid)
  {
    for (i in temp_unique_vals[-1])
    {
      abline(h=i, lty=3, col="grey")
    }
    for (i in hum_unique_vals[-1])
    {
      abline(v=i, lty=3, col="grey")
    }  
  }
}

plot(Humidity,Temperature,pch=as.character(Class))
plotGrid(FALSE)

The next step is mining of association rules. The rule mining is constrained to rules that have values of the Class attribute in the consequent.

sink("/dev/null")
classAtt="Class"
appearance <- getAppearance(data_discr, classAtt)
txns <- as(data_discr, "transactions")
rules <- apriori(txns, parameter = list(confidence = 0.5, support= 3/nrow(data_discr), minlen=1, maxlen=3), appearance=appearance)
interestingRule <- inspect(rules)[5,] #will use this later
sink()

inspect(rules)

##     lhs                                        rhs       support    confidence
## [1] {Humidity=(80;100]}                     => {Class=1} 0.11111111 0.8000000 
## [2] {Temperature=(15;20]}                   => {Class=2} 0.11111111 0.5714286 
## [3] {Temperature=(30;35]}                   => {Class=4} 0.13888889 0.6250000 
## [4] {Temperature=(25;30]}                   => {Class=4} 0.13888889 0.5000000 
## [5] {Temperature=(25;30], Humidity=(40;60]} => {Class=4} 0.08333333 0.6000000 
##     coverage  lift     count
## [1] 0.1388889 3.600000 4    
## [2] 0.1944444 2.057143 4    
## [3] 0.2222222 2.045455 5    
## [4] 0.2777778 1.636364 5    
## [5] 0.1388889 1.963636 3

The rules can be visualized in the feature space as rectangular regions. Class 1 is coded as red, Class 2 as green, Class 3 as black, and Class 4 as blue region.

interesting_rule <- 5
#as.character(Class)
plot(Humidity,Temperature,pch=as.character(Class),main="Discovered asociation rules",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

plotGrid(FALSE)

plotHumTempRule<- function(rules, ruleIndex)
{
  if (typeof(rules)=="S4")
  {
    # rules is a arules rule model
    # sink: inspect also sends rules to the standard output
    sink("/dev/null")
    r <- inspect(rules)[ruleIndex,]
    sink()
    rule <- paste(unlist(r$lhs[1]),collapse='')
    rhs <- paste(unlist(r$rhs[1]),collapse='')
  }
  else
  {
    # rules is a list of rules output by qCBA
    rule <- rules[ruleIndex,1]
    #rule <- rules$rules[ruleIndex]
    rhs <- regmatches(rule,regexec("\\{Class=.*\\}",rule))
  }
  #get color
  if (rhs == "{Class=1}")
  {
    border = "red"
    col=rgb(1.0,0.2,0.2,alpha=0.3)
  }
  else if (rhs == "{Class=2}")
  {
    border = "green"
    col=rgb(0,1,0,alpha=0.3)
  }
  else if (rhs == "{Class=3}")
  {
    border = "black"
    col=rgb(0.4,0.4,0.4,alpha=0.3)
  }
  else if (rhs == "{Class=4}")
  {
    border = "blue"
    col=rgb(0,0,1,alpha=0.3)
  }

  
  temp_coordinates<-unlist(regmatches(rule,regexec("Temperature=.([0-9]+);([0-9]+).",rule)))
  if (length(temp_coordinates)==0)
  {
    #if the temperature literal is missing in the rule, use the following coordinates
    temp_coordinates=c(0,0,50)
  }
  hum_coordinates<-unlist(regmatches(rule,regexec("Humidity=.([0-9]+);([0-9]+).",rule)))
  if (length(hum_coordinates)==0)
  {
    #if the humidity literal is missing in the rule, use the following coordinates
    hum_coordinates=c(0,0,100)
  }
  m <- rect(hum_coordinates[2], temp_coordinates[2], hum_coordinates[3], temp_coordinates[3],border=border,col=col)
}

plotRules <- function(rules)
{
  if (typeof(rules)=="S4") #for arules/cba
  {
    rule_count <- length(rules)
  }
  else #for qcba
  {
    rule_count <- nrow(rules)
  }
  for (i in 1:rule_count)
  {
    plotHumTempRule(rules,i)
  }
}
plotRules(rules)

Out of the two discovered rules, we will create a CBA classifier. This means that the rules will be:

sorted according to confidence support and length
subject to simultaneous data coverage pruning & default rule pruning:

data coverage pruning: the algorithm iterates through the rules in the sort order removing any rule which does not correctly cover any instances. If the rule correctly covers at least one instance, it is retained and the instances removed.
default rule pruning: the algorithm iterates through the rules in the sort order, and cuts off the list once keeping the current rule would result in worse accuracy of the model than if a default rule was inserted at the place of the current rule and the rules below it were removed.

default rule is inserted at the bottom of the list. Default rule is a rule with empty antecedent and consequent predicting a majority class left in the training data once the previous rules in the classifier were applied.

classAtt="Class"
  appearance <- getAppearance(data_discr, classAtt)
  # Note that we are calling `cba_manual()` instead of cba() because we want - for demonstration purposes - to construct the classifier from a externally-generated rule list.
  rmCBA <- cba_manual(data_raw,  rules, txns, appearance$rhs, classAtt, cutp = list(),pruning_options=list(default_rule_pruning=FALSE))

Explanation of the key settings:

cba_manual instead of cba: The reason why we invoke cba_manual from the arc package is that cba_manual allows to supply custom list of rules from which the CBA model will be built. An alternative option would be to directly invoke the cba function, which would also perform association rule learning as well as data discretization: rmCBA_auto <- cba(humtemp, classAtt="Class")
cutp = list(): cutp specifies list of cutpoints used to discretize data. For demonstration purposes we performed discretization of data as part of preprocessing, therefore no cutpoints are specified.
default_rule_pruning=FALSE: While default rule pruning is a part of the CBA process, we obtained slightly better results if it is disabled. This outputs more rules to be later optimized by QCBA. Within QCBA, rules will be first optimized and then subject to default rule pruning.

In this case, CBA did not remove any rule, but added a default rule to the end. This ensures that the rule list covers every possible instance.

inspect(rmCBA@rules)

##     lhs                                        rhs       support    confidence
## [1] {Humidity=(80;100]}                     => {Class=1} 0.11111111 0.8000000 
## [2] {Temperature=(30;35]}                   => {Class=4} 0.13888889 0.6250000 
## [3] {Temperature=(25;30], Humidity=(40;60]} => {Class=4} 0.08333333 0.6000000 
## [4] {Temperature=(15;20]}                   => {Class=2} 0.11111111 0.5714286 
## [5] {Temperature=(25;30]}                   => {Class=4} 0.13888889 0.5000000 
## [6] {}                                      => {Class=2} 0.27777778 0.2777778 
##     coverage  lift     count lhs_length orderedConf orderedSupp cumulativeConf
## [1] 0.1388889 3.600000  4    1          0.8000000   4           0.8000000     
## [2] 0.2222222 2.045455  5    1          0.7142857   5           0.7500000     
## [3] 0.1388889 1.963636  3    2          0.6000000   3           0.7058824     
## [4] 0.1944444 2.057143  4    1          0.5000000   3           0.6521739     
## [5] 0.2777778 1.636364  5    1          0.5000000   2           0.6296296     
## [6] 1.0000000 1.000000 10    0          0.5555556   5           0.6111111

The green background on the plot is associated with the default rule classifying to Class 2.

plot(Humidity,Temperature,pch=as.character(Class),main="CBA model",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
plotGrid(FALSE)
plotRules(rmCBA@rules)

The accuracy of this model on training data:

  prediction_cba<-predict(rmCBA,data_discr,discretize=FALSE)
  acc_cba <- CBARuleModelAccuracy(prediction_cba, data_discr[[classAtt]])
  print(paste("Accuracy (CBA):",acc_cba))

## [1] "Accuracy (CBA): 0.611111111111111"

Postprocessing CBA output with Quantitative CBA (QCBA)

QCBA postprocesses CBA classifier created over discretized numeric attributes. QCBA requires on the input also the original raw undiscretized (continuous) data.

Basic QCBA model

To build a model, qcba needs a cba model and raw (undiscretized) data. Note that number of additional parameters can be also specified - these were left to their default values.

rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw)

To make CBA models more compact, QCBA performs the following optimizations that are enabled by default. The short call above can be expanded to:

trim_literal_boundaries <- TRUE #will use this variable later
rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = trim_literal_boundaries, postpruning = "cba", attributePruning = TRUE,  defaultRuleOverlapPruning="transactionBased",  createHistorySlot=TRUE,loglevel = "WARNING")
print(rmqCBA@rules)

To make CBA models more compact, QCBA performs the following optimizations that relate to the parameter settings:

Refitting rules to value grid. Literals originally aligned to borders of the discretized regions are refit to finer grid. This is the only optimization which cannot be disabled.
Attribute pruning (attributePruning). Remove redundant attributes from rules.
Trimming (trim_literal_boundaries). Literals in discovered rules are trimmed so that they do not contain regions not covered by data.
Extension (extendType). Ranges of literals in the body of each rule are extended, escaping from the coarse hypercubic created by discretization. Currently, only extension on numerical attributes is supported.
Data coverage pruning (postpruning). Removes some of the rules, which were made redundant by the previous QCBA optimizations, by performing another iteration of rule pruning, possible values are cba (CBA’s data coverage pruning) and greedy for its faster greedy variant.
Default rule overlap pruning (defaultRuleOverlapPruning). Some rules that classify into the same class as the default rule in the end of the classifier can be removed. This parameter has three possible values: transactionBased, rangeBased and none.

Finally, the createHistorySlot parameter is useful for debugging and visualization of the extension process. By default it is set to false. If set to true, the instance of the qCBARuleModel class created by rmqCBA() will contain a history slot with the chronological list of accepted rule extensions that were created on each rule during the extension process.

As we can notice, the number of rules on the output decreased. After the rule boundaries have been extended, postpruning removed two rules from the CBA model. The default rule was recomputed, and now classifies to green (Class 2).

plot(Humidity,Temperature,pch=as.character(Class),main="QCBA model",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
plotGrid(FALSE)
plotRules(rmqCBA@rules)

The accuracy of this model on training data:

  prediction_qcba<-predict(rmqCBA,data_raw,discretize=FALSE)
  acc_qcba <- CBARuleModelAccuracy(prediction_qcba, data_raw[[classAtt]])
  print(paste("Accuracy (QCBA):",acc_cba))

## [1] "Accuracy (QCBA): 0.611111111111111"

On the training data, QCBA provides the same accuracy as CBA but with fewer rules.

Part 2: explaining optimization steps in QCBA

The steps taken by QCBA to create the model shown above:

Refit the data to finer grid: this grid has steps corresponding to all unique attribute values appearing in the training data.
Attribute pruning: Remove redundant attributes from rules. Attribute is considered redundant if its removal does not decrease rule confidence.
Trimming (optional): Value ranges (intervals) in rule conditions are adjusted so that they do not contain regions not covered by data.
Extension: the ranges of literals in the body of each rule are extended to increase coverage.
Postpruning: second iteration of pruning, which may remove some rules made redundant using the previous steps.
Default rule overlap pruning: Default rule overlap iterates through all rules classifying into the same class as the default rule. These rules all overlap with the default rule and are thus candidates for pruning. They can be removed only if their removal will not change the classification of the instances (or regions – second type of default overlap pruning) they cover by rules that have a lower priority.

All steps will be demonstrated in detail in the following. Unlike other optimizations, attribute pruning is demonstrated on the Iris dataset (HumTemp with only 2 predictors is not suitable).

First, we will setup some code for visualization of the progress of QCBA.

plotHumTempRuleDelta<- function(rules, ruleIndex2,ruleIndex1, border = "red",col=rgb(0,0,1,alpha=0.3) )
{
  rule1 <- rules$rules[ruleIndex1]
  rule2 <- rules$rules[ruleIndex2]
  
  temp_coordinates1 <- unlist(regmatches(rule1,regexec("Temperature=.([0-9]+);([0-9]+).",rule1)))
  temp_coordinates2 <- unlist(regmatches(rule2,regexec("Temperature=.([0-9]+);([0-9]+).",rule2)))
  hum_coordinates1 <- unlist(regmatches(rule1,regexec("Humidity=.([0-9]+);([0-9]+).",rule1)))
  hum_coordinates2 <- unlist(regmatches(rule2,regexec("Humidity=.([0-9]+);([0-9]+).",rule2)))
  
  r1_bottom <- temp_coordinates1[2]
  r1_top <- temp_coordinates1[3] 
  r1_left <- hum_coordinates1[2]
  r1_right <- hum_coordinates1[3] 
  r2_bottom <- temp_coordinates2[2]
  r2_top <- temp_coordinates2[3] 
  r2_left <- hum_coordinates2[2]
  r2_right <- hum_coordinates2[3]  
  
  if (r2_right>r1_right)
  {
    r_right<- r2_right
    r_left <- r1_right
    r_top <- r1_top
    r_bottom <- r1_bottom
  }
  else if(r2_left>r1_left)
  {
    r_right<- r2_left
    r_left <- r1_left
    r_top <- r1_top
    r_bottom <- r1_bottom
  }
  else if  (r2_top>r1_top)
  {
    r_right<- r1_right
    r_left <- r1_left
    r_top <- r2_top
    r_bottom <- r1_top
  }  
  else if  (r2_bottom<r1_bottom)
  {
    r_right<- r1_right
    r_left <- r1_left
    r_top <- r1_bottom
    r_bottom <- r2_bottom
  }    
  m <- rect(r_left,r_bottom,r_right,r_top,border=border,col=col)
}

plotRuleInHistory <- function(extendHistory,i,seedRuleConf)
{
  titles=c("Rule is refit to the finer grid","Rule is trimmed", rep("Rule is extended",nrow(extendHistory)-3), "Final rule: no other extend was succcessful")
  curRuleConf <- extendHistory[i,5]
    if (seedRuleConf > curRuleConf  && i>1){
      titles[i] <- paste(titles[i]," (Conditional accept)")
    }
  
  plot(Humidity,Temperature,pch=as.character(Class), main=titles[i],cex.lab=0.01, cex.axis=1.5, cex.main=1.5, xlab="", ylab="", ylim=c(20,35),xlim=c(27,65),cex.sub=1.3, sub=paste(extendHistory[i,3], "\n","Supp:",round(extendHistory[i,4],2)," Conf:",round(curRuleConf,2)))
  plotGrid(TRUE,FALSE)
  plotHumTempRule(extendHistory[3],i)
  if (seedRuleConf > curRuleConf && i>1)
  {
    plotHumTempRuleDelta(extendHistory,i-1,i,border="red",col=rgb(1,0,0,alpha=0.5))
  }  
  
}
inspected_rule_RID <- "2"
extendHistory <- rmqCBA@history[rmqCBA@history$RID==inspected_rule_RID,]
base_rule_in_history <- 1
if (trim_literal_boundaries == TRUE)
{
# the base confidence will be taken from the trimmed rule, which is the second rule in history
 base_rule_in_history <- 2
}
seedRuleConf <- rmqCBA@history[rmqCBA@history$RID==inspected_rule_RID,][base_rule_in_history,5]

Refit: The finer grid

CBA rules stick to a grid that corresponds to results of discretization. The grid used by QCBA corresponds to all unique values appearing in the training data.

interestingRuleAsText <-  paste(paste(unlist(interestingRule$lhs[1]),collapse=''),paste(unlist(interestingRule$rhs[1]),collapse=''),sep=" => ")
plot(Humidity,Temperature,pch=as.character(Class),main="CBA-generated rule on original grid", sub=paste(interestingRuleAsText, "Supp:",round(interestingRule$support,2)," Conf:",round(interestingRule$confidence,2)),cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.15)
plotGrid(FALSE)
plotHumTempRule(rmCBA@rules,3)

The same rule from the CBA classifier plotted on the finer grid.

plot(Humidity,Temperature,pch=as.character(Class), main="The finer grid",cex.lab=1.5, cex.axis=1.5,  xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
plotHumTempRule(rmCBA@rules,3)

Let’s zoom in and look at how QCBA refit the rule:

plotRuleInHistory(extendHistory,1,seedRuleConf)

Trimming

Rule is shaved of any boundaries that are not backed by correctly classified instances.

plotRuleInHistory(extendHistory,2,seedRuleConf)

Extension – the complete process

The process proceeds as follows:

1. Rule is refit to finer grid. Boundaries shrink, confidence and support is unaffected.
1. Rule is trimmed: it is shaved of one misclassified data point, confidence rises from 0.6 to 0.75
3.-9. Rule coverage is extended. Confidence and support is unaffected.
1. Left boundary on Humidity conditionally extends - one additional misclassified and one correctly classified data point is covered. Confidence drops to 0.67.
1. Left boundary on Humidity conditionally extends - one additional correctly classified point is covered. Confidence increases to 0.71, but is still below 0.75.
1. Left boundary on Humidity extends - one additional correctly classified point is covered. Confidence returns 0.75.
1. Left boundary on Humidity extends. Confidence and support is unaffected. Final rule.

Note that QCBA improved confidence of rule 2 from initial value of 0.6 to 0.75 and support from 0.08 to 0.17.

for (i in 1:nrow(extendHistory)) {
  plotRuleInHistory(extendHistory,i,seedRuleConf)
}

Postpruning

As a result of the extension process, the data matched by the individual rules change. Within postprocessing, the extended rules are resorted and pruned again using CBA data coverage pruning. Note that the default rule at the end is automatically updated.

Result of the extension process:

rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = TRUE, postpruning = "none", defaultRuleOverlapPruning="noPruning", createHistorySlot=TRUE,loglevel = "WARNING")
print(rmqCBA@rules)

##                                                 rules   support confidence
## 1                     {Humidity=[82;95]} => {Class=1} 0.1111111  0.8000000
## 2 {Temperature=[22;31],Humidity=[33;53]} => {Class=4} 0.1666667  0.7500000
## 3                  {Temperature=[31;34]} => {Class=4} 0.1388889  0.6250000
## 4                  {Temperature=[17;21]} => {Class=2} 0.1388889  0.6250000
## 5                  {Temperature=[27;29]} => {Class=4} 0.1388889  0.6250000
## 6                                     {} => {Class=4} 0.3055556  0.3055556
##   condition_count orderedConf orderedSupp
## 1               1   0.8000000           4
## 2               2   0.7500000           6
## 3               1   0.6666667           4
## 4               1   0.5714286           4
## 5               1   0.0000000           0
## 6               0   0.1111111           1

Result of postpruning:

rmqCBA_pruned <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = TRUE, postpruning = "cba",  defaultRuleOverlapPruning="noPruning", createHistorySlot=TRUE,loglevel = "WARNING")
print(rmqCBA_pruned@rules)

##                                                 rules   support confidence
## 1                     {Humidity=[82;95]} => {Class=1} 0.1111111  0.8000000
## 2 {Temperature=[22;31],Humidity=[33;53]} => {Class=4} 0.1666667  0.7500000
## 3                  {Temperature=[31;34]} => {Class=4} 0.1388889  0.6250000
## 4                                     {} => {Class=2} 0.2777778  0.2777778
##   condition_count orderedConf orderedSupp
## 1               1   0.8000000           4
## 2               2   0.7500000           6
## 3               1   0.6666667           4
## 4               0   0.4705882           8

Last two rules with non-empty antecedent were replaced using data coverage pruning by the default rule:

rmqCBA@rules[4:5,1]

## [1] "{Temperature=[17;21]} => {Class=2}" "{Temperature=[27;29]} => {Class=4}"

with default rule:

rmqCBA_pruned@rules[4,1]

## [1] "{} => {Class=2}"

This replacement results in same training set error with fewer rules.

Training set error for the original rule list:

  prediction_qcba<-predict(rmqCBA,data_raw,discretize=FALSE)
  acc_qcba <- CBARuleModelAccuracy(prediction_qcba, data_raw[[classAtt]])
  print(paste("Accuracy (QCBA - without postpruning):",acc_cba))

## [1] "Accuracy (QCBA - without postpruning): 0.611111111111111"

Error for the pruned rule list:

  prediction_qcba<-predict(rmqCBA_pruned,data_raw,discretize=FALSE)
  acc_qcba <- CBARuleModelAccuracy(prediction_qcba, data_raw[[classAtt]])
  print(paste("Accuracy (QCBA with postpruning):",acc_cba))

## [1] "Accuracy (QCBA with postpruning): 0.611111111111111"

Default rule overlap pruning

Default rule overlap iterates through all rules classifying into the same class as the default rule. These rules all overlap with the default rule and are thus candidates for pruning. However, they can be removed only if their removal will not change the classification of the instances (or regions – see below) they cover by rules that are below them. Recall that for prediction CBA uses the first rule in the rule list. Therefore, the algorithm checks whether the region (transactions) matched by the antecedent of the candidate overlaps with region matched by any of the remaining rules (potential clashing rule) classifying instances into a different class that are below the candidate in the rule list. If there are no such rules, the candidate can be removed (pruned).

There are two ways how to check the overlap between the candidate and a potential clashing rule. 1. Check the regions matched by the antecedents of the candidate and the potential clashing rule by analyzing the boundaries of the rules. 2. Check overlap in transactions covered by the antecedent of the candidate rule and the antecedent of the potential clashing rule in the training data.

Default rule overlap (transaction-based)

Let’s us create a different classifier by lowering the minimum support threshold and disabling trimming.

# we will use support of 1 instance
trim_literal_boundaries <- FALSE
supp <- 1
sink("/dev/null")
rules <- apriori(txns, parameter = list(confidence = 0.5, support= supp/nrow(data_discr), minlen=1, maxlen=3), appearance=appearance)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime    support minlen
##         0.5    0.1    1 none FALSE            TRUE       5 0.02777778      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[4 item(s)] done [0.00s].
## set transactions ...[14 item(s), 36 transaction(s)] done [0.00s].
## sorting and recoding items ... [14 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rmCBA <- cba_manual(data_raw,  rules, txns, appearance$rhs, classAtt, cutp= list(), pruning_options=list(default_rule_pruning=FALSE))
rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = trim_literal_boundaries, postpruning = "cba", defaultRuleOverlapPruning="noPruning", createHistorySlot=TRUE,loglevel = "WARNING")
  sink()
print(rmqCBA@rules)

##                                                  rules    support confidence
## 1  {Temperature=[21;45],Humidity=[82;95]} => {Class=1} 0.11111111  1.0000000
## 2  {Temperature=[34;45],Humidity=[33;58]} => {Class=2} 0.08333334  1.0000000
## 3  {Temperature=[29;34],Humidity=[29;48]} => {Class=4} 0.08333334  1.0000000
## 4  {Temperature=[19;26],Humidity=[29;53]} => {Class=2} 0.08333334  1.0000000
## 5  {Temperature=[37;45],Humidity=[53;95]} => {Class=1} 0.05555556  1.0000000
## 6  {Temperature=[34;40],Humidity=[29;47]} => {Class=3} 0.02777778  1.0000000
## 7  {Temperature=[17;24],Humidity=[82;95]} => {Class=2} 0.02777778  1.0000000
## 8  {Temperature=[29;34],Humidity=[29;68]} => {Class=4} 0.16666667  0.8571429
## 9  {Temperature=[22;31],Humidity=[33;70]} => {Class=4} 0.19444445  0.7000000
## 10                  {Temperature=[17;21]} => {Class=2} 0.13888890  0.6250000
## 11                                     {} => {Class=3} 0.19444445  0.1944444
##    condition_count orderedConf orderedSupp
## 1                2   1.0000000           4
## 2                2   1.0000000           3
## 3                2   1.0000000           3
## 4                2   1.0000000           3
## 5                2   1.0000000           1
## 6                2   1.0000000           1
## 7                2   1.0000000           1
## 8                2   0.7500000           3
## 9                2   0.6250000           5
## 10               1   0.4000000           2
## 11               0   0.6666667           2

By activating default rule overlap pruning, we can reduce the number of rules in the classifier by 1.

sink("/dev/null")
  rmqCBA_dro <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = trim_literal_boundaries, postpruning = "cba", defaultRuleOverlapPruning="transactionBased", createHistorySlot=FALSE,loglevel = "WARNING")
sink()
print(rmqCBA_dro@rules)

##                                                  rules    support confidence
## 1  {Temperature=[21;45],Humidity=[82;95]} => {Class=1} 0.11111111  1.0000000
## 2  {Temperature=[34;45],Humidity=[33;58]} => {Class=2} 0.08333334  1.0000000
## 3  {Temperature=[29;34],Humidity=[29;48]} => {Class=4} 0.08333334  1.0000000
## 4  {Temperature=[19;26],Humidity=[29;53]} => {Class=2} 0.08333334  1.0000000
## 5  {Temperature=[37;45],Humidity=[53;95]} => {Class=1} 0.05555556  1.0000000
## 6  {Temperature=[17;24],Humidity=[82;95]} => {Class=2} 0.02777778  1.0000000
## 7  {Temperature=[29;34],Humidity=[29;68]} => {Class=4} 0.16666667  0.8571429
## 8  {Temperature=[22;31],Humidity=[33;70]} => {Class=4} 0.19444445  0.7000000
## 9                   {Temperature=[17;21]} => {Class=2} 0.13888890  0.6250000
## 10                                     {} => {Class=3} 0.19444445  0.1944444
##    condition_count orderedConf orderedSupp
## 1                2       1.000           4
## 2                2       1.000           3
## 3                2       1.000           3
## 4                2       1.000           3
## 5                2       1.000           1
## 6                2       1.000           1
## 7                2       0.750           3
## 8                2       0.625           5
## 9                1       0.400           2
## 10               0       0.750           3

Rule #6 in the original classifier is not needed. Rule #6:

print(rmqCBA@rules[6,1])

## [1] "{Temperature=[34;40],Humidity=[29;47]} => {Class=3}"

This rule assigns into Class 3 - the same class as the default rule in the end of the classifier. Let’s look at rules between #6 and the default rule #11:

print(rmqCBA@rules[6:11,1])

## [1] "{Temperature=[34;40],Humidity=[29;47]} => {Class=3}"
## [2] "{Temperature=[17;24],Humidity=[82;95]} => {Class=2}"
## [3] "{Temperature=[29;34],Humidity=[29;68]} => {Class=4}"
## [4] "{Temperature=[22;31],Humidity=[33;70]} => {Class=4}"
## [5] "{Temperature=[17;21]} => {Class=2}"                 
## [6] "{} => {Class=3}"

There is no rule below #6 that would prevent the training instances covered by #6 from being classified by the default rule (which has the same class as #6). Rule #6 is drawn in grey in the following figure, the remaining rules below it are drawn in blue and green.

plot(Humidity,Temperature,pch=as.character(Class), main="Rules #6 to #11 (default rule pruning OFF)",cex.lab=1.5, cex.axis=1.5,  xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
for (i in 6:11) {
plotHumTempRule(rmqCBA@rules,i)
}

When Default rule overlap pruning is activated, rules such as #6 are removed and the area left for classification to the default rule, which comes as last (default rule is plotted in grey).

plot(Humidity,Temperature,pch=as.character(Class), main="Rules #6 to #10 (default rule pruning ON)",cex.lab=1.5, cex.axis=1.5,  xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
for (i in 6:10) {
plotHumTempRule(rmqCBA_dro@rules,i)
}

Default rule overlap (range-based)

Range-based default rule pruning looks at rules below each rule that classifies to the default class and inspects whether the rules below it can divert instances from being classified by the default rule. The difference from the transaction-based method is that it is inspected whether the regions covered by the rules do not overlap rather than training transactions. Range-based pruning thus guarantees a solution that generalizes beyond the training data. It is perfectly safe to remove rules detected as redundant by range-based default rule overlap.

The disadvantage is that range based pruning is not very uneffective. In realistically sized rule lists spanning mutliple dimensions, the odds that there will be a clashing rule for any candidate rule is very high and thus no rules will be removed. An alternative way is to judge overlap between two rules based on whether they have any transactions in common.

In our toy case, range-based pruning does not remove any rule. Rule #6 is not removed, because it shares a boundary on Temperature (34) with rule #8.

plot(Humidity,Temperature,pch=as.character(Class), main="R #6 and #8 intersect (range-based pruning is ineffective)",cex.lab=1.5, cex.axis=1.5,  xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
plotHumTempRule(rmqCBA@rules,6)
plotHumTempRule(rmqCBA@rules,8)

Attribute pruning

Attribute pruning is an optimization step in QCBA which considers removing all attribute-value pairs (literals) in all rules in the classifier. The removal is confirmed if a rule created without the literal has at least the accuracy of the original rule. The optimization is performed in a greedy manner, processing all rules in the sort order. Attributes are processed in arbitrary order.

To demonstrate attribute pruning, we need to use dataset with higher number of attributes than humtemp dataset has and also different base rule classifier than CBA, as attribute pruning is ineffective on CBA according to our experiments. We will use CPAR from arulesCBA as the baseline classifier and the iris dataset.

For initial demonstration, we will turn off all optimizations in QCBA that follow after attribute pruning (i.e. trimming, postpruning, default rule pruning).

Prepare data

set.seed(54)
allData <- datasets::iris[sample(nrow(datasets::iris)),]
trainFold <- allData[1:100,]
testFold <- allData[101:nrow(datasets::iris),]
classAtt <- "Species"
discrModel <- discrNumeric(trainFold, classAtt)
train_disc <- as.data.frame(lapply(discrModel$Disc.data, as.factor))
cutPoints <- discrModel$cutp
test_disc <- applyCuts(testFold, cutPoints, infinite_bounds=TRUE, labels=TRUE)
y_true <-testFold[[classAtt]]

Learn and evaluate CPAR model

rmBASE <- CPAR(train_disc, formula=as.formula(paste(classAtt,"~ .")))
predictionBASE <- predict(rmBASE,test_disc) # CPAR (arulesCBA) predict function

Postprocess with QCBA with attribute pruning disabled

baseModel_arc <- arulesCBA2arcCBAModel(rmBASE, cutPoints,  trainFold, classAtt)
rmQCBA_NOATTPR <- qcba(cbaRuleModel=baseModel_arc,datadf=trainFold,attributePruning = FALSE)
predictionQCBA_NOATTPR <- predict(rmQCBA_NOATTPR,testFold)

Postprocess with QCBA with attribute pruning enabled

baseModel_arc <- arulesCBA2arcCBAModel(rmBASE, cutPoints,  trainFold, classAtt)
rmQCBA <- qcba(cbaRuleModel=baseModel_arc,datadf=trainFold,attributePruning = TRUE)
predictionQCBA <- predict(rmQCBA,testFold)

Compare the three models

  print(paste0("CPAR: Number of rules: ",length(rmBASE$rules),", total conditions:",sum(rmBASE$rules@lhs@data), ", accuracy on test data: ",round(CBARuleModelAccuracy(predictionBASE, y_true),2)))

## [1] "CPAR: Number of rules: 8, total conditions:10, accuracy on test data: 0.98"

  print(paste0("QCBA (NO ATT PR.): Number of rules: ",nrow(rmQCBA_NOATTPR@rules),", total conditions:",sum(rmQCBA_NOATTPR@rules$condition_count), ", accuracy on test data: ",round(CBARuleModelAccuracy(predictionQCBA_NOATTPR, y_true),2)))

## [1] "QCBA (NO ATT PR.): Number of rules: 3, total conditions:3, accuracy on test data: 1"

  print(paste0("QCBA (ATT PR.): Number of rules: ",nrow(rmQCBA@rules),", total conditions:",sum(rmQCBA@rules$condition_count), ", accuracy on test data: ",round(CBARuleModelAccuracy(predictionQCBA, y_true),2)))

## [1] "QCBA (ATT PR.): Number of rules: 3, total conditions:2, accuracy on test data: 1"

The QCBA postprocessing with attribute pruning enabled results in a model that is more accurate and smaller than a QCBA model with attribute pruning disabled.

It should be noted that attribute pruning is performed as a second step in QCBA after refitting. The result of QCBA model with attribute pruning enabled and without it can thus differ substantially, also, as a consequence, affecting the range of literals.

Part 3: Advanced options

Speeding up conditional accept

The default value of the minCondImprovement parameter is -1. This parameter value ensures exhaustive search for extensions, but can be slow on large datasets or datasets with many distinct values. The closer this parameter value will be to 0, the faster the execution will generally be.

rmCBAiris <- cba(trainFold, classAtt="Species")

## Using automatic threshold detection

## Running apriori with setting: confidence =  0.5 , support =  0 , minlen =  2 , maxlen =  3 , MAX_RULE_LEN =  5

## Rule count:  89  Iteration:  1

## Increasing maxlen to:   4

## Running apriori with setting: confidence =  0.5 , support =  0 , minlen =  2 , maxlen =  4 , MAX_RULE_LEN =  5

## Rule count:  394  Iteration:  2

## Increasing maxlen to:   5

## Running apriori with setting: confidence =  0.5 , support =  0 , minlen =  2 , maxlen =  5 , MAX_RULE_LEN =  5

## Rule count:  1006  Iteration:  3

## Target rule count satisfied:   1000

## Removing excess discovered rules

## Rule learning took: 0.02  seconds

## Original rules:  1000

## Rules after data coverage pruning: 8

## Performing default rule pruning.

## Final rule list size:  6

## Pruning took: 0.08  seconds

start.time <- Sys.time()
for (i in 1:100)
{
 rmqCBA <- qcba(cbaRuleModel=rmCBAiris,datadf=trainFold,extendType="numericOnly", minCondImprovement=-1, postpruning="cba", defaultRuleOverlapPruning="noPruning")
}
end.time <- Sys.time()
message (paste("100 executions took:", round(end.time - start.time,2), " seconds"))

## 100 executions took: 7.14  seconds

start.time <- Sys.time()
for (i in 1:100)
{
 rmqCBA <- qcba(cbaRuleModel=rmCBAiris,datadf=trainFold,extendType="numericOnly", minCondImprovement=0.0, postpruning="cba", defaultRuleOverlapPruning="noPruning")
}
end.time <- Sys.time()
message (paste("100 executions took:", round(end.time - start.time,2), " seconds"))

## 100 executions took: 6.37  seconds

Improvement in execution time gained on this small dataset by changing the minCondImprovement from -1 to 0.00 is about 10%.

Invoking directly Java-based QCBA backend

The CBA model is passed by the R code to a Java QCBA.jar file, which performs the QCBA model learning. The QCBA model is then returned to R.

humtemp dataset example

Let’s create “debug” folder, where we will store the models.

basePath <-  tempdir()
dir.create(file.path(basePath, "debug"), showWarnings = FALSE)
rulesPath <-paste(basePath,"debug","humtemp.arules",sep=.Platform$file.sep)
write.csv(as(rmCBA@rules,"data.frame"), rulesPath, row.names=TRUE,quote = TRUE)
outputDataPath <- paste(basePath,"debug",'humtemp.csv', sep=.Platform$file.sep)
write.csv(humtemp,file=outputDataPath,row.names=FALSE)

Let’s create the configuration file for QCBA.

extendType="numericOnly"
trimLiteralBoundaries = FALSE
attributePruning =TRUE
defaultRuleOverlapPruning = "rangeBased" #noPruning,transactionBased,rangeBased
postpruning = "cba" #none,cba,greedy
dataTypes <- paste(rmCBA@attTypes, collapse = ',') 
classAtt <- colnames(humtemp)[length(humtemp)] #last attribute
outputPath <- paste(basePath,"debug",'humtemp-qcba.arules', sep=.Platform$file.sep)

x=paste('<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">',
        "<properties>\n",
        "<entry key=\"Method\">extend</entry>\n",
        "<entry key=\"RulesPath\">", rulesPath, "</entry>\n",
        "<entry key=\"TrainDataPath\">", outputDataPath,"</entry>\n",
        "<entry key=\"ExtendType\">",extendType,"</entry>\n",
        "<entry key=\"AttributePruning\">",attributePruning, "</entry>\n",
        "<entry key=\"Trimming\">",trimLiteralBoundaries, "</entry>\n",
        "<entry key=\"DefaultRuleOverlapPruning\">",defaultRuleOverlapPruning, "</entry>\n",
        "<entry key=\"Postpruning\">",postpruning, "</entry>\n",
        "<entry key=\"DataTypes\">", dataTypes,'</entry>\n',
        "<entry key=\"TargetAttribute\">", classAtt,'</entry>\n',
        "<entry key=\"OutputPath\">", outputPath,'</entry>\n',
        "</properties>", sep="")

        qcbaFilePath <-paste(basePath,"debug","humtemp-conf.xml",sep=.Platform$file.sep)
        write(x, file = qcbaFilePath,
              ncolumns = 1,
              append = FALSE, sep = ",")
print(paste("Run as: ", "java -jar QCBA.jar ", qcbaFilePath, sep=""))

## [1] "Run as: java -jar QCBA.jar /tmp/RtmpTjWPSi/debug/humtemp-conf.xml"

print(paste("QCBA model will be written to:",outputPath))

## [1] "QCBA model will be written to: /tmp/RtmpTjWPSi/debug/humtemp-qcba.arules"

iris dataset example

Let’s create “debug” folder, where we will store the rmCBA model.

        basePath <-  tempdir()
        dir.create(file.path(basePath, "debug"), showWarnings = FALSE)
        rulesPath <-paste(basePath,"debug","iris.arules",sep=.Platform$file.sep)
        write.csv(as(rmCBAiris@rules,"data.frame"), rulesPath, row.names=TRUE,quote = TRUE)
        trainDataPath <- paste(basePath,"debug",'iris-train.csv', sep=.Platform$file.sep)
        testDataPath <- paste(basePath,"debug",'iris-test.csv', sep=.Platform$file.sep)
        write.csv(trainFold,file=trainDataPath,row.names=FALSE)
        write.csv(testFold,file=testDataPath,row.names=FALSE)