Quantitative CBA (QCBA) is a postprocessing algorithm for association rule classification algorithm CBA, which implements a number of optimization steps to improve handling of quantitative (numerical) attributes. The viable properties of these rule lists that make CBA classification models most comprehensible among all association rule classification algorithms, such as one-rule classification and crisp rules, are retained. The postprocessing is conceptually fast, because it is performed on a relatively small number of rules that passed the pruning steps, and can be adapted also for multi-rule classification algorithms. Benchmarks show about 50% decrease in the total size of the model as measured by the total number of conditions in all rules. Model accuracy generally remains on the same level as for CBA with QCBA even providing small improvement over CBA on 11 of the 22 datasets involved in our benchmark.
This tutorial has three parts:
Shows how to build CBA and QCBA model using the arc and qCBA packages. To learn how to setup QCBA refer to the Quantitative CBA github homepage.
Visually demonstrates all the optimization steps in QCBA:
Introduces advanced options:
The QCBA implementation is in Java. It is also available via an R package wrapper, which we will use for this example.
library(qCBA)
To learn how to install QCBA refer to the Quantitative CBA github homepage.
Let’s look at the humtemp sample data bundled with the arc package, which we will be using throughout this tutorial.
There are two explanatory attributes (Temperature and Humidity). The target attribute is preference (e.g. subjective comfort level).
attach(humtemp)
The first few rows of the data:
head(humtemp)
## Temperature Humidity Class
## 1 45 33 2
## 2 27 29 3
## 3 40 48 2
## 4 40 65 1
## 5 38 82 1
## 6 37 30 3
And a scatter plot:
plot(Humidity,Temperature,pch=as.character(Class))
Association rule learning requires discretized data. In this case, we perform simple equidistant binning.
#custom discretization
data_raw <- humtemp
data_discr <- humtemp
temp_breaks <- seq(from=15,to=45,by=5)
hum_breaks <- c(0,40,60,80,100)
temp_unique_vals <- setdiff(unique(Temperature),temp_breaks)
hum_unique_vals <- setdiff(unique(Humidity),hum_breaks)
data_discr[,1] <- cut(Temperature,breaks=temp_breaks)
data_discr[,2] <- cut(Humidity,breaks=hum_breaks)
#change interval syntax from (15,20] to (15;20], which is required by QCBA R package
data_discr[,1] <- as.factor(unlist(lapply(data_discr[,1], function(x) {gsub(",", ";", x)})))
data_discr[,2] <- as.factor(unlist(lapply(data_discr[,2], function(x) {gsub(",", ";", x)})))
data_discr[,3] <- as.factor(Class)
head(data_discr)
## Temperature Humidity Class
## 1 (40;45] (0;40] 2
## 2 (25;30] (0;40] 3
## 3 (35;40] (40;60] 2
## 4 (35;40] (60;80] 1
## 5 (35;40] (80;100] 1
## 6 (35;40] (0;40] 3
The discretization splits the data space into rectangular regions. Given that we have two attributes, the discovered rule can only correspond to a rectangular region with borders aligned to the grid. If we had more than two attributes, the discovered rule would delimit a hypercube.
plotGrid <- function(plotFineGrid=TRUE, plotDiscrGrid=TRUE)
{
if (plotDiscrGrid)
{
for (i in temp_breaks[-1])
{
abline(h=i, lty=2)
}
for (i in hum_breaks[-1])
{
abline(v=i, lty=2)
}
}
if (plotFineGrid)
{
for (i in temp_unique_vals[-1])
{
abline(h=i, lty=3, col="grey")
}
for (i in hum_unique_vals[-1])
{
abline(v=i, lty=3, col="grey")
}
}
}
plot(Humidity,Temperature,pch=as.character(Class))
plotGrid(FALSE)
The next step is mining of association rules. The rule mining is constrained to rules that have values of the Class attribute in the consequent.
sink("/dev/null")
classAtt="Class"
appearance <- getAppearance(data_discr, classAtt)
txns <- as(data_discr, "transactions")
rules <- apriori(txns, parameter = list(confidence = 0.5, support= 3/nrow(data_discr), minlen=1, maxlen=3), appearance=appearance)
interestingRule <- inspect(rules)[5,] #will use this later
sink()
inspect(rules)
## lhs rhs support confidence
## [1] {Humidity=(80;100]} => {Class=1} 0.11111111 0.8000000
## [2] {Temperature=(15;20]} => {Class=2} 0.11111111 0.5714286
## [3] {Temperature=(30;35]} => {Class=4} 0.13888889 0.6250000
## [4] {Temperature=(25;30]} => {Class=4} 0.13888889 0.5000000
## [5] {Temperature=(25;30], Humidity=(40;60]} => {Class=4} 0.08333333 0.6000000
## coverage lift count
## [1] 0.1388889 3.600000 4
## [2] 0.1944444 2.057143 4
## [3] 0.2222222 2.045455 5
## [4] 0.2777778 1.636364 5
## [5] 0.1388889 1.963636 3
The rules can be visualized in the feature space as rectangular regions. Class 1 is coded as red, Class 2 as green, Class 3 as black, and Class 4 as blue region.
interesting_rule <- 5
#as.character(Class)
plot(Humidity,Temperature,pch=as.character(Class),main="Discovered asociation rules",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
plotGrid(FALSE)
plotHumTempRule<- function(rules, ruleIndex)
{
if (typeof(rules)=="S4")
{
# rules is a arules rule model
# sink: inspect also sends rules to the standard output
sink("/dev/null")
r <- inspect(rules)[ruleIndex,]
sink()
rule <- paste(unlist(r$lhs[1]),collapse='')
rhs <- paste(unlist(r$rhs[1]),collapse='')
}
else
{
# rules is a list of rules output by qCBA
rule <- rules[ruleIndex,1]
#rule <- rules$rules[ruleIndex]
rhs <- regmatches(rule,regexec("\\{Class=.*\\}",rule))
}
#get color
if (rhs == "{Class=1}")
{
border = "red"
col=rgb(1.0,0.2,0.2,alpha=0.3)
}
else if (rhs == "{Class=2}")
{
border = "green"
col=rgb(0,1,0,alpha=0.3)
}
else if (rhs == "{Class=3}")
{
border = "black"
col=rgb(0.4,0.4,0.4,alpha=0.3)
}
else if (rhs == "{Class=4}")
{
border = "blue"
col=rgb(0,0,1,alpha=0.3)
}
temp_coordinates<-unlist(regmatches(rule,regexec("Temperature=.([0-9]+);([0-9]+).",rule)))
if (length(temp_coordinates)==0)
{
#if the temperature literal is missing in the rule, use the following coordinates
temp_coordinates=c(0,0,50)
}
hum_coordinates<-unlist(regmatches(rule,regexec("Humidity=.([0-9]+);([0-9]+).",rule)))
if (length(hum_coordinates)==0)
{
#if the humidity literal is missing in the rule, use the following coordinates
hum_coordinates=c(0,0,100)
}
m <- rect(hum_coordinates[2], temp_coordinates[2], hum_coordinates[3], temp_coordinates[3],border=border,col=col)
}
plotRules <- function(rules)
{
if (typeof(rules)=="S4") #for arules/cba
{
rule_count <- length(rules)
}
else #for qcba
{
rule_count <- nrow(rules)
}
for (i in 1:rule_count)
{
plotHumTempRule(rules,i)
}
}
plotRules(rules)
Out of the two discovered rules, we will create a CBA classifier. This means that the rules will be:
classAtt="Class"
appearance <- getAppearance(data_discr, classAtt)
# Note that we are calling `cba_manual()` instead of cba() because we want - for demonstration purposes - to construct the classifier from a externally-generated rule list.
rmCBA <- cba_manual(data_raw, rules, txns, appearance$rhs, classAtt, cutp = list(),pruning_options=list(default_rule_pruning=FALSE))
Explanation of the key settings:
rmCBA_auto <- cba(humtemp, classAtt="Class")
In this case, CBA did not remove any rule, but added a default rule to the end. This ensures that the rule list covers every possible instance.
inspect(rmCBA@rules)
## lhs rhs support confidence
## [1] {Humidity=(80;100]} => {Class=1} 0.11111111 0.8000000
## [2] {Temperature=(30;35]} => {Class=4} 0.13888889 0.6250000
## [3] {Temperature=(25;30], Humidity=(40;60]} => {Class=4} 0.08333333 0.6000000
## [4] {Temperature=(15;20]} => {Class=2} 0.11111111 0.5714286
## [5] {Temperature=(25;30]} => {Class=4} 0.13888889 0.5000000
## [6] {} => {Class=2} 0.27777778 0.2777778
## coverage lift count lhs_length orderedConf orderedSupp cumulativeConf
## [1] 0.1388889 3.600000 4 1 0.8000000 4 0.8000000
## [2] 0.2222222 2.045455 5 1 0.7142857 5 0.7500000
## [3] 0.1388889 1.963636 3 2 0.6000000 3 0.7058824
## [4] 0.1944444 2.057143 4 1 0.5000000 3 0.6521739
## [5] 0.2777778 1.636364 5 1 0.5000000 2 0.6296296
## [6] 1.0000000 1.000000 10 0 0.5555556 5 0.6111111
The green background on the plot is associated with the default rule classifying to Class 2.
plot(Humidity,Temperature,pch=as.character(Class),main="CBA model",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
plotGrid(FALSE)
plotRules(rmCBA@rules)
The accuracy of this model on training data:
prediction_cba<-predict(rmCBA,data_discr,discretize=FALSE)
acc_cba <- CBARuleModelAccuracy(prediction_cba, data_discr[[classAtt]])
print(paste("Accuracy (CBA):",acc_cba))
## [1] "Accuracy (CBA): 0.611111111111111"
QCBA postprocesses CBA classifier created over discretized numeric attributes. QCBA requires on the input also the original raw undiscretized (continuous) data.
To build a model, qcba
needs a cba model and raw
(undiscretized) data. Note that number of additional parameters can be
also specified - these were left to their default values.
rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw)
To make CBA models more compact, QCBA performs the following optimizations that are enabled by default. The short call above can be expanded to:
trim_literal_boundaries <- TRUE #will use this variable later
rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = trim_literal_boundaries, postpruning = "cba", attributePruning = TRUE, defaultRuleOverlapPruning="transactionBased", createHistorySlot=TRUE,loglevel = "WARNING")
print(rmqCBA@rules)
To make CBA models more compact, QCBA performs the following optimizations that relate to the parameter settings:
Finally, the createHistorySlot parameter is useful for debugging and visualization of the extension process. By default it is set to false. If set to true, the instance of the qCBARuleModel class created by rmqCBA() will contain a history slot with the chronological list of accepted rule extensions that were created on each rule during the extension process.
As we can notice, the number of rules on the output decreased. After the rule boundaries have been extended, postpruning removed two rules from the CBA model. The default rule was recomputed, and now classifies to green (Class 2).
plot(Humidity,Temperature,pch=as.character(Class),main="QCBA model",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
plotGrid(FALSE)
plotRules(rmqCBA@rules)
The accuracy of this model on training data:
prediction_qcba<-predict(rmqCBA,data_raw,discretize=FALSE)
acc_qcba <- CBARuleModelAccuracy(prediction_qcba, data_raw[[classAtt]])
print(paste("Accuracy (QCBA):",acc_cba))
## [1] "Accuracy (QCBA): 0.611111111111111"
On the training data, QCBA provides the same accuracy as CBA but with fewer rules.
The steps taken by QCBA to create the model shown above:
All steps will be demonstrated in detail in the following. Unlike other optimizations, attribute pruning is demonstrated on the Iris dataset (HumTemp with only 2 predictors is not suitable).
First, we will setup some code for visualization of the progress of QCBA.
plotHumTempRuleDelta<- function(rules, ruleIndex2,ruleIndex1, border = "red",col=rgb(0,0,1,alpha=0.3) )
{
rule1 <- rules$rules[ruleIndex1]
rule2 <- rules$rules[ruleIndex2]
temp_coordinates1 <- unlist(regmatches(rule1,regexec("Temperature=.([0-9]+);([0-9]+).",rule1)))
temp_coordinates2 <- unlist(regmatches(rule2,regexec("Temperature=.([0-9]+);([0-9]+).",rule2)))
hum_coordinates1 <- unlist(regmatches(rule1,regexec("Humidity=.([0-9]+);([0-9]+).",rule1)))
hum_coordinates2 <- unlist(regmatches(rule2,regexec("Humidity=.([0-9]+);([0-9]+).",rule2)))
r1_bottom <- temp_coordinates1[2]
r1_top <- temp_coordinates1[3]
r1_left <- hum_coordinates1[2]
r1_right <- hum_coordinates1[3]
r2_bottom <- temp_coordinates2[2]
r2_top <- temp_coordinates2[3]
r2_left <- hum_coordinates2[2]
r2_right <- hum_coordinates2[3]
if (r2_right>r1_right)
{
r_right<- r2_right
r_left <- r1_right
r_top <- r1_top
r_bottom <- r1_bottom
}
else if(r2_left>r1_left)
{
r_right<- r2_left
r_left <- r1_left
r_top <- r1_top
r_bottom <- r1_bottom
}
else if (r2_top>r1_top)
{
r_right<- r1_right
r_left <- r1_left
r_top <- r2_top
r_bottom <- r1_top
}
else if (r2_bottom<r1_bottom)
{
r_right<- r1_right
r_left <- r1_left
r_top <- r1_bottom
r_bottom <- r2_bottom
}
m <- rect(r_left,r_bottom,r_right,r_top,border=border,col=col)
}
plotRuleInHistory <- function(extendHistory,i,seedRuleConf)
{
titles=c("Rule is refit to the finer grid","Rule is trimmed", rep("Rule is extended",nrow(extendHistory)-3), "Final rule: no other extend was succcessful")
curRuleConf <- extendHistory[i,5]
if (seedRuleConf > curRuleConf && i>1){
titles[i] <- paste(titles[i]," (Conditional accept)")
}
plot(Humidity,Temperature,pch=as.character(Class), main=titles[i],cex.lab=0.01, cex.axis=1.5, cex.main=1.5, xlab="", ylab="", ylim=c(20,35),xlim=c(27,65),cex.sub=1.3, sub=paste(extendHistory[i,3], "\n","Supp:",round(extendHistory[i,4],2)," Conf:",round(curRuleConf,2)))
plotGrid(TRUE,FALSE)
plotHumTempRule(extendHistory[3],i)
if (seedRuleConf > curRuleConf && i>1)
{
plotHumTempRuleDelta(extendHistory,i-1,i,border="red",col=rgb(1,0,0,alpha=0.5))
}
}
inspected_rule_RID <- "2"
extendHistory <- rmqCBA@history[rmqCBA@history$RID==inspected_rule_RID,]
base_rule_in_history <- 1
if (trim_literal_boundaries == TRUE)
{
# the base confidence will be taken from the trimmed rule, which is the second rule in history
base_rule_in_history <- 2
}
seedRuleConf <- rmqCBA@history[rmqCBA@history$RID==inspected_rule_RID,][base_rule_in_history,5]
CBA rules stick to a grid that corresponds to results of discretization. The grid used by QCBA corresponds to all unique values appearing in the training data.
interestingRuleAsText <- paste(paste(unlist(interestingRule$lhs[1]),collapse=''),paste(unlist(interestingRule$rhs[1]),collapse=''),sep=" => ")
plot(Humidity,Temperature,pch=as.character(Class),main="CBA-generated rule on original grid", sub=paste(interestingRuleAsText, "Supp:",round(interestingRule$support,2)," Conf:",round(interestingRule$confidence,2)),cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.15)
plotGrid(FALSE)
plotHumTempRule(rmCBA@rules,3)
The same rule from the CBA classifier plotted on the finer grid.
plot(Humidity,Temperature,pch=as.character(Class), main="The finer grid",cex.lab=1.5, cex.axis=1.5, xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
plotHumTempRule(rmCBA@rules,3)
Let’s zoom in and look at how QCBA refit the rule:
plotRuleInHistory(extendHistory,1,seedRuleConf)
Rule is shaved of any boundaries that are not backed by correctly classified instances.
plotRuleInHistory(extendHistory,2,seedRuleConf)
The process proceeds as follows:
Note that QCBA improved confidence of rule 2 from initial value of 0.6 to 0.75 and support from 0.08 to 0.17.
for (i in 1:nrow(extendHistory)) {
plotRuleInHistory(extendHistory,i,seedRuleConf)
}
As a result of the extension process, the data matched by the individual rules change. Within postprocessing, the extended rules are resorted and pruned again using CBA data coverage pruning. Note that the default rule at the end is automatically updated.
Result of the extension process:
rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = TRUE, postpruning = "none", defaultRuleOverlapPruning="noPruning", createHistorySlot=TRUE,loglevel = "WARNING")
print(rmqCBA@rules)
## rules support confidence
## 1 {Humidity=[82;95]} => {Class=1} 0.1111111 0.8000000
## 2 {Temperature=[22;31],Humidity=[33;53]} => {Class=4} 0.1666667 0.7500000
## 3 {Temperature=[31;34]} => {Class=4} 0.1388889 0.6250000
## 4 {Temperature=[17;21]} => {Class=2} 0.1388889 0.6250000
## 5 {Temperature=[27;29]} => {Class=4} 0.1388889 0.6250000
## 6 {} => {Class=4} 0.3055556 0.3055556
## condition_count orderedConf orderedSupp
## 1 1 0.8000000 4
## 2 2 0.7500000 6
## 3 1 0.6666667 4
## 4 1 0.5714286 4
## 5 1 0.0000000 0
## 6 0 0.1111111 1
Result of postpruning:
rmqCBA_pruned <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = TRUE, postpruning = "cba", defaultRuleOverlapPruning="noPruning", createHistorySlot=TRUE,loglevel = "WARNING")
print(rmqCBA_pruned@rules)
## rules support confidence
## 1 {Humidity=[82;95]} => {Class=1} 0.1111111 0.8000000
## 2 {Temperature=[22;31],Humidity=[33;53]} => {Class=4} 0.1666667 0.7500000
## 3 {Temperature=[31;34]} => {Class=4} 0.1388889 0.6250000
## 4 {} => {Class=2} 0.2777778 0.2777778
## condition_count orderedConf orderedSupp
## 1 1 0.8000000 4
## 2 2 0.7500000 6
## 3 1 0.6666667 4
## 4 0 0.4705882 8
Last two rules with non-empty antecedent were replaced using data coverage pruning by the default rule:
rmqCBA@rules[4:5,1]
## [1] "{Temperature=[17;21]} => {Class=2}" "{Temperature=[27;29]} => {Class=4}"
with default rule:
rmqCBA_pruned@rules[4,1]
## [1] "{} => {Class=2}"
This replacement results in same training set error with fewer rules.
Training set error for the original rule list:
prediction_qcba<-predict(rmqCBA,data_raw,discretize=FALSE)
acc_qcba <- CBARuleModelAccuracy(prediction_qcba, data_raw[[classAtt]])
print(paste("Accuracy (QCBA - without postpruning):",acc_cba))
## [1] "Accuracy (QCBA - without postpruning): 0.611111111111111"
Error for the pruned rule list:
prediction_qcba<-predict(rmqCBA_pruned,data_raw,discretize=FALSE)
acc_qcba <- CBARuleModelAccuracy(prediction_qcba, data_raw[[classAtt]])
print(paste("Accuracy (QCBA with postpruning):",acc_cba))
## [1] "Accuracy (QCBA with postpruning): 0.611111111111111"
Default rule overlap iterates through all rules classifying into the same class as the default rule. These rules all overlap with the default rule and are thus candidates for pruning. However, they can be removed only if their removal will not change the classification of the instances (or regions – see below) they cover by rules that are below them. Recall that for prediction CBA uses the first rule in the rule list. Therefore, the algorithm checks whether the region (transactions) matched by the antecedent of the candidate overlaps with region matched by any of the remaining rules (potential clashing rule) classifying instances into a different class that are below the candidate in the rule list. If there are no such rules, the candidate can be removed (pruned).
There are two ways how to check the overlap between the candidate and a potential clashing rule. 1. Check the regions matched by the antecedents of the candidate and the potential clashing rule by analyzing the boundaries of the rules. 2. Check overlap in transactions covered by the antecedent of the candidate rule and the antecedent of the potential clashing rule in the training data.
Let’s us create a different classifier by lowering the minimum support threshold and disabling trimming.
# we will use support of 1 instance
trim_literal_boundaries <- FALSE
supp <- 1
sink("/dev/null")
rules <- apriori(txns, parameter = list(confidence = 0.5, support= supp/nrow(data_discr), minlen=1, maxlen=3), appearance=appearance)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.02777778 1
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[4 item(s)] done [0.00s].
## set transactions ...[14 item(s), 36 transaction(s)] done [0.00s].
## sorting and recoding items ... [14 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rmCBA <- cba_manual(data_raw, rules, txns, appearance$rhs, classAtt, cutp= list(), pruning_options=list(default_rule_pruning=FALSE))
rmqCBA <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = trim_literal_boundaries, postpruning = "cba", defaultRuleOverlapPruning="noPruning", createHistorySlot=TRUE,loglevel = "WARNING")
sink()
print(rmqCBA@rules)
## rules support confidence
## 1 {Temperature=[21;45],Humidity=[82;95]} => {Class=1} 0.11111111 1.0000000
## 2 {Temperature=[34;45],Humidity=[33;58]} => {Class=2} 0.08333334 1.0000000
## 3 {Temperature=[29;34],Humidity=[29;48]} => {Class=4} 0.08333334 1.0000000
## 4 {Temperature=[19;26],Humidity=[29;53]} => {Class=2} 0.08333334 1.0000000
## 5 {Temperature=[37;45],Humidity=[53;95]} => {Class=1} 0.05555556 1.0000000
## 6 {Temperature=[34;40],Humidity=[29;47]} => {Class=3} 0.02777778 1.0000000
## 7 {Temperature=[17;24],Humidity=[82;95]} => {Class=2} 0.02777778 1.0000000
## 8 {Temperature=[29;34],Humidity=[29;68]} => {Class=4} 0.16666667 0.8571429
## 9 {Temperature=[22;31],Humidity=[33;70]} => {Class=4} 0.19444445 0.7000000
## 10 {Temperature=[17;21]} => {Class=2} 0.13888890 0.6250000
## 11 {} => {Class=3} 0.19444445 0.1944444
## condition_count orderedConf orderedSupp
## 1 2 1.0000000 4
## 2 2 1.0000000 3
## 3 2 1.0000000 3
## 4 2 1.0000000 3
## 5 2 1.0000000 1
## 6 2 1.0000000 1
## 7 2 1.0000000 1
## 8 2 0.7500000 3
## 9 2 0.6250000 5
## 10 1 0.4000000 2
## 11 0 0.6666667 2
By activating default rule overlap pruning, we can reduce the number of rules in the classifier by 1.
sink("/dev/null")
rmqCBA_dro <- qcba(cbaRuleModel=rmCBA,datadf=data_raw, extendType="numericOnly",trim_literal_boundaries = trim_literal_boundaries, postpruning = "cba", defaultRuleOverlapPruning="transactionBased", createHistorySlot=FALSE,loglevel = "WARNING")
sink()
print(rmqCBA_dro@rules)
## rules support confidence
## 1 {Temperature=[21;45],Humidity=[82;95]} => {Class=1} 0.11111111 1.0000000
## 2 {Temperature=[34;45],Humidity=[33;58]} => {Class=2} 0.08333334 1.0000000
## 3 {Temperature=[29;34],Humidity=[29;48]} => {Class=4} 0.08333334 1.0000000
## 4 {Temperature=[19;26],Humidity=[29;53]} => {Class=2} 0.08333334 1.0000000
## 5 {Temperature=[37;45],Humidity=[53;95]} => {Class=1} 0.05555556 1.0000000
## 6 {Temperature=[17;24],Humidity=[82;95]} => {Class=2} 0.02777778 1.0000000
## 7 {Temperature=[29;34],Humidity=[29;68]} => {Class=4} 0.16666667 0.8571429
## 8 {Temperature=[22;31],Humidity=[33;70]} => {Class=4} 0.19444445 0.7000000
## 9 {Temperature=[17;21]} => {Class=2} 0.13888890 0.6250000
## 10 {} => {Class=3} 0.19444445 0.1944444
## condition_count orderedConf orderedSupp
## 1 2 1.000 4
## 2 2 1.000 3
## 3 2 1.000 3
## 4 2 1.000 3
## 5 2 1.000 1
## 6 2 1.000 1
## 7 2 0.750 3
## 8 2 0.625 5
## 9 1 0.400 2
## 10 0 0.750 3
Rule #6 in the original classifier is not needed. Rule #6:
print(rmqCBA@rules[6,1])
## [1] "{Temperature=[34;40],Humidity=[29;47]} => {Class=3}"
This rule assigns into Class 3 - the same class as the default rule in the end of the classifier. Let’s look at rules between #6 and the default rule #11:
print(rmqCBA@rules[6:11,1])
## [1] "{Temperature=[34;40],Humidity=[29;47]} => {Class=3}"
## [2] "{Temperature=[17;24],Humidity=[82;95]} => {Class=2}"
## [3] "{Temperature=[29;34],Humidity=[29;68]} => {Class=4}"
## [4] "{Temperature=[22;31],Humidity=[33;70]} => {Class=4}"
## [5] "{Temperature=[17;21]} => {Class=2}"
## [6] "{} => {Class=3}"
There is no rule below #6 that would prevent the training instances covered by #6 from being classified by the default rule (which has the same class as #6). Rule #6 is drawn in grey in the following figure, the remaining rules below it are drawn in blue and green.
plot(Humidity,Temperature,pch=as.character(Class), main="Rules #6 to #11 (default rule pruning OFF)",cex.lab=1.5, cex.axis=1.5, xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
for (i in 6:11) {
plotHumTempRule(rmqCBA@rules,i)
}
When Default rule overlap pruning is activated, rules such as #6 are removed and the area left for classification to the default rule, which comes as last (default rule is plotted in grey).
plot(Humidity,Temperature,pch=as.character(Class), main="Rules #6 to #10 (default rule pruning ON)",cex.lab=1.5, cex.axis=1.5, xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
for (i in 6:10) {
plotHumTempRule(rmqCBA_dro@rules,i)
}
Range-based default rule pruning looks at rules below each rule that classifies to the default class and inspects whether the rules below it can divert instances from being classified by the default rule. The difference from the transaction-based method is that it is inspected whether the regions covered by the rules do not overlap rather than training transactions. Range-based pruning thus guarantees a solution that generalizes beyond the training data. It is perfectly safe to remove rules detected as redundant by range-based default rule overlap.
The disadvantage is that range based pruning is not very uneffective. In realistically sized rule lists spanning mutliple dimensions, the odds that there will be a clashing rule for any candidate rule is very high and thus no rules will be removed. An alternative way is to judge overlap between two rules based on whether they have any transactions in common.
In our toy case, range-based pruning does not remove any rule. Rule #6 is not removed, because it shares a boundary on Temperature (34) with rule #8.
plot(Humidity,Temperature,pch=as.character(Class), main="R #6 and #8 intersect (range-based pruning is ineffective)",cex.lab=1.5, cex.axis=1.5, xlab="", ylab="", cex.main=1.5, cex.sub=1.15)
plotGrid()
plotHumTempRule(rmqCBA@rules,6)
plotHumTempRule(rmqCBA@rules,8)
Attribute pruning is an optimization step in QCBA which considers removing all attribute-value pairs (literals) in all rules in the classifier. The removal is confirmed if a rule created without the literal has at least the accuracy of the original rule. The optimization is performed in a greedy manner, processing all rules in the sort order. Attributes are processed in arbitrary order.
To demonstrate attribute pruning, we need to use dataset with higher
number of attributes than humtemp
dataset has and also
different base rule classifier than CBA, as attribute pruning is
ineffective on CBA according to our experiments. We will use
CPAR from arulesCBA as the baseline classifier and the
iris dataset.
For initial demonstration, we will turn off all optimizations in QCBA that follow after attribute pruning (i.e. trimming, postpruning, default rule pruning).
set.seed(54)
allData <- datasets::iris[sample(nrow(datasets::iris)),]
trainFold <- allData[1:100,]
testFold <- allData[101:nrow(datasets::iris),]
classAtt <- "Species"
discrModel <- discrNumeric(trainFold, classAtt)
train_disc <- as.data.frame(lapply(discrModel$Disc.data, as.factor))
cutPoints <- discrModel$cutp
test_disc <- applyCuts(testFold, cutPoints, infinite_bounds=TRUE, labels=TRUE)
y_true <-testFold[[classAtt]]
rmBASE <- CPAR(train_disc, formula=as.formula(paste(classAtt,"~ .")))
predictionBASE <- predict(rmBASE,test_disc) # CPAR (arulesCBA) predict function
baseModel_arc <- arulesCBA2arcCBAModel(rmBASE, cutPoints, trainFold, classAtt)
rmQCBA_NOATTPR <- qcba(cbaRuleModel=baseModel_arc,datadf=trainFold,attributePruning = FALSE)
predictionQCBA_NOATTPR <- predict(rmQCBA_NOATTPR,testFold)
baseModel_arc <- arulesCBA2arcCBAModel(rmBASE, cutPoints, trainFold, classAtt)
rmQCBA <- qcba(cbaRuleModel=baseModel_arc,datadf=trainFold,attributePruning = TRUE)
predictionQCBA <- predict(rmQCBA,testFold)
print(paste0("CPAR: Number of rules: ",length(rmBASE$rules),", total conditions:",sum(rmBASE$rules@lhs@data), ", accuracy on test data: ",round(CBARuleModelAccuracy(predictionBASE, y_true),2)))
## [1] "CPAR: Number of rules: 8, total conditions:10, accuracy on test data: 0.98"
print(paste0("QCBA (NO ATT PR.): Number of rules: ",nrow(rmQCBA_NOATTPR@rules),", total conditions:",sum(rmQCBA_NOATTPR@rules$condition_count), ", accuracy on test data: ",round(CBARuleModelAccuracy(predictionQCBA_NOATTPR, y_true),2)))
## [1] "QCBA (NO ATT PR.): Number of rules: 3, total conditions:3, accuracy on test data: 1"
print(paste0("QCBA (ATT PR.): Number of rules: ",nrow(rmQCBA@rules),", total conditions:",sum(rmQCBA@rules$condition_count), ", accuracy on test data: ",round(CBARuleModelAccuracy(predictionQCBA, y_true),2)))
## [1] "QCBA (ATT PR.): Number of rules: 3, total conditions:2, accuracy on test data: 1"
The QCBA postprocessing with attribute pruning enabled results in a model that is more accurate and smaller than a QCBA model with attribute pruning disabled.
It should be noted that attribute pruning is performed as a second step in QCBA after refitting. The result of QCBA model with attribute pruning enabled and without it can thus differ substantially, also, as a consequence, affecting the range of literals.
The default value of the minCondImprovement parameter is -1. This parameter value ensures exhaustive search for extensions, but can be slow on large datasets or datasets with many distinct values. The closer this parameter value will be to 0, the faster the execution will generally be.
rmCBAiris <- cba(trainFold, classAtt="Species")
## Using automatic threshold detection
## Running apriori with setting: confidence = 0.5 , support = 0 , minlen = 2 , maxlen = 3 , MAX_RULE_LEN = 5
## Rule count: 89 Iteration: 1
## Increasing maxlen to: 4
## Running apriori with setting: confidence = 0.5 , support = 0 , minlen = 2 , maxlen = 4 , MAX_RULE_LEN = 5
## Rule count: 394 Iteration: 2
## Increasing maxlen to: 5
## Running apriori with setting: confidence = 0.5 , support = 0 , minlen = 2 , maxlen = 5 , MAX_RULE_LEN = 5
## Rule count: 1006 Iteration: 3
## Target rule count satisfied: 1000
## Removing excess discovered rules
## Rule learning took: 0.02 seconds
## Original rules: 1000
## Rules after data coverage pruning: 8
## Performing default rule pruning.
## Final rule list size: 6
## Pruning took: 0.08 seconds
start.time <- Sys.time()
for (i in 1:100)
{
rmqCBA <- qcba(cbaRuleModel=rmCBAiris,datadf=trainFold,extendType="numericOnly", minCondImprovement=-1, postpruning="cba", defaultRuleOverlapPruning="noPruning")
}
end.time <- Sys.time()
message (paste("100 executions took:", round(end.time - start.time,2), " seconds"))
## 100 executions took: 7.14 seconds
start.time <- Sys.time()
for (i in 1:100)
{
rmqCBA <- qcba(cbaRuleModel=rmCBAiris,datadf=trainFold,extendType="numericOnly", minCondImprovement=0.0, postpruning="cba", defaultRuleOverlapPruning="noPruning")
}
end.time <- Sys.time()
message (paste("100 executions took:", round(end.time - start.time,2), " seconds"))
## 100 executions took: 6.37 seconds
Improvement in execution time gained on this small dataset by changing the minCondImprovement from -1 to 0.00 is about 10%.
The CBA model is passed by the R code to a Java QCBA.jar
file, which performs the QCBA model learning. The QCBA model is then
returned to R.
basePath <- tempdir()
dir.create(file.path(basePath, "debug"), showWarnings = FALSE)
rulesPath <-paste(basePath,"debug","humtemp.arules",sep=.Platform$file.sep)
write.csv(as(rmCBA@rules,"data.frame"), rulesPath, row.names=TRUE,quote = TRUE)
outputDataPath <- paste(basePath,"debug",'humtemp.csv', sep=.Platform$file.sep)
write.csv(humtemp,file=outputDataPath,row.names=FALSE)
Let’s create the configuration file for QCBA.
extendType="numericOnly"
trimLiteralBoundaries = FALSE
attributePruning =TRUE
defaultRuleOverlapPruning = "rangeBased" #noPruning,transactionBased,rangeBased
postpruning = "cba" #none,cba,greedy
dataTypes <- paste(rmCBA@attTypes, collapse = ',')
classAtt <- colnames(humtemp)[length(humtemp)] #last attribute
outputPath <- paste(basePath,"debug",'humtemp-qcba.arules', sep=.Platform$file.sep)
x=paste('<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">',
"<properties>\n",
"<entry key=\"Method\">extend</entry>\n",
"<entry key=\"RulesPath\">", rulesPath, "</entry>\n",
"<entry key=\"TrainDataPath\">", outputDataPath,"</entry>\n",
"<entry key=\"ExtendType\">",extendType,"</entry>\n",
"<entry key=\"AttributePruning\">",attributePruning, "</entry>\n",
"<entry key=\"Trimming\">",trimLiteralBoundaries, "</entry>\n",
"<entry key=\"DefaultRuleOverlapPruning\">",defaultRuleOverlapPruning, "</entry>\n",
"<entry key=\"Postpruning\">",postpruning, "</entry>\n",
"<entry key=\"DataTypes\">", dataTypes,'</entry>\n',
"<entry key=\"TargetAttribute\">", classAtt,'</entry>\n',
"<entry key=\"OutputPath\">", outputPath,'</entry>\n',
"</properties>", sep="")
qcbaFilePath <-paste(basePath,"debug","humtemp-conf.xml",sep=.Platform$file.sep)
write(x, file = qcbaFilePath,
ncolumns = 1,
append = FALSE, sep = ",")
print(paste("Run as: ", "java -jar QCBA.jar ", qcbaFilePath, sep=""))
## [1] "Run as: java -jar QCBA.jar /tmp/RtmpTjWPSi/debug/humtemp-conf.xml"
print(paste("QCBA model will be written to:",outputPath))
## [1] "QCBA model will be written to: /tmp/RtmpTjWPSi/debug/humtemp-qcba.arules"
basePath <- tempdir()
dir.create(file.path(basePath, "debug"), showWarnings = FALSE)
rulesPath <-paste(basePath,"debug","iris.arules",sep=.Platform$file.sep)
write.csv(as(rmCBAiris@rules,"data.frame"), rulesPath, row.names=TRUE,quote = TRUE)
trainDataPath <- paste(basePath,"debug",'iris-train.csv', sep=.Platform$file.sep)
testDataPath <- paste(basePath,"debug",'iris-test.csv', sep=.Platform$file.sep)
write.csv(trainFold,file=trainDataPath,row.names=FALSE)
write.csv(testFold,file=testDataPath,row.names=FALSE)
Let’s create the configuration file for QCBA.
extendType="numericOnly"
trimLiteralBoundaries = FALSE
attributePruning =TRUE
defaultRuleOverlapPruning = "noPruning" #noPruning,transactionBased,rangeBased
postpruning = "cba" #none,cba,greedy
dataTypes <- paste(rmCBAiris@attTypes, collapse = ',')
classAtt <- colnames(trainFold)[length(trainFold)] #last attribute
qcbaModelPath <- paste(basePath,"debug",'iris-qcba.arules', sep=.Platform$file.sep)
x=paste('<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">',
"<properties>\n",
"<entry key=\"Method\">extend</entry>\n",
"<entry key=\"RulesPath\">", rulesPath, "</entry>\n",
"<entry key=\"TrainDataPath\">", trainDataPath,"</entry>\n",
"<entry key=\"ExtendType\">",extendType,"</entry>\n",
"<entry key=\"AttributePruning\">",attributePruning, "</entry>\n",
"<entry key=\"Trimming\">",trimLiteralBoundaries, "</entry>\n",
"<entry key=\"DefaultRuleOverlapPruning\">",defaultRuleOverlapPruning, "</entry>\n",
"<entry key=\"Postpruning\">",postpruning, "</entry>\n",
"<entry key=\"DataTypes\">", dataTypes,'</entry>\n',
"<entry key=\"TargetAttribute\">", classAtt,'</entry>\n',
"<entry key=\"OutputPath\">", qcbaModelPath,'</entry>\n',
"</properties>", sep="")
qcbaFilePath <-paste(basePath,"debug","iris-conf.xml",sep=.Platform$file.sep)
write(x, file = qcbaFilePath,
ncolumns = 1,
append = FALSE, sep = ",")
print(paste("Run as: ", "java -jar QCBA.jar ", qcbaFilePath, sep=""))
## [1] "Run as: java -jar QCBA.jar /tmp/RtmpTjWPSi/debug/iris-conf.xml"
print(paste("QCBA model will be written to:",qcbaModelPath))
## [1] "QCBA model will be written to: /tmp/RtmpTjWPSi/debug/iris-qcba.arules"
Configuration for learning QCBA rule set (mutliple rules used for classification) model
annotate =TRUE
qcbaModelPath <- paste(basePath,"debug",'iris-qcba-mixture.xml', sep=.Platform$file.sep)
x=paste('<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">',
"<properties>\n",
"<entry key=\"Method\">extend</entry>\n",
"<entry key=\"Annotate\">", annotate, "</entry>\n",
"<entry key=\"RulesPath\">", rulesPath, "</entry>\n",
"<entry key=\"TrainDataPath\">", trainDataPath,"</entry>\n",
"<entry key=\"ExtendType\">",extendType,"</entry>\n",
"<entry key=\"AttributePruning\">",attributePruning, "</entry>\n",
"<entry key=\"Trimming\">",trimLiteralBoundaries, "</entry>\n",
"<entry key=\"DefaultRuleOverlapPruning\">",defaultRuleOverlapPruning, "</entry>\n",
"<entry key=\"Postpruning\">",postpruning, "</entry>\n",
"<entry key=\"DataTypes\">", dataTypes,'</entry>\n',
"<entry key=\"TargetAttribute\">", classAtt,'</entry>\n',
"<entry key=\"OutputPath\">", qcbaModelPath,'</entry>\n',
"</properties>", sep="")
qcbaFilePath <-paste(basePath,"debug","iris-conf-mixture.xml",sep=.Platform$file.sep)
write(x, file = qcbaFilePath,
ncolumns = 1,
append = FALSE, sep = ",")
print(paste("Run as: ", "java -jar QCBA.jar ", qcbaFilePath, sep=""))
## [1] "Run as: java -jar QCBA.jar /tmp/RtmpTjWPSi/debug/iris-conf-mixture.xml"
print(paste("QCBA model will be written to:",qcbaModelPath))
## [1] "QCBA model will be written to: /tmp/RtmpTjWPSi/debug/iris-qcba-mixture.xml"
Configuration for applying QCBA rule set model
testingType ="mixture"
outputPath <- paste(basePath,"debug",'iris-qcba-mixture-predict.csv', sep=.Platform$file.sep)
qcbaFilePath <-paste(basePath,"debug","iris-conf-mixture-predict.xml",sep=.Platform$file.sep)
x=paste('<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">',
"<properties>\n",
"<entry key=\"Method\">test</entry>\n",
"<entry key=\"RulesPath\">", qcbaModelPath, "</entry>\n",
"<entry key=\"TestDataPath\">", testDataPath,"</entry>\n",
"<entry key=\"TestingType\">",testingType,"</entry>\n",
"<entry key=\"DataTypes\">", dataTypes,'</entry>\n',
"<entry key=\"TargetAttribute\">", classAtt,'</entry>\n',
"<entry key=\"OutputPath\">", outputPath,'</entry>\n',
"</properties>", sep="")
write(x, file = qcbaFilePath,
ncolumns = 1,
append = FALSE, sep = ",")
print(paste("Run as: ", "java -jar QCBA.jar ", qcbaFilePath, sep=""))
## [1] "Run as: java -jar QCBA.jar /tmp/RtmpTjWPSi/debug/iris-conf-mixture-predict.xml"
print(paste("QCBA prediction will be written to:",outputPath))
## [1] "QCBA prediction will be written to: /tmp/RtmpTjWPSi/debug/iris-qcba-mixture-predict.csv"
The first example uses the iris dataset.
set.seed(7)
library(ROCR)
library(qCBA)
twoClassIris<-datasets::iris[1:100,]
twoClassIris <- twoClassIris[sample(nrow(twoClassIris)),]
#twoClassIris$Species<-as.factor(as.character(iris$Species))
trainFold <- twoClassIris[1:75,]
testFold <- twoClassIris[76:nrow(twoClassIris),]
classAtt <- "Species"
rmCBA <- cba(trainFold, classAtt=classAtt)
rmqCBA <- qcba(cbaRuleModel=rmCBA, datadf=trainFold)
confidencesQCBA <- predict(rmqCBA,testFold,outputConfidenceScores=TRUE,positiveClass="versicolor")
#it is importat that the first level is different from positiveClass specified in the line above
roc_pred <- ROCR::prediction(confidencesQCBA, factor(testFold[[classAtt]]))
roc_qcba = ROCR::performance(roc_pred, "tpr", "fpr")
plot(roc_qcba, lwd=2, colorize=TRUE)
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1)
auc = ROCR::performance(roc_pred, "auc")
auc = unlist(auc@y.values)
auc
## [1] 0.9583333
The second example is on the adult dataset, which is more difficult than iris. In this way, we obtain varied confidence scores for the ROC plot. First, get results from CBA:
library(ROCR)
set.seed(101)
classitems <- c("income=small","income=large")
adult <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T, col.names = c('age', 'workclass', 'fnlwgt', 'educatoin',
'educatoin_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income'))
split = sample(c(TRUE, FALSE), nrow(adult), replace=TRUE, prob=c(0.75, 0.25))
trainFold <- adult[split,]
testFold <- adult[!split,]
classAtt <- "income"
positiveClass<-">50K"
rmCBA <- cba(trainFold, classAtt, list(target_rule_count = 1000))
## Warning: Column(s) 2, 4, 6, 7, 8, 9, 10, 14 not logical or factor. Applying
## default discretization (see '? discretizeDF').
confidence_scores_cba <- predict(rmCBA, testFold, outputConfidenceScores=TRUE,positiveClass=positiveClass)
pred_cba <- ROCR::prediction(confidence_scores_cba, factor(testFold[[classAtt]]))
roc_cba <- ROCR::performance(pred_cba, "tpr", "fpr")
ROCR::plot(roc_cba, lwd=2, colorize=TRUE)
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1)
auc_cba <- ROCR::performance(pred_cba, "auc")
auc_cba <- unlist(auc_cba@y.values)
auc_cba
## [1] 0.8849119
Then, for QCBA:
rmqCBA <- qcba(cbaRuleModel=rmCBA, datadf=trainFold)
confidencesQCBA <- predict(rmqCBA,testFold,outputConfidenceScores=TRUE,positiveClass=positiveClass)
roc_pred <- ROCR::prediction(confidencesQCBA, factor(testFold[[classAtt]]))
roc_qcba = ROCR::performance(roc_pred, "tpr", "fpr")
plot(roc_qcba, lwd=2, colorize=TRUE)
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1)
auc_qcba = ROCR::performance(roc_pred, "auc")
auc_qcba = unlist(auc_qcba@y.values)
auc_qcba
## [1] 0.8514297