Market Basket Analysis in MarkLogic NoSQL – Part 2, association rules

Share on FacebookShare on Google+Email this to someoneTweet about this on TwitterShare on LinkedIn

2016-01-22: As a colleague of mine pointed out, I am using the eval function in my original posting of the frequent itemset generation and that is not the most optimal. I have updated the examples and code not to use it anymore.

In my previous post I described how to generate a frequent itemset using MarkLogic server built in functionality as a first step in doing a Market Basket Analysis.

The second step is about generating rules from our frequent itemsets that support our minimum support and confidence thresholds.

During this post I will walk through how to generate these rules using MarkLogic server and JavaScript.

Based on the function created in the previous post our frequent itemsets result is looking like below.

{"1":[
  {"itemSet":["18th century schooner"], "support":6},
  {"itemSet":["1900s Vintage Bi-Plane"], "support":7},
  {"itemSet":["1903 Ford Model A"], "support":2},
  {"itemSet":["1928 British Royal Navy Airplane"], "support":6},
  {"itemSet":["Collectable Wooden Train"], "support":2}], 
"2":[
  {"itemSet":["18th century schooner", "1900s Vintage Bi-Plane"], "support":4},
  {"itemSet":["18th century schooner", "1903 Ford Model A"], "support":2},
  {"itemSet":["18th century schooner", "1928 British Royal Navy Airplane"], "support":4},
  {"itemSet":["1900s Vintage Bi-Plane", "1903 Ford Model A"], "support":2},
  {"itemSet":["1900s Vintage Bi-Plane", "1928 British Royal Navy Airplane"], "support":4},
  {"itemSet":["1900s Vintage Bi-Plane", "Collectable Wooden Train"], "support":2}], 
"3":[
  {"itemSet":["18th century schooner", "1900s Vintage Bi-Plane", "1903 Ford Model A"], "support":2}, 
  {"itemSet":["18th century schooner", "1900s Vintage Bi-Plane", "1928 British Royal Navy Airplane"], "support":2}]
}

Out of this we want to create rules that look like the following

{"lhs":["18th century schooner"], "rhs":["1900s Vintage Bi-Plane"], "support":0.444444444444444, "confidence":0.666666666666667}

This rule can also be described as:
[”18th century schooner”] => [”1900s Vintage Bi-Plane”] – support = 44%, confidence = 66%

Meaning that persons that bought ”18th century schooner” also bought “1900s Vintage Bi-Plane” and that 44% of our baskets contains both products and 66% of all baskets that contain “18th century schooner” also contain ”1900s Vintage Bi-Plane”.

The rule generation is basically done by taking each itemset and from that generate all possible subsets.

For example if we have this itemset :

[”18th century schooner”, ”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”]

We can then identify the following subsets:

  • [”18th century schooner”]
  • [”1900s Vintage Bi-Plane”]
  • [”1928 British Royal Navy Airplane”]
  • [”18th century schooner”, ”1900s Vintage Bi-Plane”]
  • [”18th century schooner”, ”1928 British Royal Navy Airplane”]
  • [”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”]

Based on the subsets we can generate the following rules that consist of all items in the itemset:

  • [”18th century schooner”] => [”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”]
  • [”1900s Vintage Bi-Plane”] => [”18th century schooner”, ”1928 British Royal Navy Airplane”]
  • [”1928 British Royal Navy Airplane”] => [”18th century schooner”, ”1900s Vintage Bi-Plane”]
  • [”18th century schooner”, ”1900s Vintage Bi-Plane”] => [”1928 British Royal Navy Airplane”]
  • [”18th century schooner”, ”1928 British Royal Navy Airplane”] => [”1900s Vintage Bi-Plane”]
  • [”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”] => [”18th century schooner”]

Now we can calculate support and confidence for each rule using the values from our frequent itemsets. Support is calculated as number of transactions that consist left and right part of rule divided by total number of baskets, confidence is calculated as number of baskets that consist of left and right part of rule divided by number of baskets containing left part.

So by using the previous created frequent itemsets we can calculate the following support and confidence numbers for our rules.

  • [”18th century schooner”] => [”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”] – support = 2/9 = 22%, confidence = 2/6 = 33%
  • [”1900s Vintage Bi-Plane”] => [”18th century schooner”, ”1928 British Royal Navy Airplane”] – support = 2/9 = 22%, confidence = 2/7 = 29%
  • [”1928 British Royal Navy Airplane”] => [”18th century schooner”, ”1900s Vintage Bi-Plane”] – support = 2/9 = 22%, confidence = 2/6 = 33%
  • [”18th century schooner”, ”1900s Vintage Bi-Plane”] => [”1928 British Royal Navy Airplane”] – support = 2/9 = 22%, confidence = 2/4 = 50%
  • [”18th century schooner”, ”1928 British Royal Navy Airplane”] => [”1900s Vintage Bi-Plane”]- support = 2/9 = 22%, confidence = 2/4 = 50%
  • [”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”] => [”18th century schooner”]- support = 2/9 = 22%, confidence = 2/4 = 50%

So now we have our Market Basket Analysis, lets se how it can be done using MarkLogic NoSQL Server and JavaScript. A lot of the code used in this step is based on an existing JavaScript Apriori implementation, https://github.com/seratch/apriori.js. All code is in a Query Console workspace that can be downloaded here and can be imported into Query Console.

I have chosen to use a function to generate our association rules, getAssociationRules, which will take the frequent itemset created earlier, total number of transactions and a minimum confidence threshold as parameters.

The full function is below.

function getAssociationRules(frequentItemSets, totTrans, minConfidence) {
  /* internal helper functions */
  
  /* returns the itemset from f */
  var extractItemSet = function (f) {
    return f.itemSet;
  };
  /* Generates all possible subsets, including the itemset, of a itemset
     ex array = ["I1", "I2", "I5"] then
    allSubSets = [["I1"], ["I2"], ["I5"], ["I1", "I2"], ["I1", "I5"], ["I2", "I5"], ["I1", "I2", "I5"]] */
  var toAllSubSets = function (array) {
    var op = function (n, sourceArray, currentArray, allSubSets) {
      if (n === 0) {
        if (currentArray.length > 0) {
          allSubSets[allSubSets.length] = currentArray;
        }
      } else {
        for (var j = 0; j < sourceArray.length; j++) {
          var nextN = n - 1, nextArray = sourceArray.slice(j + 1), updatedCurrentSubSet = currentArray.concat([sourceArray[j]]);
          op(nextN, nextArray, updatedCurrentSubSet, allSubSets);
        }
      }
    };
    
    var allSubSets = [];
    array.sort();
    for (var i = 1; i < array.length; i++) { op(i, array, [], allSubSets); } allSubSets.push(array); return allSubSets; }; /* Returns a array which contains the elements that is in arrayA but not in arrayB ex. arrayB = ["I1"], arrayA = ["I1", "I2", "I5"] diffArray = ["I2", "I5"] */ var getDiffArray = function (arrayA, arrayB) { var diffArray = []; arrayA.forEach(function (e) { if (arrayB.indexOf(e) === -1) diffArray.push(e); }); return diffArray; }; /* create and save the rules, if they apply to the confidence threshold */ var saveAssociationRuleFound = function (subsetItemSet) { var diffItemSet = getDiffArray(currentItemSet, subsetItemSet); if (diffItemSet.length > 0) {
      var itemSupport = frequencies[currentItemSet.toString()] / totTrans;
      var subsetSupport = frequencies[subsetItemSet.toString()] / totTrans;
      var confidence = itemSupport / subsetSupport;
      if (!isNaN(confidence) && confidence >= minConfidence) {
        associationRules.push({"lhs":subsetItemSet,"rhs":diffItemSet, "support":itemSupport, "confidence":confidence});
      }
    }
  };
  
  /*
    for a itemset, generate all possible subsets, and for each subset 
    generate and add rules to associationRules
  */
  var saveAllAssociationRulesFound = function (itemSet) {
    currentItemSet = itemSet;
    toAllSubSets(currentItemSet).forEach(saveAssociationRuleFound);
  };
  
  /* main logic */
  var frequencies = {};
  var currentItemSet;
  var associationRules = [];
  /* 
    Generate a objects with the frequencies of all itemsets 
    Key is the itemset as a string.
    Ex ["18th century schooner", "1900s Vintage Bi-Plane", "1903 Ford Model A"] 
        =>
       "18th century schooner,1900s Vintage Bi-Plane,1903 Ford Model A"
  */
  for (var itemSets in freqItemsSets) {
    for (var i = 0; i< freqItemsSets[itemSets].length; i++) {         
      frequencies[freqItemsSets[itemSets][i].itemSet.toString()] = freqItemsSets[itemSets][i].support;
    };
  };
  
  /* 
     Walk through all itemsets and get the rules for them.
     itemsets are grouped by the number of items they have.
     1,2,3 etc 
   */
  for (var k in freqItemsSets) {
    /* get the all itemsets that has the same number of items */
    var itemSets = freqItemsSets[k].map(extractItemSet);
    if (itemSets.length === 0 || itemSets[0].length <= 1) {
      continue;
    }
    /* for each itemset add the rules into associationRules */
    itemSets.forEach(saveAllAssociationRulesFound);
  };
  return associationRules;
}

Lets go through the different parts of it. There is a lot of functions within variables in the beginning and the main logic starts in the end.

The first thing we need to do is to store the support figure for each itemset in a simpler structure, so we can easily access it when we need it. Which is every time we add a rule.

The following code loops through our itemsets, created by the getFreqItemSets function described in my previous post, and gets the support value. It stores it in an object where the string version of the itemset is the key.

var frequencies = {};
for (var itemSets in freqItemsSets) {
    for (var i = 0; i< freqItemsSets[itemSets].length; i++) {         
      frequencies[freqItemsSets[itemSets][i].itemSet.toString()] = freqItemsSets[itemSets][i].support;
    };
  };

For example {”itemSet”:[”18th century schooner”, ”1900s Vintage Bi-Plane”, ”1928 British Royal Navy Airplane”], ”support”:2} would be stored as {”18th century schooner, 1900s Vintage Bi-Plane,1928 British Royal Navy Airplane”:2}.

Next step is to loop through our frequent itemsets and start generate rules.

  for (var k in freqItemsSets) {
    /* get the all itemsets that has the same number of items */
    var itemSets = freqItemsSets[k].map(extractItemSet);
    if (itemSets.length === 0 || itemSets[0].length <= 1) {
      continue;
    }
    /* for each itemset add the rules into associationRules */
    itemSets.forEach(saveAllAssociationRulesFound);
  };

This code loops through all itemsets groupings (“1”, “2”, “3” etc) and for each group gets the itemsets and then loop through them in order to generate rules.

freqItemsSets[k].map(extractItemSet) means that for each element in freqItemsSets[k] apply the function extractItemSet, that returns the itemset, the result is then stored in a new array, itemSets. So for the first interaction itemSets will consist of all 1-way itemsets, next 2-way and so on. We also have a if statement that skips 1-ways, there is no rule to be generated for one item itemsets, and empty itemsets.

Next step is itemSets.forEach(saveAllAssociationRulesFound) which means that for each element, itemset, in itemSets apply the function saveAllAssociationRulesFound, which is shown below.

var saveAllAssociationRulesFound = function (itemSet) {
    currentItemSet = itemSet;
    toAllSubSets(currentItemSet).forEach(saveAssociationRuleFound);
  };

The function will first store the current itemset in currentItemSet, then it will generate all possible subsets, as shown earlier in this post, by calling the function toAllSubSets using the currentItemSet as parameter.

Then there is a .forEach(saveAssociationRuleFound) added to the function call. What is happening is that toAllSubSets will return an array, containing all subsets found in currentItemSet, and then for each element, subset, in the array call the saveAssociationRuleFound function.

  var saveAssociationRuleFound = function (subsetItemSet) {
    var diffItemSet = getDiffArray(currentItemSet, subsetItemSet); 
    if (diffItemSet.length > 0) {
      var itemSupport = frequencies[currentItemSet.toString()] / totTrans;
      var subsetSupport = frequencies[subsetItemSet.toString()] / totTrans;
      var confidence = itemSupport / subsetSupport;
      if (!isNaN(confidence) && confidence >= minConfidence) {
        associationRules.push({"lhs":subsetItemSet,"rhs":diffItemSet, "support":itemSupport, "confidence":confidence});
      }
    }
  };

The saveAssociationRuleFound function is generating the rule; it creates a itemset, diffItemSet, that consists of all items that exists in currentItemSet, that we set in saveAllAssociationRulesFound, but not in subsetItemSet. Then it calculate the support and confidence numbers using the frequencies variable we created earlier and if the confidence is equal or greater to our threshold we save it in our associationRules variable.

That is it!

So if we call our new function after creating the frequent itemsets using the following code.

var field = "productName";
var collQuery = cts.collectionQuery(["apriori"]);
var transLength = cts.estimate(collQuery);
var minTransSupp = transLength * 0.22;
var minItems = 1;
var maxItems = 5;
var minConf = 0.01;

/* Get the frequent itemsets */
var freqItemsSets = getFreqItemSets(field, collQuery, minTransSupp, minItems, maxItems);
/* Get the associationRules */
var associationRules = getAssociationRules(freqItemsSets, transLength, minConf);
associationRules;

We get our Market Basket Analysis, done on non-relational data, without going through the steps to convert into  a tabular representation.

Association rules

A next step could be to wrap this into a REST extension, which is something I will do for my R package, or to save the result as a document into MarkLogic Server for later use.

I would like to finish this post with a challenge,  can this be done in a even more efficient way using MarkLogic Server built in features?

/Mats Stellwall

Share on FacebookShare on Google+Email this to someoneTweet about this on TwitterShare on LinkedIn