Market Basket Analysis in MarkLogic NoSQL – Part 1, frequent itemsets

Share on FacebookShare on Google+Email this to someoneTweet about this on TwitterShare on LinkedIn

2016-01-22: As a colleague of mine pointed out, I am using the eval function in my original posting and that is not the most optimal. I have updated the example and code not to use it anymore.

I usually write this blog in Swedish, but for this topic I think there is a broader audience. At least you can always dream…

During the past months I have been involved in a private side project that aims to create an interface, package, for R against MarkLogic. The purpose of the package is to enable a more tabular view of the data stored in MarkLogic, also to enable a more native R way to access and work with the data. The package and progress can be found at https://github.com/mstellwa/rfml.

As a part of that, I started to look into how easy it would be to enable support for analysis/algorithms that is not natively supported in MarkLogic server and Market Basket Analysis is one example.

But before going into to details lets do a quick recap of MarkLogic NoSQL server.

One of the advantages with MarkLogic Server is the possibility to ingest data as-is and that it does not enforce a schema or structure. Each document, one document could equal a row in a database table, have its own structure. Everything is indexed and can be accessed using search and/or a query, a search is a query and a query is a search.

Documents are primary stored as XML or JSON, which are hierarchical formats. This type of format can be a challenge when doing some types of analysis, since many of the existing tools and algorithms expects a tabular formated data. This can be achieved in MarkLogic Server using range indexes in combination with something called views, more information about that at http://docs.marklogic.com/guide/sql. However, it would be easier if it could be done on the data as it is stored.

One example of a document that could be interesting to use for a Market Basket Analysis could be this simplified order document in JSON format.

{"customerOrder":{
"orderNumber":"T100",
"lineItems":{
"lineItem":[{
"orderLineNumber":"1",
"productNo":"1",
"productName":"18th century schooner",
"quantityOrdered":"49",
"priceEach":"34.47",
"lineTotal":"1689.03"},
{"orderLineNumber":"2",
"productNo":"2",
"productName":"1900s Vintage Bi-Plane",
"quantityOrdered":"20",
"priceEach":"33.00",
"lineTotal":"660"},
{"orderLineNumber":"3",
"productNo":"5",
"productName":"1903 Ford Model A",
"quantityOrdered":"33",
"priceEach":"36.45",
"lineTotal":"1202.85" ]}}}

The above document describes one order. It has multiple levels and in MarkLogic it is stored as-is, the way it is shown above. Since each document has its structure of its own, we can have documents with different number of items in them. In a relational world we would have to divide it into multiple tables, if we would store it in one we would have a lot of columns to cover the maximum number of items a order can be of.

So how can I do a Market Basket Analysis on this type of data without creating a more tabular way of storing my data?

There are a number of features in MarkLogic Server that will make this rather easy, if you are prepared to let go of your traditional ways and are willing to try something new.

A Market Basket Analysis, which is a association rule mining, is basically about finding associations or relationships among data items, which in the case is products.

The output is often a number of association rules, for example.
Milk => Bread [support = 2%, confidence = 60%]

This rule tells us that persons that buy milk also buy bread, the support and confidence measures tells us that 2% of all our baskets contains both milk and bread and 60% of the baskets that has milk also have bread. Often a rule is considered interesting if it satisfy both a minimum support threshold and a minimum confidence threshold. The user of the analysis usually sets these thresholds as input parameters.

Association rule mining is done in two steps:

  1. Find all frequent itemsets. Identify all item combinations that exists in our data set and that occur at least as frequently as our minimum support threshold.
  2. Generate rules from our itemsets that satisfy our minimum support and confidence thresholds.

In this post I am going to focus on how to the first step, the second step will be covered in the next post.

For this I am going to do an implementation that is similar to the Apriori algorithm and I have also created example order documents that I am using in this and the next post. You can access the example data here, click on the Raw button to download it. All code is in a Query Console workspace that can be downloaded here and can be imported into Query Console.

You need to have a database that you can load the data into; I am using one in this example that is called advanced-analytics. To load the data you can simply use the mlcp tool, download here, and the following command, replace the values for the parameters with the correct ones for your MarkLogic installation. I am also using a collection, apriori, in order to organize my documents.:

/mlcp-8.0-4/bin/mlcp.sh import -host localhost -port 8000 -database advanced-analytics -username admin -password admin -mode local -input_file_path ./apriori_data -output_collections apriori

If we, after the loading, go to the Query Console, http://[MarkLogic server ip]:8000/qconsole/, choose our database and then click Explore we can now see that there is nine documents loaded. Clicking on one of documents will show the structure of it.

Loaded document

In order to identify our frequent itemsets we need to be able to search for all combinations of products that exists in our data.

Now, there is multiple ways to do this and if I was using a relational database I would probably had to do a lot of self joins. However, in MarkLogic server there is a much simpler way to do this. By using range index we can get the combinations without having to select documents and do a lot of self joins.

A range index is basically a list of all unique values and the identifiers of the documents that has the values.

Quickest way to create a range index to be used for this is to got to the administration page, http://[MarkLogic server ip]:8001, click on databases, click on your database, click on Element Range Indexes and then click on Add. Choose string as scalar type, productName as localname, set range value positions to true and then click Ok.

Range Idnex settings

We have now an index that consists of the unique productName values and the document id’s.

There are a number of different functions built in that can be used to get index values; in this case we are interested in getting 1-n way combinations. For getting single values we could use cts.values but since I also want to get 2, 3, 4 etc way combinations cts.valueTuples is a better function.

cts.valueTuples returns value co-occurrence tuples (that is, tuples of values, each of which appear in the same document) from a range index.

In order to get all 2 way itemsets we can write the following JavaScript query:

cts.valueTuples([cts.elementReference(xs.QName("productName")), cts.elementReference(xs.QName("productName"))], "ordered");

2-way itemsets

We can then apply the function cts.frequency on the result to get the number of documents each itemset occurs in.

The following example will get all 2-way combinations, get the frequency for them and add that in a variable that is then shown.

var itemSet = cts.valueTuples([cts.elementReference(xs.QName("productName")), cts.elementReference(xs.QName("productName"))], "ordered");

var itemSetFreq = [];
for (var item of itemSet) {
  var freq = cts.frequency(item);
  itemSetFreq.push({"itemSet":item, "support":freq});
}
itemSetFreq;

Itemsets with frequencies

The Apriori algorithm walks through and identify all n-way item combinations that is in a data set, starting from 1 and continues until there are no more that satisfy the minimum support threshold. In order to be effective it uses a property called the Apriori property that states that all subsets of an itemset also must be frequent.

For example if we have itemset {I1, I2, I3}, then {I1, I2}, {I1, I3} and {I2, I3} also must be frequent in order for us to get the frequency for {I1, I2, I3}.

Now this is based on the assumption that we need to calculate the frequency for each itemset by scanning through the data. With MarkLogic server and using range indexed we does not have that limitation.

So in my implementation I am going to loop through all the n-way itemsets I am interesting in, get the frequency and then only keep the ones that support my minimum support threshold. I am also going to use thresholds to set the minimum number of items and maximum number of items.

The following code consists of a function, getFreqItemSets that will return a variable containing frequent itemsets and the frequency for them. It will build up the cts.valueTuples call in a string, since we are going from minlen items to maxlen items in our combinations, and then use the eval function to execute the string as code. It will also use a ctsQuery, whereQuery, to limit the scope of the data used.

In the end there is just a call to the function, I am using a collection query to limit it to the data loaded earlier that was added to the collection aptiori.

/* The function returns frequent itemsets based on itemField that occurs at minimum supp times and consist of minimum minlen and maximum maxlen items. whereQuery is used to limit the data used and must be a cts.query */
function getFreqItemSets(itemField, whereQuery, supp, minlen, maxlen) {
  /* calculate the minimum support count */
  var trans = cts.estimate(whereQuery);
  var minTransSupp = trans * supp;
  var frequentItemSets = {};
  var itemSetSize = 1;
  
  /* for minlen to maxlen */
  for (var i = minlen; i <= maxlen; i++) {
    var valueTuple = 'cts.valueTuples([';
    for (var j = 1; j <= i; j++) { if (j > 1) {
     var elementRefs = [];
     for (var j = 1; j <= i; j++) {
      elementRefs.push(cts.elementReference(xs.QName(itemField)));
     };
    var itemSet = cts.valueTuples(elementRefs, 'ordered', whereQuery);
    var itemSetFreq = [];
    for (var item of itemSet) {
     var freq = cts.frequency(item);
     if (freq >= minTransSupp) {
        itemSetFreq.push({"itemSet":item, "support":freq});
     }
    }
    if (itemSetFreq.length > 0) {
      frequentItemSets[itemSetSize] = itemSetFreq
      itemSetSize += 1;
    }
  }
 
  return frequentItemSets;
}

var collQuery = cts.collectionQuery(["apriori"]);
getFreqItemSets("productName", collQuery, 0.22, 1, 5);

Frequent itemsets function

We can see in the result all frequent itemsets that supports our minimum support threshold. It is also notable that even if we used 5 as the maximum number of items we only got up to 3-way itemsets, this is because there was no 4- and 5-ways itemsets that supported the threshold.

Now we have function that gives us all frequent itemsets that we can use to identify all association rules which will be the topic for my next post.

/Mats Stellwall

Share on FacebookShare on Google+Email this to someoneTweet about this on TwitterShare on LinkedIn