Boost Business of retail supermarket

The goal of this exercice is to decipher relevant knowledges from Baskets DataSet and boost the business by positionning the items.

# get the structure of the dataset
str(basket)
## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : int  1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
##  $ Date           : Factor w/ 728 levels "01-01-2014","01-01-2015",..: 494 98 450 288 4 316 178 552 462 268 ...
##  $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...

Data Pre-processing

The dataset is a dataframe with 3 columns: Member_number, Date, and Products description. The Member_number column must be considered as factor, like the id of member. The Date must be a date type. The itemDascription is already as.factor

convert the type of variables

library(tibble)

basket$Member_number <- as.factor(basket$Member_number)
basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
# In the case if we have the transaction date as 2010-12-01 08:26:00
# We can extract time from the column and store in another variable:
# TransTime<- format(basket$Date,"%H:%M:%S")
# This line create an other column with Date YMD.

# convert the dataframe as a tibble
basket <- tibble::as.tibble(basket)

# review the structure and summary
str(basket)
## Classes 'tbl_df', 'tbl' and 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : Factor w/ 3898 levels "1000","1001",..: 787 1505 1264 181 1982 3839 3407 2732 1712 3037 ...
##  $ Date           : Date, format: "2015-07-21" "2015-01-05" ...
##  $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
##  Member_number        Date                    itemDescription 
##  3180   :   36   Min.   :2014-01-01   whole milk      : 2502  
##  2051   :   33   1st Qu.:2014-07-12   other vegetables: 1898  
##  3050   :   33   Median :2015-01-21   rolls/buns      : 1716  
##  3737   :   33   Mean   :2015-01-09   soda            : 1514  
##  2271   :   31   3rd Qu.:2015-07-10   yogurt          : 1334  
##  2433   :   31   Max.   :2015-12-30   root vegetables : 1071  
##  (Other):38568                        (Other)         :28730
basket

Now is better formated. We can observe:

  • The Member_number 3180 has the biggest transactions (36).

  • The Whole milk is the most sold with 2502 transactions.

  • The sampling data was done during two years, between 2014-01-01 and 2015-12-30.

Before to start the exploration, It is important to check if the dataset countain empty cells or non available data.

Check for any empty or NA cell

# Ckeck for Non Available (NA) cell
all(is.na(basket))
## [1] FALSE
# FALSE
# otherwise we can  use #complete.cases(data) will return a logical vector indicating which rows have no missing values. 
#Then use the vector to get only rows that are complete using basket[,].
basket <- basket[complete.cases(basket), ]

basket
# get a glimpse of the data
tibble::glimpse(basket)
## Observations: 38,765
## Variables: 3
## $ Member_number   <fct> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 3803...
## $ Date            <date> 2015-07-21, 2015-01-05, 2015-09-19, 2015-12-1...
## $ itemDescription <fct> tropical fruit, whole milk, pip fruit, other v...

Context of this dataset

The dataset has three variables that describe the transaction of multiple costumers that bought mutiple items recurrently during 2 years.

Exploratory Data Analysis

Please, find the exploratory steps in Drafting section.

Market Basket Analysis

This dataset describes the transactions done in supermarket during two years. The Market Basket Analysis (MBA) uses Association Rule to extract knowledge about the items, and costumers. The goal of this exercice is to use this data to boost the business. The idea is to change for example the layout of the physical store or rather an online store. For example, put particular product with bad sold rate near to a product with a good rate of sold to boost the business.

What we need to do is to group data by Member_number, and Date. We need this grouping and apply a function on it and store the output in another dataframe.

The following lines of code will combine all items from one Member_number and Date and combine all items from one transaction as one row, with each item, separated by ,.

library(plyr)

transaction_list <- plyr::ddply(basket,c("Member_number","Date"),
                       function(df1)paste(df1$itemDescription,
                       collapse = ","))
colnames(transaction_list) <- c("Member_number","Date","Baskets")
transaction_list

Well, at all we have 14,963 transactions. Each transaction is composed by a set of items viewed in the column Baskets. Each set of items is named a Basket or Couffin. In the following steps we need only the informations in Baskets column. We will save it in a file as csv format (comma separate Values).

# Back-Up the transaction List with Member number and Dates.
transaction_list_bkp <- transaction_list

#set column Member_number of dataframe transaction_list 
transaction_list$Member_number <- NULL
#set column Date of dataframe transaction_list
transaction_list$Date <- NULL
#Rename column to Baskets
colnames(transaction_list) <- c("Baskets")
#Show Dataframe transactionData
transaction_list

Data Mining using arules algorithme

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
# Write the Baskets list to a file
write.csv(transaction_list,"transactions_list.csv", quote = FALSE, row.names = TRUE)
# load the Baskets list as a basket format using arules package
tr_list <- read.transactions('transactions_list.csv', format = 'basket', sep=',')
## Warning in asMethod(object): removing duplicated items in transactions
tr_list
## transactions in sparse format with
##  14964 transactions (rows) and
##  15131 items (columns)
summary(tr_list)
## transactions as itemMatrix in sparse format with
##  14964 rows (elements/itemsets/transactions) and
##  15131 columns (items) and a density of 0.0002339455 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2363             1827             1646             1453 
##           yogurt          (Other) 
##             1285            44396 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10    11 
##     1   205 10012  2727  1273   338   179   113    96    19     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.54    4.00   11.00 
## 
## includes extended item information - examples:
##   labels
## 1      1
## 2     10
## 3    100
  • There are 14964 transactions and 15131 purchased items (One item can be purchased multiple times). Each transaction is a collection of items.

  • The Density is 0.0002339455 tells the percentage of non-zero cells in a sparse matrix. It is the total number of items that are purchased and divided by a possible number of items in the matrix.

  • The multiplication of the three values is the total of products were purchased.

round(14964 * 15131 * 0.0002339455)
## [1] 52970
  • The most frequent items are whole milk (2363), other vegetables (1827), rolls/buns (1646), soda (1453), yogurt (1285), other (44396)

  • The element length distribution is formulated as number of products / number of transactions. For example, there are 205 transactions with only 2 items, and 2727 transactions with 4 items. We can get directly the element length distribution by summary(tr_list)@lengths.

summary(tr_list)@lengths
## sizes
##     1     2     3     4     5     6     7     8     9    10    11 
##     1   205 10012  2727  1273   338   179   113    96    19     1

Plot the most 10 frequent purchased items

library(RColorBrewer)
#par(mfrow=c(2,1))
arules::itemFrequencyPlot(tr_list,topN=10,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Product Frequency Plot")

arules::itemFrequencyPlot(tr_list,topN=10,type="relative",col=brewer.pal(8,'Pastel2'), main="Relative Product Frequency Plot")

The option absolute plots numeric frequencies of each item independently. The relative option plots how many times these products have appeared as compared to others.

  • Theses plots shows the 10 most sold items. We can view more by changing the argument topN.

  • Whole milk and Other vegetables are in the top.

  • To boost the business of citrus fruit, we need for example to put it in the way of the whole milk or in the way of other vegetables.

Alternative way to extract the most sold items

library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
# How many transactions by product, the data will be arranged by descending
basket_plot_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))


## plot the histogram of the best 50th sold  products
ggplot2::ggplot(basket_plot_products[1:10,], aes(x = reorder(itemDescription, -N_operations), y = N_operations)) +      
                                             #aes_string(x = 'itemDescription',  y = 'N_operations')  ## without ordering
 ggplot2::labs(title = "Overview of the first 10th best solds products during 2014-2015",
               #fill = "",
               x= "The products",
              #colour= Date,
               y = "The number of sales by product") +    
      ggplot2::theme(legend.title = element_text( colour="black",
                                                size=11,
                                                face="bold"),
                   title = element_text( size = 12,
                                         face = 'bold'
                   ),
                   text = element_text(size = 10,
                                       face= 'bold'
                   ),
                   axis.text.x=element_text(angle=45, hjust=1),
                   legend.position = "right",
                   legend.direction = "vertical"
    ) +
 ggplot2::geom_bar(stat="identity", na.rm=TRUE)

Applying of APRIORI algorithme

The next step is to mine the rules using the APRIORI algorithm. The function apriori() is from arules package.

We can set and optimize the parameter of the minimum support of 0.0001, the minimum confidence of 0.8, maximum of 10 items (maxlen).

#Min Support as 0.0001, confidence as 0.8 and maximum of 10 products.
association.rules <- arules::apriori(tr_list, parameter = list(supp=0.0001, conf=0.8))  #, maxlen=10
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   1e-04      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15131 item(s), 14964 transaction(s)] done [0.02s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [647 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

Definition

  • Itemset: Collection of one or more items. K-item-set means a set of k items.

  • Support Count: Frequency of occurrence of an item-set

  • Support(s): Fraction of transactions that contain the item-set

Summary of the associated rules

# summary rules of min Support as 0.0001, confidence as 0.8 and maximum of 10 products.
summary(association.rules)
## set of 647 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5 
## 135 438  74 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   3.906   4.000   5.000 
## 
## summary of quality measures:
##     support            confidence          lift             count      
##  Min.   :0.0001337   Min.   :0.8000   Min.   :  5.066   Min.   :2.000  
##  1st Qu.:0.0001337   1st Qu.:1.0000   1st Qu.:  8.191   1st Qu.:2.000  
##  Median :0.0001337   Median :1.0000   Median : 11.645   Median :2.000  
##  Mean   :0.0001373   Mean   :0.9988   Mean   : 19.286   Mean   :2.054  
##  3rd Qu.:0.0001337   3rd Qu.:1.0000   3rd Qu.: 20.387   3rd Qu.:2.000  
##  Max.   :0.0003341   Max.   :1.0000   Max.   :364.976   Max.   :5.000  
## 
## mining info:
##     data ntransactions support confidence
##  tr_list         14964   1e-04        0.8
  • The total number of rules is: 647

  • Distribution of rule length: A length of 4 items has the most rules: 438 and a length of 5 items have the lowest number of rules: 74.

Inspect the top 10 rules

inspect(association.rules[1:10])
##      lhs                                rhs                support     
## [1]  {domestic eggs,rubbing alcohol} => {frankfurter}      0.0001336541
## [2]  {frankfurter,rubbing alcohol}   => {domestic eggs}    0.0001336541
## [3]  {bottled water,cookware}        => {canned beer}      0.0001336541
## [4]  {soap,tropical fruit}           => {whole milk}       0.0001336541
## [5]  {soap,whole milk}               => {tropical fruit}   0.0001336541
## [6]  {domestic eggs,skin care}       => {other vegetables} 0.0001336541
## [7]  {frankfurter,potato products}   => {other vegetables} 0.0001336541
## [8]  {ice cream,prosecco}            => {other vegetables} 0.0001336541
## [9]  {prosecco,waffles}              => {sausage}          0.0001336541
## [10] {prosecco,waffles}              => {other vegetables} 0.0001336541
##      confidence lift      count
## [1]  1          26.484956 2    
## [2]  1          26.962162 2    
## [3]  1          21.316239 2    
## [4]  1           6.332628 2    
## [5]  1          14.757396 2    
## [6]  1           8.190476 2    
## [7]  1           8.190476 2    
## [8]  1           8.190476 2    
## [9]  1          16.571429 2    
## [10] 1           8.190476 2

Interpretation

  • 100% of the customers who bought ‘domestic eggs,rubbing alcohol’ also bought ‘rankfurter’.

  • 100% of the customers who bought ‘prosecco,waffles’ also bought ‘sausage’.

Extract rules from given products

For example, we would like to know what costumer buy before buying canned beer.

beer.association.rules <- apriori(tr_list, parameter = list(supp=0.0001, conf=0.8),appearance = list(default="lhs",rhs="canned beer"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   1e-04      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[15131 item(s), 14964 transaction(s)] done [0.02s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

Inspect the top 10 rules

# lhs= canned beer because you want to find out the probability of that in how many customers buy canned beer along with other items
inspect(head(beer.association.rules))
##     lhs                   rhs                support confidence     lift count
## [1] {bottled water,                                                           
##      cookware}         => {canned beer} 0.0001336541          1 21.31624     2
## [2] {cat food,                                                                
##      dishes}           => {canned beer} 0.0001336541          1 21.31624     2
## [3] {ham,                                                                     
##      salty snack}      => {canned beer} 0.0001336541          1 21.31624     2
## [4] {sausage,                                                                 
##      whole milk,                                                              
##      zwieback}         => {canned beer} 0.0001336541          1 21.31624     2
## [5] {brown bread,                                                             
##      cake bar,                                                                
##      sausage}          => {canned beer} 0.0001336541          1 21.31624     2
## [6] {chicken,                                                                 
##      hygiene articles,                                                        
##      whole milk}       => {canned beer} 0.0001336541          1 21.31624     2

Interpretation

  • Canned beer (RHS) was Bought after all sub-baskets (with two items) in the LHS column.

Visualizing Association Rules

A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes.

Visualize the rules

scatter plot with 647 rules

library(arulesViz)
## Loading required package: grid
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The above plot shows that rules with high lift have low support. We can use the following options:

Two-key Plot

plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

+ The two-key plot uses support and confidence on x and y-axis respectively.

  • It uses order for coloring. The order is the number of products in the rule.

Interactive visualisation of the Rules

#top10subRules <- head(subRules, n = 10, by = "confidence")
plot(subRules[1:20], method = "graph",  engine = "htmlwidget", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE

The Arrows pointing from products to rule vertices indicate LHS products and an arrow from a rule to an product indicates the RHS.

Interprettaion + The graph shows taht all rules or itineraries focus or whome milk and Vegetables.

  • The graph shows also the people who buy tea and frozens vegetables also buy cat food. We should place these in an aisle together.

Individual Rule Representation

The Parallel Coordinates Plot is useful to visualized which items were purchased with an other one.

# Filter top 10 rules with highest lift
#subRules2<-head(subRules, n=16, by="lift")
plot(subRules[1:10], method="paracoord", control = list(reorder = TRUE))

Interpretation

Conclusion