The goal of this exercice is to decipher relevant knowledges from Baskets DataSet and boost the business by positionning the items.
# get the structure of the dataset
str(basket)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : int 1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
## $ Date : Factor w/ 728 levels "01-01-2014","01-01-2015",..: 494 98 450 288 4 316 178 552 462 268 ...
## $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
The dataset is a dataframe with 3 columns: Member_number, Date, and Products description. The Member_number
column must be considered as factor, like the id of member. The Date
must be a date type. The itemDascription
is already as.factor
library(tibble)
basket$Member_number <- as.factor(basket$Member_number)
basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
# In the case if we have the transaction date as 2010-12-01 08:26:00
# We can extract time from the column and store in another variable:
# TransTime<- format(basket$Date,"%H:%M:%S")
# This line create an other column with Date YMD.
# convert the dataframe as a tibble
basket <- tibble::as.tibble(basket)
# review the structure and summary
str(basket)
## Classes 'tbl_df', 'tbl' and 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : Factor w/ 3898 levels "1000","1001",..: 787 1505 1264 181 1982 3839 3407 2732 1712 3037 ...
## $ Date : Date, format: "2015-07-21" "2015-01-05" ...
## $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
## Member_number Date itemDescription
## 3180 : 36 Min. :2014-01-01 whole milk : 2502
## 2051 : 33 1st Qu.:2014-07-12 other vegetables: 1898
## 3050 : 33 Median :2015-01-21 rolls/buns : 1716
## 3737 : 33 Mean :2015-01-09 soda : 1514
## 2271 : 31 3rd Qu.:2015-07-10 yogurt : 1334
## 2433 : 31 Max. :2015-12-30 root vegetables : 1071
## (Other):38568 (Other) :28730
basket
Now is better formated. We can observe:
The Member_number 3180
has the biggest transactions (36).
The Whole milk
is the most sold with 2502 transactions.
The sampling data was done during two years, between 2014-01-01 and 2015-12-30.
Before to start the exploration, It is important to check if the dataset countain empty cells or non available data.
# Ckeck for Non Available (NA) cell
all(is.na(basket))
## [1] FALSE
# FALSE
# otherwise we can use #complete.cases(data) will return a logical vector indicating which rows have no missing values.
#Then use the vector to get only rows that are complete using basket[,].
basket <- basket[complete.cases(basket), ]
basket
# get a glimpse of the data
tibble::glimpse(basket)
## Observations: 38,765
## Variables: 3
## $ Member_number <fct> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 3803...
## $ Date <date> 2015-07-21, 2015-01-05, 2015-09-19, 2015-12-1...
## $ itemDescription <fct> tropical fruit, whole milk, pip fruit, other v...
The dataset has three variables that describe the transaction
of multiple costumers
that bought mutiple items
recurrently during 2 years.
We need to determine how many costumer
.
We need to determine how many items
available in this dataset.
We need to know if is there multiple items that purchased together (forming a basket or couffin) for each costumer.
Please, find the exploratory steps in Drafting section.
This dataset describes the transactions done in supermarket during two years. The Market Basket Analysis (MBA) uses Association Rule
to extract knowledge about the items, and costumers. The goal of this exercice is to use this data to boost the business. The idea is to change for example the layout of the physical store or rather an online store. For example, put particular product with bad sold rate near to a product with a good rate of sold to boost the business.
What we need to do is to group data by Member_number
, and Date
. We need this grouping and apply a function on it and store the output in another dataframe.
The following lines of code will combine all items from one Member_number
and Date
and combine all items from one transaction as one row, with each item, separated by ,
.
library(plyr)
transaction_list <- plyr::ddply(basket,c("Member_number","Date"),
function(df1)paste(df1$itemDescription,
collapse = ","))
colnames(transaction_list) <- c("Member_number","Date","Baskets")
transaction_list
Well, at all we have 14,963
transactions. Each transaction is composed by a set of items viewed in the column Baskets
. Each set of items is named a Basket
or Couffin
. In the following steps we need only the informations in Baskets
column. We will save it in a file as csv format (comma separate Values).
# Back-Up the transaction List with Member number and Dates.
transaction_list_bkp <- transaction_list
#set column Member_number of dataframe transaction_list
transaction_list$Member_number <- NULL
#set column Date of dataframe transaction_list
transaction_list$Date <- NULL
#Rename column to Baskets
colnames(transaction_list) <- c("Baskets")
#Show Dataframe transactionData
transaction_list
arules
algorithmelibrary(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
# Write the Baskets list to a file
write.csv(transaction_list,"transactions_list.csv", quote = FALSE, row.names = TRUE)
# load the Baskets list as a basket format using arules package
tr_list <- read.transactions('transactions_list.csv', format = 'basket', sep=',')
## Warning in asMethod(object): removing duplicated items in transactions
tr_list
## transactions in sparse format with
## 14964 transactions (rows) and
## 15131 items (columns)
summary(tr_list)
## transactions as itemMatrix in sparse format with
## 14964 rows (elements/itemsets/transactions) and
## 15131 columns (items) and a density of 0.0002339455
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2363 1827 1646 1453
## yogurt (Other)
## 1285 44396
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 1 205 10012 2727 1273 338 179 113 96 19 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.54 4.00 11.00
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 100
There are 14964
transactions and 15131
purchased items (One item can be purchased multiple times). Each transaction is a collection of items.
The Density is 0.0002339455
tells the percentage of non-zero cells in a sparse matrix. It is the total number of items that are purchased and divided by a possible number of items in the matrix.
The multiplication of the three values is the total of products were purchased.
round(14964 * 15131 * 0.0002339455)
## [1] 52970
The most frequent items are whole milk (2363), other vegetables (1827), rolls/buns (1646), soda (1453), yogurt (1285), other (44396)
The element length distribution is formulated as number of products
/ number of transactions
. For example, there are 205 transactions with only 2 items, and 2727 transactions with 4 items. We can get directly the element length distribution by summary(tr_list)@lengths
.
summary(tr_list)@lengths
## sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 1 205 10012 2727 1273 338 179 113 96 19 1
library(RColorBrewer)
#par(mfrow=c(2,1))
arules::itemFrequencyPlot(tr_list,topN=10,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Product Frequency Plot")
arules::itemFrequencyPlot(tr_list,topN=10,type="relative",col=brewer.pal(8,'Pastel2'), main="Relative Product Frequency Plot")
The option absolute
plots numeric frequencies of each item independently. The relative
option plots how many times these products have appeared as compared to others.
Theses plots shows the 10 most sold items. We can view more by changing the argument topN
.
Whole milk and Other vegetables are in the top.
To boost the business of citrus fruit, we need for example to put it in the way of the whole milk or in the way of other vegetables.
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# How many transactions by product, the data will be arranged by descending
basket_plot_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
## plot the histogram of the best 50th sold products
ggplot2::ggplot(basket_plot_products[1:10,], aes(x = reorder(itemDescription, -N_operations), y = N_operations)) +
#aes_string(x = 'itemDescription', y = 'N_operations') ## without ordering
ggplot2::labs(title = "Overview of the first 10th best solds products during 2014-2015",
#fill = "",
x= "The products",
#colour= Date,
y = "The number of sales by product") +
ggplot2::theme(legend.title = element_text( colour="black",
size=11,
face="bold"),
title = element_text( size = 12,
face = 'bold'
),
text = element_text(size = 10,
face= 'bold'
),
axis.text.x=element_text(angle=45, hjust=1),
legend.position = "right",
legend.direction = "vertical"
) +
ggplot2::geom_bar(stat="identity", na.rm=TRUE)
APRIORI
algorithmeThe next step is to mine the rules using the APRIORI
algorithm. The function apriori()
is from arules
package.
We can set and optimize the parameter of the minimum support
of 0.0001, the minimum confidence
of 0.8, maximum of 10 items
(maxlen).
#Min Support as 0.0001, confidence as 0.8 and maximum of 10 products.
association.rules <- arules::apriori(tr_list, parameter = list(supp=0.0001, conf=0.8)) #, maxlen=10
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 1e-04 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15131 item(s), 14964 transaction(s)] done [0.02s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [647 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
Definition
Itemset: Collection of one or more items. K-item-set means a set of k items.
Support Count: Frequency of occurrence of an item-set
Support(s): Fraction of transactions that contain the item-set
# summary rules of min Support as 0.0001, confidence as 0.8 and maximum of 10 products.
summary(association.rules)
## set of 647 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5
## 135 438 74
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 3.906 4.000 5.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.0001337 Min. :0.8000 Min. : 5.066 Min. :2.000
## 1st Qu.:0.0001337 1st Qu.:1.0000 1st Qu.: 8.191 1st Qu.:2.000
## Median :0.0001337 Median :1.0000 Median : 11.645 Median :2.000
## Mean :0.0001373 Mean :0.9988 Mean : 19.286 Mean :2.054
## 3rd Qu.:0.0001337 3rd Qu.:1.0000 3rd Qu.: 20.387 3rd Qu.:2.000
## Max. :0.0003341 Max. :1.0000 Max. :364.976 Max. :5.000
##
## mining info:
## data ntransactions support confidence
## tr_list 14964 1e-04 0.8
The total number of rules is: 647
Distribution of rule length: A length of 4 items has the most rules: 438 and a length of 5 items have the lowest number of rules: 74.
inspect(association.rules[1:10])
## lhs rhs support
## [1] {domestic eggs,rubbing alcohol} => {frankfurter} 0.0001336541
## [2] {frankfurter,rubbing alcohol} => {domestic eggs} 0.0001336541
## [3] {bottled water,cookware} => {canned beer} 0.0001336541
## [4] {soap,tropical fruit} => {whole milk} 0.0001336541
## [5] {soap,whole milk} => {tropical fruit} 0.0001336541
## [6] {domestic eggs,skin care} => {other vegetables} 0.0001336541
## [7] {frankfurter,potato products} => {other vegetables} 0.0001336541
## [8] {ice cream,prosecco} => {other vegetables} 0.0001336541
## [9] {prosecco,waffles} => {sausage} 0.0001336541
## [10] {prosecco,waffles} => {other vegetables} 0.0001336541
## confidence lift count
## [1] 1 26.484956 2
## [2] 1 26.962162 2
## [3] 1 21.316239 2
## [4] 1 6.332628 2
## [5] 1 14.757396 2
## [6] 1 8.190476 2
## [7] 1 8.190476 2
## [8] 1 8.190476 2
## [9] 1 16.571429 2
## [10] 1 8.190476 2
Interpretation
100% of the customers who bought ‘domestic eggs,rubbing alcohol’ also bought ‘rankfurter’.
100% of the customers who bought ‘prosecco,waffles’ also bought ‘sausage’.
For example, we would like to know what costumer buy before buying canned beer
.
beer.association.rules <- apriori(tr_list, parameter = list(supp=0.0001, conf=0.8),appearance = list(default="lhs",rhs="canned beer"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 1e-04 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[15131 item(s), 14964 transaction(s)] done [0.02s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
# lhs= canned beer because you want to find out the probability of that in how many customers buy canned beer along with other items
inspect(head(beer.association.rules))
## lhs rhs support confidence lift count
## [1] {bottled water,
## cookware} => {canned beer} 0.0001336541 1 21.31624 2
## [2] {cat food,
## dishes} => {canned beer} 0.0001336541 1 21.31624 2
## [3] {ham,
## salty snack} => {canned beer} 0.0001336541 1 21.31624 2
## [4] {sausage,
## whole milk,
## zwieback} => {canned beer} 0.0001336541 1 21.31624 2
## [5] {brown bread,
## cake bar,
## sausage} => {canned beer} 0.0001336541 1 21.31624 2
## [6] {chicken,
## hygiene articles,
## whole milk} => {canned beer} 0.0001336541 1 21.31624 2
Interpretation
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz
package. It uses Support
and Confidence
on the axes.
library(arulesViz)
## Loading required package: grid
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The above plot shows that rules with high lift have low support. We can use the following options:
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
+ The two-key plot uses support and confidence on x and y-axis respectively.
#top10subRules <- head(subRules, n = 10, by = "confidence")
plot(subRules[1:20], method = "graph", engine = "htmlwidget", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## itemCol = #CBD2FC
## nodeCol = c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B", "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0", "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision = 3
## igraphLayout = layout_nicely
## interactive = TRUE
## engine = visNetwork
## max = 100
## selection_menu = TRUE
## degree_highlight = 1
## verbose = FALSE
The Arrows pointing from products to rule vertices indicate LHS
products and an arrow from a rule to an product indicates the RHS
.
Interprettaion + The graph shows taht all rules or itineraries focus or whome milk
and Vegetables
.
tea
and frozens vegetables
also buy cat food
. We should place these in an aisle together.The Parallel Coordinates Plot
is useful to visualized which items were purchased with an other one.
# Filter top 10 rules with highest lift
#subRules2<-head(subRules, n=16, by="lift")
plot(subRules[1:10], method="paracoord", control = list(reorder = TRUE))
Interpretation
The plot shows the itinerary of the costumers during the shopping.
The top 10 rules show us that when costumers have whole milk
and soap
, It will be highly likely that they buy tropical fruit
.
Each arrow corresponds to a rule.
There are 4 rules that coverge to other vegetables
.
The visualizing graphs and rules, help us to make decisions for the positioning of items in the supermarket.
Now, we would place Whole Milk
and Vegetables
beside each other; tea
and frozens vegetables
also buy cat food
along side too.