The goal of this exercice is to decipher relevant knowledge from Basket DataSet.
# get the structure of the dataset
str(basket)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : int 1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
## $ Date : Factor w/ 728 levels "01-01-2014","01-01-2015",..: 494 98 450 288 4 316 178 552 462 268 ...
## $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
## Member_number Date itemDescription
## Min. :1000 21-01-2015: 96 whole milk : 2502
## 1st Qu.:2002 21-07-2015: 93 other vegetables: 1898
## Median :3005 08-08-2015: 92 rolls/buns : 1716
## Mean :3004 29-11-2015: 92 soda : 1514
## 3rd Qu.:4007 30-04-2015: 91 yogurt : 1334
## Max. :5000 26-03-2015: 88 root vegetables : 1071
## (Other) :38213 (Other) :28730
The dataset is a dataframe with 3 columns: Member_number, Date, and Products description. The Member_number
column must be considered as factor, like the id of member. The Date
must be a date type. The itemDascription
is already as.factor
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
basket$Member_number <- as.factor(basket$Member_number)
basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
# review the structure and summary
str(basket)
## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : Factor w/ 3898 levels "1000","1001",..: 787 1505 1264 181 1982 3839 3407 2732 1712 3037 ...
## $ Date : Date, format: "2015-07-21" "2015-01-05" ...
## $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
## Member_number Date itemDescription
## 3180 : 36 Min. :2014-01-01 whole milk : 2502
## 2051 : 33 1st Qu.:2014-07-12 other vegetables: 1898
## 3050 : 33 Median :2015-01-21 rolls/buns : 1716
## 3737 : 33 Mean :2015-01-09 soda : 1514
## 2271 : 31 3rd Qu.:2015-07-10 yogurt : 1334
## 2433 : 31 Max. :2015-12-30 root vegetables : 1071
## (Other):38568 (Other) :28730
Now is better formated. We can observe:
The Member 3180
has the biggest operations (36).
The Whole milk
product is the most sold with 2502 operations.
The sampling data was done during two years, between 2014-01-01 and 2015-12-30.
Before to start the exploration, It is important to check if the dataset countain empty cells or non available data.
# Ckeck for Non Available (NA) cell
all(is.na(basket))
## [1] FALSE
There is an intersting package named funModeling
that make eeasy the exploration of Data analysis. We will use it.
library(funModeling)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.6.8 :)
## Examples and tutorials at livebook.datascienceheroes.com
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.4.2 ✔ purrr 0.2.5
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.2.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
library(Hmisc)
tibble::glimpse(basket)
## Observations: 38,765
## Variables: 3
## $ Member_number <fct> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 3803...
## $ Date <date> 2015-07-21, 2015-01-05, 2015-09-19, 2015-12-1...
## $ itemDescription <fct> tropical fruit, whole milk, pip fruit, other v...
glimpse
function returns informations like str
.
funModeling::df_status(basket)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 Member_number 0 0 0 0 0 0 factor 3898
## 2 Date 0 0 0 0 0 0 Date 728
## 3 itemDescription 0 0 0 0 0 0 factor 167
df_statut
allows us to detect abnormal values and if is there non available or empty cells.
#freq(basket)
library(Hmisc)
Hmisc::describe(basket)
## basket
##
## 3 Variables 38765 Observations
## ---------------------------------------------------------------------------
## Member_number
## n missing distinct
## 38765 0 3898
##
## lowest : 1000 1001 1002 1003 1004, highest: 4996 4997 4998 4999 5000
## ---------------------------------------------------------------------------
## Date
## n missing distinct
## 38765 0 728
##
## lowest : 2014-01-01 2014-01-02 2014-01-03 2014-01-04 2014-01-05
## highest: 2015-12-26 2015-12-27 2015-12-28 2015-12-29 2015-12-30
## ---------------------------------------------------------------------------
## itemDescription
## n missing distinct
## 38765 0 167
##
## lowest : abrasive cleaner artif. sweetener baby cosmetics bags baking powder
## highest: white bread white wine whole milk yogurt zwieback
## ---------------------------------------------------------------------------
At this step it is importante to know the number of Members and the number of products.
library(magrittr)
# How many member
base::table(basket$Member_number) %>% length()
## [1] 3898
# 3898
# How many operations
length(basket$Member_number)
## [1] 38765
# 38765
# How many products
base::table(basket$itemDescription) %>% length()
## [1] 167
# 167
For all we have 3898 members and 167 products, and 38765 operations. This means that several members made several operations. How many operation per Members? And How many operations per product?
library(dplyr)
# How many transactions by Member, the data will be arranged by descending
basket %>% group_by(Member_number) %>% arrange(Member_number) %>% dplyr::summarise(N_operation=n()) %>% arrange(desc(N_operation))
#basket_plot_operations <- basket %>% group_by(Member_number, Date) %>% summarise(N_operation=n()) %>% arrange(desc(N_operation))
#ggplot2::ggplot(basket_plot_operations[1:100,], aes_string(x = 'Date', y = 'N_operation')) +
# ggplot2::geom_bar(stat="identity", na.rm=TRUE)
This table confirm that the Member 3180
ahs the biggest frequency of operations 36
. We can see the first ten important Members which got the biggest numbers of operations.
# How many Operations were done by product during 2 Years. The data will be arranged by descending
basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
This table shows arranges by descending the most important sold products.
We can visualize the first 50th best solds as following:
# How many transactions by product, the data will be arranged by descending
basket_plot_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
## plot the histogram of the best 50th sold products
ggplot2::ggplot(basket_plot_products[1:50,], aes(x = reorder(itemDescription, -N_operations), y = N_operations)) +
#aes_string(x = 'itemDescription', y = 'N_operations') ## without ordering
ggplot2::labs(title = "Overview of the first 50th best solds products during 2014-2015",
#fill = "",
x= "The products",
#colour= Date,
y = "The number of sales by product") +
ggplot2::theme(legend.title = element_text( colour="black",
size=11,
face="bold"),
title = element_text( size = 12,
face = 'bold'
),
text = element_text(size = 10,
face= 'bold'
),
axis.text.x=element_text(angle=45, hjust=1),
legend.position = "right",
legend.direction = "vertical"
) +
ggplot2::geom_bar(stat="identity", na.rm=TRUE)
# Monitoring the operations number were done per days, for all products during 2 years. The data will be arranged by descending
basket_plot_date <- basket %>% group_by(Date) %>% dplyr::summarise(N_operations=n()) # %>% arrange(desc(N_operations))
attach(basket_plot_date )
basket_plot_date %>%
ggplot(aes(x = Date, y = N_operations)) +
geom_point(color = "darkorchid4") +
geom_smooth(method = 'loess')+
labs(title = "Total Operations per day during the all period (2 years)",
subtitle = "Each value is the sum of all operations for all Members",
y = "Number of Operations",
x = "Date") + theme_bw(base_size = 15)
#basket_plot_date$months <- as.factor(months(basket_plot_date$Date))
basket_plot_date$months <- as.factor(format(as.Date(basket_plot_date$Date), "%y-%m"))
#par(mfrow=c(2,1))
plot( basket_plot_date$Date, basket_plot_date$N_operations, ylab = "N° Transactions", xlab = "Date")
fit_day <- lm(N_operations ~ Date,data= basket_plot_date)
lines(basket_plot_date$Date, fit_day$fitted.values)
plot(basket_plot_date$months, basket_plot_date$N_operations, ylab = "N° Transactions", xlab = "Months")
fit_month <- lm(N_operations ~ months,data= basket_plot_date)
summary(fit_day)
##
## Call:
## lm(formula = N_operations ~ Date, data = basket_plot_date)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.109 -8.641 -0.864 8.467 42.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.394e+02 3.668e+01 -3.800 0.000157 ***
## Date 1.172e-02 2.232e-03 5.252 1.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.68 on 726 degrees of freedom
## Multiple R-squared: 0.0366, Adjusted R-squared: 0.03527
## F-statistic: 27.58 on 1 and 726 DF, p-value: 1.981e-07
summary(fit_month)
##
## Call:
## lm(formula = N_operations ~ months, data = basket_plot_date)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.548 -8.917 -0.483 7.982 38.032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.2581 2.2490 21.903 < 2e-16 ***
## months14-02 2.0634 3.2646 0.632 0.52756
## months14-03 -3.7419 3.1805 -1.177 0.23978
## months14-04 2.7753 3.2069 0.865 0.38711
## months14-05 2.8387 3.1805 0.893 0.37241
## months14-06 3.0753 3.2069 0.959 0.33791
## months14-07 1.5806 3.1805 0.497 0.61936
## months14-08 1.5484 3.1805 0.487 0.62653
## months14-09 -0.1914 3.2069 -0.060 0.95242
## months14-10 2.0645 3.1805 0.649 0.51647
## months14-11 -0.2914 3.2069 -0.091 0.92762
## months14-12 -0.1581 3.2069 -0.049 0.96070
## months15-01 8.7097 3.1805 2.738 0.00633 **
## months15-02 6.4562 3.2646 1.978 0.04836 *
## months15-03 6.2903 3.1805 1.978 0.04834 *
## months15-04 7.3753 3.2069 2.300 0.02175 *
## months15-05 8.5806 3.1805 2.698 0.00715 **
## months15-06 7.2086 3.2069 2.248 0.02489 *
## months15-07 6.3548 3.1805 1.998 0.04609 *
## months15-08 12.7097 3.1805 3.996 7.12e-05 ***
## months15-09 3.6419 3.2069 1.136 0.25649
## months15-10 4.6129 3.1805 1.450 0.14740
## months15-11 10.2419 3.2069 3.194 0.00147 **
## months15-12 1.9419 3.2069 0.606 0.54501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.52 on 704 degrees of freedom
## Multiple R-squared: 0.08919, Adjusted R-squared: 0.05943
## F-statistic: 2.997 on 23 and 704 DF, p-value: 3.837e-06
fit_day$coeff
## (Intercept) Date
## -139.3909717 0.0117213
fit_month$coefficients
## (Intercept) months14-02 months14-03 months14-04 months14-05 months14-06
## 49.2580645 2.0633641 -3.7419355 2.7752688 2.8387097 3.0752688
## months14-07 months14-08 months14-09 months14-10 months14-11 months14-12
## 1.5806452 1.5483871 -0.1913978 2.0645161 -0.2913978 -0.1580645
## months15-01 months15-02 months15-03 months15-04 months15-05 months15-06
## 8.7096774 6.4562212 6.2903226 7.3752688 8.5806452 7.2086022
## months15-07 months15-08 months15-09 months15-10 months15-11 months15-12
## 6.3548387 12.7096774 3.6419355 4.6129032 10.2419355 1.9419355
# library(tidyverse)
# basket <- read.csv("basket_bkp.csv", header = TRUE, sep = ",")
# basket$X <- NULL
# basket$Member_number <-as.factor(basket$Member_number)
# basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
#
# basket.new <- mutate(basket, variable = rownames(basket)) %>%
# tidyr::gather(basket, Date, itemDescription)
## Determine the most 10 sold products
basket_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
basket_products10 <- as.data.frame(basket_products[1:10,])
basket %>% filter(itemDescription =="whole milk")
## Regroup all sold products in December 2015 (add index for each operation)
#basket %>% #mutate(index = as.factor(seq.int(nrow(basket)))) %>%
#mutate(Month = format(Date, "%B"), Year = format(Date, "%Y")) %>% arrange(desc(Date)) %>% filter(Month == "December", #Year == "2015") %>% dplyr::mutate(N = dplyr::summarise( n()))
## add a new column with Month_Year
basket.Year_Month <- basket %>% mutate(Year_Month = substr(basket$Date, 1,7))
# basket <- read.csv("basket_bkp.csv", header = TRUE, sep = ",")
# basket$X <- NULL
# basket %>% mutate(Member_number = as.factor(Member_number))
# basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
#basket$Year_Month <- zoo::as.yearmon(basket$Year_Month)
basket.Year_Month%>% dplyr::group_by(Year_Month) %>%
dplyr::summarise(N_Operations=n())
# Plot a subset of the data
#attach(Operations_Per_Month)
#ggplot(data = Operations_Per_Month, aes_string(x = zoo::as.yearmon(Year_Month), y = N_operations)) +
# geom_line(color = "#FC4E07")
basket %>% mutate(index = as.factor(seq.int(nrow(basket))))
basket_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
basket_products[1:10,]
The idea is to associate products per basket for each Member.
## group operations by Member and filter it by date
basket %>% group_by(Member_number) %>% arrange(Member_number) %>% filter(Date=="2015-05-27")
When We explore manually the table, we can see that some products sold together in the same date for the same Member. The response is: YES INITIATIVE TO PREDICT RULES BETWEEN PRODUCTS SOLD IN THE SAME TIME (BASKET: a group of products) IN SUPERMARKET FOR EXAMPLE.
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
#ddply(dataframe, variables_to_be_used_to_split_data_frame, function_to_be_applied)
transactionData_collapse_product <- plyr::ddply(basket,c("Member_number", "Date"),
function(basket)paste(basket$itemDescription,
collapse = ","))
transactionData_collapse_product
# remove Customer ID as we do not need it
transactionData_collapse_product$Member_number <- NULL
# remove Date as we do not need it
transactionData_collapse_product$Date <- NULL
colnames(transactionData_collapse_product) <- c("Products")
write.csv(transactionData_collapse_product, file = "transaction.csv", quote = FALSE, row.names = FALSE)
transactionData_collapse_product
Load transaction.csv
to R, we will use read.transactions() function from Arules
package this time.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
trans_basket <- read.transactions('transaction.csv', format = 'basket', sep=',') #, rm.duplicates=TRUE
## Warning in asMethod(object): removing duplicated items in transactions
str(trans_basket)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:38007] 115 104 128 165 130 132 165 166 107 138 ...
## .. .. ..@ p : int [1:14965] 0 1 4 8 10 12 14 17 19 22 ...
## .. .. ..@ Dim : int [1:2] 168 14964
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 168 obs. of 1 variable:
## .. ..$ labels: chr [1:168] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "bags" ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
trans_basket
## transactions in sparse format with
## 14964 transactions (rows) and
## 168 items (columns)
summary(trans_basket) # call in summary to check our data
## transactions as itemMatrix in sparse format with
## 14964 rows (elements/itemsets/transactions) and
## 168 columns (items) and a density of 0.01511843
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2363 1827 1646 1453
## yogurt (Other)
## 1285 29433
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 206 10012 2727 1273 338 179 113 96 19 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 2.00 2.54 3.00 10.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
We can explore the data and make the frequency plot using the itemFrequencyPlot() function in the arules
package. For this plot, we will check the top 10 products with absolute & relative frequency. The two plots give the same bihavior.
par(mfrow=c(1,2))
itemFrequencyPlot(trans_basket, topN=10, type='absolute', main="Absolute Product Frequency Plot")
itemFrequencyPlot(trans_basket, topN=10, type='relative', main="Relative Product Frequency Plot")
We can start by mining the data for overall association rules.
Refering to our item frequency plots, let’s try to set minimum support at 1 in 1,000 operations and minimum confidence at 70 percent. Also we set the maximum number of Products to be associated as two
basket_rules <- arules::apriori(trans_basket, parameter = list(supp = 0.01, conf = 0.8)) #, maxlen=5
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 149
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [64 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules <- sort(basket_rules, by='confidence', decreasing = TRUE)
summary(rules)
## set of 0 rules