The goal of this exercice is to decipher relevant knowledge from Basket DataSet.

Summarise basket object

# get the structure of the dataset
str(basket)
## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : int  1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
##  $ Date           : Factor w/ 728 levels "01-01-2014","01-01-2015",..: 494 98 450 288 4 316 178 552 462 268 ...
##  $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
##  Member_number          Date               itemDescription 
##  Min.   :1000   21-01-2015:   96   whole milk      : 2502  
##  1st Qu.:2002   21-07-2015:   93   other vegetables: 1898  
##  Median :3005   08-08-2015:   92   rolls/buns      : 1716  
##  Mean   :3004   29-11-2015:   92   soda            : 1514  
##  3rd Qu.:4007   30-04-2015:   91   yogurt          : 1334  
##  Max.   :5000   26-03-2015:   88   root vegetables : 1071  
##                 (Other)   :38213   (Other)         :28730

The dataset is a dataframe with 3 columns: Member_number, Date, and Products description. The Member_number column must be considered as factor, like the id of member. The Date must be a date type. The itemDascription is already as.factor

convert the type of variables

library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
basket$Member_number <- as.factor(basket$Member_number)
basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
# review the structure and summary
str(basket)
## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : Factor w/ 3898 levels "1000","1001",..: 787 1505 1264 181 1982 3839 3407 2732 1712 3037 ...
##  $ Date           : Date, format: "2015-07-21" "2015-01-05" ...
##  $ itemDescription: Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 122 102 112 165 156 ...
summary(basket)
##  Member_number        Date                    itemDescription 
##  3180   :   36   Min.   :2014-01-01   whole milk      : 2502  
##  2051   :   33   1st Qu.:2014-07-12   other vegetables: 1898  
##  3050   :   33   Median :2015-01-21   rolls/buns      : 1716  
##  3737   :   33   Mean   :2015-01-09   soda            : 1514  
##  2271   :   31   3rd Qu.:2015-07-10   yogurt          : 1334  
##  2433   :   31   Max.   :2015-12-30   root vegetables : 1071  
##  (Other):38568                        (Other)         :28730

Now is better formated. We can observe:

Before to start the exploration, It is important to check if the dataset countain empty cells or non available data.

check for any empty or NA cell

# Ckeck for Non Available (NA) cell
all(is.na(basket))
## [1] FALSE

There is an intersting package named funModeling that make eeasy the exploration of Data analysis. We will use it.

library(funModeling) 
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.6.8 :)
## Examples and tutorials at livebook.datascienceheroes.com
library(tidyverse) 
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  1.4.2     ✔ purrr   0.2.5
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.2.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ Hmisc::src()       masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
library(Hmisc)

tibble::glimpse(basket)
## Observations: 38,765
## Variables: 3
## $ Member_number   <fct> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 3803...
## $ Date            <date> 2015-07-21, 2015-01-05, 2015-09-19, 2015-12-1...
## $ itemDescription <fct> tropical fruit, whole milk, pip fruit, other v...

glimpse function returns informations like str.

 funModeling::df_status(basket)
##          variable q_zeros p_zeros q_na p_na q_inf p_inf   type unique
## 1   Member_number       0       0    0    0     0     0 factor   3898
## 2            Date       0       0    0    0     0     0   Date    728
## 3 itemDescription       0       0    0    0     0     0 factor    167

df_statut allows us to detect abnormal values and if is there non available or empty cells.

 #freq(basket) 
library(Hmisc)
Hmisc::describe(basket)
## basket 
## 
##  3  Variables      38765  Observations
## ---------------------------------------------------------------------------
## Member_number 
##        n  missing distinct 
##    38765        0     3898 
## 
## lowest : 1000 1001 1002 1003 1004, highest: 4996 4997 4998 4999 5000
## ---------------------------------------------------------------------------
## Date 
##        n  missing distinct 
##    38765        0      728 
## 
## lowest : 2014-01-01 2014-01-02 2014-01-03 2014-01-04 2014-01-05
## highest: 2015-12-26 2015-12-27 2015-12-28 2015-12-29 2015-12-30
## ---------------------------------------------------------------------------
## itemDescription 
##        n  missing distinct 
##    38765        0      167 
## 
## lowest : abrasive cleaner artif. sweetener baby cosmetics   bags             baking powder   
## highest: white bread      white wine       whole milk       yogurt           zwieback        
## ---------------------------------------------------------------------------

Filter and screen the dataframe

At this step it is importante to know the number of Members and the number of products.

library(magrittr)
# How many member
base::table(basket$Member_number) %>% length()
## [1] 3898
# 3898
# How many operations
length(basket$Member_number)
## [1] 38765
# 38765

# How many products
base::table(basket$itemDescription) %>% length()
## [1] 167
# 167

For all we have 3898 members and 167 products, and 38765 operations. This means that several members made several operations. How many operation per Members? And How many operations per product?

How many operations per Member

library(dplyr)
# How many transactions by Member, the data will be arranged by descending
basket %>% group_by(Member_number) %>% arrange(Member_number) %>% dplyr::summarise(N_operation=n()) %>% arrange(desc(N_operation))
#basket_plot_operations <- basket %>% group_by(Member_number, Date) %>% summarise(N_operation=n()) %>% arrange(desc(N_operation))

#ggplot2::ggplot(basket_plot_operations[1:100,], aes_string(x = 'Date', y = 'N_operation')) +
 # ggplot2::geom_bar(stat="identity", na.rm=TRUE)

This table confirm that the Member 3180 ahs the biggest frequency of operations 36. We can see the first ten important Members which got the biggest numbers of operations.

# How many Operations were done by product during 2 Years. The data will be arranged by descending
basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))

This table shows arranges by descending the most important sold products.

We can visualize the first 50th best solds as following:

# How many transactions by product, the data will be arranged by descending
basket_plot_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))


## plot the histogram of the best 50th sold  products
ggplot2::ggplot(basket_plot_products[1:50,], aes(x = reorder(itemDescription, -N_operations), y = N_operations)) +      
                                             #aes_string(x = 'itemDescription',  y = 'N_operations')  ## without ordering
 ggplot2::labs(title = "Overview of the first 50th best solds products during 2014-2015",
               #fill = "",
               x= "The products",
              #colour= Date,
               y = "The number of sales by product") +    
      ggplot2::theme(legend.title = element_text( colour="black",
                                                size=11,
                                                face="bold"),
                   title = element_text( size = 12,
                                         face = 'bold'
                   ),
                   text = element_text(size = 10,
                                       face= 'bold'
                   ),
                   axis.text.x=element_text(angle=45, hjust=1),
                   legend.position = "right",
                   legend.direction = "vertical"
    ) +
 ggplot2::geom_bar(stat="identity", na.rm=TRUE)

# Monitoring the operations number  were done per days, for all products during 2 years. The data will be arranged by descending
 basket_plot_date <-  basket %>% group_by(Date) %>% dplyr::summarise(N_operations=n()) # %>% arrange(desc(N_operations))
attach(basket_plot_date )
basket_plot_date %>%
ggplot(aes(x = Date, y = N_operations)) +
      geom_point(color = "darkorchid4") +
      geom_smooth(method = 'loess')+
      labs(title = "Total Operations per day during the all period (2 years)",
           subtitle = "Each value is the sum of all operations for all Members",
           y = "Number of Operations",
           x = "Date") + theme_bw(base_size = 15)

#basket_plot_date$months <- as.factor(months(basket_plot_date$Date))
basket_plot_date$months <- as.factor(format(as.Date(basket_plot_date$Date), "%y-%m"))
#par(mfrow=c(2,1))
plot( basket_plot_date$Date, basket_plot_date$N_operations, ylab = "N° Transactions", xlab = "Date")
fit_day <- lm(N_operations ~ Date,data= basket_plot_date)
lines(basket_plot_date$Date, fit_day$fitted.values)

plot(basket_plot_date$months, basket_plot_date$N_operations, ylab = "N° Transactions", xlab = "Months")

fit_month <- lm(N_operations ~ months,data= basket_plot_date)

summary(fit_day)
## 
## Call:
## lm(formula = N_operations ~ Date, data = basket_plot_date)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.109  -8.641  -0.864   8.467  42.505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.394e+02  3.668e+01  -3.800 0.000157 ***
## Date         1.172e-02  2.232e-03   5.252 1.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.68 on 726 degrees of freedom
## Multiple R-squared:  0.0366, Adjusted R-squared:  0.03527 
## F-statistic: 27.58 on 1 and 726 DF,  p-value: 1.981e-07
summary(fit_month)
## 
## Call:
## lm(formula = N_operations ~ months, data = basket_plot_date)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.548  -8.917  -0.483   7.982  38.032 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.2581     2.2490  21.903  < 2e-16 ***
## months14-02   2.0634     3.2646   0.632  0.52756    
## months14-03  -3.7419     3.1805  -1.177  0.23978    
## months14-04   2.7753     3.2069   0.865  0.38711    
## months14-05   2.8387     3.1805   0.893  0.37241    
## months14-06   3.0753     3.2069   0.959  0.33791    
## months14-07   1.5806     3.1805   0.497  0.61936    
## months14-08   1.5484     3.1805   0.487  0.62653    
## months14-09  -0.1914     3.2069  -0.060  0.95242    
## months14-10   2.0645     3.1805   0.649  0.51647    
## months14-11  -0.2914     3.2069  -0.091  0.92762    
## months14-12  -0.1581     3.2069  -0.049  0.96070    
## months15-01   8.7097     3.1805   2.738  0.00633 ** 
## months15-02   6.4562     3.2646   1.978  0.04836 *  
## months15-03   6.2903     3.1805   1.978  0.04834 *  
## months15-04   7.3753     3.2069   2.300  0.02175 *  
## months15-05   8.5806     3.1805   2.698  0.00715 ** 
## months15-06   7.2086     3.2069   2.248  0.02489 *  
## months15-07   6.3548     3.1805   1.998  0.04609 *  
## months15-08  12.7097     3.1805   3.996 7.12e-05 ***
## months15-09   3.6419     3.2069   1.136  0.25649    
## months15-10   4.6129     3.1805   1.450  0.14740    
## months15-11  10.2419     3.2069   3.194  0.00147 ** 
## months15-12   1.9419     3.2069   0.606  0.54501    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.52 on 704 degrees of freedom
## Multiple R-squared:  0.08919,    Adjusted R-squared:  0.05943 
## F-statistic: 2.997 on 23 and 704 DF,  p-value: 3.837e-06
fit_day$coeff
##  (Intercept)         Date 
## -139.3909717    0.0117213
fit_month$coefficients
## (Intercept) months14-02 months14-03 months14-04 months14-05 months14-06 
##  49.2580645   2.0633641  -3.7419355   2.7752688   2.8387097   3.0752688 
## months14-07 months14-08 months14-09 months14-10 months14-11 months14-12 
##   1.5806452   1.5483871  -0.1913978   2.0645161  -0.2913978  -0.1580645 
## months15-01 months15-02 months15-03 months15-04 months15-05 months15-06 
##   8.7096774   6.4562212   6.2903226   7.3752688   8.5806452   7.2086022 
## months15-07 months15-08 months15-09 months15-10 months15-11 months15-12 
##   6.3548387  12.7096774   3.6419355   4.6129032  10.2419355   1.9419355
# library(tidyverse)
#  basket <- read.csv("basket_bkp.csv", header = TRUE, sep = ",")
# basket$X <- NULL
#  basket$Member_number <-as.factor(basket$Member_number)
# basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
# 
# basket.new <- mutate(basket, variable = rownames(basket)) %>% 
#  tidyr::gather(basket, Date, itemDescription)

Regroupe all Operations by Month (December 2015)

## Determine the most 10 sold products
basket_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
basket_products10 <- as.data.frame(basket_products[1:10,])

basket %>% filter(itemDescription =="whole milk")
## Regroup all sold products in December 2015 (add index for each operation)
 #basket %>% #mutate(index = as.factor(seq.int(nrow(basket)))) %>%
 #mutate(Month = format(Date, "%B"), Year = format(Date, "%Y")) %>% arrange(desc(Date)) %>% filter(Month == "December", #Year == "2015") %>% dplyr::mutate(N = dplyr::summarise( n()))
  
 ## add a new column with Month_Year
 basket.Year_Month <- basket %>% mutate(Year_Month = substr(basket$Date, 1,7)) 
 
 # basket <- read.csv("basket_bkp.csv", header = TRUE, sep = ",")
 # basket$X <- NULL
 # basket %>% mutate(Member_number = as.factor(Member_number))
 # basket$Date <- as.Date(basket$Date, "%d-%m-%Y")
  
  #basket$Year_Month <- zoo::as.yearmon(basket$Year_Month)

basket.Year_Month%>% dplyr::group_by(Year_Month) %>%
  dplyr::summarise(N_Operations=n()) 
# Plot a subset of the data
#attach(Operations_Per_Month)
#ggplot(data =  Operations_Per_Month, aes_string(x = zoo::as.yearmon(Year_Month), y = N_operations)) + 
 # geom_line(color = "#FC4E07")

add index for each operation

basket %>% mutate(index = as.factor(seq.int(nrow(basket)))) 

Filter the 10 best sold products

basket_products <- basket %>% group_by(itemDescription) %>% dplyr::summarise(N_operations=n()) %>% arrange(desc(N_operations))
basket_products[1:10,]

Explore if all operations for each Member were done in the same day

The idea is to associate products per basket for each Member.

## group operations by Member and filter it by date
basket %>% group_by(Member_number) %>% arrange(Member_number) %>% filter(Date=="2015-05-27")

When We explore manually the table, we can see that some products sold together in the same date for the same Member. The response is: YES INITIATIVE TO PREDICT RULES BETWEEN PRODUCTS SOLD IN THE SAME TIME (BASKET: a group of products) IN SUPERMARKET FOR EXAMPLE.

combine all products per one Member

library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
#ddply(dataframe, variables_to_be_used_to_split_data_frame, function_to_be_applied)
transactionData_collapse_product <- plyr::ddply(basket,c("Member_number", "Date"),
                       function(basket)paste(basket$itemDescription,
                       collapse = ","))
transactionData_collapse_product
# remove Customer ID as we do not need it
transactionData_collapse_product$Member_number <- NULL
# remove Date as we do not need it
transactionData_collapse_product$Date <- NULL
colnames(transactionData_collapse_product) <- c("Products")
write.csv(transactionData_collapse_product, file = "transaction.csv", quote = FALSE, row.names = FALSE)
transactionData_collapse_product

Predict a rule between Product in each basket

Load transaction.csv to R, we will use read.transactions() function from Arules package this time.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
trans_basket <- read.transactions('transaction.csv', format = 'basket', sep=',')  #, rm.duplicates=TRUE
## Warning in asMethod(object): removing duplicated items in transactions
str(trans_basket)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:38007] 115 104 128 165 130 132 165 166 107 138 ...
##   .. .. ..@ p       : int [1:14965] 0 1 4 8 10 12 14 17 19 22 ...
##   .. .. ..@ Dim     : int [1:2] 168 14964
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  168 obs. of  1 variable:
##   .. ..$ labels: chr [1:168] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "bags" ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables
trans_basket
## transactions in sparse format with
##  14964 transactions (rows) and
##  168 items (columns)
summary(trans_basket) # call in summary to check our data
## transactions as itemMatrix in sparse format with
##  14964 rows (elements/itemsets/transactions) and
##  168 columns (items) and a density of 0.01511843 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2363             1827             1646             1453 
##           yogurt          (Other) 
##             1285            29433 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##   206 10012  2727  1273   338   179   113    96    19     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    2.00    2.54    3.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

We can explore the data and make the frequency plot using the itemFrequencyPlot() function in the arules package. For this plot, we will check the top 10 products with absolute & relative frequency. The two plots give the same bihavior.

par(mfrow=c(1,2))
itemFrequencyPlot(trans_basket, topN=10, type='absolute', main="Absolute Product Frequency Plot")
itemFrequencyPlot(trans_basket, topN=10, type='relative', main="Relative Product Frequency Plot")

Modeling and Evaluation

We can start by mining the data for overall association rules.

Refering to our item frequency plots, let’s try to set minimum support at 1 in 1,000 operations and minimum confidence at 70 percent. Also we set the maximum number of Products to be associated as two

basket_rules <- arules::apriori(trans_basket, parameter = list(supp = 0.01, conf = 0.8))  #, maxlen=5
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 149 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [64 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules <- sort(basket_rules, by='confidence', decreasing = TRUE)
summary(rules)
## set of 0 rules

certificates

Intermediate R programming for Data Analyst

Dataquest certificate

Dataquest certificate