1 Introduction
2 Preparations
3 Overview: File structure and content
4 Individual feature
5 Feature relations
6 Feature Engineering
7 Time series parameters
8 Forcasting methods and examples
9 Weather data
- 9.1 Weather Stations
- 9.2 Weather time series

1 Introduction

This is an extensive Exploratory Data Analysis for the Recruit Restaurant Visitor Forecasting competition with the powers of tidy R and ggplot2. Also We tried to translate the main R code to Python.

The aim of this challenge is to predict the future numbers of restaurant visitors. This makes it a Time Series Forecasting problem. The data was collected from Japanese restaurants. As we will see, the data set is small and easily accessible without requiring much memory or computing power. Therefore, this competition is particularly suited for beginners.

The data comes in the shape of 8 relational files which are derived from two separate Japanese websites that collect user information: “Hot Pepper Gourmet (hpg): similar to Yelp” (search and reserve) and “AirREGI / Restaurant Board (air): similar to Square” (reservation control and cash register). The training data is based on the time range of Jan 2016 - most of Apr 2017, while the test set includes the last week of Apr plus May 2017. The test data “intentionally spans a holiday week in Japan called the ‘Golden Week.’ The data description further notes that:”There are days in the test set where the restaurant were closed and had no visitors. These are ignored in scoring. The training set omits days where the restaurants were closed.”

Those are the individual files:

air_visit_data.csv: historical visit data for the air restaurants. This is essentially the main training data set.
air_reserve.csv / hpg_reserve.csv: reservations made through the air / hpg systems.
air_store_info.csv / hpg_store_info.csv: details about the air / hpg restaurants including genre and location.
store_id_relation.csv: connects the air and hpg ids
date_info.csv: essentially flags the Japanese holidays.
sample_submission.csv: serves as the test set. The id is formed by combining the air id with the visit date.

Thank you all for the upvotes, critical feedback, and continued support!

2 Preparations

2.1 Set Python and R Envirments and load packages

2.1.1 for R to use python3

# Set python environment and version in RStudio ;-)
reticulate::use_python("/usr/bin/python3.10", required = TRUE)
reticulate::py_config()

## python:         /usr/bin/python3.10
## libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
## pythonhome:     //usr://usr
## version:        3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
## numpy:          /home/kirus/.local/lib/python3.10/site-packages/numpy
## numpy_version:  1.26.0
## 
## NOTE: Python version was forced by use_python() function

# check module availability
#reticulate::py_module_available("numpy")
#reticulate::py_module_available("matplotlib")
#reticulate::py_module_available("seaborn")

2.1.2 Set Language

#Sys.setlocale("LC_ALL","English")
Sys.setenv(LANG = "en_US.UTF-8")
Sys.setenv("LANGUAGE"="En")

import locale
locale.setlocale(locale.LC_ALL,'en_US.utf8')

## 'en_US.utf8'

2.1.3 Check R libraries paths

.libPaths()

## [1] "/home/kirus/R/x86_64-pc-linux-gnu-library/4.3"
## [2] "/usr/local/lib/R/site-library"                
## [3] "/usr/lib/R/site-library"                      
## [4] "/usr/lib/R/library"

2.1.4 Check the path of R envirment used by python

import os
print(os.environ['R_HOME'])

## /usr/lib/R

# os.environ['R_HOME'] = "/home/kirus/R/x86_64-pc-linux-gnu-library/4.3"
# print(os.environ['R_HOME'])

We need to set R libraries path when called by Python.
Two paths will be used:
- /usr/lib/R/library for packages installed with root profile during installing R base,
- /home/kirus/R/x86_64-pc-linux-gnu-library/4.3 for packages installed by user without root profile.

2.1.5 Check which R version and libraries paths used by `rpy2``

import rpy2.rinterface
#rpy2.rinterface.set_initoptions((b'rpy2', b'--no-save', b'--no-restore', b'--quiet'))
from rpy2.robjects.packages import importr

## /home/kirus/.local/lib/python3.10/site-packages/rpy2/rinterface_lib/embedded.py:276: UserWarning: R was initialized outside of rpy2 (R_NilValue != NULL). Trying to use it nevertheless.
##   warnings.warn(msg)
## R was initialized outside of rpy2 (R_NilValue != NULL). Trying to use it nevertheless.

base = importr('base', lib_loc= '/usr/lib/R/library/')
print(base.version)

print(base._libPaths())

2.2 Load libraries

We load a range of libraries for general data wrangling and general visualisation together with more specialised tools.

# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation

# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('DT') # display table
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation

# specific visualisation
library('ggrepel') # visualisation
library('ggridges') # visualisation
library('ggExtra') # visualisation
library('ggforce') # visualisation
library('viridis') # visualisation

# specific data manipulation
library('lazyeval') # data wrangling
library('broom') # data wrangling
library('purrr') # string manipulation

# Date plus forecast
library('lubridate') # date and time
library('timeDate') # date and time
library('tseries') # time series analysis
library('forecast') # time series analysis
library('prophet') # time series analysis
library('timetk') # time series analysis

# Maps / geospatial
library('geosphere') # geospatial locations
library('leaflet') # maps
library('leaflet.extras') # maps
library('maps') # maps

library(reticulate) # interface to python
library(kableExtra) # rendering table with scrollX

#pip3 install matplotlib
#pip3 install seaborn
#pip3 install scikit-learn
#pip3 install rpy2
# pip install git+https://github.com/h2oai/datatable
# pip install ipywidgets==7.0.1

from rpy2 import robjects
from rpy2.robjects.packages import importr, data
import rpy2.robjects.packages as rpackages
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
import matplotlib
from matplotlib import pyplot as plt, dates
#import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import locale
from statsmodels.nonparametric.smoothers_lowess import lowess
from matplotlib.gridspec import GridSpec
import matplotlib.dates as mdates
from matplotlib.ticker import FormatStrFormatter
import folium
from folium.plugins import MarkerCluster
locale.setlocale(locale.LC_ALL,'en_US.utf8')

## 'en_US.utf8'

from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
from datetime import timedelta
from pmdarima import auto_arima
from statsmodels.tsa.seasonal import seasonal_decompose

#from IPython.display import display, HTML # displays html tobale in Rmarkdown
#import ipywidgets 
#import qgrid  # display dataframe like DT::datatable in Rmarkdown document
#import datatable as dt
#from sklearn.model_selection import train_test_split
from collections import Counter
from datetime import datetime
from datetime import date
from prophet import Prophet
from prophet.plot import plot
import plotly
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import pytimetk as tk
from pytimetk import augment_timeseries_signature, get_timeseries_signature
import joypy

2.3 plot iris datasets

iris %>%
ggplot()+
aes(x=Petal.Length,y=Petal.Width,color=Species, alpha=0.8) +
 geom_point()

#datasets::iris

## Load iris from python
#iris = sns.load_dataset('iris')

# load iris from R
datasets = rpackages.importr('datasets', 
            lib_loc= '/usr/lib/R/library/')
iris = data(datasets).fetch('iris')['iris']

# convert R dataframe to pandas df 
# (https://rpy2.github.io/doc/latest/html/generated_rst/pandas.html)
with (ro.default_converter + pandas2ri.converter).context():
  iris = ro.conversion.get_conversion().rpy2py(iris)
  
  
#sns.set_style("darkgrid")
plt.style.use('ggplot')
#fi = sns.FacetGrid(iris, hue ="Species", 
#              height = 6).map(plt.scatter, 
#                              'Petal.Length', 
#                              'Petal.Width').add_legend()

fi=sns.relplot(x="Petal.Length", 
               y="Petal.Width", 
               data=iris, alpha=0.8, 
               hue='Species') 
fi.fig.set_figheight(4)
fi.fig.set_figwidth(8)
plt.show()

2.4 Helper functions

We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots. We also make use of a brief helper function to compute binomial confidence intervals.

# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

We use matplotlib to create subplots. The function multiplot now works with *args to accept an arbitrary number of plots, and the cols argument determines the number of columns in the layout. If layout is not provided, it is calculated from cols and the number of plots.

The example at the end demonstrates how to use the multiplot function with two plots (y1 and y2) arranged in two columns.



import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

def multiplot(*args, plotlist=None, file=None, cols=1, layout=None):
    plots = list(args) + (plotlist if plotlist else [])
    num_plots = len(plots)
    
    if layout is None:
        rows = (num_plots - 1) // cols + 1
        layout = [(i // cols, i % cols) for i in range(num_plots)]

    if num_plots == 1:
        plt.subplot2grid((rows, cols), (0, 0))
        plt.plot(plots[0])
        
    else:
        fig = plt.figure()
        gs = GridSpec(rows, cols)

        for i, plot in enumerate(plots):
            ax = fig.add_subplot(gs[layout[i]])
            ax.plot(plot)

    if file:
        plt.savefig(file)
    else:
        plt.show()

# Example usage
#import numpy as np

#x = np.linspace(0, 10, 100)
#y1 = np.sin(x)
#y2 = np.cos(x)

#multiplot(y1, y2, cols=2)

An other version using add_gridspec to create a layout for multiple plots:

def multiplot(*args, plotlist=None, file=None, cols=1, layout=None):
    plots = list(args) + (plotlist if plotlist else [])
    num_plots = len(plots)
    
    if layout is None:
        rows = (num_plots - 1) // cols + 1

    fig = plt.figure()

    if num_plots == 1:
        plt.subplot2grid((rows, cols), (0, 0), colspan=cols)
        plt.plot(plots[0])
        
    else:
        gs = GridSpec(rows, cols)

        for i, plot in enumerate(plots):
            ax = fig.add_subplot(gs[i // cols, i % cols])
            ax.plot(plot)

    if file:
        plt.savefig(file)
    else:
        plt.show()

2.5 Load Data

air_visits <- data.table::fread(str_c("data/air_visit_data.csv"))
air_reserve <- data.table::fread(str_c('data/air_reserve.csv'))
air_store <- data.table::fread(str_c('data/air_store_info.csv'))
holidays <- data.table::fread(str_c('data/date_info.csv'))
hpg_reserve <- data.table::fread(str_c('data/hpg_reserve.csv'))
hpg_store <- data.table::fread(str_c('data/hpg_store_info.csv'))
store_ids <- data.table::fread(str_c('data/store_id_relation.csv'))
test <- data.table::fread(str_c('data/sample_submission.csv'))

In this Python code, we use the pandas library to read the CSV files. Each read_csv function call reads a CSV file into a DataFrame. The variable names (air_visit, air_reserve, etc.) will be DataFrames that you can use for data manipulation and analysis.

air_visits = pd.read_csv("data/air_visit_data.csv")
air_reserve = pd.read_csv("data/air_reserve.csv")
air_store = pd.read_csv("data/air_store_info.csv")
holidays = pd.read_csv("data/date_info.csv")
hpg_reserve = pd.read_csv("data/hpg_reserve.csv")
hpg_store = pd.read_csv("data/hpg_store_info.csv")
store_ids = pd.read_csv("data/store_id_relation.csv")
test = pd.read_csv("data/sample_submission.csv")

3 Overview: File structure and content

As a first step let’s have an overview of the data sets.

3.1 Air visits

air_visits %>%
  head(10) %>%
  DT::datatable()

We find that this file contains the visitors numbers for each visit_date and air_store_id. The date feature should be transformed into a time-series format. There are 829 different stores, which is a small data set:

air_visits %>%
  distinct(air_store_id) %>%
  nrow()

## [1] 829

qgrid and ipywidgets verions are not compatible and have a issue to render html table like DT::datatable().

#air_visits = pd.DataFrame(air_visits)

# Show the DataFrame using qgrid in Python
#qgrid_widget = qgrid.show_grid(air_visits, show_toolbar=True)
#qgrid_widget.head(10)

#html = air_visits.head(10).to_html()
#display(HTML(html))
#print(html)
air_visits.head(10)

##            air_store_id  visit_date  visitors
## 0  air_ba937bf13d40fb24  2016-01-13        25
## 1  air_ba937bf13d40fb24  2016-01-14        32
## 2  air_ba937bf13d40fb24  2016-01-15        29
## 3  air_ba937bf13d40fb24  2016-01-16        22
## 4  air_ba937bf13d40fb24  2016-01-18         6
## 5  air_ba937bf13d40fb24  2016-01-19         9
## 6  air_ba937bf13d40fb24  2016-01-20        31
## 7  air_ba937bf13d40fb24  2016-01-21        21
## 8  air_ba937bf13d40fb24  2016-01-22        18
## 9  air_ba937bf13d40fb24  2016-01-23        26

air_visits['air_store_id'].nunique()

## 829

3.2 Air Reserve

air_reserve %>%
  head(10) %>%
  DT::datatable()

We find that the air reservations include the date and time of the reservation, as well as those of the visit. We have reservation numbers for 314 air stores:

air_reserve %>% distinct(air_store_id) %>% nrow()

## [1] 314

air_reserve.head(10)

##            air_store_id  ... reserve_visitors
## 0  air_877f79706adbfb06  ...                1
## 1  air_db4b38ebe7a7ceff  ...                3
## 2  air_db4b38ebe7a7ceff  ...                6
## 3  air_877f79706adbfb06  ...                2
## 4  air_db80363d35f10926  ...                5
## 5  air_db80363d35f10926  ...                2
## 6  air_db80363d35f10926  ...                4
## 7  air_3bb99a1fe0583897  ...                2
## 8  air_3bb99a1fe0583897  ...                2
## 9  air_2b8b29ddfd35018e  ...                2
## 
## [10 rows x 4 columns]

air_reserve['air_store_id'].nunique()

## 314

3.3 HPG Reserve

  hpg_reserve %>%
  head(10) %>%
  DT::datatable()

The hpg reservations file follows the same structure as the corresponding air file. We have reservation numbers for 13325 hpg stores:

hpg_reserve %>% distinct(hpg_store_id) %>% nrow()

## [1] 13325

hpg_reserve.head(10)

##            hpg_store_id  ... reserve_visitors
## 0  hpg_c63f6f42e088e50f  ...                1
## 1  hpg_dac72789163a3f47  ...                3
## 2  hpg_c8e24dcf51ca1eb5  ...                2
## 3  hpg_24bb207e5fd49d4a  ...                5
## 4  hpg_25291c542ebb3bc2  ...               13
## 5  hpg_28bdf7a336ec6a7b  ...                2
## 6  hpg_2a01a042bca04ad9  ...                2
## 7  hpg_2a84dd9f4c140b82  ...                2
## 8  hpg_2ad179871696901f  ...                2
## 9  hpg_2c1d989eedb0ff83  ...                6
## 
## [10 rows x 4 columns]

hpg_reserve['hpg_store_id'].nunique()

## 13325

3.4 Air Store

air_store %>%
  head(10) %>%
  datatable()

We find that the air_store info includes the name of the particular cuisine along with the name of the area.

air_store.head(10)

##            air_store_id  air_genre_name  ...   latitude   longitude
## 0  air_0f0cdeee6c9bf3d7  Italian/French  ...  34.695124  135.197853
## 1  air_7cc17a324ae5c7dc  Italian/French  ...  34.695124  135.197853
## 2  air_fee8dcf4d619598e  Italian/French  ...  34.695124  135.197853
## 3  air_a17f0778617c76e2  Italian/French  ...  34.695124  135.197853
## 4  air_83db5aff8f50478e  Italian/French  ...  35.658068  139.751599
## 5  air_99c3eae84130c1cb  Italian/French  ...  35.658068  139.751599
## 6  air_f183a514cb8ff4fa  Italian/French  ...  35.658068  139.751599
## 7  air_6b9fa44a9cf504a1  Italian/French  ...  35.658068  139.751599
## 8  air_0919d54f0c9a24b8  Italian/French  ...  35.658068  139.751599
## 9  air_2c6c79d597e48096  Italian/French  ...  35.658068  139.751599
## 
## [10 rows x 5 columns]

3.5 HPG Store

hpg_store %>%
  head(10) %>%
  datatable()

Again, the hpg_store info follows the same structure as the air info. Here the genre_name includes the word style. It’s worth checking whether the same is true for the air data or whether it just refers to the specific “Japanese style”. There are 4690 different hpg_store_ids, which are significantly fewer than we have reservation data for.

hpg_store.head(10)

##            hpg_store_id  hpg_genre_name  ...   latitude   longitude
## 0  hpg_6622b62385aec8bf  Japanese style  ...  35.643675  139.668221
## 1  hpg_e9e068dd49c5fa00  Japanese style  ...  35.643675  139.668221
## 2  hpg_2976f7acb4b3a3bc  Japanese style  ...  35.643675  139.668221
## 3  hpg_e51a522e098f024c  Japanese style  ...  35.643675  139.668221
## 4  hpg_e3d0e1519894f275  Japanese style  ...  35.643675  139.668221
## 5  hpg_530cd91db13b938e  Japanese style  ...  35.643675  139.668221
## 6  hpg_02457b318e186fa4  Japanese style  ...  35.643675  139.668221
## 7  hpg_0cb3c2c490020a29  Japanese style  ...  35.643675  139.668221
## 8  hpg_3efe9b08c887fe9a  Japanese style  ...  35.643675  139.668221
## 9  hpg_765e8d3ba261dc1c  Japanese style  ...  35.643675  139.668221
## 
## [10 rows x 5 columns]

3.6 Holidays

holidays %>%
  head(10) %>%
  datatable()

We called the date_info file holidays, because that’s essentially the information it contains. Holidays are encoded as binary flags in integer format. This should become a logical binary feature for exploration purposes.

holidays.head(10)

##   calendar_date day_of_week  holiday_flg
## 0    2016-01-01      Friday            1
## 1    2016-01-02    Saturday            1
## 2    2016-01-03      Sunday            1
## 3    2016-01-04      Monday            0
## 4    2016-01-05     Tuesday            0
## 5    2016-01-06   Wednesday            0
## 6    2016-01-07    Thursday            0
## 7    2016-01-08      Friday            0
## 8    2016-01-09    Saturday            0
## 9    2016-01-10      Sunday            0

3.7 Store IDs

store_ids %>%
  head(10) %>%
  datatable()

This is a relational file that connects the air and hpg ids. There are only 150 pairs, which is less than 20% of all air stores.

store_ids.head(10)

##            air_store_id          hpg_store_id
## 0  air_63b13c56b7201bd9  hpg_4bc649e72e2a239a
## 1  air_a24bf50c3e90d583  hpg_c34b496d0305a809
## 2  air_c7f78b4f3cba33ff  hpg_cd8ae0d9bbd58ff9
## 3  air_947eb2cae4f3e8f2  hpg_de24ea49dc25d6b8
## 4  air_965b2e0cf4119003  hpg_653238a84804d8e7
## 5  air_a38f25e3399d1b25  hpg_50378da9ffb9b6cd
## 6  air_3c938075889fc059  hpg_349b1b92f98b175e
## 7  air_68301bcb11e2f389  hpg_2c09f3abb2220659
## 8  air_5f6fa1b897fe80d5  hpg_40aff6385800ebb1
## 9  air_00a91d42b08b08d9  hpg_fbe603376b5980fc

3.8 Test data

test %>%
  head(10) %>%
  datatable()

As noted in the data description, the id of the final submission file is a concatenation of the air_id and the date.

test.head(10)

##                                 id  visitors
## 0  air_00a91d42b08b08d9_2017-04-23         0
## 1  air_00a91d42b08b08d9_2017-04-24         0
## 2  air_00a91d42b08b08d9_2017-04-25         0
## 3  air_00a91d42b08b08d9_2017-04-26         0
## 4  air_00a91d42b08b08d9_2017-04-27         0
## 5  air_00a91d42b08b08d9_2017-04-28         0
## 6  air_00a91d42b08b08d9_2017-04-29         0
## 7  air_00a91d42b08b08d9_2017-04-30         0
## 8  air_00a91d42b08b08d9_2017-05-01         0
## 9  air_00a91d42b08b08d9_2017-05-02         0

3.9 Missing values

sum(is.na(air_visits))

## [1] 0

sum(is.na(air_reserve))

## [1] 0

sum(is.na(hpg_reserve))

## [1] 0

sum(is.na(air_store))

## [1] 0

sum(is.na(hpg_store))

## [1] 0

sum(is.na(holidays))

## [1] 0

sum(is.na(store_ids))

## [1] 0

sum(is.na(test))

## [1] 0

There are no missing values in our data. Excellent.

In Python, you can use the isna() method from the pandas library to check for missing values in a DataFrame, and then use sum() to count them. Here’s the equivalent code:

air_visits.isna().sum().sum()

## 0

air_reserve.isna().sum().sum()

## 0

hpg_reserve.isna().sum().sum()

## 0

air_store.isna().sum().sum()

## 0

hpg_store.isna().sum().sum()

## 0

holidays.isna().sum().sum()

## 0

test.isna().sum().sum()

## 0

test.isna() returns a DataFrame of the same shape as test with True where the original DataFrame has missing values, and False otherwise.
.sum() is used twice. The first sum() computes the count of missing values for each column, and the second sum() adds up those counts to get the total number of missing values in the entire DataFrame.
The final result, missing_values_count, will be the equivalent of sum(is.na(test)) in R.

3.10 Reformating feature

We change the formatting of the date/time features and also reformat a few features to logical and factor variables for exploration purposes.

air_visits <- air_visits %>%
  mutate(visit_date = ymd(visit_date))

air_reserve <- air_reserve %>%
  mutate(visit_datetime = as.POSIXct(visit_datetime,format="%Y-%m-%dT%H:%MZ"), #ymd_hms(visit_datetime),
         reserve_datetime = as.POSIXct(reserve_datetime,format="%Y-%m-%dT%H:%MZ")) #ymd_hms(reserve_datetime)

hpg_reserve <- hpg_reserve %>%
  mutate(visit_datetime = as.POSIXct(visit_datetime,format="%Y-%m-%dT%H:%MZ"),
         reserve_datetime = as.POSIXct(reserve_datetime,format="%Y-%m-%dT%H:%MZ"))

air_store <- air_store %>%
  mutate(air_genre_name = as.factor(air_genre_name),
         air_area_name = as.factor(air_area_name))

hpg_store <- hpg_store %>%
  mutate(hpg_genre_name = as.factor(hpg_genre_name),
         hpg_area_name = as.factor(hpg_area_name))

holidays <- holidays %>%
  mutate(holiday_flg = as.logical(holiday_flg),
         date = ymd(calendar_date),
         calendar_date = as.character(calendar_date))

# Convert visit_date column to datetime format
air_visits['visit_date'] = pd.to_datetime(air_visits['visit_date'])

# Convert visit_datetime and reserve_datetime columns to datetime format
air_reserve['visit_datetime'] = pd.to_datetime(air_reserve['visit_datetime'])
air_reserve['reserve_datetime'] = pd.to_datetime(air_reserve['reserve_datetime'])

hpg_reserve['visit_datetime'] = pd.to_datetime(hpg_reserve['visit_datetime'])
hpg_reserve['reserve_datetime'] = pd.to_datetime(hpg_reserve['reserve_datetime'])

# Convert categorical columns to factors (categorical variables in pandas)
air_store['air_genre_name'] = air_store['air_genre_name'].astype('category')
air_store['air_area_name'] = air_store['air_area_name'].astype('category')

hpg_store['hpg_genre_name'] = hpg_store['hpg_genre_name'].astype('category')
hpg_store['hpg_area_name'] = hpg_store['hpg_area_name'].astype('category')

# Convert holiday_flg to boolean, date to datetime, and calendar_date to string
holidays['holiday_flg'] = holidays['holiday_flg'].astype(bool)
holidays['date'] = pd.to_datetime(holidays['calendar_date'])
holidays['calendar_date'] = holidays['calendar_date'].astype(str)

4 Individual feature

Here we have a first look at the distributions of the feature in our individual data files before combining them for a more detailed analysis. This inital visualisation will be the foundation on which we build our analysis.

4.1 Air Visits

We start with the number of visits to the air restaurants. Here we plot the total number of visitors per day over the full training time range together with the median visitors per day of the week and month of the year:

Sys.setenv(LANG = "en_US.UTF-8")

p1 <- air_visits %>%
  group_by(visit_date) %>%
  summarise(all_visitors = sum(visitors)) %>%
  ggplot(aes(visit_date,all_visitors)) +
  geom_line(col = "blue") +
  labs(y = "All visitors", x = "Date")

p2 <- air_visits %>%
  ggplot(aes(visitors)) +
  geom_vline(xintercept = 20, color = "orange") +
  geom_histogram(fill = "blue", bins = 30) +
  scale_x_log10()

p3 <- air_visits %>%
  mutate(wday = wday(visit_date, label = TRUE, week_start = 1)) %>%
  group_by(wday) %>%
  summarise(visits = median(visitors)) %>%
  ggplot(aes(wday, visits, fill = wday)) +
  geom_col() +
  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(x = "Day of the week", y = "Median visitors") +
  scale_fill_hue()
  
p4 <- air_visits %>%
  mutate(month = month(visit_date, label = TRUE)) %>%
  group_by(month) %>%
  summarise(visits = median(visitors)) %>%
  ggplot(aes(month, visits, fill = month)) +
  geom_col() +
  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(x = "Month", y = "Median visitors")

layout <- matrix(c(1,1,1,1,2,3,4,4),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)

We find:

There is an interesting long-term step structure in the overall time series. This might be related to new restaurants being added to the data base. In addition, we already see a periodic pattern that most likely corresponds to a weekly cycle.
The number of guests per visit per restaurant per day peaks at around 20 (the orange line). The distribution extends up to 100 and, in rare cases, beyond.
Friday and the weekend appear to be the most popular days; which is to be expected. Monday and Tuesday have the lowest numbers of average visitors.
Also during the year there is a certain amount of variation. Dec appears to be the most popular month for restaurant visits. The period of Mar - May is consistently busy.

#locale.setlocale(locale.LC_ALL,'en_US.utf8')
# Set the language to English
#plt.rcParams['axes.formatter.use_locale'] = False
#plt.rcParams['pgf.rcfonts'] = False
#plt.rcParams['pdf.fonttype'] = 42

p1_data = air_visits.groupby('visit_date', observed=True)['visitors'].sum().reset_index()

p3_data = air_visits.assign(wday=air_visits['visit_date'].dt.day_name())\
.groupby('wday')['visitors'].median().reset_index()

p4_data = air_visits.assign(month=air_visits['visit_date'].dt.strftime('%b'))\
.groupby('month')['visitors'].median().reset_index()


plt.style.use('ggplot')
fig = plt.figure(figsize=(11, 6))
gs = fig.add_gridspec(2, 3)

# First row
ax1 = fig.add_subplot(gs[0, :])
sns.lineplot(data=p1_data, x='visit_date', y='visitors', color='blue', ax=ax1)
ax1.set_xlabel('All visitors')
ax1.set_ylabel('Date')

# Second row
ax2 = fig.add_subplot(gs[1, 0])
plt.axvline(x=20, color='orange')
plt.xscale('log')
sns.histplot(data=air_visits, x='visitors', bins=30, color='blue', kde=True, ax=ax2)

ax3 = fig.add_subplot(gs[1, 1])
order = [ "Monday", "Tuesday", "Wednesday", "Thursday",
          "Friday", "Saturday", "Sunday"]
ax3.set_xticklabels( ('Mon', 'Tue','Wed', 'Thur', 'Fri', 'Sat', 'Sun') )
p3 = sns.barplot(data=p3_data, x='wday', y='visitors', hue='wday', 
                  legend=False, ax=ax3, order=order) #palette='viridis',
p3 = plt.xticks(rotation=45)
#ax3.set_xticklabels(ax3.get_xticklabels(), rotation=45)
ax3.set_xlabel('Days')

ax4 = fig.add_subplot(gs[1, 2])
order = [ "Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul",
"Aug", "Sep", "Oct", "Nov", "Dec"]
p4 = sns.barplot(data=p4_data, x='month', y='visitors', hue='month',
         legend=False, ax=ax4, order=order, palette='viridis')
p4 = plt.xticks(rotation=45)
#ax4.set_xticklabels(ax4.get_xticklabels(), rotation=45)
ax4.set_xlabel('Months')

plt.tight_layout()
plt.show()

In this code, we use add_gridspec to create a 2x3 grid of subplots.
Then, we manually add subplots to specific positions using add_subplot.
This way, you have full control over the layout of the plots.

We will be forecasting for the last week of April plus May 2017, so let’s look at this time range in our 2016 training data:

air_visits %>%
  filter(visit_date > ymd("2016-04-15") & visit_date < ymd("2016-06-15")) %>%
  group_by(visit_date) %>%
  summarise(all_visitors = sum(visitors)) %>%
  ggplot(aes(visit_date,all_visitors)) +
  geom_line() +
  geom_smooth(method = "loess", color = "blue", span = 1/7) +
  labs(y = "All visitors", x = "Date")

## `geom_smooth()` using formula = 'y ~ x'

Here, the black line is the date and the blue line corresponds to a smoothing fit with a corresponding grey confidence area. We see again the weekly period and also the impact of the aforementioned Golden Week, which in 2016 happened between Apr 29 and May 5.

Seaborn doesn’t have a direct equivalent to the geom_smooth function with the loess method. However, we can achieve a similar effect using the seaborn.lineplot function combined with a lowess smoother from the statsmodels library.

from statsmodels.nonparametric.smoothers_lowess import lowess
import matplotlib.dates as mdates
locale.setlocale(locale.LC_ALL,'en_US.utf8')

## 'en_US.utf8'

plt.style.use('ggplot')

# Assuming air_visits is your DataFrame
subset_data = air_visits[(air_visits['visit_date'] > '2016-04-15') &
                        (air_visits['visit_date'] < '2016-06-15')]
# Calculate the total visitors for each visit_date
agg_data = subset_data.groupby('visit_date')['visitors'].sum().reset_index()
#agg_data.visit_date = agg_data.visit_date.dt.strftime('%d %b')

# build the figure
fig, ax = plt.subplots()

# Scatter plot of the data
p = sns.lineplot(data = agg_data, x='visit_date', y='visitors',
                 color='black', label='Data')
#p= sns.regplot(data=agg_data, x='visit_date', y='visitors',
#                color='black', lowess=True, ax=plt.gca())

# set dates format
p = ax.set(xticklabels = agg_data['visit_date'].dt.strftime('%d %b'))

p = plt.fill_between(agg_data.visit_date, agg_data.visitors*0.8,
                        agg_data.visitors*1.2, 
                        alpha=0.25, label='LOWESS uncertainty',
                        color = "grey")


# put the labels at 45deg since they tend to be too long
#p = plt.xticks(rotation=45)
fig.autofmt_xdate() 

# # Ensure that the x.axis text is spacieux 
p = ax.xaxis.set_major_locator(mdates.AutoDateLocator())

## Calculate the lowess smoothed line
smoothed = lowess(agg_data['visitors'], 
           agg_data['visit_date'].astype('datetime64[s]'),  frac=1/7)
## Plot the smoothed line
p = plt.plot(agg_data['visit_date'], smoothed[:, 1],
    color='blue', label='Lowess Smoother')

# Set labels and legend
p = plt.xlabel('Date')
p = plt.ylabel('All visitors')
plt.legend("")

# Show the plot
plt.show(p)

4.2 Air Reservations

Let’s see how our reservations data compares to the actual visitor numbers. We start with the air restaurants and visualise their visitor volume through reservations for each day, alongside the hours of these visits and the time between making a reservation and visiting the restaurant:

foo <- air_reserve %>%
  mutate(reserve_date = date(reserve_datetime),
         reserve_hour = hour(reserve_datetime),
         reserve_wday = wday(reserve_datetime, label = TRUE, week_start = 1),
         visit_date = date(visit_datetime),
         visit_hour = hour(visit_datetime),
         visit_wday = wday(visit_datetime, label = TRUE, week_start = 1),
         diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
         diff_day = time_length(visit_datetime - reserve_datetime, unit = "day")
         )

p1 <- foo %>%
  group_by(visit_date) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(visit_date, all_visitors)) +
  geom_line() +
  labs(x = "Air visit date")

p2 <- foo %>%
  group_by(visit_hour) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(visit_hour, all_visitors)) +
  geom_col(fill = "blue")

p3 <- foo %>%
  filter(diff_hour < 24*5) %>%
  group_by(diff_hour) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(diff_hour, all_visitors)) +
  geom_col(fill = "blue") +
  labs(x = "Time from reservation to visit (hours)")

layout <- matrix(c(1,1,2,3),2,2, byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)

# Assuming 'air_reserve' is your DataFrame
foo = air_reserve.copy()
foo['reserve_date'] = foo['reserve_datetime'].dt.date
foo['reserve_hour'] = foo['reserve_datetime'].dt.hour
foo['reserve_wday'] = foo['reserve_datetime'].dt.day_name()
foo['visit_date'] = foo['visit_datetime'].dt.date
foo['visit_hour'] = foo['visit_datetime'].dt.hour
foo['visit_wday'] = foo['visit_datetime'].dt.day_name()
foo['diff_hour'] = (foo['visit_datetime'] - \
                   foo['reserve_datetime']).dt.total_seconds() / 3600
foo['diff_day'] = (foo['visit_datetime'] - \
                   foo['reserve_datetime']).dt.days

p1 = foo.groupby('visit_date')['reserve_visitors'].sum().reset_index()
p2 = foo.groupby('visit_hour')['reserve_visitors'].sum().reset_index()
p3 = foo[foo['diff_hour'] < 24*5]\
    .groupby('diff_hour')['reserve_visitors'].sum().reset_index()

# Multiplot layout
fig = plt.figure(figsize=(9, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])

p1.plot(x='visit_date', y='reserve_visitors', kind='line', ax=ax1, color = 'black')
p1= ax1.set_xlabel('air visit date')
p1= plt.legend("")

p2.plot(x='visit_hour', y='reserve_visitors', kind='bar', color='blue', ax=ax2)
p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
p2= ax2.xaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")

p3.plot(x='diff_hour', y='reserve_visitors', kind='bar', color='blue', ax=ax3)
p3= ax3.set_xlabel('Time from reservation to visit (hours)')
p3= ax3.set_xticklabels(ax3.get_xticklabels(), rotation=0)
p3= ax3.xaxis.set_major_locator(mdates.AutoDateLocator())
p3= ax3.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
p3= plt.legend("")

plt.tight_layout()
plt.show()

4.3 HPG Reservations

In the same style as above, here are the hpg reservations:

foo <- hpg_reserve %>%
  mutate(reserve_date = date(reserve_datetime),
         reserve_hour = hour(reserve_datetime),
         visit_date = date(visit_datetime),
         visit_hour = hour(visit_datetime),
         diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
         diff_day = time_length(visit_datetime - reserve_datetime, unit = "day")
         )

p1 <- foo %>%
  group_by(visit_date) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(visit_date, all_visitors)) +
  geom_line() +
  labs(x = "HPG visit date")

p2 <- foo %>%
  group_by(visit_hour) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(visit_hour, all_visitors)) +
  geom_col(fill = "red")

p3 <- foo %>%
  filter(diff_hour < 24*5) %>%
  group_by(diff_hour) %>%
  summarise(all_visitors = sum(reserve_visitors)) %>%
  ggplot(aes(diff_hour, all_visitors)) +
  geom_col(fill = "red") +
  labs(x = "Time from reservation to visit (hours)")

layout <- matrix(c(1,1,2,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)

We find:

Here the visits after reservation follow a more orderly pattern, with a clear spike in Dec 2016. As above for the air data, we also see reservation visits dropping off as we get closer to the end of the time frame.
Again, most reservations are for dinner, and we see another nice 24-hour pattern for making these reservations. It’s worth noting that here the last few hours before the visit don’t see more volume than the 24 or 48 hours before. This is in stark constrast to the air data.

foo = hpg_reserve.copy()
foo['reserve_date'] = foo['reserve_datetime'].dt.date
foo['reserve_hour'] = foo['reserve_datetime'].dt.hour
foo['visit_date'] = foo['visit_datetime'].dt.date
foo['visit_hour'] = foo['visit_datetime'].dt.hour
foo['diff_hour'] = (foo['visit_datetime'] - foo['reserve_datetime']).dt.total_seconds() / 3600
foo['diff_day'] = (foo['visit_datetime'] - foo['reserve_datetime']).dt.days

p1 = foo.groupby('visit_date')['reserve_visitors'].sum().reset_index()
p2 = foo.groupby('visit_hour')['reserve_visitors'].sum().reset_index()
p3 = foo[foo['diff_hour'] < 24*5].groupby('diff_hour')['reserve_visitors'].sum().reset_index()

# Multiplot layout
fig = plt.figure(figsize=(9, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])

p1.plot(x='visit_date', y='reserve_visitors', kind='line', color='black', ax=ax1)
p1= ax1.set_xlabel('HPG visit date')
#p1= ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
p1= ax1.xaxis.set_major_locator(mdates.AutoDateLocator())
p1= plt.legend("")

p2.plot(x='visit_hour', y='reserve_visitors', kind='bar', color='red', ax=ax2)
p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
p2= ax2.xaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")

p3.plot(x='diff_hour', y='reserve_visitors', kind='bar', color='red', ax=ax3)
p3= ax3.set_xlabel('Time from reservation to visit (hours)')
p3= ax3.set_xticklabels(ax3.get_xticklabels(), rotation=0)
p3= ax3.xaxis.set_major_locator(mdates.AutoDateLocator())
p3= ax3.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
p3= plt.legend("")


plt.tight_layout()
plt.show()

4.4 Air Store

After visualising the temporal aspects, let’s now look at the spatial information. Note, that according to the data description the latitude and longitude are the latitude and longitude of the area to which the store belongs. This is meant to discourage us from identifying specific restaurants. I would be surprised if nobody tried anyway, though.

This is a fully interactive and zoomable map of all the air restaurants. Click on the clusters to break them up into smaller clusters and ultimately into the individual restaurants, which are labelled by their genre. The map nicely visualises the fact that many restaurants share common coordinates, since those coordinates refer to the area of the restaurant. Click on the single markers to see their air_store_id. The map is powered by the leaflet package, which includes a variety of cool tools for interactive maps. Have fun exploring!

leaflet(air_store) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addMarkers(~longitude, ~latitude,
             popup = ~air_store_id, label = ~air_genre_name,
             clusterOptions = markerClusterOptions())

In Python, you can achieve a similar map using the folium library.

import warnings
warnings.filterwarnings('ignore')

# Create a map centered at a specific location
m = folium.Map(location=[air_store['latitude'].median(),
               air_store['longitude'].median()],
               zoom_start=6, width=1500, height=800)

# Add tiles (you can choose different providers)
folium.TileLayer(tiles='CartoDB positron').add_to(m)

## <folium.raster_layers.TileLayer object at 0x7ff5ca4a1ae0>

# Create a MarkerCluster object
marker_cluster = MarkerCluster().add_to(m)

## Add markers from the air_store dataset
# for index, row in air_store.iterrows():
#     folium.Marker(location=[row['latitude'], row['longitude']],
#                   popup=row['air_store_id'],
#                   icon=None,
#                   tooltip=row['air_genre_name']).add_to(marker_cluster)

# Define a function to add markers
def add_marker(row):
    folium.Marker(location=[row['latitude'], row['longitude']],
                  popup=row['air_store_id'],
                  icon=None,
                  tooltip=row['air_genre_name']).add_to(marker_cluster)

# Apply the function to each row of the DataFrame
air_store.apply(add_marker, axis=1)

## 0      None
## 1      None
## 2      None
## 3      None
## 4      None
##        ... 
## 824    None
## 825    None
## 826    None
## 827    None
## 828    None
## Length: 829, dtype: object

# Display the map
m

Make this Notebook Trusted to load map: File -> Trust Notebook

Next, we plot the numbers of different types of cuisine (or air_genre_names) alongside the areas with the most air restaurants:

p1 <- air_store %>%
  group_by(air_genre_name) %>%
  count() %>%
  ggplot(aes(reorder(air_genre_name, n, FUN = min), n, fill = air_genre_name)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(x = "Type of cuisine (air_genre_name)", y = "Number of air restaurants")
  
p2 <- air_store %>%
  group_by(air_area_name) %>%
  count() %>%
  ungroup() %>%
  top_n(15,n) %>%
  ggplot(aes(reorder(air_area_name, n, FUN = min) ,n, fill = air_area_name)) +
  geom_col() +
  theme(legend.position = "none") +
  coord_flip() +
  labs(x = "Top 15 areas (air_area_name)", y = "Number of air restaurants")

layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)

We find:

There are lots of Izakaya gastropubs in our data, followed by Cafe’s. We don’t have many Karaoke places in the air data set and also only a few that describe themselves as generically International or Asian. I have to admit, I’m kind of intrigued by creative cuisine.
Fukuoka has the largest number of air restaurants per area, followed by many Tokyo areas.

p1_data = air_store['air_genre_name'].value_counts().reset_index()
p1_data.columns = ['air_genre_name', 'count']
p1_data = p1_data.sort_values(by=['count'], ascending=False).reset_index(drop=True)

p2_data = air_store['air_area_name'].value_counts().nlargest(15).reset_index()
p2_data.columns = ['air_area_name', 'count']
p2_data = p2_data.sort_values(['count'], ascending=False).reset_index(drop=True)


# Multiplot layout
fig = plt.figure(figsize=(10, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, :])

# Plot 1
p1= sns.barplot(data=p1_data, x='count', y='air_genre_name', 
                 order=p1_data['air_genre_name'],
                 palette='dark', ax=ax1, legend=False)

p1= ax1.set_xlabel('Number of air restaurants')
p1= ax1.set_ylabel("Type of cuisine")
#p1= ax1.set_title("Number of air restaurants by cuisine type")
#p1= ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
#p1= ax1.yaxis.set_major_locator(mdates.AutoDateLocator())
#p1= plt.legend("")

# Plot 2
p2= sns.barplot(data=p2_data, x='count', y='air_area_name', 
                order=p2_data['air_area_name'],palette='bright',
                ax=ax2, legend=False, width=0.8)
p2= ax2.set_xlabel("Number of air restaurants")
p2= ax2.set_ylabel("Top 15 areas")
#p2= ax2.set_title("Top 15 areas with most air restaurants")
#p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
#p2= ax2.yaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")
plt.tight_layout()
plt.show()

4.5 HPG Store

In the same way as for the air stores above, we also create an interactive map for the different hpg restaurants:

leaflet(hpg_store) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addMarkers(~longitude, ~latitude,
             popup = ~hpg_store_id, label = ~hpg_genre_name,
              clusterOptions = markerClusterOptions())

# Create a map centered at a specific location
m = folium.Map(location=[hpg_store['latitude'].median(), 
                         hpg_store['longitude'].median()],
                         zoom_start=7, 
                         width=1500, height=800)

# Add tiles (you can choose different providers)
folium.TileLayer(tiles='CartoDB positron').add_to(m)

## <folium.raster_layers.TileLayer object at 0x7ff5f9c829b0>

# Create a MarkerCluster object
marker_cluster = MarkerCluster().add_to(m)

## Also working
# hpg_store.apply(lambda row:folium.Marker(
#                   location=[row['latitude'], row['longitude']],                        
#                   popup=row['hpg_store_id'],icon=None,
#                   tooltip=row['hpg_genre_name']).add_to(marker_cluster), 
#                   axis=1)

# Define a function to add markers
def add_marker(row):
    folium.Marker(location=[row['latitude'], row['longitude']],
                  popup=row['hpg_store_id'],
                  icon=None,
                  tooltip=row['hpg_genre_name']).add_to(marker_cluster)

# Apply the function to each row of the DataFrame
hpg_store.apply(add_marker, axis=1)

## 0       None
## 1       None
## 2       None
## 3       None
## 4       None
##         ... 
## 4685    None
## 4686    None
## 4687    None
## 4688    None
## 4689    None
## Length: 4690, dtype: object

# Display the map
m

Make this Notebook Trusted to load map: File -> Trust Notebook

Here is the breakdown of genre and area for the hpg restaurants

p1 <- hpg_store %>%
  group_by(hpg_genre_name) %>%
  count() %>%
  ggplot(aes(reorder(hpg_genre_name, n, FUN = min), n, fill = hpg_genre_name)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(x = "Type of cuisine (hpg_genre_name)", y = "Number of hpg restaurants")
  
p2 <- hpg_store %>%
  mutate(area = str_sub(hpg_area_name, 1, 20)) %>%
  group_by(area) %>%
  count() %>%
  ungroup() %>%
  top_n(15,n) %>%
  ggplot(aes(reorder(area, n, FUN = min) ,n, fill = area)) +
  geom_col() +
  theme(legend.position = "none") +
  coord_flip() +
  labs(x = "Top 15 areas (hpg_area_name)", y = "Number of hpg restaurants")

layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

Here we have truncated the hpg_area_name to 20 characters to make the plot more readable.

We find:

The hpg description contains a larger variety of genres than in the air data. Here, Japanese style appears to contain many more places that are categorised more specifically in the air data. The same applies to International cuisine.
In the top 15 area we find again Tokyo and Osaka to be prominently present.

p1_data = hpg_store['hpg_genre_name'].value_counts().reset_index()
p1_data.columns = ['hpg_genre_name', 'count']
p1_data = p1_data.sort_values(by=['count'], ascending=False).reset_index(drop=True)

p2_data = hpg_store['hpg_area_name'].value_counts().nlargest(15).reset_index()
p2_data.columns = ['hpg_area_name', 'count']
p2_data = p2_data.sort_values(['count'], ascending=False).reset_index(drop=True)


# Multiplot layout
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

ax1 = fig.add_subplot(gs[:, 0])
ax2 = fig.add_subplot(gs[:, 1])

# Plot 1
p1= sns.barplot(data=p1_data, x='count', y='hpg_genre_name', order=p1_data['hpg_genre_name'],
                palette='dark', ax=ax1, legend=False)
p1= ax1.set_xlabel('Number of hpg restaurants')
p1= ax1.set_ylabel("Type of cuisine")
#p1= ax1.set_title("Number of hpg restaurants by cuisine type")
#p1= ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
#p1= ax1.yaxis.set_major_locator(mdates.AutoDateLocator())
#p1= plt.legend("")

# Plot 2
p2= sns.barplot(data=p2_data, x='count', y='hpg_area_name', order=p2_data['hpg_area_name'],
                palette='bright', ax=ax2, legend=False, width=0.8)
p2= ax2.set_xlabel("Number of hpg restaurants")
p2= ax2.set_ylabel("Top 15 areas")
#p2= ax2.set_title("Top 10 areas with most hpg restaurants")
#p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
#p2= ax2.yaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")
plt.tight_layout()
plt.show()

4.6 Holidays

Let’s have a quick look at the holidays. We’ll plot how many there are in total and also how they are distributed during our prediction time range in 2017 and the corresponding time in 2016:

foo <- holidays %>%
  mutate(wday = wday(date, week_start = 1))

p1 <- foo %>%
  ggplot(aes(holiday_flg, fill = holiday_flg)) +
  geom_bar() +
  theme(legend.position = "none")
  
p2 <- foo %>%
  filter(date > ymd("2016-04-15") & date < ymd("2016-06-01")) %>%
  ggplot(aes(date, holiday_flg, color = holiday_flg)) +
  geom_point(size = 2) +
  theme(legend.position = "none") +
  labs(x = "2016 date")

p3 <- foo %>%
  filter(date > ymd("2017-04-15") & date < ymd("2017-06-01")) %>%
  ggplot(aes(date, holiday_flg, color = holiday_flg)) +
  geom_point(size = 2) +
  theme(legend.position = "none") +
  labs(x = "2017 date")

layout <- matrix(c(1,1,2,3),2,2,byrow=FALSE)
multiplot(p1, p2, p3, layout=layout)

We find:

The same days were holidays in late Apr / May in 2016 as in 2017.
There are about 7% holidays in our data:

holidays %>% summarise(frac = mean(holiday_flg))

##         frac
## 1 0.06769826

# Convert 'date' column to datetime if it's not already
holidays['date'] = pd.to_datetime(holidays['date'])

# Plot 1
p1_data = holidays['holiday_flg'].value_counts().reset_index()
p1_data.columns = ['holiday_flg', 'count']
p2_data = holidays[(holidays['date'] > '2016-04-15') & (holidays['date'] < '2016-06-01')]
p2_data['holiday_flg'] = p2_data['holiday_flg'].astype('category')
p3_data = holidays[(holidays['date'] > '2017-04-15') & (holidays['date'] < '2017-06-01')]


# Multiplot layout
fig = plt.figure(figsize=(6, 4))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
locale.setlocale(locale.LC_ALL,'en_US.utf8')

## 'en_US.utf8'

ax1 = fig.add_subplot(gs[:, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[1, 1])


# Plot1
p1= sns.barplot(data=p1_data, x='holiday_flg', y='count', palette='viridis',  
            ax=ax1, legend=False)
p1= ax1.set_xlabel('Holiday Flag')
#p1= ax1.set_title("Distribution of Holiday Flag")
p1= plt.legend("")

# Plot 2
p2= sns.scatterplot(data=p2_data, x='date', y='holiday_flg', hue='holiday_flg', 
                     palette='viridis', ax=ax2, legend=False)
p2.set_yticks([1.0, 0.0], ["True","False"])
p2= ax2.set_xlabel('2016')
p2= ax2.set_title("2016")
#p2 = ax2.set(xticklabels = p2_data['date'].dt.strftime('%d %b'))
p2= fig.autofmt_xdate() 
p2= ax2.set_ylabel('')
p2= plt.legend("")
p2= ax2.xaxis.set_major_locator(mdates.AutoDateLocator())

# Plot 3
p3= sns.scatterplot(data=p3_data, x='date', y='holiday_flg', 
     hue='holiday_flg', palette='viridis', ax=ax3, legend=False)
p3.set_yticks([1.0, 0.0], ["True","False"])
p3 = ax3.set(xticklabels = p3_data['date'].dt.strftime('%d %b'))
p3= fig.autofmt_xdate()
p3= ax3.set_ylabel('')
p3= ax3.xaxis.set_major_locator(mdates.AutoDateLocator())
p3= plt.xlabel("2017")
#p3= plt.title("Holiday Flag in 2017")
p3= plt.legend("")


plt.tight_layout( pad=0.4, w_pad=0.5, h_pad=1)
plt.show()

frac = holidays['holiday_flg'].mean()
print(frac.round(3))

## 0.068

4.7 Test data set

The following plot visualizes the time range of the train vs test data sets:

foo <- air_visits %>%
  rename(date = visit_date) %>%
  distinct(date) %>%
  mutate(dset = "train")

bar <- test %>%
  separate(id, c("foo", "bar", "date"), sep = "_") %>%
  mutate(date = ymd(date)) %>%
  distinct(date) %>%
  mutate(dset = "test")

foo <- foo %>%
  bind_rows(bar) %>%
  mutate(year = year(date))
year(foo$date) <- 2017

foo %>%
  filter(!is.na(date)) %>%
  mutate(year = fct_relevel(as.factor(year), c("2017","2016"))) %>%
  ggplot(aes(date, year, color = dset)) +
  geom_point(shape = "|", size = 10) +
  scale_x_date(date_labels = "%B", date_breaks = "1 month") +
  #scale_y_reverse() +
  theme(legend.position = "bottom", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(color = "Data set") +
  guides(color = guide_legend(override.aes = list(size = 4, pch = 15)))

# Renaming and selecting distinct dates for 'foo'
foo = air_visits.rename(columns={'visit_date': 'date'}).drop_duplicates(subset=['date'])
foo['dset'] = 'train'

# Processing 'bar'
test['date'] = pd.to_datetime(test['id'].str.split('_').str[-1])
bar = test[['date']].drop_duplicates()
bar['dset'] = 'test'

# Combining 'foo' and 'bar'
foo = pd.concat([foo, bar], ignore_index=True)

# Setting 'year' column
foo['year'] = foo['date'].dt.strftime('%Y').astype('category')

# Filtering out NA dates
foo = foo.dropna(subset=['date'])

# Adding a 'Months' column
foo['Months'] = foo['date'].dt.strftime('%B')

# Set up the plot
plt.figure(figsize=(8, 4))
#p= sns.set(rc={'figure.figsize':(4,2)})
plt.style.use('ggplot')

# Create a scatter plot using Seaborn
p= sns.scatterplot(data=foo, x='Months', y= 'year' , hue='dset', color= 'dset',
                   #palette={'train': 'blue', 'test': 'red'}, 
                   marker='*',
                   s=250)
# Format the tick labels to display the month name
#p.invert_yaxis()
p= plt.xticks(rotation=15)
p= plt.tick_params(axis='x', pad=-5, labelsize=8)
plt.margins(y=1)
plt.legend(bbox_to_anchor=(0.99, 0.01), loc='lower right', borderaxespad=0)
plt.show(p)

5 Feature relations

After looking at every data set individually, let’s get to the real fun and start combining them. This will tell us something about the relations between the various features and how these relationsy might affect the visitor numbers. Any signal we find will need to be interpreted in the context of the individual feature distributions; which is why it was one of our first steps to study those.

5.1 Visitors per genre

Our first plot of the multi-feature space deals with the average number of air restaurant visitors broken down by type of cuisine; i.e. the air_genre_name. We use a facet plot to distinguish the time series for the 14 categories. Note the logarithmic y-axis:

foo <- air_visits %>%
  left_join(air_store, by = "air_store_id")

foo %>%
  group_by(visit_date, air_genre_name) %>%
  summarise(mean_visitors = mean(visitors), .groups = "keep") %>%
  ungroup() %>%
  ggplot(aes(visit_date, mean_visitors, color = air_genre_name)) +
  geom_line() +
  labs(y = "Average number of visitors to 'air' restaurants", x = "Date") +
  theme(legend.position = "none") +
  scale_y_log10() +
  facet_wrap(~ air_genre_name)

We find:

The mean values range between 10 and 100 visitors per genre per day. Within each category, the long-term trend looks reasonably stable. There is an upward trend for Creative Cuisine and Okonomiyaki et al., while the popularity of Asian food has been declining since late 2016.
The low-count time series like Karaoke or Asian (see Fig. 6) are understandably more noisy than the genres with higher numbers of visitors. Still, Asian restaurants appear to be very popular despite (or because of?) their rarity.

# Merge 'air_visits' and 'air_store' on the column 'air_store_id'
foo = pd.merge(air_visits, air_store, on='air_store_id', how='left')

#Grouping and summarizing data
summary_data = foo.groupby(['visit_date', 'air_genre_name'],
                           observed=True)['visitors'].mean().reset_index()

fig = plt.figure(figsize = (4,2))
plt.style.use('ggplot')

# Set up the FacetGrid
g = sns.FacetGrid(summary_data, col="air_genre_name",
                  col_wrap=4, height=2, aspect= 1,
                  sharex=True)

# Create line plots for each genre
p = g.map_dataframe(sns.lineplot, x='visit_date', 
                y='visitors', hue='air_genre_name')
                
# Resize facet titles
p= g.set_titles(col_template="{col_name}", 
               size = 8, backgroundcolor='#DDDDDD', pad=2
           # bbox={'facecolor': '#DDDDDD', 'alpha': 0, 'pad': 0}
            )
p= g.tick_params(axis='x', pad=0, labelsize= 7)
p= g.set_xlabels(fontsize=7, label="Date")
p= g.set_ylabels(fontsize=8 )#,label="Average number of visitors")
# Set labels and title
#g.set_axis_labels(x_var="Date",y_var= "Average number of visitors")
#g.add_legend(title="gendre")

for ax in g.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(45)

# Apply logarithmic scale to y-axis
p= g.set(yscale="log")
# Adjust the layout
p= g.tight_layout()

# Show the plot
plt.show(p)

In all of the curves we see the familiar weekly variation, so let’s look in more detail at the mean visitor numbers per week day and genre. We also add to this a series of ridgeline plots via the ggridges package. Ridgeline plots allow for a quick comparison of semi-overlapping (density) curves. Here we show the distribution of visitors per day for each genre. Through a little bit of ggplot magic, the y-axis labels in between those two plots refer to both of them:

p1 <- foo %>%
  mutate(wday = wday(visit_date, label = TRUE, week_start = 1)) %>%
  group_by(wday, air_genre_name) %>%
  summarise(mean_visitors = mean(visitors)) %>%
  ggplot(aes(air_genre_name, mean_visitors, color = wday)) +
  geom_point(size = 4) +
  theme(legend.position = "left", axis.text.y = element_blank(),
        plot.title = element_text(size = 14)) +
  coord_flip() +
  labs(x = "") +
  scale_x_discrete(position = "top") +
  ggtitle("air_genre_name") +
  scale_color_hue()

## `summarise()` has grouped output by 'wday'. You can override using the
## `.groups` argument.

p2 <- foo %>%
  ggplot(aes(visitors, air_genre_name, fill = air_genre_name)) +
  geom_density_ridges(bandwidth = 0.1) +
  scale_x_log10() +
  theme(legend.position = "none") +
  labs(y = "") +
  scale_fill_cyclical(values = c("blue", "red"))

layout <- matrix(c(1,1,2,2,2),1,5,byrow=TRUE)
multiplot(p1, p2, layout=layout)

Here each colour corresponds to a day of the week. Red-ish coulours are the weekend, while the cooler colours are the middle of the week. Monday is a dark orange.

We find:

The biggest difference between weekend and weekdays exists for the Karaoke bars, which rule the weekend. A similar trend, although with a considerably smaller gap, can be seen for the International cuisine.
No genre really goes against the trend of busier weekends. The smallest variations are in the generic Other category, the Japanese food, and also the Korean cuisine which is the only category where Fridays are the busiest days. General Bars/Cocktail are notably unpopular overall.
The density curves confirm the impression we got from the week-day distribution: the Asian restaurants have rarely less than 10 visitors per date and the Karaoke places show a very broad distribution due to the strong impact of the weekends. Note the logarithmic x-axis

from joypy import joyplot

# Extracting weekday
foo['wday'] = foo['visit_date'].dt.day_name()

# Data Processing for p1
p1_data = foo.groupby(['wday', 'air_genre_name'], observed=True)\
           .agg(mean_visitors=('visitors', 'mean')).reset_index()

# Data Processing for p2 (Assuming 'foo' DataFrame is available)
p2_data = foo.copy()
# Apply necessary transformations to 'p2_data' DataFrame

# Create a 2x2 grid
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(1,2 , figure=fig)
ax1 = fig.add_subplot(gs[0])
ax2 = fig.add_subplot(gs[1])

p1= sns.scatterplot(data=p1_data,
                y='air_genre_name',
                x='mean_visitors',
                hue='wday',
                size=4,
                #palette='viridis', 
                ax=ax1)
# Set the figure size using rcParams
p1= ax1.set_title('air_genre_name', fontsize=8)
p1= ax1.set_xlabel('')
p1= ax1.set_ylabel('')
p1= ax1.legend(title='wday', loc='center right')

## Plot 2 (Top Right)
p2= sns.kdeplot(
            data=p2_data,
            x='visitors',
            hue='air_genre_name',
            fill=True,
            common_norm=False,
            levels=30,
            #label="genre",
            ax=ax2)
p2= ax2.set_xlabel('')
p2= ax2.set_ylabel('')
p2= ax2.set_title('')
p2= ax2.legend(loc='bottom right')

## 'bottom right' is not a valid value for loc; supported values are 'best',
## 'upper right', 'upper left', 'lower left', 'lower right', 'right', 'center
## left', 'center right', 'lower center', 'upper center', 'center'

p2= ax2.set_xlim(-50, 200)
#p2= ax2.get_legend().remove()

# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()

# Data Processing for p2 (Assuming 'foo' DataFrame is available)
p2_data = foo.copy()

fig = plt.figure(figsize=(10, 8))

# Plot 2 (Bottom)
pal = sns.cubehelix_palette(10, rot=-.25, light=.7)
g = sns.FacetGrid(p2_data, 
                 row="air_genre_name", 
                 hue="air_genre_name", #palette=pal,
                 aspect=15, height=.5
                 )
p= g.map(sns.kdeplot, 'visitors',
      bw_adjust=.5, clip_on=False,
      fill=True, alpha=1, linewidth=.5)
      
p= g.map(sns.kdeplot, 'visitors', clip_on=False, color="black", lw=2, bw_adjust=.5)

# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, .2, label,  color=color, #fontweight="bold",
            ha="right", va="center", transform=ax.transAxes)

p= g.map(label, "visitors")

p= g.set(xlim=(-50, 1000))
p= g.set_titles("")
p= g.set(yticks=[], ylabel="")
p= g.despine(bottom=True, left=True)
p= g.fig.subplots_adjust(hspace=-.25)

# Adjust layout
plt.tight_layout()
# Show the plots
plt.show(p)

fig, axes = plt.subplots(1, 1, figsize=(10, 4))

fig, axes = joypy.joyplot(
    p2_data,
    by='air_genre_name',
    column='visitors',
    linewidth=1,
    colormap=plt.cm.viridis,
    grid="both", #'x', 'y'
    xlabels=True,
    legend=False,
    fade=True
)
p= axes[-1].set_xlim(0, 500)
# Show the plot
plt.show(p)

5.2 The impact of holidays

We will study the influence of holidays on our visitor numbers by comparing the statistics for days with holidays vs those without holiday flag:

foo <- air_visits %>%
  mutate(calendar_date = as.character(visit_date)) %>%
  left_join(holidays, by = "calendar_date")

p1 <- foo %>%
  ggplot(aes(holiday_flg, visitors, color = holiday_flg)) +
  geom_boxplot() +
  scale_y_log10() +
  theme(legend.position = "none")

p2 <- foo %>%
  mutate(wday = wday(date, label = TRUE, week_start = 1)) %>%
  group_by(wday, holiday_flg) %>%
  summarise(mean_visitors = mean(visitors), .groups = "keep") %>%
  ggplot(aes(wday, mean_visitors, color = holiday_flg)) +
  geom_point(size = 4) +
  theme(legend.position = "none") +
  labs(y = "Average number of visitors")

layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

We find:

Overall, holidays don’t have any impact on the average visitor numbers (left panel). As so often, more information is hidden in the details.
While a weekend holiday has little impact on the visitor numbers, and even decreases them slightly, there is a much more pronounced effect for the weekdays; especially Monday and Tuesday (right panel).

from matplotlib import pyplot as plt, dates

# Data Processing
foo = air_visits.copy()
#foo['calendar_date'] = foo['visit_date'].dt.strftime('%Y-%m-%d')
foo['calendar_date'] = foo['visit_date'].astype(str)
#foo = foo.merge(holidays, left_on='calendar_date', right_on='date', how='left')
foo = pd.merge(foo, holidays, left_on='calendar_date', 
                how='left', right_on = 'calendar_date')

# Data Processing for p1
p1_data = foo.copy()
p1_data['holiday_flg'] = p1_data['holiday_flg'].astype(bool)  # Convert to boolean
p1_data['wday'] = p1_data['visit_date'].dt.day_name()  # Extract weekday

# Data Processing for Plot 2
p2_data = foo.copy()
order = ["Monday", "Tuesday", "Wednesday",
        "Thursday", "Friday", "Saturday", "Sunday"]
p2_data['wday'] = pd.Categorical(p2_data['visit_date'].dt.day_name(),
                                 categories= order, ordered=True)
# Extract weekday as a number (0-6)
p2_data['wdayn'] = p2_data['date'].dt.dayofweek    
p2_data = p2_data.groupby(['wday', 'holiday_flg', 'wdayn'], 
          observed=True).agg(mean_visitors=('visitors', 'mean')).reset_index()
p2_data['wdayn'] = pd.Categorical(p2_data['wdayn'], 
                  categories=[0,1,2,3,4,5,6], ordered=True)#Sort weekdays
#p2_data['wdayn'] = p2_data['wdayn'].dt.day_name()

# Create a 1x2 grid
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
#locale.setlocale(locale.LC_ALL,'en_US.utf8')
plt.style.use('ggplot')

# Plot 1 (p1)
flierprops = dict(marker='o', markerfacecolor='r', markersize=10,
                  linestyle='none', markeredgecolor='b')
sns.boxplot(data=p1_data, x='holiday_flg', y='visitors',
              hue='holiday_flg', 
              #flierprops=flierprops,
              fill=False,
              #dodge=False, 
              showfliers = True,
              ax=axes[0])
axes[0].set_yscale('log')
#axes[0].set_title('Boxplot of Visitors on Holidays')
#axes[0].set_xlabel('Holiday')
axes[0].set_ylabel('Visitors (log scale)')
axes[0].legend('')

# Plot 2 (p2)
sns.scatterplot(data=p2_data, x='wday', y='mean_visitors', 
                hue='holiday_flg', 
                ax=axes[1], s=50)
#axes[1].set_title('Average Visitors by Weekday')
axes[1].set_xlabel('wday')
axes[1].set_ylabel('Average Visitors')
axes[1].xaxis.set_ticks(order)
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=25)
#axes[1].xaxis.set_major_formatter(dates.DateFormatter('%a'))
axes[1].legend('')

# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()

5.3 Restaurants per area and the effect on visitor numbers

Our next exploration follows from a simple thought: if gastropubs are popular and we own the only gastropub in the area then we can expect lots of customers. If there are twelve other gastropubs in our street then, try as we might, some of those customers will venture into other establishments. Economists tell us that we can ultimately expect a convergence towards an equilibrium between supply and demand. But for snapshots like our data set, and for relatively localised areas, there might still be merit in investigating restaurant clustering. Therefore, let’s study the number of restaurants of a certain genre per area and their impact on visitor numbers.

We begin with an overview plot of the frequency of certain genres per area for the two data sets of air and hpg stores. This could have well been a part of the previous chapter, but I only just thought about it ;-) . The following count plots show which genres exist in which areas (names truncated). The size of the dots is proportional to the number of cases:

air_store %>%
  mutate(area = str_sub(air_area_name, 1, 12)) %>%
  ggplot(aes(area, air_genre_name)) +
  geom_count(colour = "blue") +
  theme(legend.position = "bottom", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))

We find:

Some areas have lots of restaurants and much variety, whereas others contain only a single air restaurant. Large parts of the parameter space are empty.
Similarly, some cuisines like Izakaya or Cafe are pretty ubiqutous, whereas others can only be found in a few areas. Note, that the only 2 Karaoke bars in the air sample are in Hokkaido Sapporo-shi Minami 3 Jonishi, whereas the only 2 International cuisine restaurants as well as the only two Asian places can be found in Tokyo-to Shibuya-ku Shibuya.

The same kind of plot for the hpg data looks similar albeit more busy due to the larger number of genres:

hpg_store %>%
  mutate(area = str_sub(hpg_area_name, 1, 10)) %>%
  ggplot(aes(area, hpg_genre_name)) +
  geom_count(colour = "red") +
  theme(legend.position = "bottom", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))

We find:

Also here there are busy areas and those with only a few restaurants. Unsurprisingly, Tokyo features prominently in the areas with a lot of culinary diversity.
Japanese style and International cuisine are popular pretty much everywhere. Amusement bars and Udon/Soba places are rare, as are Shanghai food or Dim Sum.

# Create a new column 'area' by extracting the first 12 characters from 'air_area_name'
air_store['area'] = air_store['air_area_name'].str[:12]
#Count the frequency of each combination of 'area' and 'air_genre_name'
counts = air_store.groupby(['area', 'air_genre_name'], 
                    observed=True).size().reset_index(name='count')
# Plotting
fig = plt.figure(figsize=(8, 4))
p= sns.scatterplot(data=counts, x='area', y='air_genre_name',
                 size='count', color='blue', legend=True)
# Rotate x-axis labels for better visibility
p= plt.xticks(rotation=45, ha='right')
# Set labels and title
p= plt.xlabel('Area')
p= plt.ylabel('Air Genre Name')
p= plt.tick_params(axis='x', pad=0, labelsize=7)
# Adjust legend position
p= plt.legend(bbox_to_anchor = (1,1),loc='upper left')

# Show the plot
plt.show(p)

# Create a new column 'area' by extracting the first 12 characters from 'air_area_name'
hpg_store['area'] = hpg_store['hpg_area_name'].str[:10]
#Count the frequency of each combination of 'area' and 'air_genre_name'
counts = hpg_store.groupby(['area', 'hpg_genre_name'], observed=True).size().reset_index(name='count')
# Plotting
fig= plt.figure(figsize=(8, 4))
p= sns.scatterplot(data=counts, x='area', y='hpg_genre_name',
                 size='count', color='red', legend=True)
# Rotate x-axis labels for better visibility
p= plt.xticks(rotation=45, ha='right')
# Set labels and title
p= plt.xlabel('Area')
p= plt.ylabel('HPG Genre Name')
p= plt.tick_params(axis='x', pad=0, labelsize=7)
p= plt.tick_params(axis='y', pad=0, labelsize=7)
# Adjust legend position
p= plt.legend(bbox_to_anchor = (1,1),loc='upper left')

# Show the plot
plt.show(p)

The count plots tell us that there is a distribution of how many restaurants of a certain genre can be found per area. Here we look at these distributions in detail via boxplots with overlayed jitter plots. The genres are ordered by decreasing mean cases per area, i.e. the mean of a horizontal sequence of dots in a count plot. The we overlay the indvidual data point and assign each dot a random jitter to visually separate otherwise overlapping data. Here, the y axis (i.e. Occurences per area) correspond to the size of the dots in the count plots above. We’re using single plots here, instead of panels, because these plots are quite detailed. Note the logarithmic y-axes.

We start with the air data:

air_store %>%
  group_by(air_genre_name, air_area_name) %>%
  count() %>%
  ggplot(aes(reorder(air_genre_name, n, FUN = mean), n)) +
  geom_boxplot() +
  geom_jitter(color = "blue") +
  scale_y_log10() +
  coord_flip() +
  labs(x = "Air genre", y = "Occurences per air area")

We find:

Only few genres have medians of more than 2 restaurants per area. Examples are Italian/French restaurants or Bar/Cocktail places, which are more likely to be found in groups of more than 2 per area.
For the majority of genres the distribution is firmly clustered around 2 cases per area with a bit of scatter towards higher numbers. Cafes have the highest number with 26 occurences in a single area (Fukuoka-ken Fukuoka-shi Daimyō).
Curiously, the minimum here is 2, not 1. This means that there is no air restaurant that is the only one of it’s genre in any area. You can see that also on the map in Fig. 5: whenever there is a group of 2 restaurants at the same area coordinates they have the same genre (but different air_store_ids). Similarly, there are no single occurrences of a genre in any larger group per area. (There exist odd numbers of occurences though, i.e. 3 or 5 of the same genre per area.) I’m not quite sure what to make of that, since it seems too perfectly matched to be true. Could it have something to do with the way the air data have been selected? Might it be a bug in assigning genre names? I checked a few examples of 2 restaurants in one spot and they have different visitor numbers on the same day, e.g. for this pair of Dining Bars, indicating that we don’t simply have a problem of single entries being duplicated here:

air_store %>%
  filter(air_store_id == "air_b5598d12d1b84890" | air_store_id == "air_bbe1c1a47e09f161")

##            air_store_id air_genre_name                 air_area_name latitude
## 1: air_bbe1c1a47e09f161     Dining bar Tōkyō-to Setagaya-ku Kitazawa 35.66266
## 2: air_b5598d12d1b84890     Dining bar Tōkyō-to Setagaya-ku Kitazawa 35.66266
##    longitude
## 1:  139.6683
## 2:  139.6683

air_visits %>%
  filter(air_store_id == "air_b5598d12d1b84890" | air_store_id == "air_bbe1c1a47e09f161") %>%
  arrange(visit_date) %>%
  head(10) %>%
  DT::datatable()

# Group by 'air_genre_name' and 'air_area_name', and count the frequency
counts = air_store.groupby(['air_genre_name', 'air_area_name'], 
          observed=True).size().reset_index(name='count')

# Reorder 'air_genre_name' by count
order = counts.groupby('air_genre_name', 
        observed=True)['count'].mean().sort_values().index

# Plotting
fig= plt.figure(figsize=(8, 6))
p= sns.boxplot(data=counts, y='air_genre_name', x='count', 
   order=order, hue='air_genre_name' ,palette='viridis')
p= sns.stripplot(data=counts, y='air_genre_name', x='count',
    order=order, color='blue', size=3)
# Set labels and title
p= plt.xlabel('Occurences per air area')
p= plt.ylabel('Air genre')
p= plt.xscale('log')

# Show the plot
plt.show(p)

air_store[air_store['air_store_id'].isin(["air_b5598d12d1b84890", 
                                      "air_bbe1c1a47e09f161"])]

##              air_store_id air_genre_name  ...   longitude          area
## 202  air_bbe1c1a47e09f161     Dining bar  ...  139.668268  Tōkyō-to Set
## 203  air_b5598d12d1b84890     Dining bar  ...  139.668268  Tōkyō-to Set
## 
## [2 rows x 6 columns]

selected_visits = air_visits[air_visits['air_store_id'].isin(["air_b5598d12d1b84890", 
                                                            "air_bbe1c1a47e09f161"])]
selected_visits.sort_values(by='visit_date').head(10)

##                 air_store_id visit_date  visitors
## 239978  air_bbe1c1a47e09f161 2016-07-01         7
## 250021  air_b5598d12d1b84890 2016-07-01         5
## 250022  air_b5598d12d1b84890 2016-07-02        14
## 250023  air_b5598d12d1b84890 2016-07-03         6
## 250024  air_b5598d12d1b84890 2016-07-04         6
## 250025  air_b5598d12d1b84890 2016-07-05         5
## 250026  air_b5598d12d1b84890 2016-07-06         5
## 239979  air_bbe1c1a47e09f161 2016-07-07         1
## 250027  air_b5598d12d1b84890 2016-07-08        10
## 239980  air_bbe1c1a47e09f161 2016-07-08         7

Now we look at the same distribution for the HPG restaurants:

foobar <- hpg_store %>%
  group_by(hpg_genre_name, hpg_area_name) %>%
  count()

foobar %>%
  ggplot(aes(reorder(hpg_genre_name, n, FUN = mean), n)) +
  geom_boxplot() +
  geom_jitter(color = "red") +
  scale_y_log10() +
  coord_flip() +
  labs(x = "hpg genre", y = "Cases per hpg area")

We find:

Here we clearly have a minimum of 1 genre per area, and also much more variety in median cases due to the higher overall numbers.
The most extreme genre is Japanese style for which the median is just above 10 restaurants per area. Alongside of this, there a number of other genres for which the lower box hinge is not touching the minimum of 1 case per area.

foobar = hpg_store.groupby(['hpg_genre_name', 'hpg_area_name'], 
                      observed=True).size().reset_index(name='n')

# Reorder 'hpg_genre_name' by count
#order = counts.groupby('hpg_genre_name', 
#observed=True)['n'].mean().sort_values().index
order = foobar.groupby('hpg_genre_name', 
        observed=True)['n'].mean().sort_values(ascending=False).index


fig= plt.figure(figsize=(8, 6))

# Create the boxplot with jitter
sns.boxplot(data=foobar, y='hpg_genre_name', x='n', 
                       order=order, color='lightgrey')
sns.stripplot(data=foobar, y='hpg_genre_name', x='n',
                    order=order, color='red', alpha=0.7)

# Set y-axis to log scale
#plt.yscale('log')

# Set labels and title
plt.xlabel('hpg genre')
plt.ylabel('Cases per hpg area')
plt.tick_params(axis='y', pad=0, labelsize=7)
# Rotate x-axis labels
#plt.xticks(rotation=45)

# Show the plot
plt.show()

Using the information on the number of genres in each area we can now proceed to quantify the clustering, or crowdedness, of our data set and relate it to the visitor numbers. The next plot first shows the overall distribution of the air and hpg data points from the last two plots (i.e. cases of the same genre per area).

In addition, we estimate the mean visitor numbers for each clustering case. For this, we first take the mean of the log1p-transformed visitor numbers (per genre and area) and then compute the mean and standard deviation of these numbers for each case, i.e. number of occurences of the same genre in an area. (log1p means log(x+1) and is intended to make the visitors number distribution more normal; see Fig. 1).

foo <- air_visits %>%
  left_join(air_store, by = "air_store_id")

bar <- air_store %>%
  group_by(air_genre_name, air_area_name) %>%
  count()

foobar <- hpg_store %>%
  group_by(hpg_genre_name, hpg_area_name) %>%
  count()

p1 <- bar %>%
  ggplot(aes(n)) +
  geom_histogram(fill = "blue", binwidth = 1) +
  labs(x = "Air genres per area")

p2 <- foobar %>%
  ggplot(aes(n)) +
  geom_histogram(fill = "red", binwidth = 1) +
  labs(x = "HPG genres per area")

p3 <- foo %>%
  group_by(air_genre_name, air_area_name) %>%
  summarise(mean_log_visit = mean(log1p(visitors)), .groups = "keep") %>%
  left_join(bar, by = c("air_genre_name","air_area_name")) %>%
  group_by(n) %>%
  summarise(mean_mlv = mean(mean_log_visit),
            sd_mlv = sd(mean_log_visit)) %>%
  replace_na(list(sd_mlv = 0)) %>%
  ggplot(aes(n, mean_mlv)) +
  geom_point(color = "blue", size = 4) +
  geom_errorbar(aes(ymin = mean_mlv - sd_mlv, ymax = mean_mlv + sd_mlv), 
                width = 0.5, linewidth = 0.7, color = "blue") +
  labs(x = "Cases of identical Air genres per area", 
       y = "Mean +/- SD of\n mean log1p visitors")

layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)

# Compute 'foo'
foo = air_visits.merge(air_store, on='air_store_id', how='left')

# Compute 'bar'
bar = air_store.groupby(['air_genre_name', 'air_area_name'], 
                            observed=True).size().reset_index(name='n')

# Compute 'foobar' (Assuming 'hpg_store' DataFrame is available)
foobar = hpg_store.groupby(['hpg_genre_name', 'hpg_area_name'], 
                              observed=True).size().reset_index(name='n')

p3_data = (
    foo.groupby(['air_genre_name', 'air_area_name'], observed=True)
    .agg(mean_log_visit=('visitors', lambda x: x.apply(lambda x: np.log1p(x)).mean()))
    .reset_index()
    .merge(bar, left_on=['air_genre_name', 'air_area_name'], 
                right_on=['air_genre_name', 'air_area_name'], how='left')
    .groupby('n', observed=True)
    .agg(
        mean_mlv=('mean_log_visit', 'mean'),
        sd_mlv=('mean_log_visit', 'std')
    )
    .fillna(0)
    .reset_index()
)
# Multiplot layout
fig = plt.figure(figsize=(10, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

# Assign each plot to a specific position
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[1, :])

# Plot p1 and p2 side by side
sns.histplot(data=bar, x='n', color='blue', binwidth=1, ax=ax1)
ax1.set(xlabel='Air Genres per area', ylabel='Count')
sns.histplot(data=foobar, x='n', color='red', binwidth=1, ax=ax2)
ax1.set(xlabel='HPG Genres per area', ylabel='Count')
# Plot p3
p3 = sns.scatterplot(data=p3_data, x='n', y='mean_mlv', color='blue',
      s=40, ax=ax3)
p3.errorbar(p3_data['n'], p3_data['mean_mlv'], yerr=p3_data['sd_mlv'],
              fmt='o', color='blue', markersize=5, capsize=8)
p3.set(xlabel='Cases of identical Air genres per area', 
       ylabel='Mean +/- SD of\n mean log1p visitors')

# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()

We find:

Upper panels: The numbers of cases of identical genres in the same area drops quickly, possibly exponentially, from the maxima of 2 and 1 for air and hpg data sets, respectively. There is a longer tail towards larger case numbers, with a possible 2nd peak around 8 for air and about 13 for hpg.
Lower panel: The log1p means of visitor numbers show relatively large spread and are perfectly consistent in the range from 2 cases to 9. From 10 cases upward we simply don’t have the numbers to measure a spread: there are two data points with 11 cases per area and otherwise it’s just a single measurement. However, it is noteworthy that the scatter among all the data points > 9 cases is pretty low, and that the lie notably higher than the means of the points <= 9. Now, that could simply mean that those high visitor numbers are not being brought down by small (less busy?) areas, but that in itself is an interesting result.
Note that the scatter plot in the lower panel mixes treats all the clustering/crowding cases regardless of genre. We can of course only draw this plot for the air data for which we have visitor numbers.

We will try another method of quantifying the distinctiveness of a restaurant by the distance to its neighbouring restaurants in the feature engineering section.

5.4 Reservations vs Visits

Next we will turn our attention to the reservation numbers in the air_reserve and hpg_reserve data sets. We have seen their time series and distributions back in Sections 4.2 and 4.3; now we will compare the reservation numbers to the actual visitor numbers.

For this, we compute the sum of reserve_visitors per day (i.e. the number of people reservations were made for) for each restaurant id and then join these summaries to the air_visitors file. In order to include the hpg reservations we need to use the store_ids data to join the hpg_store_ids from the hpg_reserve file to the corresponding air_store_ids:

foo <- air_reserve %>%
  mutate(visit_date = date(visit_datetime)) %>%
  group_by(air_store_id,visit_date) %>%
  summarise(reserve_visitors_air = sum(reserve_visitors), .groups = "keep")
  
bar <- hpg_reserve %>%
  mutate(visit_date = date(visit_datetime)) %>%
  group_by(hpg_store_id,visit_date) %>%
  summarise(reserve_visitors_hpg = sum(reserve_visitors), .groups = "keep") %>%
  inner_join(store_ids, by = "hpg_store_id")

all_reserve <- air_visits %>%
  inner_join(foo, by = c("air_store_id", "visit_date")) %>%
  inner_join(bar, by = c("air_store_id", "visit_date")) %>%
  mutate(reserve_visitors = reserve_visitors_air + reserve_visitors_hpg)

Now we will plot the total reserve_visitor numbers against the actual visitor numbers for the air restaurants. We use a scatter plot to which we are adding marginal histograms via the ggMarginal function of the ggExtra package. The grey line shows \(reserve_visitors == visitors\) and the blue line is a linear fit:

p <- all_reserve %>%
  filter(reserve_visitors < 120) %>%
  ggplot(aes(reserve_visitors, visitors)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "grey60") +
  geom_smooth(method = "lm", color = "blue")
ggMarginal(p, type="histogram", fill = "blue", bins=50)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

We find:

The histograms show that the reserve_visitors and visitors numbers peak below ~20 and are largely confined to the range below 100.
The scatter points fall largely above the line of identity, indicating that there were more visitors that day than had reserved a table. This is not surprising, since a certain number of people will always be walk-in customers.
A notable fraction of the points is below the line, which probably indicates that some people made a reservation but changed their mind and didn’t go. That kind of effect is probably to be expected and taking it into account will be one of the challenges in this competition.
The linear fit suggests a trend in which larger numbers of reserve_visitors are more likely to underestimate the eventual visitor numbers. This is not surprising either, since I can imagine that it is more likely that (a) a large reservation is cancelled than (b) a large group of people walk in a restaurant without reservation.

# Process 'air_reserve' data
foo = (
    air_reserve.assign(visit_date=air_reserve['visit_datetime'].dt.date)
    .groupby(['air_store_id', 'visit_date'], observed=True)
    .agg(reserve_visitors_air=('reserve_visitors', 'sum'))
    .reset_index()
)

# Process 'hpg_reserve' data
bar = (
    hpg_reserve.assign(visit_date=hpg_reserve['visit_datetime'].dt.date)
    .groupby(['hpg_store_id', 'visit_date'], observed=True)
    .agg(reserve_visitors_hpg=('reserve_visitors', 'sum'))
    .reset_index()
    .merge(store_ids, left_on='hpg_store_id', 
    right_on='hpg_store_id', how='inner')
)

air_visits['visit_date'] = air_visits['visit_date'].dt.date
# Combine reservation data with 'air_visits'
all_reserve = (
    air_visits.merge(foo, left_on=['air_store_id', 'visit_date'],
    right_on=['air_store_id', 'visit_date'], how='inner')
    .merge(bar, left_on=['air_store_id', 'visit_date'], 
     right_on=['air_store_id', 'visit_date'], how='inner')
    .assign(reserve_visitors=lambda x: x['reserve_visitors_air'] + x['reserve_visitors_hpg'])
)

# Filter the data
filtered_data = all_reserve[all_reserve['reserve_visitors'] < 120]

fig = plt.figure(figsize=(10, 6))
plt.style.use('ggplot')
# Create the scatterplot with marginal histograms
g = sns.jointplot(x='reserve_visitors', y='visitors',
    data=filtered_data, color='blue', alpha=0.2, kind='scatter',
     palette='black')
p= g.plot_joint(sns.lineplot, color='grey')  # Add the diagonal line

# Add a linear regression line
p= sns.regplot(x='reserve_visitors', y='visitors', 
            data=filtered_data, ax=g.ax_joint, line_kws={'color': 'blue'})

# Add marginal histograms
p= plt.subplots_adjust(top=0.9)
p= g.fig.suptitle('Scatterplot with Marginal Histograms')

plt.show(p)

Now we will break down the discrepancy between visitors - reserve_visitors over time, look at the overall histograms, and visualise the air_reserve vs hpg_reserve numbers separately. Here, the time series for air (blue) and hpg (red) are offset vertically by 150 and -250 (see the solid black baselines):

p1 <- all_reserve %>%
  ggplot(aes(visitors - reserve_visitors)) +
  geom_histogram(binwidth = 5, fill = "black") +
  coord_flip() +
  labs(x = "")

p2 <- all_reserve %>%
  ggplot(aes(visitors - reserve_visitors_air)) +
  geom_histogram(binwidth = 5, fill = "blue") +
  coord_flip() +
  labs(x = "")

p3 <- all_reserve %>%
  ggplot(aes(visitors - reserve_visitors_hpg)) +
  geom_histogram(binwidth = 5, fill = "red") +
  coord_flip() +
  labs(x = "")

p4 <- all_reserve %>%
  ggplot(aes(visit_date, visitors - reserve_visitors)) +
  geom_hline(yintercept = c(150, 0, -250)) +
  geom_line() +
  geom_line(aes(visit_date, visitors - reserve_visitors_air + 150),
            color = "blue") +
  geom_line(aes(visit_date, visitors - reserve_visitors_hpg - 250),
            color = "red") +
  ggtitle("Visitors - Reserved: all (black), air (blue), hpg (red)")

layout <- matrix(c(4,4,2,4,4,1,4,4,3),3,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)

We find:

The time series show significant scatter throughout the training time range. While the air (blue) and hpg (red) curves are predominantly above the baseline (i.e. more visitors than reservations), combining the two data sets brings the mean of the distribution closer to the zero line. This can also be seen in the corresponding histograms on the right side.
We see the gap where there are no air reservations (compare Sect. 4.2). We could only look at the hpg reservations here (for which this gap does not exist, Sect. 4.3) but it appears safe to assume that they would follow the same trend and can be used as a proxy for the air reservations. Feel free to check this assumption for the gap.
The (flipped) histograms in the 3 right panels are roughly aligned with the time series in the left panel for convenience of interpretation. They demonstrate how much the distributions are skewed towards larger visitor numbers than reserve_visitor numbers. We might see a mix here between two distributions: a (probably normal) spread due to cancellations plus a tail from walk-in visitors, which should follow a Poisson distribution.

Finally, let’s have a quick look at the impact of holidays on the discrepancy between reservations and visitors. We’ll be using overlapping density plots:

all_reserve %>%
  mutate(date = visit_date) %>%
  left_join(holidays, by = "date") %>%
  ggplot(aes(visitors - reserve_visitors, fill = holiday_flg)) +
  geom_density(alpha = 0.5)

We find:

There are somewhat higher numbers of visitors compared to reservations on a holiday. The peaks are almost identical, but we see small yet clear differences towards larger numbers.

# Sort DataFrame by visit_date
all_reserve = all_reserve.sort_values(by='visit_date')

# Plot 1
fig= plt.figure(figsize=(9, 6))
gs = GridSpec(3, 3, figure=fig)
plt.style.use('ggplot')


ax1 = plt.subplot(gs[0, 2])
ax1.hist(all_reserve['visitors'] - all_reserve['reserve_visitors'], 
        bins=range(-50, 60, 5), color='black', orientation='horizontal')
ax1.set_xlabel("")
ax1.invert_yaxis()
# Set size and rotation for x-axis tick labels
ax1.tick_params(axis='x', labelsize=8, rotation=0) 
ax1.tick_params(axis='y', labelsize=8, rotation=0)  

# Plot 2
ax2 = plt.subplot(gs[1, 2])
ax2.hist(all_reserve['visitors'] - all_reserve['reserve_visitors_air'],
         bins=range(-50, 60, 5), color='blue', orientation='horizontal')
ax2.set_xlabel("")
ax2.invert_yaxis()
# Set size and rotation for x-axis tick labels
ax2.tick_params(axis='x', labelsize=8, rotation=0)
ax2.tick_params(axis='y', labelsize=8, rotation=0)  

# Plot 3
ax3 = plt.subplot(gs[2, 2])
ax3.hist(all_reserve['visitors'] - all_reserve['reserve_visitors_hpg'],
         bins=range(-50, 60, 5), color='red', orientation='horizontal')
ax3.set_xlabel("")
ax3.invert_yaxis()
# Set size and rotation for x-axis tick labels
ax3.tick_params(axis='x', labelsize=8, rotation=0)  
ax3.tick_params(axis='y', labelsize=8, rotation=0)
ax3.set(xlabel='count') # ylabel='Count'

# Plot 4
ax4 = plt.subplot(gs[0:3, 0:2])
#ax4.axhline(y=[-300, 0, 300], color=['red', 'black', 'blue'], linestyle='--')
ax4.plot(all_reserve['visit_date'], 
        all_reserve['visitors'] - all_reserve['reserve_visitors'], color='black')
ax4.plot(all_reserve['visit_date'], 
        all_reserve['visitors'] - all_reserve['reserve_visitors_air'] + 150, color='blue')
ax4.plot(all_reserve['visit_date'], 
         all_reserve['visitors'] - all_reserve['reserve_visitors_hpg'] - 250, color='red')
ax4.set_title("Visitors - Reserved: all (black), air (blue), hpg (red)", fontsize=10)
# Format the x-axis date labels
ax4.xaxis.set_major_locator(mdates.MonthLocator(interval=3))
ax4.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
# Set size and rotation for x-axis tick labels
ax4.tick_params(axis='x', labelsize=8, rotation=0)
ax4.set(xlabel='visit_date', ylabel= "visitors - reserve_visitors")


# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()

# Convert visit_date to datetime if needed
all_reserve['visit_date'] = pd.to_datetime(all_reserve['visit_date'])
# Rename 'visit_date' column before merging
all_reserve['date'] = all_reserve['visit_date']

# Left join with holidays
reserve_holy = all_reserve.merge(holidays, on='date', how='left')

# Create the density plot
#sns.set(style="whitegrid")
fig = plt.figure(figsize=(10, 6))
plt.style.use('ggplot')

sns.histplot(data=reserve_holy, x=(reserve_holy['visitors'] - reserve_holy['reserve_visitors']),
             kde=True, #kde_kws={'bw_method': 1},
             hue='holiday_flg', fill=True, alpha=0.5, palette='Set1')

plt.title("Density Plot of Visitors - Reserve Visitors")
plt.xlabel("Visitors - Reserve Visitors")
plt.ylabel("Density")
#plt.legend(loc='lower left')

plt.show()

6 Feature Engineering

The next step is to derive new features from the existing ones and the purpose of these new features is to provide additional predictive power for our goal to forecast visitor numbers. Those new features can be as simple as deriving the day of the week or the month from a date column; something that we have already used in Fig. 1 of our visual exploration. Or they can take a more complex shape from the interplay of several related variables. This section collects and studies these new features.

My personal preference is to collect all engineered features into a single code block, so that we don’t have to search for them in different places of the kernel. Often, this also makes the computation of these features easier as we can re-use certain intermediate transformations. As we expand our analysis, we will come back to this code block and it will grow as our feature space grows:

air_visits <- air_visits %>%
  mutate(wday = wday(visit_date, label=TRUE, week_start = 1),
         wday = fct_relevel(wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
         month = month(visit_date, label=TRUE))

air_reserve <- air_reserve %>%
  mutate(reserve_date = date(reserve_datetime),
         reserve_hour = hour(reserve_datetime),
         reserve_wday = wday(reserve_datetime, label = TRUE, week_start = 1),
         reserve_wday = fct_relevel(reserve_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
         visit_date = date(visit_datetime),
         visit_hour = hour(visit_datetime),
         visit_wday = wday(visit_datetime, label = TRUE, week_start = 1),
         visit_wday = fct_relevel(visit_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
         diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
         diff_day = time_length(visit_datetime - reserve_datetime, unit = "day"))

hpg_reserve <- hpg_reserve %>%
  mutate(reserve_date = date(reserve_datetime),
         reserve_hour = hour(reserve_datetime),
         reserve_wday = wday(reserve_datetime, label = TRUE, week_start = 1),
         reserve_wday = fct_relevel(reserve_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
         visit_date = date(visit_datetime),
         visit_hour = hour(visit_datetime),
         visit_wday = wday(visit_datetime, label = TRUE, week_start = 1),
         visit_wday = fct_relevel(visit_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
         diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
         diff_day = time_length(visit_datetime - reserve_datetime, unit = "day"))

# count stores in area
air_count <- air_store %>%
  group_by(air_area_name) %>%
  summarise(air_count = n())

hpg_count <- hpg_store %>%
  group_by(hpg_area_name) %>%
  summarise(hpg_count = n())

# distances
med_coord_air <- air_store %>%
  summarise_at(vars(longitude:latitude), median)
med_coord_hpg <- hpg_store %>%
  summarise_at(vars(longitude:latitude), median)

air_coords <- air_store %>%
  select(longitude, latitude)
hpg_coords <- hpg_store %>%
  select(longitude, latitude)

air_store$dist <- distCosine(air_coords, med_coord_air)/1e3
hpg_store$dist <- distCosine(hpg_coords, med_coord_hpg)/1e3

# apply counts, dist; add prefecture
air_store <- air_store %>%
  mutate(dist_group = as.integer(case_when(
    dist < 80 ~ 1,
    dist < 300 ~ 2,
    dist < 500 ~ 3,
    dist < 750 ~ 4,
    TRUE ~ 5))) %>%
  left_join(air_count, by = "air_area_name") %>%
  mutate(prefecture = str_extract(air_area_name,"[^\\s]+\\s" ))
  #separate(air_area_name, c("prefecture"), sep = " ", remove = FALSE)

hpg_store <- hpg_store %>%
  mutate(dist_group = as.integer(case_when(
    dist < 80 ~ 1,
    dist < 300 ~ 2,
    dist < 500 ~ 3,
    dist < 750 ~ 4,
    TRUE ~ 5))) %>%
  left_join(hpg_count, by = "hpg_area_name") %>%
  mutate(prefecture = str_extract(hpg_area_name,"[^\\s]+\\s" ))
 # separate(hpg_area_name, c("prefecture", "City", "Place"), sep = " ", remove = FALSE)

from pandas.api.types import CategoricalDtype

# Convert visit_date to datetime
air_visits['visit_date'] = pd.to_datetime(air_visits['visit_date'])

# Extract weekday
air_visits['wday'] = air_visits['visit_date'].dt.day_name()

# Reorder weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
air_visits['wday'] = air_visits['wday'].astype(CategoricalDtype(categories=weekday_order, ordered=True))

# Extract month
air_visits['month'] = air_visits['visit_date'].dt.month_name()

# Convert visit_datetime to datetime
air_reserve['reserve_datetime'] = pd.to_datetime(air_reserve['reserve_datetime'])
air_reserve['visit_datetime'] = pd.to_datetime(air_reserve['visit_datetime'])

# Extract various date and time components
air_reserve['reserve_date'] = air_reserve['reserve_datetime'].dt.date
air_reserve['reserve_hour'] = air_reserve['reserve_datetime'].dt.hour
air_reserve['reserve_wday'] = air_reserve['reserve_datetime'].dt.day_name()
air_reserve['reserve_wday'] = air_reserve['reserve_wday'].astype(\
                              CategoricalDtype(categories=weekday_order, ordered=True))

air_reserve['visit_date'] = air_reserve['visit_datetime'].dt.date
air_reserve['visit_hour'] = air_reserve['visit_datetime'].dt.hour
air_reserve['visit_wday'] = air_reserve['visit_datetime'].dt.day_name()
air_reserve['visit_wday'] = air_reserve['visit_wday'].astype(\
                            CategoricalDtype(categories=weekday_order, ordered=True))

# Calculate time differences
air_reserve['diff_hour'] = (air_reserve['visit_datetime'] - \
                           air_reserve['reserve_datetime']).dt.total_seconds() / 3600
#air_reserve['diff_day'] = (air_reserve['visit_datetime'].dt.date -
#air_reserve['reserve_datetime'].dt.date)#.dt.days
air_reserve['diff_day'] = (air_reserve['visit_datetime'] - \
                          air_reserve['reserve_datetime']).apply(lambda x: x.total_seconds() / 86400)
# Count stores in area
air_count = air_store.groupby('air_area_name', 
            observed=True).size().reset_index(name='air_count')

hpg_count = hpg_store.groupby('hpg_area_name', 
            observed=True).size().reset_index(name='hpg_count')

# Calculate distances
med_coord_air = air_store[['longitude', 'latitude']].median()

med_coord_hpg = hpg_store[['longitude', 'latitude']].median()

air_coords = air_store[['longitude', 'latitude']]
hpg_coords = hpg_store[['longitude', 'latitude']]

air_store['dist'] = ((air_coords - med_coord_air)**2).sum(axis=1).pow(0.5) / 1000
hpg_store['dist'] = ((hpg_coords - med_coord_hpg)**2).sum(axis=1).pow(0.5) / 1000

# Apply counts, dist; add prefecture
def dist_group(x):
    if x < 80:
        return 1
    elif x < 300:
        return 2
    elif x < 500:
        return 3
    elif x < 750:
        return 4
    else:
        return 5

air_store['dist_group'] = air_store['dist'].apply(dist_group)
air_store = pd.merge(air_store, air_count, 
           left_on='air_area_name', right_on='air_area_name',
           how='left')
# Split 'air_area_name' into 'prefecture' and 'area'
split_data = air_store['air_area_name'].str.split(' ', n=1, expand=True)
# Assign the first part to 'prefecture' and the rest to 'area'
air_store['prefecture'] = split_data[0]
air_store['area'] = split_data[1]
# If there was only one part (no space), assign an empty string to 'area'
air_store['area'].fillna("", inplace=True)



hpg_store['dist_group'] = hpg_store['dist'].apply(dist_group)
hpg_store = pd.merge(hpg_store, hpg_count, 
                    left_on='hpg_area_name', 
                    right_on='hpg_area_name', 
                    how='left')

# Split 'air_area_name' into 'prefecture' and 'area'
split_data = hpg_store['hpg_area_name'].str.split(' ', n=1, expand=True)
# Assign the first part to 'prefecture' and the rest to 'area'
hpg_store['prefecture'] = split_data[0]
hpg_store['area'] = split_data[1]
# If there was only one part (no space), assign an empty string to 'area'
hpg_store['area'].fillna("", inplace=True)

6.1 Days of the week & months of the year

We start simple, with the week day (wday) and month features derived from the visit_date. We already looked at the total visitors per week day and month in Fig. 1. Now we will study the log1p transformed visitor numbers. We compute their mean and standard deviation and plot them alongside the (ridgeline) density distributions:

p1 <- air_visits %>%
  group_by(wday) %>%
  summarise(mean_log_visitors = mean(log1p(visitors)),
            sd_log_visitors = sd(log1p(visitors))) %>%
  ggplot(aes(wday, mean_log_visitors, color = wday)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
                    ymax = mean_log_visitors + sd_log_visitors,
                    color = wday), width = 0.5, linewidth = 0.7) +
  theme(legend.position = "none")

p2 <- air_visits %>%
  mutate(visitors = log1p(visitors)) %>%
  ggplot(aes(visitors, wday, fill = wday)) +
  geom_density_ridges(bandwidth = 0.1) +
  scale_x_log10() +
  theme(legend.position = "none") +
  labs(x = "log1p(visitors)", y = "")

p3 <- air_visits %>%
  group_by(month) %>%
  summarise(mean_log_visitors = mean(log1p(visitors)),
            sd_log_visitors = sd(log1p(visitors))) %>%
  ggplot(aes(month, mean_log_visitors, color = month)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
                    ymax = mean_log_visitors + sd_log_visitors,
                    color = month), width = 0.5, size = 0.7) +
  theme(legend.position = "none")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p4 <- air_visits %>%
  mutate(visitors = log1p(visitors)) %>%
  ggplot(aes(visitors, month, fill = month)) +
  geom_density_ridges(bandwidth = 0.1) +
  scale_x_log10() +
  theme(legend.position = "none") +
  labs(x = "log1p(visitors)", y = "")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)

We find:

The differences in mean log(visitor) numbers between the days of the week are much smaller than the corresponding uncertainties. This is confirmed by the density plots that show a strong overlap. Still, there is a trend toward higher visitor numbers on the weekend that might be useful.
The months are even more similar, including the sligthly increased numbers in December. Overall, there is not much difference.

# Compute log1p of visitors
air_visits['log1p_visitors'] = air_visits['visitors'].apply(lambda x: np.log1p(x))
# Plot 1 data
p1_data = air_visits.groupby('wday', observed=True).agg(mean_log_visitors=('log1p_visitors', 'mean'), 
                                        sd_log_visitors=('log1p_visitors', 'std')).reset_index()

p3_data = air_visits.groupby('month', observed=True).agg(mean_log_visitors=('log1p_visitors', 'mean'), 
                                          sd_log_visitors=('log1p_visitors', 'std')).reset_index()


# Create a 2x2 grid
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
plt.style.use('ggplot')

# Define custom order for weekdays and months
custom_wday_order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
custom_month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                      "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# Plot 1
p1 = sns.scatterplot(data=p1_data, x='wday', y='mean_log_visitors', 
                     hue='wday', #palette='viridis', 
                     ax=axes[0])
p1= p1.errorbar(p1_data['wday'], p1_data['mean_log_visitors'],
            yerr=p1_data['sd_log_visitors'], fmt='o', 
            color='black',
            markersize=5,capsize=8)
# Set custom order for x-axis labels
#p1.xaxis.set_ticks(custom_wday_order)            
p1= axes[0].set_xticklabels(custom_wday_order)  
p1= axes[0].legend('')



# Plot 3
p3 = sns.scatterplot(data=p3_data, x='month', y='mean_log_visitors', 
                      hue='month', #palette='viridis',
                      ax=axes[1]) 
p3= p3.errorbar(p3_data['month'], p3_data['mean_log_visitors'], 
            yerr=p3_data['sd_log_visitors'], 
            fmt='o', 
            color='black',
            #palette='viridis',
            markersize=5,capsize=8)
# Set custom order for x-axis labels
#p3= p3.xaxis.set_ticks(custom_month_order)
p3= axes[1].set_xticklabels(custom_month_order, rotation=45)  
p3= axes[1].legend('')

# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Define a function to compute log1p of visitors
def compute_log1p_visitors(visitors):
    return np.log1p(visitors)

# Create subplots for p2 and p4
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 12))
gs = GridSpec(3, 3)
plt.style.use('ggplot')
ax1 = plt.subplot(gs[0, 0])
ax2 = plt.subplot(gs[0, 1])

# Plot 2
p2_data = air_visits.copy()
p2_data['log1p_visitors'] = compute_log1p_visitors(p2_data['visitors'])

p2 = sns.FacetGrid(p2_data, row='wday', hue="wday", palette='viridis',
                 aspect=15, height=.5)
p2 =p2.map(sns.kdeplot, 'log1p_visitors', bw_adjust=.5, 
        clip_on=False, fill=True, alpha=1, linewidth=.5, ax=ax1)
p2= p2.map(sns.kdeplot, 'log1p_visitors', clip_on=False, color="black", lw=2, bw_adjust=.5, ax=ax1)
p2= p2.set(xlim=(0, 10))

# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, .2, label,  color=color, ha="right", va="center", transform=ax.transAxes)
p2= p2.map(label, 'log1p_visitors')

p2= p2.set_titles("")
p2= p2.set(yticks=[], ylabel="")
p2= p2.despine(bottom=True, left=True)
p2= p2.fig.subplots_adjust(hspace=-.25)

# Plot 4
p4_data = air_visits.copy()
p4_data['log1p_visitors'] = compute_log1p_visitors(p4_data['visitors'])

p4 = sns.FacetGrid(p4_data, row='month', hue="month", palette='viridis',
                 aspect=15, height=.5)
p4= p4.map(sns.kdeplot, 'log1p_visitors', bw_adjust=.5, 
       clip_on=False, fill=True, alpha=1,
       linewidth=.5, ax=ax2)
p4= p4.map(sns.kdeplot, 'log1p_visitors', clip_on=False, 
       color="black", lw=2, bw_adjust=.5, ax=ax2)
p4= p4.set(xlim=(0, 10))

p4= p4.map(label, 'log1p_visitors')

p4= p4.set_titles("")
p4= p4.set(yticks=[], ylabel="")
p4= p4.despine(bottom=True, left=True)
p4= p4.fig.subplots_adjust(hspace=-.25)

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()

Here is a breakdown by air_genre_name:

air_visits %>%
  left_join(air_store, by = "air_store_id") %>%
  group_by(wday, air_genre_name) %>%
  summarise(mean_log_visitors = mean(log1p(visitors)),
            sd_log_visitors = sd(log1p(visitors)),
            .groups = "keep") %>%
  ggplot(aes(wday, mean_log_visitors, color = wday)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
                    ymax = mean_log_visitors + sd_log_visitors,
                    color = wday), width = 0.5, size = 0.7) +
  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  facet_wrap(~ air_genre_name)

We find:

There is a stronger impact of the weekend vs weekdays for genres such as “Karaoke” or “International Cuisine”, while others show very little change over the days of the week (e.g. “Japanese food”).
Overall, there is no significant effect for any of the genres on any day of the week.

What about the statistics of visitors vs reservations? Does it depend on the day of the week? We use boxplots to find out:

all_reserve %>%
  mutate(wday = wday(visit_date, label=TRUE, week_start = 1),
         wday = fct_relevel(wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
  ggplot(aes(wday, visitors - reserve_visitors, fill = wday)) +
  geom_boxplot() +
  theme(legend.position = "none")

We find that there is no significant difference, although we see a slight trend for more visitors compared to reservations on the weekend, especially Sunday.

from matplotlib import pyplot as plt, dates
#locale.setlocale(locale.LC_ALL,'en_US.utf8')
# Compute log1p of visitors
air_visits['log1p_visitors'] = np.log1p(air_visits['visitors'])

# Merge air_visits and air_store
merged_data = pd.merge(air_visits, air_store, on='air_store_id', how='left')

# Group by wday and air_genre_name, compute mean and standard deviation
summary_data = merged_data.groupby(['wday', 'air_genre_name'], observed=True).agg(
    mean_log_visitors=('log1p_visitors', 'mean'),
    sd_log_visitors=('log1p_visitors', 'std')
).reset_index()


fig = plt.figure(figsize = (12,8))
plt.style.use('ggplot')

g = sns.FacetGrid(summary_data, col="air_genre_name", col_wrap=4,
                  sharex=True, sharey=True, height=5)
p = g.map_dataframe(sns.scatterplot, x='wday', y='mean_log_visitors',
                hue='wday', palette='viridis', s=50, alpha=0.7)

# Add error bars
for ax, (_, data) in zip(g.axes.flat, summary_data.groupby('air_genre_name')):
    #sns.lineplot(data=data, x='wday', y='mean_log_visitors', ax=ax, color='black')
    p= ax.errorbar(data['wday'], data['mean_log_visitors'], yerr=data['sd_log_visitors'], fmt='none', color='black', capsize=5)
    p= ax.legend('')

# Resize facet titles
p= g.set_titles(col_template="{col_name}", 
               size = 8, backgroundcolor='#DDDDDD', pad=2
           # bbox={'facecolor': '#DDDDDD', 'alpha': 0, 'pad': 0}
            )
p= g.tick_params(axis='x', pad=0, labelsize= 6)
p= g.set_xlabels(fontsize=6, label="Date")
p= g.set_ylabels(fontsize=8 )#,label="Average number of visitors")

for ax in g.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(25)
        #ax.xaxis.set_major_formatter(dates.DateFormatter('%a'))

# Apply logarithmic scale to y-axis
#p= g.set(yscale="log")

# Set x-axis labels
p= g.set_axis_labels("Weekday", "Mean Log Visitors")
#p= g.add_legend(title="weekday")

# Adjust layout
p= plt.tight_layout()
# Show the plot
plt.show(p)

# Convert visit_date to datetime and extract the weekday
all_reserve['visit_date'] = pd.to_datetime(all_reserve['visit_date'])
all_reserve['wday'] = all_reserve['visit_date'].dt.day_name()

# Reorder the weekdays
weekday_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
all_reserve['wday'] = pd.Categorical(all_reserve['wday'], categories=weekday_order, ordered=True)

#Create a new column for the difference between visitors and reserve_visitors
all_reserve['visitors_minus_reserve'] = all_reserve['visitors'] - all_reserve['reserve_visitors']

# Create the boxplot
fig = plt.figure(figsize=(10, 6))
plt.style.use('ggplot')

p= sns.boxplot(data=all_reserve, x='wday', y='visitors_minus_reserve', 
             hue='wday', palette='viridis')
p= plt.xlabel('Weekday')
p= plt.ylabel('Visitors - Reserve Visitors')
p= plt.tick_params(axis='x', pad=0, labelsize= 8)
#plt.set_major_formatter(dates.DateFormatter('%a'))
#plt.title('Difference between Visitors and Reserve Visitors by Weekday')

# Remove legend
#plt.legend().remove()
# Show the plot
plt.show(p)

6.2 Restaurant per area - air/hpg_count

This feature will address the question of how many restaurant can be found in a certain area and whether larger numbers of restaurants per area affect the average visitor numbers.

In order to compute it, we simply count the number of air/hpg restaurants for the corresponding areas. We show the resulting distributions in the next plot at the top. The we estimate the mean log1p-transformed visitor count per restaurant, and then compute the mean and standard deviation of this number for each area count group:

p1 <- air_count %>%
  ggplot(aes(air_count)) +
  geom_histogram(binwidth = 2, fill = "blue")

p2 <- hpg_count %>%
  ggplot(aes(hpg_count)) +
  geom_histogram(binwidth = 5, fill = "red")

p3 <- air_visits %>%
  left_join(air_store, by = "air_store_id") %>%
  group_by(air_store_id, air_count) %>%
  summarise(mean_store_visit = mean(log1p(visitors)), .groups = "keep") %>%
  group_by(air_count) %>%
  summarise(mean_log_visitors = mean(mean_store_visit),
            sd_log_visitors = sd(mean_store_visit)) %>%
  ggplot(aes(air_count, mean_log_visitors)) +
  geom_point(size = 4, color = "blue") +
  geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
                    ymax = mean_log_visitors + sd_log_visitors),
                    color = "blue", width = 0.5, size = 0.7) +
  geom_smooth(method = "lm", color = "black") +
  labs(x = "Air restaurants per area")

layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)

## `geom_smooth()` using formula = 'y ~ x'

We find:

Most areas in the air data set have only a few restaurants, with the distribution peaking at 2. This goes back to the earlier observation that no single air_genre can ever be found in an air_area. There are always at least 2 of the same type. Still odd. The air data also has a tentative 2nd peak around 16-20. The hpg data, with larger overall numbers, also peaks at low counts and has few other, smaller peaks through its decline.
The mean log1p visitor numbers per area have large uncertainties and are all consistent with each other. There is a slight trend, shown by the black linear regression, toward lower average numbers for larger areas but it is quite weak.

# Multiplot layout
fig = plt.figure(figsize=(9, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[1, :])

# Plot 1
p1 = sns.histplot(data=air_count, x='air_count', 
                  bins=range(0, max(air_count['air_count'])+2, 2), 
                  color='blue',
                  ax=ax1
                  )
p1= ax1.set_xlabel('Air Count')
p1= ax1.set_title('Air Count Histogram')

# Plot 2
p2 = sns.histplot(data=hpg_count, x='hpg_count', 
                 bins=range(0, max(hpg_count['hpg_count'])+6, 5), 
                 color='red',
                 ax=ax2)
p2= ax2.set_xlabel('HPG Count')
p2= ax2.set_title('HPG Count Histogram')

# Preprocessing for Plot 3
air_visits_processed = air_visits.merge(air_store, on='air_store_id', how='left')

store_visits = air_visits_processed.groupby(['air_store_id', 'air_count'],
               observed=True)['visitors'].apply(lambda x: x.mean()).reset_index()
area_visits = store_visits.groupby('air_count', 
              observed=True)['visitors'].agg(['mean', 'std']).reset_index()


# Plot 3
p3 = sns.scatterplot(data=area_visits, x='air_count', y='mean',
                     color='blue',
                     ax=ax3)
p3= p3.errorbar(x=area_visits['air_count'], y=area_visits['mean'], 
            yerr=area_visits['std'], fmt='o',
            color='blue', capsize=5)
p3= sns.regplot(data=area_visits, x='air_count', y='mean', scatter=False, color='black', ax=ax3)

# Set labels and title for Plot 3
p3= ax3.set_xlabel('Air Restaurants per Area')
p3= ax3.set_ylabel('Mean Log Visitors')
p3= ax3.set_title('Mean Log Visitors vs Air Restaurants per Area')

# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()

6.3 Distance from the busiest area

Whenever we have spatial coordinates in our data we automatically also have the distances between these coordinates. As a first approximation we use the linear distance between two points (i.e. as the crow flies).

To compute these distances we are using the distCosine function of the geosphere package for spherical trigonometry. This method gives us the shortest distance between two points on a spherical earth. For the purpose of this localised analysis we choose to ignore ellipsoidal distortion of the earth’s shape.

As the single reference point for all distances we choose the median latitude and median longitude. Our analysis can be extended to investigate all the pairwise distances between two restaurants with the same methodology. We won’t perform this study in our kernel, but I encourage anyone who wants to try it to see whether additional insight can be gained this way.

Here, we plot the histograms of the distances from the medians for all the restaurants in the air and hpg data. These distributions will suggest a grouping into 5 different bins with ranges of increasing distance. We then compute the mean log1p visitor numbers for the 5 groups and compare them:

foo <- air_store %>% select(latitude, longitude, dist_group) %>% mutate(dset = "air")
bar <- hpg_store %>% select(latitude, longitude, dist_group) %>% mutate(dset = "hpg")

leaflet(foo) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addScaleBar() %>%
  addCircleMarkers(~longitude, ~latitude, group = "AIR",
                   color = "blue", fillOpacity = 0.5, radius = 3*foo$dist_group) %>%
  addCircleMarkers(lng = bar$longitude, lat = bar$latitude, group = "HPG",
                   color = "red", fillOpacity = 0.5, radius = 3*bar$dist_group) %>%
  addCircleMarkers(lng = med_coord_air$longitude, lat = med_coord_air$latitude, group = "Centre",
                   color = "darkgreen", fillOpacity = 1) %>%
  addLayersControl(
    overlayGroups = c("AIR", "HPG", "Centre"),
    options = layersControlOptions(collapsed = FALSE)
  )

foo = air_store[['latitude', 'longitude', 'dist_group']].assign(dset='air')
bar = hpg_store[['latitude', 'longitude', 'dist_group']].assign(dset='hpg')
# Create a base map
m = folium.Map(location=[bar['latitude'].median(),
                         bar['longitude'].median()], 
                         zoom_start=5
              )

# Add tiles
folium.TileLayer('CartoDB positron').add_to(m)

## <folium.raster_layers.TileLayer object at 0x7ff5e0604700>

# Create a MarkerCluster object
marker_cluster = MarkerCluster().add_to(m)

#folium.LayerControl().add_to(m)
                  
# Define a function to add markers for hpg_store
def add_marker(row):
    folium.Marker(location=[row['latitude'], row['longitude']],
                  #popup=row['hpg_store_id'],
                  icon=None
                  #tooltip=row['hpg_genre_name']
                  ).add_to(marker_cluster)

# Add markers for hpg_store
bar.apply(add_marker, axis=1)

## 0       None
## 1       None
## 2       None
## 3       None
## 4       None
##         ... 
## 4685    None
## 4686    None
## 4687    None
## 4688    None
## 4689    None
## Length: 4690, dtype: object

# Display the map
m

## <folium.folium.Map object at 0x7ff5e0a26cb0>

Foo = air_store[['latitude', 'longitude', 'dist_group']].assign(dset='air')
Bar = hpg_store[['latitude', 'longitude', 'dist_group']].assign(dset='hpg')

FooBar = pd.concat([Foo, Bar])

# Create a base map
m = folium.Map(location=[FooBar['latitude'].median(),
                         FooBar['longitude'].median()], 
                         zoom_start=5
              )
              
FooBar = FooBar.to_records(index=False)
FooBar = list(FooBar)

# point_layer name list
all_gp = []
for x in range(len(FooBar)):
    pg = FooBar[x][3]
    all_gp.append(pg)

# Create point_layer object
unique_gp = list(set(all_gp))
vlist = []
for i,k in enumerate(unique_gp):
    locals()[f'point_layer{i}'] = folium.FeatureGroup(name=k)
    vlist.append(locals()[f'point_layer{i}'])

# Creating list for point_layer
pl_group = []
for n in all_gp:
    for v in vlist: 
        if n == vars(v)['layer_name']:
            pl_group.append(v)

for (latitude,longitude,dist_group, dset),pg in zip(FooBar[0:5], pl_group[0:5]):
    folium.CircleMarker(location=[latitude, longitude], radius=10,
        popup=str(dset) + " Lat: " + str(latitude) + " , Long: " + str(longitude),
        tooltip=str(dset) + " Lat: " + str(latitude) + " , Long: " + str(longitude),
        fill=True,  # Set fill to True
        color='red',
        fill_opacity=1.0).add_to(pg)

## <folium.vector_layers.CircleMarker object at 0x7ff5dd0d3700>
## <folium.vector_layers.CircleMarker object at 0x7ff5dd0d15a0>
## <folium.vector_layers.CircleMarker object at 0x7ff5dd0d2710>
## <folium.vector_layers.CircleMarker object at 0x7ff5dd120220>
## <folium.vector_layers.CircleMarker object at 0x7ff5dd122440>

pg.add_to(m)

## <folium.map.FeatureGroup object at 0x7ff5dd0d36d0>

m.add_child(folium.LayerControl(collapsed=False))

## <folium.folium.Map object at 0x7ff5dd0d3970>

#m.save("Map1.html")
m

## <folium.folium.Map object at 0x7ff5dd0d3970>

6.4 Prefectures

This is a straight-forward feature where we simply define the prefecture as the first part of the area_name. Here we make use of the (generally) consistent formatting of the area name. This gives us a relatively small number of distinct high-level areas:

p1 <- air_store %>%
  ggplot(aes(prefecture)) +
  geom_bar(fill = "blue") +
  coord_flip() +
  ggtitle("air prefectures - # restaurants") +
  labs(x = "")

p2 <- hpg_store %>%
  ggplot(aes(prefecture)) +
  geom_bar(fill = "red") +
  coord_flip() +
  ggtitle("hpg prefectures - # restaurants") +
  labs(x = "")

p3 <- air_visits %>%
  left_join(air_store, by = "air_store_id") %>%
  group_by(air_store_id, prefecture) %>%
  summarise(mean_store_visit = mean(log1p(visitors))) %>%
  group_by(prefecture) %>%
  summarise(mean_log_visitors = mean(mean_store_visit),
            sd_log_visitors = sd(mean_store_visit)) %>%
  ggplot(aes(prefecture, mean_log_visitors)) +
  geom_point(size = 4, color = "blue") +
  geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
                    ymax = mean_log_visitors + sd_log_visitors),
                    color = "blue", width = 0.5, size = 0.7) +
  labs(x = "prefecture") +
  theme(axis.text.x  = element_text(angle=15, hjust=1, vjust=0.9))

## `summarise()` has grouped output by 'air_store_id'. You can override using the
## `.groups` argument.

layout <- matrix(c(1,2,1,2,1,2,3,3,3,3),5,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)

We find:

There are 9 prefectures in the air data and 13 entries in the hpg data. In both data sets “Tokyo” is clearly the most frequent entry.
Once more, we don’t see signficant differences in the mean log1p visitor numbers for the 9 different air prefectures. Any possible variation is smaller than the size of the error bars.

# Create a 2x2 grid
fig = plt.figure(figsize=(8, 5))
grid = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')

# Plot 1 (p1)
ax1 = fig.add_subplot(grid[0, 0])
air_store['prefecture'].value_counts().plot(kind='barh', color='blue', ax=ax1)

## <Axes: ylabel='prefecture'>

p1= ax1.set_title('air prefectures - restaurants', fontsize= 10)
p1= ax1.set_xlabel('count')
p1= ax1.tick_params(axis='y', rotation=0, labelsize= 6)
p1= ax1.set_ylabel('')

# Plot 2 (p2)
ax2 = fig.add_subplot(grid[0, 1])
p2= hpg_store['prefecture'].value_counts().plot(kind='barh', color='red', ax=ax2)
p2= ax2.set_title('hpg prefectures - restaurants', fontsize= 10)
p2= ax2.set_xlabel('count')
p2= ax2.tick_params(axis='y', rotation=0, labelsize= 6)
p2= ax2.set_ylabel('')

# Plot 3 (p3)
p3_data = (
    air_visits
    .merge(air_store, on='air_store_id')
    .groupby(['air_store_id', 'prefecture'])['visitors']
    .mean()
    .groupby('prefecture')
    .agg(mean_store_visit='mean', sd_log_visitors='std')
    .reset_index()
)

ax3 = fig.add_subplot(grid[1, :])
p3= ax3.errorbar(
    p3_data['prefecture'], p3_data['mean_store_visit'], yerr=p3_data['sd_log_visitors'],
    color='blue', marker='o', linestyle='-', capsize=5
)
p3= ax3.set_title('Mean Store Visits by Prefecture', fontsize= 10)
p3= ax3.set_xlabel('Prefecture', fontsize= 8)
p3= ax3.set_ylabel('Mean Store Visit', fontsize= 7)
p3= ax3.tick_params(axis='x', rotation=15, labelsize= 7)  # Customize x-axis ticks

# Adjust layout
fig.tight_layout()
# Show the plots
plt.show()

7 Time series parameters

After engineering new features based on the geographical or culinary properties of the different restaurants, we will now look directly at the time series of their visitor numbers. With only 829 time series, from 829 air restaurants, it is actually easily possible to look at each of them individually, and there are kernels that plot small or large fractions of them to get a feeling for the data.

Here, we will take a different approach and look at the meta parameters of each time series. Those features are the mean, standard deviation, slope, and, slope error of the time series. Each parameter is computed for the log1p-transformed data. In the next code block, we define a extraction function and then run it for each restaurant. Due to the small size of the data set this takes almost no time:

foo <- air_visits %>%
  left_join(air_store, by = "air_store_id") %>%
  group_by(air_store_id, air_genre_name) %>%
  summarise(mean_log_visits = mean(log1p(visitors)),
            mean_log_visits = mean(log1p(visitors)),
            sd_log_visits = sd(log1p(visitors)),
            .groups = "keep") %>%
  ungroup()

params_ts1 <- function(rownr){
  bar <- air_visits %>%
    filter(air_store_id == foo$air_store_id[rownr])
  slope <- summary(lm(visitors ~ visit_date, data = bar))$coef[2]
  slope_err <- summary(lm(visitors ~ visit_date, data = bar))$coef[4]
  
  foobar <- tibble(
    air_store_id = foo$air_store_id[rownr],
    slope = slope,
    slope_err = slope_err
  )
  
  return(foobar)
}

params <- params_ts1(1)
for (i in seq(2,nrow(foo))){
  params <- bind_rows(params, params_ts1(i))
}

ts_params <- foo %>%
  left_join(params, by = "air_store_id")

This Python code performs the same operations as the provided R code. It defines a function params_ts1 to calculate the slope and standard error for each air_store_id, then iterates through the foo DataFrame to compute these parameters. Finally, it merges the parameters with foo to obtain ts_params.

import pandas as pd
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant

# Merge air_visits with air_store
merged_df = pd.merge(air_visits, air_store, on='air_store_id')
# Group by air_store_id and air_genre_name and compute the required statistics
foo = merged_df.groupby(['air_store_id', 'air_genre_name'],
                    observed=True).agg(mean_log_visits=('visitors',
                    lambda x: np.mean(np.log1p(x))),
                    sd_log_visits=('visitors',
                    lambda x: np.std(np.log1p(x)))).reset_index()


# Define function to calculate slope and standard error
def params_ts1(air_store_id):
    bar = air_visits[air_visits['air_store_id'] == air_store_id]
    X = add_constant(range(len(bar['visit_date'])))
    model = OLS(bar['visitors'], X)
    results = model.fit()
    slope = results.params[1]
    slope_err = results.bse[1]
    
    foobar = pd.DataFrame({
        'air_store_id': [air_store_id],
        'slope': [slope],
        'slope_err': [slope_err]
    })
    
    return foobar

# Calculate parameters for all rows in foo
params = params_ts1(foo['air_store_id'].iloc[0])
for i in range(1, len(foo)):
    params = pd.concat([params, params_ts1(foo['air_store_id'].iloc[i])], 
                       ignore_index=True)

# Merge parameters with foo
ts_params = foo.merge(params, on='air_store_id', how='left')

Now we will plot the parameters and their one-dimensional and two-dimensional distributions:

p1 <- ts_params %>%
  ggplot(aes(mean_log_visits)) +
  geom_histogram(bins = 50, fill = "blue")

p2 <- ts_params %>%
  ggplot(aes(sd_log_visits)) +
  geom_histogram(bins = 50, fill = "blue")

p3 <- ts_params %>%
  filter(slope < 0.5) %>%
  ggplot(aes(slope)) +
  geom_histogram(bins = 50, fill = "blue") +
  labs(x = "Slope < 0.5")

p4 <- ts_params %>%
  ggplot((aes(mean_log_visits, sd_log_visits))) +
  geom_point(size = 2, color = "blue")

p5 <- ts_params %>%
  ggplot((aes(slope, slope_err))) +
  geom_point(size = 2, color = "blue")

layout <- matrix(c(1,1,2,2,3,3,4,4,4,5,5,5),2,6,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)

We find:

The distributions of the mean and standard deviation for the log1p-transformed visitor counts are relatively broad. I don’t think we can assume normality for either of them, although the distributions are relatively symmetric. The shape of the mean vs sd cloud in the lower left panel support the idea that there is a relation between the two features, with larger average visitor counts having a smaller variance.
The slope distribution is narrow; centred on zero slope. The slope vs slope error plot in the lower right panel showcases that there are a few outliers with large slope errors and mostly high absolute slope values:

# Define plots
def plot_p1(ax):
    ax.hist(ts_params['mean_log_visits'], bins=50, color='blue')
    ax.set_title('Mean Log Visits', fontsize=10)
    ax.set_ylabel('Count', fontsize=10)

def plot_p2(ax):
    ax.hist(ts_params['sd_log_visits'], bins=50, color='blue')
    ax.set_title('Standard Deviation Log Visits', fontsize=10)

def plot_p3(ax):
    filtered_data = ts_params[ts_params['slope'] < 0.5]
    ax.hist(filtered_data['slope'], bins=50, color='blue')
    ax.set_title('Slope < 0.5', fontsize= 10)

def plot_p4(ax):
    ax.scatter(ts_params['mean_log_visits'], 
               ts_params['sd_log_visits'], color='blue', s=2)
    ax.set_xlabel('Mean Log Visits', fontsize=6)
    ax.set_ylabel('Sd_Log_Visits', fontsize=10)
    ax.tick_params(axis='y', rotation=0, labelsize= 7) 

def plot_p5(ax):
    ax.scatter(ts_params['slope'], ts_params['slope_err'], color='blue', s=2)
    ax.set_xlabel('Slope', fontsize=6)
    ax.set_ylabel('Slope Error', fontsize=6)
    ax.tick_params(axis='y', rotation=0, labelsize= 7) 
    ax.set_ylim(0, 1)

# Create a 2x6 grid
fig = plt.figure(figsize=(8, 4))
gs = GridSpec(2, 6)

# Plot each subplot
plot_p1(fig.add_subplot(gs[0, :2]))
plot_p2(fig.add_subplot(gs[0, 2:4]))
plot_p3(fig.add_subplot(gs[0, 4:]))

plot_p4(fig.add_subplot(gs[1, :3]))
plot_p5(fig.add_subplot(gs[1, 3:]))

# Adjust layout
plt.tight_layout()

# Show the plots
plt.show()

ts_params %>%
  filter(abs(slope) > 0.25) %>%
  select(air_store_id, air_genre_name, slope, slope_err) %>%
  DT::datatable(options = list(scrollX=TRUE))

ts_params[abs(ts_params['slope']) > 0.25][['air_store_id', 'air_genre_name', 'slope', 'slope_err']]

##              air_store_id                air_genre_name     slope  slope_err
## 71   air_1c0b150f9e696a5f  Okonomiyaki/Monja/Teppanyaki -0.299005   0.219136
## 224  air_4b55d8aea1d2b395                   Cafe/Sweets  0.419028   0.056423
## 246  air_52a08ef3efdb4bb0                  Bar/Cocktail  0.383160   0.166805
## 379  air_789103bf53b8096b                       Izakaya  0.434146   0.254186
## 402  air_8110d68cc869b85e                       Izakaya  0.456303   0.117346
## 447  air_8e492076a1179383                Italian/French  0.489384   0.356846
## 453  air_900d755ebd2f7bbd                Italian/French  3.237594   0.879957
## 482  air_965b2e0cf4119003                       Izakaya  0.322585   0.031688
## 510  air_9c6787aa03a45586                   Cafe/Sweets  0.988855   0.183597
## 555  air_a9a380530c1e121f                   Cafe/Sweets -0.254878   0.179283

Finally, we will have a comprehensive view at the relation between mean log1p visitors and the time series slope using a facet zoom view, as implemented in the ggforce package. Here, we will see the full space on the right side and the zoomed-in version on the left side. The zoom region is indicated on the right side with a darker shading. This view is perfect for situations like this, where a few extreme points would make it difficult to focus on the properties of the majority data cluster. The mean_log_visits and slope are plotted coloured by genre and with associated errorbars in grey:

ts_params %>%
  ggplot(aes(mean_log_visits, slope, color = air_genre_name)) +
  geom_errorbarh(aes(xmin = mean_log_visits - sd_log_visits,
                    xmax = mean_log_visits + sd_log_visits),
                    color = "grey70", size = 0.7) +
  geom_errorbar(aes(ymin = slope - slope_err,
                    ymax = slope + slope_err),
                    color = "grey70", size = 0.7) +
  geom_point() +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 3, override.aes = list(size = 4))) +
  labs(color = "") +
  facet_zoom(y = (slope < 0.05 & slope > -0.1))

We find:

There’s not a lot of genre clustering, although the three points with the lowest mean log1p visits are all dining bars, for what it’s worth. We find the slope outliers in the again in the overview plot and notice that all of them have large mean values.
Both panels show an overall trend for larger slopes to be associated to higher means. This may not be surprising, since we use he absolute slope value with has more range for high visitor counts than for low ones. Feel free to plot the slope normalised by mean to see whether the effect persists.

The extreme values in this plot can now be studied in detail, and we will choose some of them as examples in our next section.

# Filter data
filtered_data = ts_params[(ts_params['slope'] < 0.05) & (ts_params['slope'] > -0.1)]

# Create the scatter plot with error bars
#fig = plt.figure(figsize=(8, 6))
fig, axes = plt.subplots(1, 2, figsize=(10, 6))

p1= sns.scatterplot(x='mean_log_visits', y='slope', hue='air_genre_name', data=filtered_data, ax=axes[0])
p1= p1.errorbar(x=filtered_data['mean_log_visits'], y=filtered_data['slope'], 
             xerr=filtered_data['sd_log_visits'], yerr=filtered_data['slope_err'], 
             fmt='none', color='grey', alpha=0.7, elinewidth=0.5)
# Set legend properties
p1= plt.title('Scatter Plot with Error Bars')
p1=axes[0].get_legend().remove()
#plt.legend(title='Air Genre Name', loc='upper left', bbox_to_anchor=(1, 1), ncol=1, fontsize='small')

# Add facet zoom
#ax = plt.gca()
#axins = ax.inset_axes([0.6, 0.6, 0.35, 0.35])
p2= sns.scatterplot(x='mean_log_visits', y='slope', hue='air_genre_name', data=filtered_data, ax=axes[1])
p2=axes[1].set_xlim(-0, 4)
p2=axes[1].set_ylim(-0.1, 0.05)

# Set labels and title
p2= plt.xlabel('Mean Log Visits')
p2= plt.ylabel('Slope')
p2= plt.title('Facet Zoom')
p2= axes[1].get_legend().remove()

plt.show()

8 Forcasting methods and examples

Finally, we will turn our focus to time-series forecasting methods. We have already learnt a lot about our data set and it’s features. The following sections will introduce basic forecasting methods.

8.1 ARIMA / auto.arima

A popular method for forecasting is the autoregressive integrated moving average model; short ARIMA model. This kind of model consists of three building blocks which parametrised by the three indeces p, d, q as ARIMA(p, d, q):

auto-regressive / p: we are using past data to compute a regression model for future data. The parameter p indicates the range of lags; e.g. ARIMA(3,0,0) includes t-1, t-2, and t-3 values in the regression to compute the value at t.
integrated / d: this is a differencing parameter, which gives us the number of times we are subtracting the current and the previous values of a time series. Differencing removes the change in a time series in that it stabilises the mean and removes (seasonal) trends. This is necessary since computing the lags (e.g. difference between time t and time t-1) is most meaningful if large-scale trends are removed. A time series where the variance (or amount of variability) (and the autocovariance) are time-invariant (i.e. don’t change from day to day) is called stationary.
moving average / q: this parameter gives us the number of previous error terms to include in the regression error of the model.

Here we will be using the auto.arima tool which estimates the necessary ARIMA parameters for each individual time series. In order to feed our data to auto.arima we need to turn them into a time-series object using the ts tool. We will also add a step for cleaning and outlier removal via the tsclean function of the forecast package. We have already seen that our data contain a strong weekly cycle, which will be one of the pre-set parameters of our model. We will include this knowledge when transforming our data. Let’s set everything up step by step, with comments and explanations, and then turn it into a function. Unhide the code to see how it is implemented.

We use the first air_store_id (“air_ba937bf13d40fb24”) as an example.

air_id = "air_ba937bf13d40fb24"

pred_len <- test %>%
  separate(id, c("air", "store_id", "date"), sep = "_") %>%
  distinct(date) %>%
  nrow()

air_id = "air_ba937bf13d40fb24"

# Separate the 'id' column into three columns
test[['air', 'store_id', 'date']] = test['id'].str.split('_', expand=True)

# Get distinct dates and calculate the number of rows
pred_len = test['date'].nunique()

We choose to predict for the last 39 days of our training sample. This might not be here we compute the upper end of our training dates and subtract our “prediction length” from this value to define the start of our validation sample on Mar 14th. We also create a data set of all visit_dates in preparation for many time series having gaps.

max_date <- max(air_visits$visit_date)
split_date <- max_date - pred_len
all_visits <- tibble(visit_date = seq(min(air_visits$visit_date),
                                      max(air_visits$visit_date), 1))

from datetime import timedelta
# Find the maximum visit date
max_date = air_visits['visit_date'].max()

# Calculate the split date
split_date = max_date - timedelta(days=pred_len)

# Create a DataFrame with a sequence of dates
all_visits = pd.DataFrame({'visit_date': pd.date_range(start=min(air_visits['visit_date']),
                                                       end=max(air_visits['visit_date']))})

Next, we extract the time series for the specific air_store_id. We transform the visitors counts by log1p and join the data set of all visit_dates. This gives us a number of NA which we fill in with the overall median. The median might not be the best choice here, but it’s a sensible starting point. Most time series prediction tools require a sequential time series without gaps; which is what we create in this step.

foo <- air_visits %>%
  filter(air_store_id == air_id)

visits <- foo %>%
  right_join(all_visits, by = "visit_date") %>%
  mutate(visitors = log1p(visitors)) %>%
  replace_na(list(visitors = median(log1p(foo$visitors)))) %>%
  rownames_to_column()

# Assuming you have DataFrames 'air_visits', 'all_visits' available

#air_id = "air_900d755ebd2f7bbd"
# Filter air_visits for a specific air_store_id (assuming air_id is a variable)
foo = air_visits[air_visits['air_store_id'] == air_id] 

# Right join with all_visits on visit_date
visits = pd.merge(all_visits, foo, on='visit_date', how='right')

# Apply log transformation to visitors column
visits['visitors'] = np.log1p(visits['visitors'])

# Replace NA values with the median of log-transformed visitors from 'foo'
median_visitors = np.median(np.log1p(foo['visitors']))
visits['visitors'].fillna(median_visitors, inplace=True)

# Reset row indices
visits = visits.reset_index(drop=True)

Using this new time series, we now split the data into training and validation sets.

visits_train <- visits %>% filter(visit_date <= split_date)
visits_valid <- visits %>% filter(visit_date > split_date)

# Filter visits for training and validation sets based on visit_date
visits_train = visits[visits['visit_date'] <= split_date]
visits_valid = visits[visits['visit_date'] > split_date]

Now comes the fitting part. As said before, we use the ts function to create a time series object and the tsclean tool to remove outliers. We also add the weekly frequency. The stepwise and approximation parameter settings mean that the tool performs a more thorough and precise search over all model parameters. This increases the computing time, but for our small data set this doesn’t matter much.

arima.fit <- auto.arima(tsclean(ts(visits_train$visitors, frequency = 7)),
                        stepwise = FALSE, approximation = FALSE)

from pmdarima import auto_arima
from statsmodels.tsa.seasonal import seasonal_decompose

# Assuming 'visits_train' DataFrame is available

# Perform seasonal decomposition of time series
result = seasonal_decompose(visits_train['visitors'], model='multiplicative', period=7)

# Extract the trend component (you may need to adjust the index based on the result)
trend = result.trend.dropna()

visits_train.set_index('visit_date', inplace=True)
# Fit ARIMA model
arima_fit = auto_arima(trend.to_numpy(), seasonal=True, m=7, stepwise=False, trace=False)

Using the fitted ARIMA model we will forecast for our “prediction length”. We include confidence intervals.

arima_visits <- arima.fit %>% forecast(h = pred_len, level = c(50,95))

arima_visits %>%
  autoplot +
  geom_line(aes(as.integer(rowname)/7, visitors), data = visits_valid, color = "grey40") +
  labs(x = "Time [weeks]", y = "log1p visitors vs auto.arima predictions")

We find that the first days of the forecast fit quite well, but then our prediction is not able to capture the larger spikes. Still, it’s a useful starting point to compare other methods to.


# Generate forecasts
arima_forecast, conf_int = arima_fit.predict(n_periods=pred_len, return_conf_int=True)

# Convert the forecasts to a DataFrame with appropriate dates

forecast_dates = pd.date_range(start=split_date + pd.Timedelta(days=1), periods=pred_len)

arima_forecast_df = pd.DataFrame({
    'date': forecast_dates,
    'forecast': arima_forecast
})

# Optionally, add confidence intervals
arima_forecast_df['lower_conf_int'] = conf_int[:, 0]
arima_forecast_df['upper_conf_int'] = conf_int[:, 1]

# Plot the points and the forecasts
x_axis = np.arange(visits_train.shape[0] + arima_forecast.shape[0])
x_dates = visits['visit_date']  # Year starts at 1821

p= plt.plot(x_dates[x_axis[:visits_train.shape[0]]].to_numpy(), 
            visits_train['visitors'].to_numpy(), 
            alpha=0.75, color='black')
p= plt.plot(arima_forecast_df['date'].to_numpy(), 
            arima_forecast_df['forecast'].to_numpy(), 
            alpha=0.75, color='blue')  # Forecasts
p= plt.plot(visits_valid['visit_date'].to_numpy(), 
               visits_valid['visitors'],
               alpha=0.4, color='grey')  # valid data
#p= plt.scatter(visits_valid['visit_date'].to_numpy(), 
#               visits_valid['visitors'],
#               alpha=0.4, color='grey', marker='x')
p= plt.fill_between(arima_forecast_df['date'].to_numpy(),
                 arima_forecast_df['lower_conf_int'], 
                 arima_forecast_df['upper_conf_int'], 
                 alpha=0.2, color='b')
#p= plt.title("ARIMA forecasts")
p= plt.xlabel("dates")
p=plt.ylabel("log1p visitors vs auto.arima predictions")

plt.show(p)

Now we turn this procedure into a function, including the plotting part.

plot_auto_arima_air_id <- function(air_id){

  pred_len <- test %>%
    separate(id, c("air", "store_id", "date"), sep = "_") %>%
    distinct(date) %>%
    nrow()

  max_date <- max(air_visits$visit_date)
  split_date <- max_date - pred_len
  all_visits <- tibble(visit_date = seq(min(air_visits$visit_date), 
                                        max(air_visits$visit_date), 1))
  
  foo <- air_visits %>%
    filter(air_store_id == air_id)

  visits <- foo %>%
    right_join(all_visits, by = "visit_date") %>%
    mutate(visitors = log1p(visitors)) %>%
    replace_na(list(visitors = median(log1p(foo$visitors)))) %>%
    rownames_to_column()
  
  visits_train <- visits %>% filter(visit_date <= split_date)
  visits_valid <- visits %>% filter(visit_date > split_date)

  arima.fit <- auto.arima(tsclean(ts(visits_train$visitors, frequency = 7)),
                          stepwise = FALSE, approximation = FALSE)

  arima_visits <- arima.fit %>% forecast(h = pred_len, level = c(50,95))

  arima_visits %>%
    autoplot +
    geom_line(aes(as.integer(rowname)/7, visitors), data = visits_valid, color = "grey40") +
    labs(x = "Time [weeks]", y = "log1p visitors vs forecast")
}


def plot_auto_arima_air_id(air_id, ax):
  
    # Separate the 'id' column into three columns
    test[['air', 'store_id', 'date']] = test['id'].str.split('_', expand=True)
    # Get distinct dates and calculate the number of rows
    pred_len = test['date'].nunique()
    # Find the maximum visit date
    max_date = air_visits['visit_date'].max()
  
    # Calculate the split date
    split_date = max_date - timedelta(days=pred_len)
  
    # Create a DataFrame with a sequence of dates
    all_visits = pd.DataFrame({'visit_date': pd.date_range(start=min(air_visits['visit_date']),
                                                            end=max(air_visits['visit_date']))})
    
    # Filter air_visits for a specific air_store_id (assuming air_id is a variable)
    foo = air_visits[air_visits['air_store_id'] == air_id]
  
    # Right join with all_visits on visit_date
    visits = pd.merge(all_visits, foo, on='visit_date', how='right')
  
    # Apply log transformation to visitors column
    visits['visitors'] = np.log1p(visits['visitors'])
  
    # Replace NA values with the median of log-transformed visitors from 'foo'
    median_visitors = np.median(np.log1p(foo['visitors']))
    visits['visitors'].fillna(median_visitors, inplace=True)
  
    # Reset row indices
    visits = visits.reset_index(drop=True)
    # Filter visits for training and validation sets based on visit_date
    visits_train = visits[visits['visit_date'] <= split_date]
    visits_valid = visits[visits['visit_date'] > split_date]
    
    
    # Perform seasonal decomposition of time series
    result = seasonal_decompose(visits_train['visitors'], model='multiplicative', period=7)
  
    # Extract the trend component (you may need to adjust the index based on the result)
    trend = result.trend.dropna()
    visits_train.set_index('visit_date', inplace=True)
    
    # Fit ARIMA model
    arima_fit = auto_arima(trend.to_numpy(), seasonal=True, m=7, stepwise=False, trace=False)
    
    # Generate forecasts
    arima_forecast, conf_int = arima_fit.predict(n_periods=pred_len, return_conf_int=True)
  
    # Convert the forecasts to a DataFrame with appropriate dates
    forecast_dates = pd.date_range(start=split_date + pd.Timedelta(days=1), periods=pred_len)
  
    arima_forecast_df = pd.DataFrame({
      'date': forecast_dates,
      'forecast': arima_forecast
    })
  
    # Optionally, add confidence intervals
    arima_forecast_df['lower_conf_int'] = conf_int[:, 0]
    arima_forecast_df['upper_conf_int'] = conf_int[:, 1]
    
    # Plot the points and the forecasts
    x_axis = np.arange(visits_train.shape[0] + arima_forecast.shape[0])
    x_dates = visits['visit_date']
    
    fig, ax = plt.subplots(figsize=(8, 8))
    
    ax.plot(x_dates[x_axis[:visits_train.shape[0]]].to_numpy(), 
              visits_train['visitors'].to_numpy(), 
              alpha=0.75, color='black')
    ax.plot(arima_forecast_df['date'].to_numpy(), 
              arima_forecast_df['forecast'].to_numpy(), 
              alpha=0.75, color='blue')  # Forecasts
    ax.plot(visits_valid['visit_date'].to_numpy(), 
                 visits_valid['visitors'],
                 alpha=0.4, color='grey')  # valid data
    #ax.scatter(visits_valid['visit_date'].to_numpy(), 
    #               visits_valid['visitors'],
    #               alpha=0.4, color='grey', marker='x',
    #               ax=["ax" + str(axe)])
    ax.fill_between(arima_forecast_df['date'].to_numpy(),
                   arima_forecast_df['lower_conf_int'], 
                   arima_forecast_df['upper_conf_int'], 
                   alpha=0.2, color='b')
    #ax.plt.title("ARIMA forecasts")
    ax.tick_params(axis='x', rotation=45, labelsize= 7) 
    plt.xlabel("dates")
    plt.ylabel("log1p visitors vs auto.arima predictions")
    plt.show()

And we apply this function to a few time series’, including two of the slope outliers from the previous section:

ranked_store <- air_visits %>%
                group_by('air_store_id') %>%
                summarize(N = sum(visitors)) %>%
                arrange(desc(N) )

p1 <- plot_auto_arima_air_id("air_f3f9824b7d70c3cf")
p2 <- plot_auto_arima_air_id("air_8e4360a64dbd4c50")
#p3 <- plot_auto_arima_air_id("air_399904bdb7685ca0")
#p4 <- plot_auto_arima_air_id("air_f26f36ec4dc5adb0")

#layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

We find:

The two time series’ in the upper panels are reasonable complete, but we see that the long gaps (and our median filling) lead to problems in the predictions in the upper left panel where we loose the weekly periodicity. The upper right panel retains this periodicity and the predictions for the first days are relatively decent, but then we quickly under-predict the amplitude of the variations.
The lower panels include two of the outliers from our time-series parameter space above; and here we see cases where things go really wrong. These kind of peculiar time series could lead to a bad performance for any otherwise decent forecasting algorithm if they contain a large enough fraction of visits in the test data set.

Overall, the results are not great, but given that it’s a fully automatic forecast (assuming only weekly periodicities) the auto.arima tool gives us a first baseline to compare other methods to.

For a more detailed exploration of ARIMA models take a look at this Kernel.

## Rank store to select the must important ones
ranked_store = air_visits.groupby('air_store_id', 
                observed=True)['visitors'].sum().sort_values(ascending=False)

fig = plt.figure(figsize=(8, 4))
gs = GridSpec(2, 1)
Air_id = ["air_f3f9824b7d70c3cf","air_8e4360a64dbd4c50"]
#,"air_1c0b150f9e696a5f", "air_820d1919cbecaa0a"]
j=0
for g in gs:
  #print(g)
  plot_auto_arima_air_id(air_id= Air_id[j], ax= g)
  j = j+1

8.2 Prophet

The prophet forecasting tool is an open-source software developed by Facebook. It is available for both R and Python.

Prophet utilises an additive regression model which decomposes a time series into (i) a (piecewise) linear/logistic trend, (ii) a yearly seasonal component, (iii) a weekly seasonal component, and (iv) an optional list of important days (such as holidays, special events, …). It claims to be “robust to missing data, shifts in the trend, and large outliers”. Especially the missing data functionality could be useful in this competition

Let’s again explore the tool step by step. We will build on our work in the ARIMA section and won’t repeat any explanations that can be found there.

We will again create a training and validation set for the same periods as above (i.e before/after Mar 14th). The only differences to our ARIMA approach are that:

We don’t need to replace NA values because prophet knows how to handle those.
Prophet expects a data frame with two columns: ds for the dates and y for the time series variable.

air_id = "air_ba937bf13d40fb24"

pred_len <- test %>%
  separate(id, c("air", "store_id", "date"), sep = "_") %>%
  distinct(date) %>%
  nrow()

max_date <- max(air_visits$visit_date)
split_date <- max_date - pred_len
all_visits <- tibble(visit_date = seq(min(air_visits$visit_date), 
                                      max(air_visits$visit_date), 1))

foo <- air_visits %>%
  filter(air_store_id == air_id)

visits <- foo %>%
  right_join(all_visits, by = "visit_date") %>%
  mutate(visitors = log1p(visitors)) %>%
  rownames_to_column() %>%
  select(y = visitors,
         ds = visit_date)

visits_train <- visits %>% filter(ds <= split_date)
visits_valid <- visits %>% filter(ds > split_date)

air_id = "air_ba937bf13d40fb24"

# Calculate prediction length
pred_len = test['id'].str.split('_', expand=True)[2].nunique()

# Find max date
max_date = air_visits['visit_date'].max()

# Calculate split date
split_date = max_date - timedelta(days=pred_len)

# Create a DataFrame with all visit dates
all_visits = pd.DataFrame({
             'visit_date': pd.date_range(start=air_visits['visit_date'].min(),
              end=max_date)})

# Filter air_visits for the specified air_id
foo = air_visits[air_visits['air_store_id'] == air_id]

# Merge with all_visits to ensure all dates are included
visits = all_visits.merge(foo, on='visit_date', how='left')

# Apply log transformation
visits['visitors'] = visits['visitors'].apply(lambda x: np.log1p(x))

# Split into train and validation sets
visits_train = visits[visits['visit_date'] <= split_date]
visits_valid = visits[visits['visit_date'] > split_date]

Here we fit the prophet model and make the forecast:

the parameter changepoint.prior.scale adjusts the trend flexibility. Increasing this parameter makes the fit more flexible, but also increases the forecast uncertainties and makes it more likely to overfit to noise. The changepoints in the data are automatically detected unless being specified by hand using the changepoints argument (which we don’t do here).
the parameter yearly.seasonality has to be enabled/disabled explicitely and allows prophet to notice large-scale cycles. We have barely a year of data here, which is definitely insufficient to find yearly cycles and probably not enough to identify variations on the time scales of months. Feel free to test the performance of this parameter.

proph <- prophet(visits_train, changepoint.prior.scale=0.5,
                yearly.seasonality=FALSE, daily.seasonality = FALSE)

future <- make_future_dataframe(proph, periods = pred_len)

fcast <- predict(proph, future)

plot(proph, fcast)

from prophet import Prophet
from prophet.plot import plot

# Create a Prophet instance
proph = Prophet(changepoint_prior_scale=0.5,
                yearly_seasonality=False,
                daily_seasonality=False)

# Prepare the DataFrame
visits_train = visits_train.rename(columns={'visit_date': 'ds', 'visitors': 'y'})

# Fit the model
proph.fit(visits_train)

## <prophet.forecaster.Prophet object at 0x7ff5c9921300>
## 
## 12:31:12 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:12 - cmdstanpy - INFO - Chain [1] done processing

# Create a DataFrame for future dates
future = proph.make_future_dataframe(periods=pred_len)

# Generate forecasts
fcast = proph.predict(future)

fcast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

##             ds      yhat  yhat_lower  yhat_upper
## 389 2017-04-18  2.619568    2.080136    3.131940
## 390 2017-04-19  2.874313    2.354824    3.440354
## 391 2017-04-20  2.671990    2.154216    3.209783
## 392 2017-04-21  3.216012    2.637781    3.758921
## 393 2017-04-22  2.995465    2.478023    3.541450

plt.style.use('ggplot')
#fig = proph.plot(fcast)

# Plot the value with line and scatter
#plt.plot(fcast['ds'], fcast['yhat'], label='Value', color='red')
p= plt.scatter(fcast['ds'], fcast['yhat'], color='black', alpha=0.5, s =4)

# Plot the confidence intervals
p= plt.fill_between(fcast['ds'], 
                    fcast['yhat_lower'], fcast['yhat_upper'], 
                    color='blue', alpha=0.05) #, label='Confidence Interval'

# Add labels and legend
p= plt.xlabel('Date')
p= plt.ylabel('yhat')
p= plt.tick_params(axis='x', rotation=10, labelsize= 7)
#plt.legend()

# Show the plot
plt.show(p)

The observed data are plotted as black points and the fitted model, plus forecast, as a blue line. In light blue we see the corresponding uncertainties.

Prophet offers a decomposition plot, where we can inspect the additive components of the model: trend, yearly seasonality (if included), and weekly cycles:

prophet_plot_components(proph, fcast)

from prophet.plot import plot_components
# Generate the components plot
#fig = plot_components(proph, fcast)

# Create a 2x2 grid
fig = plt.figure(figsize=(6, 4))
gs = GridSpec(2, 1)

ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[1, 0])

p= ax1.plot(fcast['ds'], fcast['trend'], color = 'blue')
p= ax1.set_xlabel('date', fontsize=6)
p= ax1.set_ylabel('Trend', fontsize=6)
p= ax1.tick_params(axis='x', rotation=10, labelsize= 7) 

fcast['ds'] = pd.to_datetime(fcast['ds'])
fcast['weekday'] = fcast['ds'].dt.day_name()
weekly = fcast.groupby(['weekday'])['weekly'].mean().reindex(['Sunday',
         'Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'])

p= ax2.plot(weekly, color = 'blue')
p= ax2.set_xlabel('date', fontsize=6)
p= ax2.set_ylabel('Weekly', fontsize=6)
p= ax2.tick_params(axis='x', rotation=10, labelsize= 7)

# Adjust layout
plt.tight_layout()
plt.show(p)

We find:

Prophet detects a weekly variation pattern which is similar to what we had found before, in that Fri/Sat are more popular than the rest of the week. The difference, however, is that here Sun has a very low average visitor count. This might be be due to the properties of this particular restaurant, a “Dining Bar” in “Tokyo”. In Fig. 23 we see that “Dining Bars” in general are not as busy on Sundays; although not to such a large extent as we see in this graph. The difference in visitors on Sat vs Sun might be an interesting feature of an eventual model.
The long-term trend is different from the average behaviors displayed in Fig. 11, which were more likely to rise toward the end of 2016. Here we see a peak in mid 2016.
Note, that this is one specific time series. I’ve checked other time series as well and many of them showed the expected Fri/Sat/Sun vs Mon-Thu variations along with a long-term peak around Dec 2017. Feel free to fork this Kernel and check for yourself.

So: This is a good start, but we will of course use our trusted ggplot2 to give us more control over the plotting parameters:

fcast %>%
  as_tibble() %>%
  mutate(ds = date(ds)) %>%
  ggplot(aes(ds, yhat)) + 
  geom_ribbon(aes(x = ds, ymin = yhat_lower, ymax = yhat_upper), fill = "light blue", na.rm = TRUE) +
  geom_line(colour = "blue", na.rm = TRUE) +
  geom_line(data = visits_train, aes(ds, y), colour = "black", na.rm = TRUE) +
  geom_line(data = visits_valid, aes(ds, y), colour = "grey50", na.rm = TRUE)

That doesn’t look too bad. We plot our training visitor counts in black, the validation set in grey, and the forecast plus uncertainties in blue and light blue, again.

# Convert 'ds' column to datetime
fcast['ds'] = pd.to_datetime(fcast['ds'])

# Plotting
p= plt.figure(figsize=(10, 6))

# Forecast line
p= plt.plot(fcast['ds'], fcast['yhat'], color='blue', label='Forecast')

# Confidence interval ribbon
p= plt.fill_between(fcast['ds'], fcast['yhat_lower'], fcast['yhat_upper'], 
                    color='lightblue', alpha=0.5)

# Training data
p= plt.plot(visits_train['ds'], visits_train['y'], 
            color='black', label='Training Data')

# Validation data
p= plt.plot(visits_valid['visit_date'], visits_valid['visitors'],
            color='grey', label='Validation Data')

p= plt.tick_params(axis='x', rotation=10, labelsize= 7)
# Add labels and title
p= plt.xlabel('Date')
p= plt.ylabel('yhat')
p= plt.title('Prophet Forecast with Confidence Intervals')

# Add legend
#p= plt.legend()

# Show the plot
plt.show(p)

Let’s turn this into another function:

plot_prophet_air_id <- function(air_id){
  
  pred_len <- test %>%
    separate(id, c("air", "store_id", "date"), sep = "_") %>%
    distinct(date) %>%
    nrow()

  max_date <- max(air_visits$visit_date)
  split_date <- max_date - pred_len
  all_visits <- tibble(visit_date = seq(min(air_visits$visit_date), 
                                        max(air_visits$visit_date), 1))

  foo <- air_visits %>%
    filter(air_store_id == air_id)

  visits <- foo %>%
    right_join(all_visits, by = "visit_date") %>%
    mutate(visitors = log1p(visitors)) %>%
    rownames_to_column() %>%
    select(y = visitors,
          ds = visit_date)

  visits_train <- visits %>% filter(ds <= split_date)
  visits_valid <- visits %>% filter(ds > split_date)
  
  proph <- prophet(visits_train, changepoint.prior.scale=0.5,
                   yearly.seasonality=FALSE, 
                   daily.seasonality = FALSE,
                   weekly.seasonality=TRUE)
  future <- make_future_dataframe(proph, periods = pred_len)
  fcast <- predict(proph, future)
  
  p <- fcast %>%
    as_tibble() %>%
    mutate(ds = date(ds)) %>%
    ggplot(aes(ds, yhat)) +
    geom_ribbon(aes(x = ds, ymin = yhat_lower, 
                    ymax = yhat_upper), fill = "light blue") +
    geom_line(colour = "blue", na.rm = TRUE) +
    geom_line(data = visits_train, aes(ds, y), colour = "black", na.rm = TRUE) +
    geom_line(data = visits_valid, aes(ds, y), colour = "grey50", na.rm=TRUE) +
    labs(title = str_c("Prophet for ", air_id))
  
  return(p)
}


def plot_prophet_air_id(air_id, ax):
  
  # Separate the 'id' column into three columns
    test[['air', 'store_id', 'date']] = test['id'].str.split('_', expand=True)
  # Get distinct dates and calculate the number of rows
    pred_len = test['date'].nunique()
    # Find the maximum visit date
    max_date = pd.to_datetime(air_visits['visit_date'].max())

    # Calculate the split date
    split_date = max_date - timedelta(days=pred_len)

    # Create a DataFrame with a sequence of dates
    all_visits = pd.DataFrame(
      {'visit_date': pd.date_range(start=min(air_visits['visit_date']),
                                  end=max(air_visits['visit_date']))}
    )
    
    # Filter air_visits for the specified air_id
    foo = air_visits[air_visits['air_store_id'] == air_id]

    #foo['visit_date'] = pd.to_datetime(foo['visit_date'])
     
    # Merge with all_visits to ensure all dates are included
    visits = foo.merge(all_visits, on='visit_date', how='right')

    # Apply log transformation
    visits['visitors'] = visits['visitors'].apply(lambda x: np.log1p(x))
    visits['y'] = visits['visitors']
    visits['ds'] = visits['visit_date']
    #visits = visits[["ds", "y"]]

    # Split into train and validation sets
    visits_train = visits[visits['ds'] <= split_date]
    visits_valid = visits[visits['ds'] > split_date]
    
    # Create a Prophet instance
    proph = Prophet(changepoint_prior_scale=0.5,
                yearly_seasonality=False,
                daily_seasonality=False)
    # Fit the model
    proph.fit(visits_train)

    # Create a DataFrame for future dates
    future = proph.make_future_dataframe(periods=pred_len)

    # Generate forecasts
    fcast = proph.predict(future)
    
    # Convert 'ds' column to datetime
    fcast['ds'] = pd.to_datetime(fcast['ds'])

    # set plot
    fig, ax = plt.subplots(figsize=(8, 4))
    plt.style.use('ggplot')
    ax = plt.gca()

    # Forecast line
    ax.plot(fcast['ds'], fcast['yhat'], 
            color='blue', label='Forecast')

    # Training data
    ax.plot(visits_train['ds'], visits_train['y'],
           color='black', label='Training Data')

    # Validation data
    ax.plot(visits_valid['visit_date'], visits_valid['visitors'],
            color='grey', label='Validation Data')

    # Confidence interval ribbon
    ax.fill_between(fcast['ds'], fcast['yhat_lower'],
                    fcast['yhat_upper'], color='lightblue',
                    alpha=0.5)
                    
    # Add labels and title
    ax.tick_params(axis='x', rotation=10, labelsize= 7)
    ax.set_xlabel('Date')
    ax.set_ylabel('yhat')
    #return(ax)
    plt.show()

p1 <- plot_prophet_air_id("air_f3f9824b7d70c3cf")
p2 <- plot_prophet_air_id("air_8e4360a64dbd4c50")
p3 <- plot_prophet_air_id("air_1c0b150f9e696a5f")

## n.changepoints greater than number of observations. Using 9

p4 <- plot_prophet_air_id("air_820d1919cbecaa0a")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)

That looks not bad either. Looks like we’re overestimating the trend component, which could probably use less flexibility, for at least three of these. The forth time series has too little data for prophet to be able to do much. Here we could discard the trend component completely or simply take the median.

## WAY 1
# Create a 4x1 grid
fig = plt.figure(figsize=(6, 3))
gs = GridSpec(4, 1)
Air_id = ["air_f3f9824b7d70c3cf","air_8e4360a64dbd4c50",
          "air_1c0b150f9e696a5f", "air_820d1919cbecaa0a"]
j=0
for g in gs:
  #print(g)
  plot_prophet_air_id(air_id= Air_id[j], ax= g)
  j = j+1

## 12:31:37 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:37 - cmdstanpy - INFO - Chain [1] done processing
## 12:31:38 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:38 - cmdstanpy - INFO - Chain [1] done processing
## 12:31:40 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:40 - cmdstanpy - INFO - Chain [1] done processing
## 12:31:41 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:41 - cmdstanpy - INFO - Chain [1] done processing

  #print(j)


# ## WAY 2
# fig, axes = plt.subplots(4, 1, figsize=(12, 4))
# Air_id = ["air_f3f9824b7d70c3cf","air_8e4360a64dbd4c50","air_1c0b150f9e696a5f", "air_820d1919cbecaa0a"]
# j=0
# for x in axes:
#   print(x)
#   plot_prophet_air_id(air_id= Air_id[j], ax= x)
#   j = j+1
#   print(j)
# plt.tight_layout()                         
# plt.show()
#   
# #WAY 3
# fig = plt.figure(figsize=(12, 4))
# gs = GridSpec(2, 2)
# 
# p=plot_prophet_air_id(air_id = "air_ba937bf13d40fb24",   #air_f3f9824b7d70c3cf
#                         ax = fig.add_subplot(gs[0, 0]))
# 
# p=plot_prophet_air_id(air_id = "air_8e4360a64dbd4c50",
#                           ax= fig.add_subplot(gs[0,1]))
# 
# p= plot_prophet_air_id(air_id = "air_1c0b150f9e696a5f",
#                         ax= fig.add_subplot(gs[1,0]))
# 
# p= plot_prophet_air_id(air_id = "air_820d1919cbecaa0a",
#                           ax= fig.add_subplot(gs[1,1]))
# 
# plt.tight_layout()
# plt.show(p)

Remember that prophet also gives us the possibility to include holidays and other special events in our forecast. For our task, this could be very useful in taking the Golden Week into account. For more insights from a tailored analysis of the Golden Week’s impact have a look at this kernel.

Since the holiday data is readily available in this competition we only need to rename it according to prophet’s liking to include it in our forecast function. Let’s modify that function see whether holidays make a difference when predicting for the 2016 Golden Week. The only thing we need to do is to truncate our air_visits data on May 31 2016 in an intermediate step:

plot_prophet_air_id_holiday <- function(air_id, use_hday){
  
  air_visits_cut <- air_visits %>%
    filter(visit_date <= ymd("20160531"))
  
  hday <- holidays %>%
    filter(holiday_flg == TRUE) %>%
    mutate(holiday = "holiday") %>%
    select(ds = date, holiday)
  
  pred_len <- test %>%
    separate(id, c("air", "store_id", "date"), sep = "_") %>%
    distinct(date) %>%
    nrow()

  max_date <- max(air_visits_cut$visit_date)
  split_date <- max_date - pred_len
  all_visits <- tibble(
               visit_date = seq(min(air_visits_cut$visit_date),
                                max(air_visits_cut$visit_date), 1)
               )

  foo <- air_visits_cut %>%
    filter(air_store_id == air_id)

  visits <- foo %>%
    right_join(all_visits, by = "visit_date") %>%
    mutate(visitors = log1p(visitors)) %>%
    rownames_to_column() %>%
    select(y = visitors,
          ds = visit_date)

  visits_train <- visits %>% filter(ds <= split_date)
  visits_valid <- visits %>% filter(ds > split_date)
  
  if (use_hday == TRUE){
    proph <- prophet(visits_train,
                     changepoint.prior.scale=0.5,
                     yearly.seasonality=FALSE,
                     daily.seasonality=FALSE,
                     holidays = hday)
    ptitle = "Prophet (w/ holidays) for "
  } else {
     proph <- prophet(visits_train,
                     changepoint.prior.scale=0.5,
                     yearly.seasonality=FALSE,
                     daily.seasonality = FALSE)
    ptitle = "Prophet for "
  }
  
  future <- make_future_dataframe(proph, periods = pred_len)
  fcast <- predict(proph, future)
  
  p <- fcast %>%
    as_tibble() %>%
    mutate(ds = date(ds)) %>%
    ggplot(aes(ds, yhat)) +
    geom_ribbon(aes(x = ds, ymin = yhat_lower, ymax = yhat_upper), fill = "light blue", na.rm = TRUE) +
    geom_line(colour = "blue", na.rm = TRUE) +
    geom_line(data = visits_train, aes(ds, y), colour = "black", na.rm = TRUE) +
    geom_line(data = visits_valid, aes(ds, y), colour = "grey50", na.rm = TRUE) +
    labs(title = str_c(ptitle, air_id))
  
  return(p)
}

And then we take a well behaved time series and compare the predictions with and without holidays:

p1 <- plot_prophet_air_id_holiday("air_5c817ef28f236bdf", TRUE)
p2 <- plot_prophet_air_id_holiday("air_5c817ef28f236bdf", FALSE)

layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)

We find:

There is a subtle improvement in fitting the Golden Week visitors when including holidays. The performance of this component might improve if there are more holidays included in the training set; but I think it’s difficult to validate this.


def plot_prophet_air_id_holida(air_id, use_hday ,ax):
  
  # Separate the 'id' column into three columns
    test[['air', 'store_id', 'date']] = test['id'].str.split('_', expand=True)
  # Get distinct dates and calculate the number of rows
    pred_len = test['date'].nunique()
    # Find the maximum visit date
    max_date = pd.to_datetime(air_visits['visit_date'].max())

    # Calculate the split date
    split_date = max_date - timedelta(days=pred_len)

    # Create a DataFrame with a sequence of dates
    all_visits = pd.DataFrame({'visit_date': pd.date_range(start=min(air_visits['visit_date']),
                                                            end=max(air_visits['visit_date']))})
    
    # Filter air_visits for the specified air_id
    foo = air_visits[air_visits['air_store_id'] == air_id]

    # Merge with all_visits to ensure all dates are included
    visits = foo.merge(all_visits, on='visit_date', how='right')
    #visits = foo.merge(all_visits, on='visit_date', how='right')

    # Apply log transformation
    visits['visitors'] = visits['visitors'].apply(lambda x: np.log1p(x))
    visits['y'] = visits['visitors']
    visits['ds'] = visits['visit_date']
    #visits = visits[["ds", "y"]]

    # Split into train and validation sets
    visits_train = visits[visits['ds'] <= split_date]
    visits_valid = visits[visits['ds'] > split_date]
    
    if use_hday == True:
      # Create a Prophet instance
      proph = Prophet(changepoint_prior_scale=0.5,
                yearly_seasonality=False,
                daily_seasonality=False)      
      # Add holidays
      proph.add_country_holidays(country_name='JP')
      ptitle = "Prophet (w/ holidays) for "

    else:
        # Create a Prophet instance
      proph = Prophet(changepoint_prior_scale=0.5,
                yearly_seasonality=False,
                daily_seasonality=False)
      ptitle = "Prophet for"
      
    # Fit the model
    proph.fit(visits_train)

    # Create a DataFrame for future dates
    future = proph.make_future_dataframe(periods=pred_len)

    # Generate forecasts
    fcast = proph.predict(future)
    
    # Convert 'ds' column to datetime
    fcast['ds'] = pd.to_datetime(fcast['ds'])

    # set plot
    fig, ax = plt.subplots(figsize=(8, 4))
    plt.style.use('ggplot')

    # Confidence interval ribbon
    ax.fill_between(fcast['ds'], fcast['yhat_lower'], 
                    fcast['yhat_upper'], color='lightblue',
                    alpha=0.5)

    # Forecast line
    ax.plot(fcast['ds'], fcast['yhat'], 
            color='blue', label='Forecast')

    # Training data
    ax.plot(visits_train['ds'], visits_train['y'], 
           color='black', label='Training Data')

    # Validation data
    ax.plot(visits_valid['visit_date'], visits_valid['visitors'],
            color='grey', label='Validation Data')

    # Add labels and title
    ax.tick_params(axis='x', rotation=10, labelsize= 7)
    ax.set_xlabel('Date')
    ax.set_ylabel('yhat')
    ax.set_title('use_hday: '+ str(use_hday))
    return(ax)

# Create a 2x2 grid
fig = plt.figure(figsize=(12, 4))
gs = GridSpec(2, 1)

#Add plots to the grid
p= plot_prophet_air_id_holida(air_id = "air_f3f9824b7d70c3cf", use_hday=True,
                        ax = fig.add_subplot(gs[0]))

## 12:31:49 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:49 - cmdstanpy - INFO - Chain [1] done processing

p= plot_prophet_air_id_holida(air_id = "air_f3f9824b7d70c3cf", use_hday=False,
                         ax= fig.add_subplot(gs[1]))

## 12:31:51 - cmdstanpy - INFO - Chain [1] start processing
## 12:31:51 - cmdstanpy - INFO - Chain [1] done processing

                         
plt.show(p)

8.3 Holt-Winters

A more traditional time series filtering and forecasting is the Holt-Winters algorithm, as implemented in the stats package. This is an exponential smoothing method which uses moving averages to take into account the presence of a trend in the data. Here we define a default seasonal model in a fitting and plotting function:

plot_hw_air_id <- function(air_id){

  pred_len <- test %>%
    separate(id, c("air", "store_id", "date"), sep = "_") %>%
    distinct(date) %>%
    nrow()

  max_date <- max(air_visits$visit_date)
  split_date <- max_date - pred_len
  all_visits <- tibble(visit_date = seq(min(air_visits$visit_date),
                                        max(air_visits$visit_date), 1))

  foo <- air_visits %>%
    filter(air_store_id == air_id)

  visits <- foo %>%
    right_join(all_visits, by = "visit_date") %>%
    mutate(visitors = log1p(visitors)) %>%
    replace_na(list(visitors = median(log1p(foo$visitors)))) %>%
    rownames_to_column()

  visits_train <- visits %>% filter(visit_date <= split_date)
  visits_valid <- visits %>% filter(visit_date > split_date)

  hw.fit <- HoltWinters(tsclean(ts(visits_train$visitors, frequency = 7)))

  hw_visits <- predict(hw.fit, n.ahead = pred_len,
                       prediction.interval = T, level = 0.95) %>%
    as_tibble() %>%
    bind_cols(visits_valid)

  visits_train %>%
    ggplot(aes(visit_date, visitors)) +
    geom_line() +
    geom_ribbon(data = hw_visits, 
                aes(x = visit_date, ymin = lwr, 
                    ymax = upr), fill = "light blue") +
    geom_line(data = hw_visits, aes(visit_date, visitors), color = "grey60") +
    geom_line(data = hw_visits, aes(visit_date, fit), color = "blue") +
    geom_line(data = hw_visits, aes(visit_date, fit), color = "blue") +
    labs(x = "Time [weeks]", y = "log1p visitors vs predictions") +
    ggtitle("HoltWinters")
}

We then apply this function to our example time series:

plot_hw_air_id("air_ba937bf13d40fb24")

from statsmodels.tsa.holtwinters import ExponentialSmoothing

def plot_hw_air_id(air_id, ax):
    
    # Separate the 'id' column into three columns
    test[['air','store_id','date']] = test['id'].str.split('_', expand=True)
    # Get distinct dates and calculate the number of rows
    pred_len = test['date'].nunique()
    
    max_date = air_visits['visit_date'].max()
    split_date = max_date - pd.DateOffset(days=pred_len)
    
    all_visits = pd.DataFrame(
      {'visit_date': pd.date_range(start=air_visits['visit_date'].min(), 
      end=max_date)})
    
    foo = air_visits[air_visits['air_store_id'] == air_id].copy()

    #visits = pd.merge(foo, all_visits, on='visit_date', how='right')
    visits = foo.merge(all_visits, on='visit_date', how='right')
    
    visits['visitors'] = visits['visitors'].apply(lambda x: np.log1p(x))
    visits['visitors'].fillna(visits['visitors'].median(), inplace=True)
    
    visits_train = visits[visits['visit_date'] <= split_date]
    visits_valid = visits[visits['visit_date'] > split_date]
    
    hw_model = ExponentialSmoothing(visits_train['visitors'],
               trend='add', seasonal='add', seasonal_periods=7)
               
    hw_fit = hw_model.fit()
    
    
    hw_forecast = hw_fit.forecast(len(visits_valid['visit_date']))
    
    hw_visits = hw_fit.predict(len(visits_valid['visit_date']))
    
    #plt.figure(figsize=(10, 6))
    fig, ax = plt.subplots(figsize=(8, 4))
    plt.style.use('ggplot')
    
    ax.plot(visits_train['visit_date'], 
              visits_train['visitors'], label='Training Data', 
              color= 'black')
              
    ax.plot(visits_valid['visit_date'], 
             visits_valid['visitors'], 
             label='Validation Data', color='grey')
             
    ax.plot(visits_valid['visit_date'], hw_forecast, 
             label='Holt-Winters Forecast', color='blue')
    
#    plt.plot(visits_valid['visit_date'], hw_visits, 
#             label='Holt-Winters Forecast', color='red')
             
    ax.fill_between(visits_valid['visit_date'],
                    hw_forecast - hw_forecast * 0.05, #hw_fit.bse,
                    hw_forecast + hw_forecast * 0.05, #hw_fit.bse, 
                    color='lightblue')
    
    ax.tick_params(axis='x', rotation=10, labelsize= 7)
    ax.set_xlabel('Time [weeks]')
    ax.set_ylabel('log1p visitors vs predictions')
    ax.set_title('Holt-Winters Forecasting')
    plt.legend()
    return(ax)


fig = plt.figure(figsize=(12, 4))
gs = GridSpec(1, 1)

p = plot_hw_air_id("air_ba937bf13d40fb24", ax = fig.add_subplot(gs[0]))
plt.show(p)

And also plot the same predictions as for prophet above:

p1 <- plot_hw_air_id("air_f3f9824b7d70c3cf")

## Warning in HoltWinters(tsclean(ts(visits_train$visitors, frequency = 7))):
## optimization difficulties: ERROR: ABNORMAL_TERMINATION_IN_LNSRCH

p2 <- plot_hw_air_id("air_8e4360a64dbd4c50")
p3 <- plot_hw_air_id("air_1c0b150f9e696a5f")

## Warning in HoltWinters(tsclean(ts(visits_train$visitors, frequency = 7))):
## optimization difficulties: ERROR: ABNORMAL_TERMINATION_IN_LNSRCH

p4 <- plot_hw_air_id("air_820d1919cbecaa0a")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)

We find:

While the algorithm performs reasonably well with sufficient data given, similar to auto.arima it breaks down for special cases of missing data. We might get a better performance by discarding the missing values instead of replacing them with the median. However, the first impression looks less promising than the prophet approach.
Feel free to fork this kernel and experiment with modified additive and exponential models to tweak the performance of the Holt-Winters method.

# Create a 2x2 grid
fig = plt.figure(figsize=(12, 4))
gs = GridSpec(2, 2)

#Add plots to the grid
p1= plot_hw_air_id(air_id = "air_820d1919cbecaa0a", 
                        ax = fig.add_subplot(gs[0,0]))

p2= plot_hw_air_id(air_id = "air_8e4360a64dbd4c50",
                         ax= fig.add_subplot(gs[0,1]))
                         
p3= plot_hw_air_id(air_id = "air_f3f9824b7d70c3cf",
                         ax= fig.add_subplot(gs[1,0]))

p4= plot_hw_air_id(air_id = "air_1c0b150f9e696a5f",
                         ax= fig.add_subplot(gs[1,1]))
                         
plt.show()

8.4 timetk

The timetk package (R, Python ) is a recently released tool kit for time series analysis. It integrates well into the tidyverse ecosystem and is specifically designed to work with the tidy “tibble” data frames. Here we briefly describe the timetk approach and how we can apply it to our data.

First, we will create the train and validation data frames in the same way as for the other prediction tools:

air_id = "air_ba937bf13d40fb24"

pred_len <- test %>%
  separate(id, c("air", "store_id", "date"), sep = "_") %>%
  distinct(date) %>%
  nrow()

max_date <- max(air_visits$visit_date)
split_date <- max_date - pred_len
all_visits <- tibble(visit_date = seq(min(air_visits$visit_date),
                                      max(air_visits$visit_date), 1))

foo <- air_visits %>%
  filter(air_store_id == air_id)

visits <- foo %>%
  right_join(all_visits, by = "visit_date") %>%
  mutate(visitors = log1p(visitors)) %>%
  rownames_to_column() %>%
  select(y = visitors,
         ds = visit_date)

visits_train <- visits %>% filter(ds <= split_date)
visits_valid <- visits %>% filter(ds > split_date)

air_id = "air_ba937bf13d40fb24"

# Separate id column into "air", "store_id", "date"
test[['air', 'store_id', 'date']] = test['id'].str.split('_', expand=True)

# Get distinct dates
distinct_dates = test['date'].drop_duplicates()

# Calculate prediction length
pred_len = len(distinct_dates)

# Get max date from air_visits
max_date = air_visits['visit_date'].max()

# Calculate split date
split_date = max_date - pd.DateOffset(days=pred_len)

# Generate a sequence of dates
all_visits = \
pd.DataFrame({'visit_date': pd.date_range(start=air_visits['visit_date'].min(),
                                                       end=max_date)})

# Filter air_visits for the specific air_id
foo = air_visits[air_visits['air_store_id'] == air_id].copy()

# Right join with all_visits
visits = all_visits.merge(foo, on='visit_date', how='right')

# Apply log transformation to visitors
visits['visitors'] = np.log1p(visits['visitors'])

# Set 'visit_date' as index
visits.set_index('visit_date', inplace=True)

# Reset rownames to a new column 'ds'
visits = visits.rename_axis('ds').reset_index()

# Filter visits for training and validation sets
visits_train = visits[visits['ds'] <= split_date]
visits_valid = visits[visits['ds'] > split_date]

Then, we use the tk_augment_timeseries_signature tool to augment our data frames with time series characteristics. This means that we will add comprehensive time series properties that have been extracted from the date. Those new features include for instance the month, day and week of the year, half and quarter of the year; down to minutes and seconds for date-time data. Here we show a glimpse of the augmented training data:

visits_train_aug <- visits_train %>%
  tk_augment_timeseries_signature()

## tk_augment_timeseries_signature(): Using the following .date_var variable: ds

visits_valid_aug <- visits_valid %>%
  .$ds %>%
  tk_get_timeseries_signature()

visits_train_aug %>%
  head(100) %>%
   knitr::kable() %>%   kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = "600px", height = "600px")

y	ds	index.num	diff	year	year.iso	half	quarter	month	month.xts	month.lbl	day	am.pm	wday	wday.xts	wday.lbl	mday	qday	yday	mweek	week	week.iso	week2	week3	week4	mday7
NA	2016-01-01	1451606400	NA	2016	2015	1	1	1	0	January	1	1	6	5	Friday	1	1	1	1	1	53	1	1	1	1
NA	2016-01-02	1451692800	86400	2016	2015	1	1	1	0	January	2	1	7	6	Saturday	2	2	2	1	1	53	1	1	1	1
NA	2016-01-03	1451779200	86400	2016	2015	1	1	1	0	January	3	1	1	0	Sunday	3	3	3	2	1	53	1	1	1	1
NA	2016-01-04	1451865600	86400	2016	2016	1	1	1	0	January	4	1	2	1	Monday	4	4	4	2	1	1	1	1	1	1
NA	2016-01-05	1451952000	86400	2016	2016	1	1	1	0	January	5	1	3	2	Tuesday	5	5	5	2	1	1	1	1	1	1
NA	2016-01-06	1452038400	86400	2016	2016	1	1	1	0	January	6	1	4	3	Wednesday	6	6	6	2	1	1	1	1	1	1
NA	2016-01-07	1452124800	86400	2016	2016	1	1	1	0	January	7	1	5	4	Thursday	7	7	7	2	1	1	1	1	1	2
NA	2016-01-08	1452211200	86400	2016	2016	1	1	1	0	January	8	1	6	5	Friday	8	8	8	2	2	1	0	2	2	2
NA	2016-01-09	1452297600	86400	2016	2016	1	1	1	0	January	9	1	7	6	Saturday	9	9	9	2	2	1	0	2	2	2
NA	2016-01-10	1452384000	86400	2016	2016	1	1	1	0	January	10	1	1	0	Sunday	10	10	10	3	2	1	0	2	2	2
NA	2016-01-11	1452470400	86400	2016	2016	1	1	1	0	January	11	1	2	1	Monday	11	11	11	3	2	2	0	2	2	2
NA	2016-01-12	1452556800	86400	2016	2016	1	1	1	0	January	12	1	3	2	Tuesday	12	12	12	3	2	2	0	2	2	2
3.2580965	2016-01-13	1452643200	86400	2016	2016	1	1	1	0	January	13	1	4	3	Wednesday	13	13	13	3	2	2	0	2	2	2
3.4965076	2016-01-14	1452729600	86400	2016	2016	1	1	1	0	January	14	1	5	4	Thursday	14	14	14	3	2	2	0	2	2	3
3.4011974	2016-01-15	1452816000	86400	2016	2016	1	1	1	0	January	15	1	6	5	Friday	15	15	15	3	3	2	1	0	3	3
3.1354942	2016-01-16	1452902400	86400	2016	2016	1	1	1	0	January	16	1	7	6	Saturday	16	16	16	3	3	2	1	0	3	3
NA	2016-01-17	1452988800	86400	2016	2016	1	1	1	0	January	17	1	1	0	Sunday	17	17	17	4	3	2	1	0	3	3
1.9459101	2016-01-18	1453075200	86400	2016	2016	1	1	1	0	January	18	1	2	1	Monday	18	18	18	4	3	3	1	0	3	3
2.3025851	2016-01-19	1453161600	86400	2016	2016	1	1	1	0	January	19	1	3	2	Tuesday	19	19	19	4	3	3	1	0	3	3
3.4657359	2016-01-20	1453248000	86400	2016	2016	1	1	1	0	January	20	1	4	3	Wednesday	20	20	20	4	3	3	1	0	3	3
3.0910425	2016-01-21	1453334400	86400	2016	2016	1	1	1	0	January	21	1	5	4	Thursday	21	21	21	4	3	3	1	0	3	4
2.9444390	2016-01-22	1453420800	86400	2016	2016	1	1	1	0	January	22	1	6	5	Friday	22	22	22	4	4	3	0	1	0	4
3.2958369	2016-01-23	1453507200	86400	2016	2016	1	1	1	0	January	23	1	7	6	Saturday	23	23	23	4	4	3	0	1	0	4
NA	2016-01-24	1453593600	86400	2016	2016	1	1	1	0	January	24	1	1	0	Sunday	24	24	24	5	4	3	0	1	0	4
3.0910425	2016-01-25	1453680000	86400	2016	2016	1	1	1	0	January	25	1	2	1	Monday	25	25	25	5	4	4	0	1	0	4
2.4849066	2016-01-26	1453766400	86400	2016	2016	1	1	1	0	January	26	1	3	2	Tuesday	26	26	26	5	4	4	0	1	0	4
3.2188758	2016-01-27	1453852800	86400	2016	2016	1	1	1	0	January	27	1	4	3	Wednesday	27	27	27	5	4	4	0	1	0	4
3.0910425	2016-01-28	1453939200	86400	2016	2016	1	1	1	0	January	28	1	5	4	Thursday	28	28	28	5	4	4	0	1	0	5
3.2958369	2016-01-29	1454025600	86400	2016	2016	1	1	1	0	January	29	1	6	5	Friday	29	29	29	5	5	4	1	2	1	5
1.9459101	2016-01-30	1454112000	86400	2016	2016	1	1	1	0	January	30	1	7	6	Saturday	30	30	30	5	5	4	1	2	1	5
NA	2016-01-31	1454198400	86400	2016	2016	1	1	1	0	January	31	1	1	0	Sunday	31	31	31	6	5	4	1	2	1	5
NA	2016-02-01	1454284800	86400	2016	2016	1	1	2	1	February	1	1	2	1	Monday	1	32	32	1	5	5	1	2	1	1
NA	2016-02-02	1454371200	86400	2016	2016	1	1	2	1	February	2	1	3	2	Tuesday	2	33	33	1	5	5	1	2	1	1
2.9444390	2016-02-03	1454457600	86400	2016	2016	1	1	2	1	February	3	1	4	3	Wednesday	3	34	34	1	5	5	1	2	1	1
2.5649494	2016-02-04	1454544000	86400	2016	2016	1	1	2	1	February	4	1	5	4	Thursday	4	35	35	1	5	5	1	2	1	1
3.8286414	2016-02-05	1454630400	86400	2016	2016	1	1	2	1	February	5	1	6	5	Friday	5	36	36	1	6	5	0	0	2	1
2.7725887	2016-02-06	1454716800	86400	2016	2016	1	1	2	1	February	6	1	7	6	Saturday	6	37	37	1	6	5	0	0	2	1
NA	2016-02-07	1454803200	86400	2016	2016	1	1	2	1	February	7	1	1	0	Sunday	7	38	38	2	6	5	0	0	2	2
2.9957323	2016-02-08	1454889600	86400	2016	2016	1	1	2	1	February	8	1	2	1	Monday	8	39	39	2	6	6	0	0	2	2
2.7725887	2016-02-09	1454976000	86400	2016	2016	1	1	2	1	February	9	1	3	2	Tuesday	9	40	40	2	6	6	0	0	2	2
3.4965076	2016-02-10	1455062400	86400	2016	2016	1	1	2	1	February	10	1	4	3	Wednesday	10	41	41	2	6	6	0	0	2	2
1.3862944	2016-02-11	1455148800	86400	2016	2016	1	1	2	1	February	11	1	5	4	Thursday	11	42	42	2	6	6	0	0	2	2
3.2958369	2016-02-12	1455235200	86400	2016	2016	1	1	2	1	February	12	1	6	5	Friday	12	43	43	2	7	6	1	1	3	2
2.1972246	2016-02-13	1455321600	86400	2016	2016	1	1	2	1	February	13	1	7	6	Saturday	13	44	44	2	7	6	1	1	3	2
NA	2016-02-14	1455408000	86400	2016	2016	1	1	2	1	February	14	1	1	0	Sunday	14	45	45	3	7	6	1	1	3	3
2.7080502	2016-02-15	1455494400	86400	2016	2016	1	1	2	1	February	15	1	2	1	Monday	15	46	46	3	7	7	1	1	3	3
2.7725887	2016-02-16	1455580800	86400	2016	2016	1	1	2	1	February	16	1	3	2	Tuesday	16	47	47	3	7	7	1	1	3	3
2.8903718	2016-02-17	1455667200	86400	2016	2016	1	1	2	1	February	17	1	4	3	Wednesday	17	48	48	3	7	7	1	1	3	3
3.1354942	2016-02-18	1455753600	86400	2016	2016	1	1	2	1	February	18	1	5	4	Thursday	18	49	49	3	7	7	1	1	3	3
3.7841896	2016-02-19	1455840000	86400	2016	2016	1	1	2	1	February	19	1	6	5	Friday	19	50	50	3	8	7	0	2	0	3
3.0445224	2016-02-20	1455926400	86400	2016	2016	1	1	2	1	February	20	1	7	6	Saturday	20	51	51	3	8	7	0	2	0	3
NA	2016-02-21	1456012800	86400	2016	2016	1	1	2	1	February	21	1	1	0	Sunday	21	52	52	4	8	7	0	2	0	4
2.0794415	2016-02-22	1456099200	86400	2016	2016	1	1	2	1	February	22	1	2	1	Monday	22	53	53	4	8	8	0	2	0	4
2.8332133	2016-02-23	1456185600	86400	2016	2016	1	1	2	1	February	23	1	3	2	Tuesday	23	54	54	4	8	8	0	2	0	4
3.0910425	2016-02-24	1456272000	86400	2016	2016	1	1	2	1	February	24	1	4	3	Wednesday	24	55	55	4	8	8	0	2	0	4
3.0910425	2016-02-25	1456358400	86400	2016	2016	1	1	2	1	February	25	1	5	4	Thursday	25	56	56	4	8	8	0	2	0	4
3.4965076	2016-02-26	1456444800	86400	2016	2016	1	1	2	1	February	26	1	6	5	Friday	26	57	57	4	9	8	1	0	1	4
3.1780538	2016-02-27	1456531200	86400	2016	2016	1	1	2	1	February	27	1	7	6	Saturday	27	58	58	4	9	8	1	0	1	4
NA	2016-02-28	1456617600	86400	2016	2016	1	1	2	1	February	28	1	1	0	Sunday	28	59	59	5	9	8	1	0	1	5
3.1354942	2016-02-29	1456704000	86400	2016	2016	1	1	2	1	February	29	1	2	1	Monday	29	60	60	5	9	9	1	0	1	5
2.9957323	2016-03-01	1456790400	86400	2016	2016	1	1	3	2	March	1	1	3	2	Tuesday	1	61	61	1	9	9	1	0	1	1
3.0910425	2016-03-02	1456876800	86400	2016	2016	1	1	3	2	March	2	1	4	3	Wednesday	2	62	62	1	9	9	1	0	1	1
3.0445224	2016-03-03	1456963200	86400	2016	2016	1	1	3	2	March	3	1	5	4	Thursday	3	63	63	1	9	9	1	0	1	1
3.6375862	2016-03-04	1457049600	86400	2016	2016	1	1	3	2	March	4	1	6	5	Friday	4	64	64	1	10	9	0	1	2	1
2.6390573	2016-03-05	1457136000	86400	2016	2016	1	1	3	2	March	5	1	7	6	Saturday	5	65	65	1	10	9	0	1	2	1
NA	2016-03-06	1457222400	86400	2016	2016	1	1	3	2	March	6	1	1	0	Sunday	6	66	66	2	10	9	0	1	2	1
2.4849066	2016-03-07	1457308800	86400	2016	2016	1	1	3	2	March	7	1	2	1	Monday	7	67	67	2	10	10	0	1	2	2
2.1972246	2016-03-08	1457395200	86400	2016	2016	1	1	3	2	March	8	1	3	2	Tuesday	8	68	68	2	10	10	0	1	2	2
3.1780538	2016-03-09	1457481600	86400	2016	2016	1	1	3	2	March	9	1	4	3	Wednesday	9	69	69	2	10	10	0	1	2	2
3.4657359	2016-03-10	1457568000	86400	2016	2016	1	1	3	2	March	10	1	5	4	Thursday	10	70	70	2	10	10	0	1	2	2
3.6375862	2016-03-11	1457654400	86400	2016	2016	1	1	3	2	March	11	1	6	5	Friday	11	71	71	2	11	10	1	2	3	2
3.2580965	2016-03-12	1457740800	86400	2016	2016	1	1	3	2	March	12	1	7	6	Saturday	12	72	72	2	11	10	1	2	3	2
NA	2016-03-13	1457827200	86400	2016	2016	1	1	3	2	March	13	1	1	0	Sunday	13	73	73	3	11	10	1	2	3	2
2.3978953	2016-03-14	1457913600	86400	2016	2016	1	1	3	2	March	14	1	2	1	Monday	14	74	74	3	11	11	1	2	3	3
3.0445224	2016-03-15	1458000000	86400	2016	2016	1	1	3	2	March	15	1	3	2	Tuesday	15	75	75	3	11	11	1	2	3	3
3.2580965	2016-03-16	1458086400	86400	2016	2016	1	1	3	2	March	16	1	4	3	Wednesday	16	76	76	3	11	11	1	2	3	3
3.3322045	2016-03-17	1458172800	86400	2016	2016	1	1	3	2	March	17	1	5	4	Thursday	17	77	77	3	11	11	1	2	3	3
3.6888795	2016-03-18	1458259200	86400	2016	2016	1	1	3	2	March	18	1	6	5	Friday	18	78	78	3	12	11	0	0	0	3
3.4965076	2016-03-19	1458345600	86400	2016	2016	1	1	3	2	March	19	1	7	6	Saturday	19	79	79	3	12	11	0	0	0	3
NA	2016-03-20	1458432000	86400	2016	2016	1	1	3	2	March	20	1	1	0	Sunday	20	80	80	4	12	11	0	0	0	3
NA	2016-03-21	1458518400	86400	2016	2016	1	1	3	2	March	21	1	2	1	Monday	21	81	81	4	12	12	0	0	0	4
2.8903718	2016-03-22	1458604800	86400	2016	2016	1	1	3	2	March	22	1	3	2	Tuesday	22	82	82	4	12	12	0	0	0	4
3.4011974	2016-03-23	1458691200	86400	2016	2016	1	1	3	2	March	23	1	4	3	Wednesday	23	83	83	4	12	12	0	0	0	4
3.2580965	2016-03-24	1458777600	86400	2016	2016	1	1	3	2	March	24	1	5	4	Thursday	24	84	84	4	12	12	0	0	0	4
3.4339872	2016-03-25	1458864000	86400	2016	2016	1	1	3	2	March	25	1	6	5	Friday	25	85	85	4	13	12	1	1	1	4
3.7376696	2016-03-26	1458950400	86400	2016	2016	1	1	3	2	March	26	1	7	6	Saturday	26	86	86	4	13	12	1	1	1	4
0.6931472	2016-03-27	1459036800	86400	2016	2016	1	1	3	2	March	27	1	1	0	Sunday	27	87	87	5	13	12	1	1	1	4
2.0794415	2016-03-28	1459123200	86400	2016	2016	1	1	3	2	March	28	1	2	1	Monday	28	88	88	5	13	13	1	1	1	5
2.9957323	2016-03-29	1459209600	86400	2016	2016	1	1	3	2	March	29	1	3	2	Tuesday	29	89	89	5	13	13	1	1	1	5
3.2580965	2016-03-30	1459296000	86400	2016	2016	1	1	3	2	March	30	1	4	3	Wednesday	30	90	90	5	13	13	1	1	1	5
3.8066625	2016-03-31	1459382400	86400	2016	2016	1	1	3	2	March	31	1	5	4	Thursday	31	91	91	5	13	13	1	1	1	5
3.6109179	2016-04-01	1459468800	86400	2016	2016	1	2	4	3	April	1	1	6	5	Friday	1	1	92	1	14	13	0	2	2	1
2.8332133	2016-04-02	1459555200	86400	2016	2016	1	2	4	3	April	2	1	7	6	Saturday	2	2	93	1	14	13	0	2	2	1
NA	2016-04-03	1459641600	86400	2016	2016	1	2	4	3	April	3	1	1	0	Sunday	3	3	94	2	14	13	0	2	2	1
3.0445224	2016-04-04	1459728000	86400	2016	2016	1	2	4	3	April	4	1	2	1	Monday	4	4	95	2	14	14	0	2	2	1
3.1354942	2016-04-05	1459814400	86400	2016	2016	1	2	4	3	April	5	1	3	2	Tuesday	5	5	96	2	14	14	0	2	2	1
2.6390573	2016-04-06	1459900800	86400	2016	2016	1	2	4	3	April	6	1	4	3	Wednesday	6	6	97	2	14	14	0	2	2	1
2.5649494	2016-04-07	1459987200	86400	2016	2016	1	2	4	3	April	7	1	5	4	Thursday	7	7	98	2	14	14	0	2	2	2
3.9120230	2016-04-08	1460073600	86400	2016	2016	1	2	4	3	April	8	1	6	5	Friday	8	8	99	2	15	14	1	0	3	2
2.9957323	2016-04-09	1460160000	86400	2016	2016	1	2	4	3	April	9	1	7	6	Saturday	9	9	100	2	15	14	1	0	3	2

from pytimetk import augment_timeseries_signature, get_timeseries_signature

# Augment visits_train
visits_train_aug = augment_timeseries_signature(data= visits_train,
                                                date_column = 'ds')

# Get timeseries signature for visits_valid
visits_valid_ds = visits_valid['ds']
visits_valid_aug = get_timeseries_signature(visits_valid_ds)

# Display information about visits_train_aug
print(visits_train_aug.info())

## <class 'pandas.core.frame.DataFrame'>
## Index: 355 entries, 0 to 354
## Data columns (total 35 columns):
##  #   Column           Non-Null Count  Dtype         
## ---  ------           --------------  -----         
##  0   ds               355 non-null    datetime64[ns]
##  1   air_store_id     355 non-null    object        
##  2   visitors         355 non-null    float64       
##  3   wday             355 non-null    category      
##  4   month            355 non-null    object        
##  5   log1p_visitors   355 non-null    float64       
##  6   ds_index_num     355 non-null    int32         
##  7   ds_year          355 non-null    int16         
##  8   ds_year_iso      355 non-null    UInt32        
##  9   ds_yearstart     355 non-null    uint8         
##  10  ds_yearend       355 non-null    uint8         
##  11  ds_leapyear      355 non-null    uint8         
##  12  ds_half          355 non-null    int8          
##  13  ds_quarter       355 non-null    int8          
##  14  ds_quarteryear   355 non-null    object        
##  15  ds_quarterstart  355 non-null    uint8         
##  16  ds_quarterend    355 non-null    uint8         
##  17  ds_month         355 non-null    int8          
##  18  ds_month_lbl     355 non-null    object        
##  19  ds_monthstart    355 non-null    uint8         
##  20  ds_monthend      355 non-null    uint8         
##  21  ds_yweek         355 non-null    UInt32        
##  22  ds_mweek         355 non-null    int8          
##  23  ds_wday          355 non-null    int8          
##  24  ds_wday_lbl      355 non-null    object        
##  25  ds_mday          355 non-null    int8          
##  26  ds_qday          355 non-null    int8          
##  27  ds_yday          355 non-null    int16         
##  28  ds_weekend       355 non-null    int8          
##  29  ds_hour          355 non-null    int8          
##  30  ds_minute        355 non-null    int8          
##  31  ds_second        355 non-null    int8          
##  32  ds_msecond       355 non-null    int8          
##  33  ds_nsecond       355 non-null    int8          
##  34  ds_am_pm         355 non-null    object        
## dtypes: UInt32(2), category(1), datetime64[ns](1), float64(2), int16(2), int32(1), int8(13), object(6), uint8(7)
## memory usage: 41.6+ KB
## None

Now, the idea behind timetk is to use these new features to make predictions based on a regression or classification approach; with standard tools such as linear/logistic regression or (boosted/ensembled) trees. For this example, we will use a simple linear model. This approach can easily be extended to more sophisticated methods.

fit_lm <- lm(y ~ ., data = select(visits_train_aug, 
                                  -c(ds, diff, wday.xts, wday.lbl, year.iso)))

summary(fit_lm)

## 
## Call:
## lm(formula = y ~ ., data = select(visits_train_aug, -c(ds, diff, 
##     wday.xts, wday.lbl, year.iso)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0924 -0.2096  0.0643  0.2882  1.0873 
## 
## Coefficients: (12 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.659e+04  1.403e+05   0.546  0.58557    
## index.num     1.220e-06  2.254e-06   0.541  0.58889    
## year         -3.887e+01  7.123e+01  -0.546  0.58567    
## half          6.010e-01  5.602e-01   1.073  0.28417    
## quarter      -3.646e-01  2.840e-01  -1.284  0.20005    
## month        -7.352e+00  5.894e+00  -1.247  0.21317    
## month.xts            NA         NA      NA       NA    
## month.lbl.L          NA         NA      NA       NA    
## month.lbl.Q  -6.574e-01  2.001e-01  -3.285  0.00113 ** 
## month.lbl.C   4.938e-01  3.758e-01   1.314  0.18979    
## month.lbl^4   2.385e-01  1.058e-01   2.254  0.02486 *  
## month.lbl^5  -3.988e-01  3.442e-01  -1.159  0.24747    
## month.lbl^6   1.135e-01  1.218e-01   0.932  0.35218    
## month.lbl^7   2.203e-01  1.502e-01   1.466  0.14360    
## month.lbl^8   2.317e-01  1.776e-01   1.305  0.19289    
## month.lbl^9          NA         NA      NA       NA    
## month.lbl^10  1.974e-01  1.918e-01   1.029  0.30416    
## month.lbl^11         NA         NA      NA       NA    
## day          -2.619e-01  2.038e-01  -1.285  0.19980    
## hour                 NA         NA      NA       NA    
## minute               NA         NA      NA       NA    
## second               NA         NA      NA       NA    
## hour12               NA         NA      NA       NA    
## am.pm                NA         NA      NA       NA    
## wday          3.051e-01  3.689e-02   8.270 3.26e-15 ***
## mday                 NA         NA      NA       NA    
## qday                 NA         NA      NA       NA    
## yday                 NA         NA      NA       NA    
## mweek         1.741e-01  1.975e-01   0.882  0.37858    
## week         -4.851e-03  8.652e-02  -0.056  0.95532    
## week.iso      9.643e-01  1.730e-01   5.575 5.13e-08 ***
## week2        -6.493e-03  5.556e-02  -0.117  0.90702    
## week3        -6.410e-05  3.117e-02  -0.002  0.99836    
## week4         3.292e-03  2.465e-02   0.134  0.89384    
## mday7        -2.572e-02  8.649e-02  -0.297  0.76637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4566 on 332 degrees of freedom
##   (84 observations deleted due to missingness)
## Multiple R-squared:  0.4351, Adjusted R-squared:  0.3977 
## F-statistic: 11.62 on 22 and 332 DF,  p-value: < 2.2e-16

import statsmodels.api as sm
from sklearn import linear_model

X = visits_train_aug.select_dtypes(['number'])
X = X.drop(columns=['visitors', 'ds_index_num', 'log1p_visitors']) 
X = X.astype('int')
y = visits_train_aug['visitors']

# Add a constant (intercept) term to the independent variables
X = sm.add_constant(X, prepend=True)

# Fit the linear model
model = sm.OLS(y, X)
fit_lm = model.fit()

# Print the summary of the linear regression model
#fit_lm.params
print(fit_lm.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:               visitors   R-squared:                       0.422
## Model:                            OLS   Adj. R-squared:                  0.399
## Method:                 Least Squares   F-statistic:                     17.76
## Date:                Fri, 17 Nov 2023   Prob (F-statistic):           6.71e-33
## Time:                        12:32:06   Log-Likelihood:                -217.45
## No. Observations:                 355   AIC:                             464.9
## Df Residuals:                     340   BIC:                             523.0
## Df Model:                          14                                         
## Covariance Type:            nonrobust                                         
## ===================================================================================
##                       coef    std err          t      P>|t|      [0.025      0.975]
## -----------------------------------------------------------------------------------
## const            2.683e-05   4.02e-05      0.668      0.504   -5.22e-05       0.000
## ds_year            -0.0093      0.002     -3.828      0.000      -0.014      -0.005
## ds_year_iso        -0.0093      0.002     -3.828      0.000      -0.014      -0.005
## ds_yearstart     2.887e-12   7.48e-13      3.858      0.000    1.42e-12    4.36e-12
## ds_yearend      -2.923e-13   7.56e-14     -3.865      0.000   -4.41e-13   -1.44e-13
## ds_leapyear         0.0635      0.080      0.794      0.427      -0.094       0.221
## ds_half            -0.3212      0.135     -2.387      0.018      -0.586      -0.057
## ds_quarter         39.0370     10.159      3.843      0.000      19.055      59.019
## ds_quarterstart     0.5356      0.316      1.694      0.091      -0.086       1.157
## ds_quarterend       0.2227      0.314      0.710      0.478      -0.395       0.840
## ds_month            1.3683      1.376      0.994      0.321      -1.338       4.075
## ds_monthstart      -0.0287      0.174     -0.165      0.869      -0.372       0.314
## ds_monthend         0.0133      0.171      0.078      0.938      -0.323       0.349
## ds_yweek           -0.1011      0.020     -4.943      0.000      -0.141      -0.061
## ds_mweek            0.0527      0.087      0.607      0.544      -0.118       0.224
## ds_wday             0.1247      0.015      8.465      0.000       0.096       0.154
## ds_mday             0.0386      0.047      0.824      0.410      -0.054       0.131
## ds_qday             0.4273      0.111      3.847      0.000       0.209       0.646
## ds_yday            -0.4566      0.108     -4.212      0.000      -0.670      -0.243
## ds_weekend         -1.9430      0.165    -11.796      0.000      -2.267      -1.619
## ds_hour                  0          0        nan        nan           0           0
## ds_minute                0          0        nan        nan           0           0
## ds_second                0          0        nan        nan           0           0
## ds_msecond               0          0        nan        nan           0           0
## ds_nsecond               0          0        nan        nan           0           0
## ==============================================================================
## Omnibus:                       69.811   Durbin-Watson:                   2.215
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              132.706
## Skew:                          -1.065   Prob(JB):                     1.52e-29
## Kurtosis:                       5.106   Cond. No.                          inf
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The smallest eigenvalue is      0. This might indicate that there are
## strong multicollinearity problems or that the design matrix is singular.

Then we use the standard predict tool to apply the model to our prediction range. Make sure to include the same features as in the training data frame.

pred <- predict(fit_lm, newdata = select(visits_valid_aug, 
                                         -c(index, diff, wday.xts,
                                            wday.lbl, year.iso)))

pred_tk <- tibble(
    date  = visits_valid$ds,
    value = pred
    )

V = visits_valid_aug.select_dtypes(['number'])
V = V.drop(columns=['ds_index_num']) 
V = V.astype('int')

## add intercept
#V = sm.add_constant(V, prepend=True)
V.insert(0,'const', 1.0)
V.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 36 entries, 355 to 390
## Data columns (total 25 columns):
##  #   Column           Non-Null Count  Dtype  
## ---  ------           --------------  -----  
##  0   const            36 non-null     float64
##  1   ds_year          36 non-null     int64  
##  2   ds_year_iso      36 non-null     int64  
##  3   ds_yearstart     36 non-null     int64  
##  4   ds_yearend       36 non-null     int64  
##  5   ds_leapyear      36 non-null     int64  
##  6   ds_half          36 non-null     int64  
##  7   ds_quarter       36 non-null     int64  
##  8   ds_quarterstart  36 non-null     int64  
##  9   ds_quarterend    36 non-null     int64  
##  10  ds_month         36 non-null     int64  
##  11  ds_monthstart    36 non-null     int64  
##  12  ds_monthend      36 non-null     int64  
##  13  ds_yweek         36 non-null     int64  
##  14  ds_mweek         36 non-null     int64  
##  15  ds_wday          36 non-null     int64  
##  16  ds_mday          36 non-null     int64  
##  17  ds_qday          36 non-null     int64  
##  18  ds_yday          36 non-null     int64  
##  19  ds_weekend       36 non-null     int64  
##  20  ds_hour          36 non-null     int64  
##  21  ds_minute        36 non-null     int64  
##  22  ds_second        36 non-null     int64  
##  23  ds_msecond       36 non-null     int64  
##  24  ds_nsecond       36 non-null     int64  
## dtypes: float64(1), int64(24)
## memory usage: 7.3 KB

pred = fit_lm.predict(V)

pred_tk = pd.DataFrame({
    'date': visits_valid['ds'],
    'value': pred
})
pred_tk.head()

##           date     value
## 355 2017-03-15  2.934311
## 356 2017-03-16  3.068310
## 357 2017-03-17  3.202309
## 358 2017-03-18  3.336308
## 359 2017-03-19  1.527270

X.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 355 entries, 0 to 354
## Data columns (total 25 columns):
##  #   Column           Non-Null Count  Dtype  
## ---  ------           --------------  -----  
##  0   const            355 non-null    float64
##  1   ds_year          355 non-null    int64  
##  2   ds_year_iso      355 non-null    int64  
##  3   ds_yearstart     355 non-null    int64  
##  4   ds_yearend       355 non-null    int64  
##  5   ds_leapyear      355 non-null    int64  
##  6   ds_half          355 non-null    int64  
##  7   ds_quarter       355 non-null    int64  
##  8   ds_quarterstart  355 non-null    int64  
##  9   ds_quarterend    355 non-null    int64  
##  10  ds_month         355 non-null    int64  
##  11  ds_monthstart    355 non-null    int64  
##  12  ds_monthend      355 non-null    int64  
##  13  ds_yweek         355 non-null    int64  
##  14  ds_mweek         355 non-null    int64  
##  15  ds_wday          355 non-null    int64  
##  16  ds_mday          355 non-null    int64  
##  17  ds_qday          355 non-null    int64  
##  18  ds_yday          355 non-null    int64  
##  19  ds_weekend       355 non-null    int64  
##  20  ds_hour          355 non-null    int64  
##  21  ds_minute        355 non-null    int64  
##  22  ds_second        355 non-null    int64  
##  23  ds_msecond       355 non-null    int64  
##  24  ds_nsecond       355 non-null    int64  
## dtypes: float64(1), int64(24)
## memory usage: 72.1 KB

As before, we combine the prediction with a plotting method in a helper function:

plot_tk_lm_air_id <- function(air_id){
  
  pred_len <- test %>%
    separate(id, c("air", "store_id", "date"), sep = "_") %>%
    distinct(date) %>%
    nrow()

  max_date <- max(air_visits$visit_date)
  split_date <- max_date - pred_len
  all_visits <- tibble(visit_date = seq(min(air_visits$visit_date),
                                        max(air_visits$visit_date), 1))

  foo <- air_visits %>%
    filter(air_store_id == air_id)

  visits <- foo %>%
    right_join(all_visits, by = "visit_date") %>%
    mutate(visitors = log1p(visitors)) %>%
    rownames_to_column() %>%
    select(y = visitors,
          ds = visit_date)

  visits_train <- visits %>% filter(ds <= split_date)
  visits_valid <- visits %>% filter(ds > split_date)
  
  # augment train with ts info
  visits_train_aug <- visits_train %>%
    tk_augment_timeseries_signature()
  
  # fit lm
  fit_lm <- lm(y ~ ., data = select(visits_train_aug, 
                                    -c(ds, diff, wday.xts,
                                       wday.lbl, year.iso)))
  
  # augment valid with ts info
  visits_valid_aug <- visits_valid %>%
    .$ds %>%
    tk_get_timeseries_signature()
  
  # predict from lm
  pred <- predict(fit_lm, newdata = select(visits_valid_aug,
                                           -c(index, diff, wday.xts,
                                              wday.lbl, year.iso)))

  pred_tk <- tibble(
      date  = visits_valid$ds,
      y_pred = pred
      )
  
  # plot
  p <- pred_tk %>%
    ggplot(aes(date, y_pred)) +
    geom_line(data = visits_train, aes(ds, y), colour = "black", na.rm = TRUE) +
    geom_line(data = visits_valid, aes(ds, y), colour = "grey50", na.rm=TRUE) +
    geom_line(colour = "blue", na.rm=TRUE) +
    labs(title = str_c("timetk for ", air_id))
    
  return(p)
}

plot_tk_lm_air_id("air_ba937bf13d40fb24")

## tk_augment_timeseries_signature(): Using the following .date_var variable: ds


def plot_tk_lm_air_id(air_id, ax):
    # Separate id column into "air", "store_id", "date"
    test[['air', 'store_id', 'date']] = test['id'].str.split('_', expand=True)
    
    # Get distinct dates
    distinct_dates = test['date'].drop_duplicates()
    
    # Calculate prediction length
    pred_len = len(distinct_dates)
    
    # Get max date from air_visits
    max_date = air_visits['visit_date'].max()
    
    # Calculate split date
    split_date = max_date - pd.DateOffset(days=pred_len)
    
    # Generate a sequence of dates
    all_visits = \
    pd.DataFrame({'visit_date': pd.date_range(start=air_visits['visit_date'].min(),
    end=max_date)})
    
    # Filter air_visits for the specific air_id
    foo = air_visits[air_visits['air_store_id'] == air_id].copy()
    
    # Right join with all_visits
    visits = all_visits.merge(foo, on='visit_date', how='right')
    
    # Apply log transformation to visitors
    visits['visitors'] = np.log1p(visits['visitors'])
    
    # Set 'visit_date' as index
    visits.set_index('visit_date', inplace=True)
    
    # Reset rownames to a new column 'ds'
    visits = visits.rename_axis('ds').reset_index()
    
    # Filter visits for training and validation sets
    visits_train = visits[visits['ds'] <= split_date]
    visits_valid = visits[visits['ds'] > split_date]
    
    # Augment visits_train
    visits_train_aug = augment_timeseries_signature(data= visits_train,
                                                    date_column = 'ds')
    
  
    # Get timeseries signature for visits_valid
    visits_valid_ds = visits_valid['ds']
    visits_valid_aug = get_timeseries_signature(visits_valid_ds)
    
    X = visits_train_aug.select_dtypes(['number'])
    X = X.drop(columns=['visitors' ,'log1p_visitors','ds_index_num']) #'ds_year','ds_year_iso'
    X = X.astype('int')
    y = visits_train_aug['visitors']
    
    # Add a constant (intercept) term to the independent variables
    X = sm.add_constant(X, prepend=True)#.to_numpy()
    # Fit the linear model
    model = sm.OLS(y, X)
    fit_lm = model.fit()
 
    V = visits_valid_aug.select_dtypes(['number'])
    V = V.drop(columns=['ds_index_num']) #'ds_year','ds_year_iso'
    V = V.astype('int')
    V= V.reset_index(drop=True)
    #V = sm.add_constant(V, prepend=True)
    V.insert(0,'const', 1.0)

    pred = fit_lm.predict(V)
   
    pred_tk = pd.DataFrame({
        'date': visits_valid['ds'],
        'value': pred
    })
    
    #pred_tk = pred_tk.reset_index(drop=True)

    fig, ax = plt.subplots(figsize=(8, 4))
    plt.style.use('ggplot')
      
    ax.plot(pred_tk['date'], 
                pred_tk['value'] , label='Predicted Data', 
                color= 'blue')
                
    ax.plot(visits_train['ds'], 
               visits_train['visitors'], 
               label='training Data', color='black')
               
    ax.plot(visits_valid['ds'], 
              visits_valid['visitors'],  
               label='validate Data', color='grey')
  
  #    ax.fill_between(visits_valid['visit_date'],
  #                    hw_forecast - hw_forecast * 0.05, #hw_fit.bse,
  #                    hw_forecast + hw_forecast * 0.05, #hw_fit.bse, 
  #                    color='lightblue')
      
    ax.tick_params(axis='x', rotation=10, labelsize= 7)
    ax.set_xlabel('Date')
    ax.set_ylabel('Visitors')
    ax.set_title('timetk for '+ str(air_id))
    plt.legend()
    return(ax)

# Create a 1x1 grid
fig = plt.figure(figsize=(12, 4))
gs = GridSpec(1, 1)

#Add plots to the grid
p= plot_tk_lm_air_id(air_id = "air_ba937bf13d40fb24", 
                     ax = fig.add_subplot(gs[0]))
                        
plt.show(p)

We find:

The result looks quite decent; especially considering that we only used a simply linear model for our prediction. More elaborate methods are likely to give a better result.
Be aware that this is not technically a time-series method. Instead we use the timetk functionality to extract time series characteristics of our data and use them as extra features in a regression-style approach. Thus, this methodology could be considered primarily as (automated) feature engineering. I suggest to experiment with different feature selections.

9 Weather data

Detailed weather data, from the official website of the Japanese Meteorological Agency, has been extracted and provided as a dataset by Hunter McGushion. Many thanks!

The data comes in the shape of a weather_stations.csv file with the meta data and location of all the weather stations as well as individual files with detailed time series of meteorological data in a 1-1-16_5-31-17_Weather_Translated.zip folder. Part of his extensive work is that he has already identified the nearest (active) weather stations for each store (region) location and added this information to the modified files air/hpg_store_info_with_nearest_active_station.csv.

Here We will give a brief overview of the weather stations and the parameters they provide. For more information see Hunter’s exploration kernel here and most likely others that will appear shortly.

9.1 Weather Stations

Let’s look at the weather_stations.csv overview data first. When reading the data in the Kernels environment remember to use the more detailed directory paths.

weather_stations <- as_tibble(fread(str_c("data/weather_stations.csv")))
weather_stations %>%
  head(100) %>%
   knitr::kable() %>%   kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = "600px", height = "600px")

id	prefecture	first_name	second_name	latitude	longitude	altitude	date_terminated
aichi__ai-xi-kana__isaai	aichi	ai-xi-kana	isaai	35.21667	136.6983	5.0	NA
aichi__aki-kana__azo	aichi	aki-kana	azo	35.07833	137.4133	613.0	NA
aichi__centrair-kana__centrair	aichi	centrair-kana	centrair	34.85833	136.8050	4.0	NA
aichi__chapel-mountain-kana__chaw-yama	aichi	chapel-mountain-kana	chaw-yama	35.22000	137.6600	1216.0	NA
aichi__gamagori-kana__gamagori	aichi	gamagori-kana	gamagori	34.84500	137.2167	55.0	NA
aichi__horai-kanahorai__2002-12-24	aichi	horai-kana	horai	34.93167	137.5750	81.0	2002-12-24
aichi__ichinomiya-kana__ichinomiya	aichi	ichinomiya-kana	ichinomiya	35.29833	136.8533	11.0	NA
aichi__ikki-kana__ishiki	aichi	ikki-kana	ishiki	34.81167	137.0283	12.0	NA
aichi__inake-kana__inab	aichi	inake-kana	inab	35.21167	137.5067	505.0	NA
aichi__iragaka-kana__iraco	aichi	iragaka-kana	iraco	34.62833	137.0933	6.2	NA
aichi__kanie-kana__kanye	aichi	kanie-kana	kanye	35.13667	136.7917	2.0	NA
aichi__kayama-kanadekiyama__2005-12-05	aichi	kayama-kana	dekiyama	35.09500	137.4167	946.0	2005-12-05
aichi__kohara-kana__obara	aichi	kohara-kana	obara	35.22833	137.2850	290.0	NA
aichi__minamichita-kana__minamichita	aichi	minamichita-kana	minamichita	34.74000	136.9383	16.0	NA
aichi__nagoya-kana__nagoya	aichi	nagoya-kana	nagoya	35.16667	136.9650	51.1	NA
aichi__okazaki-kana__okazaki	aichi	okazaki-kana	okazaki	34.91833	137.1933	47.0	NA
aichi__oku-kana__oobu	aichi	oku-kana	oobu	34.99500	136.9433	32.0	NA
aichi__shingo-castle__shinshiro	aichi	shingo-castle	shinshiro	34.90667	137.5183	53.0	NA
aichi__tahara-kana__tahala	aichi	tahara-kana	tahala	34.62667	137.2217	45.0	NA
aichi__tokai-kanaNONE__2012-10-17	aichi	tokai-kana	NONE	35.02333	136.9017	10.0	2012-10-17
aichi__toyama-kanatoyoyama__2005-02-17	aichi	toyama-kana	toyoyama	35.25500	136.9233	14.0	2005-02-17
aichi__toyohashi-kana__toyohashi	aichi	toyohashi-kana	toyohashi	34.75000	137.3417	3.0	NA
aichi__toyota-kana__toyota	aichi	toyota-kana	toyota	35.13167	137.1767	75.0	NA
aichi__writer-katakana__tsukodee	aichi	writer-katakana	tsukodee	34.97667	137.4250	532.0	NA
akita__NONE__uno-tay	akita	NONE	uno-tay	38.96000	140.5283	335.0	NA
akita__ai-no-kana__aniayi	akita	ai-no-kana	aniayi	39.99333	140.4033	120.0	NA
akita__akita-kana__akita	akita	akita-kana	akita	39.71667	140.0983	6.3	NA
akita__armored-field-kana__yoroi-bata	akita	armored-field-kana	yoroi-bata	39.77500	140.6650	281.0	NA
akita__fujisato-kana__fujisato	akita	fujisato-kana	fujisato	40.32000	140.2933	68.0	NA
akita__fujiwara-kana__fujiwara	akita	fujiwara-kana	fujiwara	40.35333	140.7800	280.0	NA
akita__gokojima-kana__gojoume	akita	gokojima-kana	gojoume	39.93833	140.1150	6.0	NA
akita__hachimantai-kaa__hachimantai	akita	hachimantai-kaa	hachimantai	40.01333	140.8017	578.0	NA
akita__hachimori-kana__hachimori	akita	hachimori-kana	hachimori	40.41333	139.9483	34.0	NA
akita__higase-gen-kana__higa-shinsei	akita	higase-gen-kana	higa-shinsei	39.17833	140.6483	191.0	NA
akita__hinoki-inside-kana__hinokinai	akita	hinoki-inside-kana	hinokinai	39.81167	140.5850	255.0	NA
akita__hitachiuchi-kana__hitachini	akita	hitachiuchi-kana	hitachini	39.90333	140.4500	210.0	NA
akita__honjo-cana__honjo	akita	honjo-cana	honjo	39.36000	140.0550	11.0	NA
akita__iwami-sanchi-kana__iwamisannai	akita	iwami-sanchi-kana	iwamisannai	39.70667	140.2867	41.0	NA
akita__jinja-kana__jimba	akita	jinja-kana	jimba	40.40333	140.6083	176.0	NA
akita__kaga-kana__katsuno	akita	kaga-kana	katsuno	40.21500	140.7867	123.0	NA
akita__kakunoda-kana__kakunodate	akita	kakunoda-kana	kakunodate	39.60333	140.5567	56.0	NA
akita__keumuyama-kanaholohayama__2005-04-20	akita	keumuyama-kana	holohayama	39.36833	140.3167	340.0	2005-04-20
akita__mikiyama-sanamitsumoriyama__1994-10-05	akita	mikiyama-sana	mitsumoriyama	39.22167	140.7167	570.0	1994-10-05
akita__moriyoshiyama-kanamoriyoshiyama__2009-11-02	akita	moriyoshiyama-kana	moriyoshiyama	40.00500	140.5183	800.0	2009-11-02
akita__mt-ukaidoyamaubaidoyama__1983-09-12	akita	mt-ukaidoyama	ubaidoyama	39.09333	140.3350	690.0	1983-09-12
akita__nakahara-kana__nikaho	akita	nakahara-kana	nikaho	39.25500	139.9133	7.0	NA
akita__noritakana__nibetsu	akita	noritakana	nibetsu	39.80000	140.2167	179.0	NA
akita__noshiro-kana__nosiro	akita	noshiro-kana	nosiro	40.19833	140.0317	6.0	NA
akita__odate-kana__odate	akita	odate-kana	odate	40.25167	140.5050	49.0	NA
akita__oga-kana__oga	akita	oga-kana	oga	39.91167	139.9000	20.0	NA
akita__oga-makayama-kana__ogachinesan	akita	oga-makayama-kana	ogachinesan	39.93833	139.7817	84.0	NA
akita__oga-motoyama-kanaogahonzan__1984-11-01	akita	oga-motoyama-kana	ogahonzan	39.89167	139.7667	440.0	1984-11-01
akita__ogata-kana__ogata	akita	ogata-kana	ogata	40.00000	139.9500	3.0	NA
akita__okae-kana__oomagari	akita	okae-kana	oomagari	39.49000	140.4950	30.0	NA
akita__sasako-kana__ginego	akita	sasako-kana	ginego	39.10333	140.2933	200.0	NA
akita__shingakami-kana__wakigami	akita	shingakami-kana	wakigami	40.19167	140.3717	84.0	NA
akita__sugizawa-mountain-kanasugisawayama__1983-09-20	akita	sugizawa-mountain-kana	sugisawayama	40.36500	140.8100	670.0	1983-09-20
akita__taiheiyama-kanatai-haisan__1984-10-31	akita	taiheiyama-kana	tai-haisan	39.78333	140.2500	600.0	1984-10-31
akita__taishoji-kana__daisyouji	akita	taishoji-kana	daisyouji	39.52667	140.2333	20.0	NA
akita__takanosu-kana__takanos	akita	takanosu-kana	takanos	40.22667	140.3717	29.0	NA
akita__tashiro-mountain-canatasilodake__2005-04-20	akita	tashiro-mountain-cana	tasilodake	40.41167	140.4000	800.0	2005-04-20
akita__tazawako-kana__tazawaco	akita	tazawako-kana	tazawaco	39.69833	140.7317	230.0	NA
akita__tazawako-takahara-kana__tazawa-cocogen	akita	tazawako-takahara-kana	tazawa-cocogen	39.77833	140.7617	652.0	NA
akita__tohru-kana__higashi-yuri	akita	tohru-kana	higashi-yuri	39.30500	140.2883	117.0	NA
akita__yajima-kana__yashima	akita	yajima-kana	yashima	39.23500	140.1367	46.0	NA
akita__yokote-kana__NONE	akita	yokote-kana	NONE	39.32000	140.5550	59.0	NA
akita__yuuse-kana__NONE	akita	yuuse-kana	NONE	40.12000	140.8400	214.0	NA
akita__yuwa-kana__yuwa	akita	yuwa-kana	yuwa	39.61500	140.2183	91.0	NA
akita__yuzawa-kana__yuzawa	akita	yuzawa-kana	yuzawa	39.18667	140.4633	74.0	NA
aomori__acid-kettle-kana__sukayu	aomori	acid-kettle-kana	sukayu	40.64833	140.8483	890.0	NA
aomori__aomori-kana__aomori	aomori	aomori-kana	aomori	40.82167	140.7683	2.8	NA
aomori__aomori-otani-kana__aomori-ootani	aomori	aomori-otani-kana	aomori-ootani	40.73333	140.6883	198.0	NA
aomori__asahina-mountain-kanaasahinada-dake__2009-11-02	aomori	asahina-mountain-kana	asahinada-dake	40.32500	141.0483	710.0	2009-11-02
aomori__fukaura-kana__fukaura	aomori	fukaura-kana	fukaura	40.64500	139.9317	66.1	NA
aomori__goshogawara-kana__goshogawara	aomori	goshogawara-kana	goshogawara	40.80833	140.4583	9.0	NA
aomori__hachinohe-kana__hachinoho	aomori	hachinohe-kana	hachinoho	40.52667	141.5217	27.1	NA
aomori__hakkoda-mountain-kanahakko-dasan__2009-11-02	aomori	hakkoda-mountain-kana	hakko-dasan	40.67500	140.8583	1310.0	2009-11-02
aomori__hikawa-kana__nurkawa	aomori	hikawa-kana	nurkawa	40.51500	140.7833	404.0	NA
aomori__hirosaki-kana__hirosaki	aomori	hirosaki-kana	hirosaki	40.61167	140.4550	30.0	NA
aomori__holiday-kana__yasumiya	aomori	holiday-kana	yasumiya	40.42667	140.8983	414.0	NA
aomori__ichiura-kana__siura	aomori	ichiura-kana	siura	41.05667	140.3467	20.0	NA
aomori__ikari-kaseki-kana__ikarigaseki	aomori	ikari-kaseki-kana	ikarigaseki	40.48167	140.6200	135.0	NA
aomori__imabari-kana__imabeets	aomori	imabari-kana	imabeets	41.18000	140.4817	30.0	NA
aomori__kanazawa-kana__ajigasawa	aomori	kanazawa-kana	ajigasawa	40.77667	140.2050	40.0	NA
aomori__kanita-kana__kanita	aomori	kanita-kana	kanita	41.04500	140.6333	5.0	NA
aomori__kuroishi-kana__kuroishi	aomori	kuroishi-kana	kuroishi	40.66667	140.5850	30.0	NA
aomori__masaruyama-kanakenasiyama__1983-09-12	aomori	masaruyama-kana	kenasiyama	40.55167	140.7533	940.0	1983-09-12
aomori__misawa-kana__misawa	aomori	misawa-kana	misawa	40.67500	141.3750	39.0	NA
aomori__mito-kana__san-noghe	aomori	mito-kana	san-noghe	40.38333	141.2567	60.0	NA
aomori__mt-katakana__NONE	aomori	mt-katakana	NONE	40.62833	140.2633	438.0	NA
aomori__mt-sorakuyamasoradayama__1994-10-26	aomori	mt-sorakuyama	soradayama	40.43833	140.7083	585.0	1994-10-26
aomori__mutsukana__mutsu	aomori	mutsukana	mutsu	41.28333	141.2100	2.9	NA
aomori__nagoya-ping-kanachow-kei-daira__1977-09-01	aomori	nagoya-ping-kana	chow-kei-daira	40.61833	139.9967	242.0	1977-09-01
aomori__nobeyama-kana__north-hegi	aomori	nobeyama-kana	north-hegi	40.88500	141.1600	14.0	NA
aomori__odanosawa-kana__odanosawa	aomori	odanosawa-kana	odanosawa	41.23500	141.3967	6.0	NA
aomori__oma-kana__ooma	aomori	oma-kana	ooma	41.52667	140.9117	14.0	NA
aomori__owani-kana__owani	aomori	owani-kana	owani	40.53000	140.5550	63.0	NA
aomori__rokkasho-kana__roccasso	aomori	rokkasho-kana	roccasso	40.88500	141.2717	80.0	NA
aomori__shichido-kana__NONE	aomori	shichido-kana	NONE	40.70833	141.1283	57.0	NA
aomori__shiomon-mori-kanashi-hei-mori__1996-10-14	aomori	shiomon-mori-kana	shi-hei-mori	40.55167	140.2067	645.0	1996-10-14

weather_stations = pd.read_csv("data/weather_stations.csv")

weather_stations.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1663 entries, 0 to 1662
## Data columns (total 8 columns):
##  #   Column           Non-Null Count  Dtype  
## ---  ------           --------------  -----  
##  0   id               1663 non-null   object 
##  1   prefecture       1663 non-null   object 
##  2   first_name       1663 non-null   object 
##  3   second_name      1663 non-null   object 
##  4   latitude         1663 non-null   float64
##  5   longitude        1663 non-null   float64
##  6   altitude         1663 non-null   float64
##  7   date_terminated  360 non-null    object 
## dtypes: float64(3), object(5)
## memory usage: 104.1+ KB

weather_stations.describe()

##           latitude    longitude     altitude
## count  1663.000000  1663.000000  1663.000000
## mean     36.670306   137.158422   287.321888
## std       4.038955     4.410313   389.432230
## min      24.055000   122.978333     0.000000
## 25%      34.313333   133.614167    25.050000
## 50%      35.858333   138.108333   126.000000
## 75%      39.177500   140.636667   406.000000
## max      45.520000   153.983333  3775.100000

weather_stations.describe()

##           latitude    longitude     altitude
## count  1663.000000  1663.000000  1663.000000
## mean     36.670306   137.158422   287.321888
## std       4.038955     4.410313   389.432230
## min      24.055000   122.978333     0.000000
## 25%      34.313333   133.614167    25.050000
## 50%      35.858333   138.108333   126.000000
## 75%      39.177500   140.636667   406.000000
## max      45.520000   153.983333  3775.100000

We find:

There are about 1700 weather stations. Each station has a 3-part id consisting of prefecture, first_name, and second_name; very similar to our restaurant data. There is also altitude data, which might be useful for removing stations at significant elevation.
The most important information are the longitude and latitude coordinates of the stations as well the date at which the station was terminated. Needless to say we would only want to use stations for which we have weather data during our training and prediction date range.

Those are the termination dates:

weather_stations %>%
  filter(date_terminated != is.na(date_terminated)) %>%
  mutate(date_terminated = ymd(date_terminated)) %>%
  ggplot(aes(date_terminated)) +
  geom_histogram(bins = 30, fill = "blue") +
  ggtitle("Weather station termination date")

# Remove rows where date_terminated is empty
weather_stations = weather_stations[weather_stations['date_terminated'].notnull() ]

# Convert date_terminated to datetime
weather_stations['date_terminated'] = pd.to_datetime(weather_stations['date_terminated'])

# Plot histogram
p= plt.figure(figsize=(10, 6))
plt.style.use('ggplot')
p= plt.hist(weather_stations['date_terminated'], bins=30, color='blue',
                            edgecolor='black')
p= plt.title("Weather station termination date")
p= plt.xlabel("Date")
p= plt.ylabel("Frequency")
plt.show(p)

We see that a significant number of stations will only have historical data. Nevertheless, I agree that including all of them in the overview table is useful for getting a feeling for the wider context of the data set.

Here is another interactive leaflet map with all the currently active stations indicated as blue markers. In addition, the air and hpg locations are shown as purple (air) and red (hpg) circles. Each marker/circle is labeled with their respective id. This map illustrates which stations are close to restaurants and allows us to verify the identification of “nearest stations” that has already been included in the larger data set:

foobar <- weather_stations %>%
  filter(date_terminated != is.na(date_terminated))

foo <- air_store %>% select(air_store_id, latitude, longitude) %>%
        mutate(dset = "air")
bar <- hpg_store %>% select(hpg_store_id, latitude, longitude) %>% 
        mutate(dset = "hpg")

leaflet(foobar) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addMarkers(~longitude, ~latitude, popup = ~id, label = ~id, group = "Weather") %>%
  addCircleMarkers(~foo$longitude, ~foo$latitude, group = "AIR",
                   color = "purple", fillOpacity = 0.5, label = ~foo$air_store_id) %>%
  addCircleMarkers(lng = bar$longitude, lat = bar$latitude, group = "HPG",
                   color = "red", fillOpacity = 0.5, label = ~bar$hpg_store_id)

foobar = weather_stations[weather_stations['date_terminated'].notnull()]
foo = air_store[["air_store_id", "latitude", "longitude"]]
foo["dset"] = "air"
bar = hpg_store[["hpg_store_id", "latitude", "longitude"]]
bar["dset"] = "hpg"

# Create a folium map
m = folium.Map(location=[foobar['latitude'].mean(), 
               foobar['longitude'].mean()],zoom_start=5)

# Add tiles
folium.TileLayer('CartoDB positron').add_to(m)

## <folium.raster_layers.TileLayer object at 0x7ff5dc8cf130>

# Create marker clusters
marker_cluster = MarkerCluster(name='Weather').add_to(m)
air_cluster = MarkerCluster(name='AIR').add_to(m)
hpg_cluster = MarkerCluster(name='HPG').add_to(m)

# Add markers to clusters
# for _, row in foobar.iterrows():
#     folium.Marker(location=[row['latitude'], row['longitude']],
#                   popup=row['id'],
#                   icon=None,
#                   tooltip=row['id']).add_to(marker_cluster)

# Define a function to add markers
def add_marker(row):
    folium.Marker(location=[row['latitude'], row['longitude']],
                  popup=row['id'],
                  icon=None,
                  tooltip=row['id']).add_to(marker_cluster)

# Apply the function to each row of the DataFrame
foobar.apply(add_marker, axis=1)

## 5       None
## 11      None
## 19      None
## 20      None
## 41      None
##         ... 
## 1644    None
## 1647    None
## 1648    None
## 1651    None
## 1655    None
## Length: 360, dtype: object

# for _, row in foo.iterrows():
#     folium.CircleMarker(location=[row['latitude'], row['longitude']],
#                         radius=10,
#                         color='purple',
#                         fill=True,
#                         fill_color='purple',
#                         fill_opacity=0.5,
#                         popup=row['air_store_id']).add_to(air_cluster)

# Define a function to add markers
def add_CircleMarker(row, name_id):
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        radius=10,
                        color='purple',
                        fill=True,
                        fill_color='purple',
                        fill_opacity=0.5,
                        popup=row[name_id]).add_to(air_cluster)

# Apply the function to each row of the DataFrame
foo.apply(add_CircleMarker, name_id= 'air_store_id' , axis=1)

## 0      None
## 1      None
## 2      None
## 3      None
## 4      None
##        ... 
## 824    None
## 825    None
## 826    None
## 827    None
## 828    None
## Length: 829, dtype: object

# for _, row in bar.iterrows():
#     folium.CircleMarker(location=[row['latitude'], row['longitude']],
#                         radius=10,
#                         color='red',
#                         fill=True,
#                         fill_color='red',
#                         fill_opacity=0.5,
#                         popup=row['hpg_store_id']).add_to(hpg_cluster)

bar.apply(add_CircleMarker, name_id= 'hpg_store_id' , axis=1)

## 0       None
## 1       None
## 2       None
## 3       None
## 4       None
##         ... 
## 4685    None
## 4686    None
## 4687    None
## 4688    None
## 4689    None
## Length: 4690, dtype: object

# Show the map
m

## <folium.folium.Map object at 0x7ff5d1600b50>

9.2 Weather time series

After getting a spatial overview of all the stations, we will now look at an example of the weather data they provide. We’ll pick the station tokyo__tokyo-kana__tonokyo.csv because is has data for most features but not all (see below). Remember that the individual weather data files are in the folder 1-1-16_5-31-17_Weather.

weather_data <- as.tibble(fread(str_c("data/tokyo__tokyo-kana__tonokyo.csv")))

## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

summary(weather_data)

##  calendar_date        avg_temperature high_temperature low_temperature
##  Min.   :2016-01-01   Min.   : 0.70   Min.   : 3.80    Min.   :-2.60  
##  1st Qu.:2016-05-09   1st Qu.: 8.00   1st Qu.:12.80    1st Qu.: 3.70  
##  Median :2016-09-15   Median :14.60   Median :19.40    Median :10.20  
##  Mean   :2016-09-15   Mean   :14.94   Mean   :19.56    Mean   :11.08  
##  3rd Qu.:2017-01-22   3rd Qu.:21.50   3rd Qu.:26.10    3rd Qu.:18.00  
##  Max.   :2017-05-31   Max.   :31.90   Max.   :37.70    Max.   :26.00  
##                                                                       
##  precipitation     hours_sunlight   solar_radiation deepest_snowfall
##  Min.   :  0.000   Min.   : 0.000   Min.   : 1.11   Min.   :0.00    
##  1st Qu.:  0.000   1st Qu.: 1.200   1st Qu.: 7.77   1st Qu.:0.75    
##  Median :  1.000   Median : 5.900   Median :12.12   Median :2.00    
##  Mean   :  7.750   Mean   : 5.548   Mean   :13.38   Mean   :2.50    
##  3rd Qu.:  7.625   3rd Qu.: 9.000   3rd Qu.:18.18   3rd Qu.:3.75    
##  Max.   :106.500   Max.   :13.400   Max.   :29.87   Max.   :6.00    
##  NA's   :249                                        NA's   :513     
##  total_snowfall avg_wind_speed avg_vapor_pressure avg_local_pressure
##  Min.   :6      Min.   :1.20   Min.   : 2.00      Min.   : 983.1    
##  1st Qu.:6      1st Qu.:2.30   1st Qu.: 6.00      1st Qu.:1006.6    
##  Median :6      Median :2.80   Median :10.90      Median :1011.6    
##  Mean   :6      Mean   :2.87   Mean   :13.06      Mean   :1011.4    
##  3rd Qu.:6      3rd Qu.:3.30   3rd Qu.:19.10      3rd Qu.:1016.1    
##  Max.   :6      Max.   :7.20   Max.   :32.60      Max.   :1029.2    
##  NA's   :516                   NA's   :1                            
##   avg_humidity    avg_sea_pressure  cloud_cover    
##  Min.   : 28.00   Min.   : 985.8   Min.   : 0.000  
##  1st Qu.: 53.00   1st Qu.:1009.5   1st Qu.: 4.000  
##  Median : 66.00   Median :1014.5   Median : 7.500  
##  Mean   : 66.16   Mean   :1014.3   Mean   : 6.721  
##  3rd Qu.: 79.00   3rd Qu.:1019.0   3rd Qu.:10.000  
##  Max.   :100.00   Max.   :1032.1   Max.   :10.000  
##  NA's   :1

weather_data = pd.read_csv("data/tokyo__tokyo-kana__tonokyo.csv")
weather_data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 517 entries, 0 to 516
## Data columns (total 15 columns):
##  #   Column              Non-Null Count  Dtype  
## ---  ------              --------------  -----  
##  0   calendar_date       517 non-null    object 
##  1   avg_temperature     517 non-null    float64
##  2   high_temperature    517 non-null    float64
##  3   low_temperature     517 non-null    float64
##  4   precipitation       268 non-null    float64
##  5   hours_sunlight      517 non-null    float64
##  6   solar_radiation     517 non-null    float64
##  7   deepest_snowfall    4 non-null      float64
##  8   total_snowfall      1 non-null      float64
##  9   avg_wind_speed      517 non-null    float64
##  10  avg_vapor_pressure  516 non-null    float64
##  11  avg_local_pressure  517 non-null    float64
##  12  avg_humidity        516 non-null    float64
##  13  avg_sea_pressure    517 non-null    float64
##  14  cloud_cover         517 non-null    float64
## dtypes: float64(14), object(1)
## memory usage: 60.7+ KB

#weather_data.describe()

We find:

There is daily data for the 517 days that cover our train and test sets. This data includes information such as temperature, rain, snow, air pressure, and even sunlight hours or cloud cover.
We see that some features include mostly NA. In fact, this specific weather station data set is one of the more complete ones and you will find many more missing values in other stations (e.g. miyagi__shiogama-kana__shioogama.csv). Some stations appear to have essentially complete feature data (e.g. hokkaido_ishikari__sapporo-katakana__satporo.csv). This heterogeneity might make it necessary to focus on the overall common features in a first modelling approach.

We will do a little bit of formatting to add some date features like month or day of the week:

weather_data <- weather_data %>%
  mutate(date = ymd(calendar_date),
         wday = wday(date, label = TRUE, abbr = TRUE, week_start = 1),
         month = month(date, label = TRUE, abbr = TRUE),
         week = week(date)) %>%
  select(-calendar_date)

# Convert 'calendar_date' to datetime and create additional columns
weather_data['date'] = pd.to_datetime(weather_data['calendar_date'])
weather_data['wday'] = weather_data['date'].dt.day_name().str[:3]
weather_data['month'] = weather_data['date'].dt.strftime('%b')
weather_data['week'] = weather_data['date'].dt.isocalendar().week

# Drop the original 'calendar_date' column
weather_data = weather_data.drop(columns=['calendar_date'])

# Display the modified DataFrame
weather_data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 517 entries, 0 to 516
## Data columns (total 18 columns):
##  #   Column              Non-Null Count  Dtype         
## ---  ------              --------------  -----         
##  0   avg_temperature     517 non-null    float64       
##  1   high_temperature    517 non-null    float64       
##  2   low_temperature     517 non-null    float64       
##  3   precipitation       268 non-null    float64       
##  4   hours_sunlight      517 non-null    float64       
##  5   solar_radiation     517 non-null    float64       
##  6   deepest_snowfall    4 non-null      float64       
##  7   total_snowfall      1 non-null      float64       
##  8   avg_wind_speed      517 non-null    float64       
##  9   avg_vapor_pressure  516 non-null    float64       
##  10  avg_local_pressure  517 non-null    float64       
##  11  avg_humidity        516 non-null    float64       
##  12  avg_sea_pressure    517 non-null    float64       
##  13  cloud_cover         517 non-null    float64       
##  14  date                517 non-null    datetime64[ns]
##  15  wday                517 non-null    object        
##  16  month               517 non-null    object        
##  17  week                517 non-null    UInt32        
## dtypes: UInt32(1), datetime64[ns](1), float64(14), object(2)
## memory usage: 71.3+ KB

#weather_data.info()

Then we plot the monthly statistics for high temperature and average humidity. This is a perfect job for ridgeline plots and we add a bit of styling following the great CRAN vignette of the ggridges package (which was partly inspired by this blogpost by Austin Wehrwein):

#library('viridis')
p1 <- weather_data %>%
  ggplot(aes(x = high_temperature, y = fct_rev(month), fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, 
                               rel_min_height = 0.01, 
                               gradient_lwd = 1., bandwidth = 1.4) +
  scale_fill_viridis(name = "T_max [°C]", option = "C") +
  ggtitle("Maximum temperature at station \ntokyo tokyo-kana tonokyo in 2016/17") +
  labs(x = "High temperature", y = "") +
  theme_ridges(font_size = 13, grid = TRUE) +
  theme(legend.position = "none") +
  theme(axis.title.y = element_blank())

p2 <- weather_data %>%
  ggplot(aes(x = avg_humidity, y = fct_rev(month), fill = stat(x))) +
  geom_density_ridges_gradient(scale = 3, 
                               rel_min_height = 0.01,
                               gradient_lwd = 1., bandwidth = 4) +
  ggtitle("Average humidity at station \ntokyo tokyo-kana tonokyo in 2016/17") +
  labs(x = "Humidity", y = "", fill = "Humidity") +
  theme_ridges(font_size = 13, grid = TRUE) +
  theme(legend.position = "none") +
  theme(axis.title.y = element_blank())

layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `stat(x)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

import joypy
from pandas.api.types import CategoricalDtype

# Assuming weather_data is a DataFrame

# Define the order of months
month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# Convert 'month' to a categorical variable with the specified order
weather_data['month']= \
weather_data['month'].astype(CategoricalDtype(categories=month_order,
ordered=True))

# Plotting joyplot for maximum temperature and average humidity
#fig, axes = plt.subplots(1, 1, figsize=(8, 4))
fig = plt.figure(figsize=(10, 4))
plt.style.use('ggplot')
# gs = GridSpec(2,1 , figure=fig)
# axes1 = fig.add_subplot(gs[0,0])
# axes2 = fig.add_subplot(gs[1,0])

# Joyplot for Maximum Temperature
p1= joypy.joyplot(
    weather_data,
    by='month',
    column='high_temperature',
    colormap=plt.cm.viridis,
    grid="both", #'x', 'y'
    linewidth=1,
    legend=False,
    fade=True,
    linecolor ="white"#,
#    ax=axes1
)
p1= plt.title('Maximum Temperature at station\n Tokyo Tokyo-Kana Tonokyo in 2016/17', size= 8)

plt.show()

fig = plt.figure(figsize=(10, 4))
#plt.style.use('ggplot')

# Joyplot for Average Humidity
p2= joypy.joyplot(
    weather_data,
    by='month',
    column='avg_humidity',
    colormap=plt.cm.viridis,
    grid= "both",
    linewidth=1,
    xlabels=True,
    legend=False,
    fade=True,
    linecolor = "white"#,
#    ax=ax2
)
p2= plt.title('Average Humidity at station\n Tokyo Tokyo-Kana Tonokyo in 2016/17', size= 8)

plt.show()

We find that summers in Tokyo are quite a bit hotter but also more humid. There is a range of about 20 degrees celsius between winter and summer, and a spread of what looks like around 10 degrees within a typical month. The humidity has more variance per month but the difference between winter and summer is still significant.

Other weather information include the precipitation, total and deepest snowfall (all three mostly missing for this station), hours of sunlight, average wind speed, various pressures, cloud cover, and solar radiation. Here we plot some of those features over the months of the year:

p1 <- weather_data %>%
  ggplot(aes(fct_rev(month), hours_sunlight, fill = fct_rev(month))) +
  geom_boxplot() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(x = "Month")

p2 <- weather_data %>%
  ggplot(aes(fct_rev(month), cloud_cover, fill = fct_rev(month))) +
  geom_boxplot() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(x = "Month")

p3 <- weather_data %>%
  ggplot(aes(fct_rev(month), precipitation, fill = fct_rev(month))) +
  geom_boxplot() +
  coord_flip() +
  theme(legend.position = "none") +
  scale_y_log10() +
  labs(x = "Month")

p4 <- weather_data %>%
  ggplot(aes(fct_rev(month), avg_local_pressure, fill = fct_rev(month))) +
  geom_boxplot() +
  coord_flip() +
  theme(legend.position = "none") +
  scale_y_log10() +
  labs(x = "Month")

layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 354 rows containing non-finite values (`stat_boxplot()`).

from plotnine import ggplot, aes, geom_boxplot, coord_flip, theme, labs, scale_y_log10, facet_wrap


fig = plt.figure(figsize=(12, 8))
plt.style.use('ggplot')
gs = GridSpec(2,2 , figure=fig)
ax1 = fig.add_subplot(gs[0,0])
ax2 = fig.add_subplot(gs[0,1])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])

# Plotting boxplots for various weather features
p1 = ggplot(weather_data, aes(x='month', y='hours_sunlight', fill='month')) + \
     geom_boxplot(na_rm=True) + \
     coord_flip() + \
     theme(legend_position="none") + \
     labs(x="Month")

p2 = ggplot(weather_data, aes(x='month', y='cloud_cover', fill='month')) + \
     geom_boxplot(na_rm=True) + \
     coord_flip() + \
     theme(legend_position="none") + \
     labs(x="Month")

p3 = ggplot(weather_data, aes(x='month', y='precipitation', fill='month')) + \
     geom_boxplot(na_rm=True) + \
     coord_flip() + \
     theme(legend_position="none") + \
     scale_y_log10() + \
     labs(x="Month")

p4 = ggplot(weather_data, aes(x='month', y='avg_local_pressure', fill='month')) + \
     geom_boxplot(na_rm=True) + \
     coord_flip() + \
     theme(legend_position="none") + \
     scale_y_log10() + \
     labs(x="Month")

# Combine the plots using facet_wrap
#combined_plot = (p1 , p2 , p3 , p4) + facet_wrap('~month')

# Display the plots
p1

## <Figure Size: (640 x 480)>

p2

## <Figure Size: (640 x 480)>

p3

## <Figure Size: (640 x 480)>

p4

## <Figure Size: (640 x 480)>

We find:

Interestingly, there are fewer sunlight hours in Sep than in Dec, despite the significantly shorter days (upper left). This is explained by the cloud cover plot (upper right), which shows that winter is considerably less cloudy than the rest of the year. The cloud cover appears to be measured on a relative scale between 0 and 10.
The overall precipitation seems lowest in Feb; although there is a large variance within a month and all boxes are consistent. Remember that the precipitation feature has many missing values.
The average local pressure clearly drops during the summer, with Aug having the lowest pressure values. This is consistent with the tendency for higher precipitation during that month.

Japenese Restaurant Visitor Exploratory

R & Python

Karim Mezhoud & Martin Henze

2023-11-17

1 Introduction

2 Preparations

2.1 Set Python and R Envirments and load packages

2.1.1 for R to use python3

2.1.2 Set Language

2.1.3 Check R libraries paths

2.1.4 Check the path of R envirment used by python

2.1.5 Check which R version and libraries paths used by `rpy2``

2.2 Load libraries

2.3 plot iris datasets

2.4 Helper functions

2.5 Load Data

3 Overview: File structure and content

3.1 Air visits

3.2 Air Reserve

3.3 HPG Reserve

3.4 Air Store

3.5 HPG Store

3.6 Holidays

3.7 Store IDs

3.8 Test data

3.9 Missing values

3.10 Reformating feature

4 Individual feature

4.1 Air Visits

4.2 Air Reservations

4.3 HPG Reservations

4.4 Air Store

4.5 HPG Store

4.6 Holidays

4.7 Test data set

5 Feature relations

5.1 Visitors per genre

5.2 The impact of holidays

5.3 Restaurants per area and the effect on visitor numbers

5.4 Reservations vs Visits

6 Feature Engineering

6.1 Days of the week & months of the year

6.2 Restaurant per area - air/hpg_count

6.3 Distance from the busiest area

6.4 Prefectures

7 Time series parameters

8 Forcasting methods and examples

8.1 ARIMA / auto.arima

8.2 Prophet

8.3 Holt-Winters

8.4 timetk

9 Weather data

9.1 Weather Stations

9.2 Weather time series