7959 Pages
2355 Words
Introduction Of Data Warehousing and Business Intelligence
Get free written samples by our Top-Notch subject experts and Assignment Helper team.
Online Retail Dataset can be considered as a data set which is transactional and efficient in containing all the data of transaction that occurs between 1/12/2010 and 09/12/2011. This transaction is for a organization which is based in UK and is registered as a non store retailing that is online. The mean product which the company sells is unique gifts for all occasions and most of the customers for the organization have been observed as wholesalers.
The main aim is to segment the base of customers on RFM which stands for recency along with frequency and monetary so that the company can effectively and efficiently target the customers.
Recency involves information about how recently A purchase has been made by the customer which tends to signify the number of days since the last purchase has been made by the customer.
Frequency tends to determine how frequently e a customer purchase which also helps in understanding the value on the amount of time for which a customer uses a product. The intensity of value drawn from a product is directly related to the engagement of customers.
Monetary describes how much money a customer spends upon buying a product. This can be defined as the total number of money which is being spent by the customer for a particular product.
About the dataset
Attribute Information:
InvoiceNo: number of invoice . Nominal, for each transaction an integral number containing 6 digits has been assigned. If the starting of the code has the letter 'c', cancellation is indicated.
StockCode: code of the (item) product. Nominal, for each distinct product and integral number containing five digits has been uniquely assigned.
Description: name of the (item) product. Nominal.
Quantity: each product's quantities (item) for each transaction. Numeric.
InvoiceDate: time as well as data of invoice. Numeric, date and time for the generation of each and every individual transaction.
UnitPrice: price of the unit. Numeric, price of the product per individual unit regarding sterling.
CustomerID: number of the customer. Nominal, for each customer, an integral number containing five digits has been uniquely assigned.
Country: name of the country . Nominal, for each the region or country from where individual customers have come
Task 1: Data Understanding
The Code
retail<-read.csv("C:/Users/Rohan/Downloads/Online Retail.csv")
head(retail)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 01-12-2010 08:26 2.55 17850 United Kingdom
## 2 01-12-2010 08:26 3.39 17850 United Kingdom
## 3 01-12-2010 08:26 2.75 17850 United Kingdom
## 4 01-12-2010 08:26 3.39 17850 United Kingdom
## 5 01-12-2010 08:26 3.39 17850 United Kingdom
## 6 01-12-2010 08:26 7.65 17850 United Kingdom
dim(retail)
## [1] 541909 8
str(retail)
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
## $ Description: Factor w/ 4224 levels ""," 4 PURPLE FLOCK DINNER CANDLES",..: 4027 4035 932 1959 2980 3235 1573 1698 1695 259 ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Factor w/ 23260 levels "1/11/2011 10:01",..: 198 198 198 198 198 198 198 199 199 200 ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
Discussion
From the above analysis, it can be seen that the dataset contains 8 variables with 541909 observations. The names of the variables are checked with the attribute names and those are same as above.
Task 2: Performing RFM segmentation
In order to perform RFM segmentation, first of all, the data has to be made clean. Therefore, it is checked whether it contains any missing values or null values.
The Code
colSums(is.na(retail))
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country
## 135080 0
any(is.null(retail))
## [1] FALSE
retail = na.omit(retail)
dim(retail)
## [1] 406829 8
retail <- mutate(retail, Quantity = replace(Quantity, Quantity <= 0, NA),
UnitPrice = replace(UnitPrice, UnitPrice <= 0, NA))
Duplicate Values:
dim(unique(retail))[1]
## [1] 401574
Discussion
From the result, it can be seen that there are 135080 null values within the dataset and at the same time, no null values can be found. As the attribute names CustomerID is the key attribute in this case for performing the RFM analysis, therefore, all the missing rows are omitted from the dataset. After omitting the missing values, the dataset contains 406829 observations. The negative values are also replaced with NA and the duplicate values are deleted. After performing all these, the dataset contain 401604 unique values.
Preparation of data
For performing RFM analysis, after cleaning the dataset, the attribute InvoiceDate is split into four sections that are Hour, Year, Month and Day. Therefore, first of all, this attribute is converted into character first.
The Code
retail$InvoiceDate <- as.character(retail$InvoiceDate)
# separate date and time components of invoice date
retail$date <- sapply(retail$InvoiceDate, FUN = function(x) {strsplit(x, split = '[ ]')[[1]][1]})
retail$time <- sapply(retail$InvoiceDate, FUN = function(x) {strsplit(x, split = '[ ]')[[1]][2]})
# create month, year and hour of day variables
retail$month <- sapply(retail$date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][2]})
retail$year <- sapply(retail$date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][3]})
retail$hourOfDay <- sapply(retail$time, FUN = function(x) {strsplit(x, split = '[:]')[[1]][1]})
retail$InvoiceDate <- AsDateTime(retail$InvoiceDate)
retail = mutate(retail, TotalSales = Quantity*UnitPrice)
retail$dayOfWeek <- wday(retail$InvoiceDate,label = TRUE)
retail$Country <- as.factor(retail$Country)
retail$month <- as.factor(retail$month)
retail$year <- as.factor(retail$year)
levels(retail$year) <- c(2010,2011)
hourOfDay <- as.factor(retail$hourOfDay)
retail$dayOfWeek <- as.factor(retail$dayOfWeek)
The Process
In order to perform RFM analysis, several steps have to be performed. First of all, the most recent date for each of ID has to be found in order to get the Recency data. Then the quantity of the transactions of a customer till the present date has to be calculated in order to get the Frequency data. The Monetary data is going to be represented by the sum of total sales.
The code for calculating RFM (Recency, Frequency and Monetary)
max_date <- max(retail$InvoiceDate, na.rm = TRUE)
retail = mutate(retail, Diff = difftime(max_date, InvoiceDate, units = "days"))
retail$Diff <- floor(retail$Diff)
RFM <- summarise(group_by(retail,CustomerID),Frequency = n(), Monetary = sum(TotalSales), Recency = min(Diff))
RFM$Recency <- as.numeric(RFM$Recency)
RFM$Monetary[is.na(RFM$Monetary)] <- 0
summary(RFM)
## CustomerID Frequency Monetary Recency
## Min. :12346 Min. : 1.00 Min. : 1.0 Min. : 0.00
## 1st Qu.:13828 1st Qu.: 14.00 1st Qu.: 256.4 1st Qu.: 12.00
## Median :15300 Median : 33.00 Median : 541.5 Median : 48.00
## Mean :15297 Mean : 65.88 Mean : 1447.1 Mean : 91.29
## 3rd Qu.:16767 3rd Qu.: 72.00 3rd Qu.: 1226.4 3rd Qu.:154.00
## Max. :18287 Max. :4460.00 Max. :192588.6 Max. :352.00
The output
head(RFM,10)
## # A tibble: 10 x 4
## CustomerID Frequency Monetary Recency
## <int> <int> <dbl> <dbl>
## 1 12346 1 77184. 316
## 2 12347 76 1770. 30
## 3 12348 26 1430. 66
## 4 12349 73 1758. 9
## 5 12352 62 1210. 63
## 6 12353 4 89 194
## 7 12354 58 1079. 223
## 8 12356 38 2330. 13
## 9 12359 105 2877. 48
## 10 12360 129 2662. 43
Task 3: Customer Segmentation with K – means
Here, in K means clustering method, two algorithms are used. One is known as the elbow method whereas the other one is known as the silhouette method. First of all, the scaling of data is performed which is compulsory to perform clustering.
Coding for the Elbow method
fviz_nbclust(RFM_scaled, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)
Discussion
From the above figure, it can be seen that, the graph is starting to bend when it is at cluster 3. Therefore, it can be said that K = 3 is the optimal cluster in this case.
Silhouette Method
It can be understood that the quality which a clustering has can be measured and defined by the approach of the average silhouette. With this we can understand how well the objects which are present inside the cluster lies. A good clustering can be indicated by how high the width of the average silhouette is. What are the different values of k the average silhouette of observation can be computed by the average silhouette method. The maximization of the average silhouette over a range for k which is the optimal number of clusters is done to find the possible values for the square of k.
The silhouette function in the cluster package can be used to calculate and compute the average weight of the silhouette. For 1 - 10 clusters the following code computes the approach.
The code
fviz_nbclust(RFM_scaled, kmeans, method = "silhouette")
The Output
Discussion
Here, from this graph, it can be seen that K =4 gives the most optimal cluster whereas K = 3 is just next to it. Therefore, K means clustering is going to be visualized for both K = 3 and K – 4 in order to have a clear and better understanding.
The coding and the Output
k3 <- kmeans(RFM_scaled, centers = 3, nstart = 25)
k4 <- kmeans(RFM_scaled, centers = 4, nstart = 25)
fviz_cluster(k3, geom = "point", data = RFM_scaled, pointsize = 0.2) + ggtitle("k = 3")
fviz_cluster(k4, geom = "point", data = RFM_scaled, pointsize = 0.2) + ggtitle("k = 4")
Discussion
It can be seen that some overlapping is taking at K = 4. Therefore, it can be concluded that at K = 3 the optimal clusters can be found and K = 3 is the optimal value of K. the summary statistics of the clusters are visually represented using box plot.
The Code
res <- cbind(RFM, ClusterId = k3$cluster)
res <- as.data.frame(res)
a <- ggplot(res, aes(x = ClusterId, y = Frequency, group = ClusterId, fill = as.factor(ClusterId))) +
geom_boxplot(show.legend = FALSE) + theme_minimal() + scale_fill_brewer(palette = "Set2")
b <- ggplot(res, aes(x = ClusterId, y = Monetary, group = ClusterId, fill = as.factor(ClusterId))) +
geom_boxplot(show.legend = FALSE) + theme_minimal() + scale_fill_brewer(palette = "Set2")
c <- ggplot(res, aes(x = ClusterId, y = Recency, group = ClusterId, fill = as.factor(ClusterId))) +
geom_boxplot(show.legend = FALSE) + theme_minimal() + scale_fill_brewer(palette = "Set2")
grid.arrange(a,b,c, ncol = 3)
The Output
Task 4: Review of Results
The customers that can be identified in cluster 1 has been observed as the customers with the highest amount of transaction along with being individuals who buy frequently and are also buyers who are recent when being compared to other customers. This can be regarded as an important component from the point of view for the business. Customers who are present in the second cluster are the customers who has medium transaction amount when they are being compared to other customers.
The customers in cluster 3 has been identified as the customers who has the lowerst transaction amount along with being buyers who are not the frequent and not even buyers who are recent. These are the customers who can be considered as the least important from the point of view of the Business.
Task 5: Data Mart Design
The Coding and the Output
dunn_km = dunn(clusters = k3$cluster, Data = RFM_scaled)
dunn_km
## [1] 0.0003457535
memb_ward = cutree(hc3, k = 3)
dunn_ward <- dunn(clusters = memb_ward, Data = RFM_scaled)
dunn_ward
## [1] 0.001797323
sil_k3 <- silhouette(k3$cluster, euclidian_dist)
summary(sil_k3)
## Silhouette of 3480 units in 3 clusters from silhouette.default(x = k3$cluster, dist = euclidian_dist) :
## Cluster sizes and average silhouette widths:
## 1017 2452 11
## 0.56177664 0.57127184 0.07252183
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4021 0.5110 0.6455 0.5669 0.6893 0.7217
sil_hc <- silhouette(memb_ward, euclidian_dist)
summary(sil_hc)
## Silhouette of 3480 units in 3 clusters from silhouette.default(x = memb_ward, dist = euclidian_dist) :
## Cluster sizes and average silhouette widths:
## 2206 1268 6
## 0.5592346 0.4545712 0.2045036
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2356 0.4575 0.6028 0.5205 0.6499 0.6857
#K-means Clustering results
aggregate(res,by = list(res$ClusterId),FUN = mean)
## Group.1 Frequency Monetary Recency ClusterId
## 1 1 26.69322 495.5639 223.22124 1
## 2 2 74.99878 1472.5307 36.82871 2
## 3 3 1655.36364 83753.2727 35.09091 3
#Hierarchical clustering results
aggregate(res1,by = list(res1$ClusterId),FUN = mean)
## Group.1 Frequency Monetary Recency ClusterId
## 1 1 81.41659 1698.8343 28.99728 1
## 2 2 28.70268 520.1902 200.09937 2
## 3 3 2208.83333 104781.7217 2.00000 3