“A company can develop rapidly when it knows the behavior of it’s customer personality, so that it can provide better services and benefits to customers who have the potential to become loyal customers. By processing historical marketing campaign data to improve performance and target the right customers, so they can transcat on the company’s platform, from this data insight our focus is to create a cluster prediction model to make it easir for companies to make decisions.“
df = pd.read_csv("marketing_campaign_data.csv")
df.head(12)
df['conversion_rate'] = df['Response'] / df['NumWebVisitsMonth']
df
def kelompok_usia(x):
if x['Year_Birth'] <= 1954:
kelompok = 'Lansia'
elif x['Year_Birth'] >= 1955 and x['Year_Birth'] <= 1993:
kelompok = 'Dewasa'
else:
kelompok = 'Remaja'
return kelompok
df['grup_umur'] = df.apply(lambda x: kelompok_usia(x), axis=1)
def kesejahteraan_masyakat(x):
if x['Income'] >= 5.174150e+07:
kelompok = 'Kaya'
else:
kelompok = 'Biasa aja'
return kelompok
df['grup_income'] = df.apply(lambda x: kesejahteraan_masyakat(x), axis=1)
df['Total_Purchases'] = df['NumDealsPurchases'] + df['NumWebPurchases']+df['NumCatalogPurchases']+df['NumStorePurchases']+df['NumWebVisitsMonth']
df['jumlah_anak'] = df['Kidhome'] + df['Teenhome']
df['total_pembelian'] = df['MntCoke']+df['MntFruits']+df['MntMeatProducts']+df['MntFishProducts']+df['MntSweetProducts']+df['MntGoldProds']
df['Total_Transaksi'] = df['Income'] - df['total_pembelian']
df['total_acc_cmp'] = df['AcceptedCmp2'] + df['AcceptedCmp1'] + df['AcceptedCmp5'] + df['AcceptedCmp3'] + df['AcceptedCmp4']
From the boxplot above, it can be seen that there is an outlier that is not too far from the other data. This outlier is located between the upper and lower bounds of the boxplot. This indicates that the outlier is still within the reasonable range of values for the data.
- The following variables are normally distributed: 'total_transaksi', NumWebVisitsMonth, NumStorePurchases, NumWebPurchases, NumDealsPurchases, Recency, Year_Birth
- The following variables are positively skewed: MntCoke, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, conversion_rate
- The following variables are bimodal or have more than 1 mode: total_acc_cmp, jumlah_anak, Kidhome, Teenhome
There is a significant relationship between customer age and conversion rate, where adults tend to have a greater impact on conversion rate than teenagers and the elderly. This is because adults are in their active age and have a higher income than teenagers and are more active than the elderly.
The graph above shows the relationship between conversion rate and the number of children. It can be seen that people with no children tend to have a higher conversion rate than people with one or more children.
According to the results, Income has 24 null values, conversion_rate has 11, and Total_Transaksi has 24.
We handle missing values using the following query,
df['Income'].fillna(df['Income'].mean(), inplace=True)
df['conversion_rate'] = df['conversion_rate'].fillna(0)
df['Total_Transaksi'].fillna(df['Total_Transaksi'].mean(), inplace=True)
there are no duplicates in our data
df.duplicated().sum()
we will remove unnecessary data.
df.drop(columns = ['Unnamed: 0','ID', 'Kidhome', 'Teenhome','Z_CostContact', 'Z_Revenue','Dt_Customer'], inplace=True)
df['Education'] = df['Education'].map({'S3' : 4, 'S2' : 3, 'S1':2, 'D3':1, 'SMA':0})
df['grup_income'] = df['grup_income'].map({'Kaya':1, 'Biasa aja':0})
df['grup_umur'] = df['grup_umur'].map({'Dewasa' : 1, 'Lansia': 0, 'Remaja':2})
df['Marital_Status'] = df['Marital_Status'].map({'Single' : 0, 'Couple' : 1})
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scd = StandardScaler()
y_fit = scd.fit_transform(df.astype(float))
y_fit
cluster = df[['Recency', 'Total_Purchases', 'total_pembelian']].copy()
cluster.columns = ['Recency','Frequency','Monetary']
features = ['Recency','Frequency','Monetary']
cluster.describe(include='all')
We want to see a graph of the RFM
cols = cluster.columns
plt.figure(figsize= (15, 20))
for i in range(len(cols)):
plt.subplot(6, 2, i+1)
sns.kdeplot(x = cluster[cols[i]])
plt.tight_layout()
cols = cluster.columns
plt.figure(figsize= (10,15))
for i in range(len(cols)):
plt.subplot(4, 4, i+1)
sns.boxplot(y = cluster[cols[i]], orient='v')
plt.tight_layout()
Looks like we have a few outliers. Time to handle them.
for col in cols:
high_cut = cluster[col].quantile(q=0.99)
low_cut= cluster[col].quantile(q=0.01)
cluster.loc[cluster[col]>high_cut,col]=high_cut
cluster.loc[cluster[col]<low_cut,col]=low_cut
It turns out that there are still some outliers in the monetary data. Let's handle with transformation.
tf_log = cluster.copy()
tf_log['Monetary'] = np.log(cluster['Monetary'])
plt.figure(figsize= (5, 5))
sns.kdeplot(x = tf_log['Monetary'])
plt.tight_layout()
inertia = []
for i in range(1,11):
kmeans = KMeans(n_clusters = i, max_iter = 300, n_init=10, random_state = 42)
kmeans.fit(y_fit)
inertia.append(kmeans.inertia_)
sns.lineplot(x=range(1,11), y = inertia, color = 'purple')
sns.scatterplot(x=range(1,11), y = inertia, s = 50, color = 'blue')
circle = Ellipse((4, 45000), width=0.3, height=2000, color='red', fill=False, linewidth=2)
plt.gca().add_patch(circle)
# plt.gca().autoscale_view()
plt.show()
The slope appears to be decreasing from 4 to 5. Therefore, n_cluster = 4 will be chosen to perform the k-means clustering model.
n_cluster = [4,5,6,8,9,10]
fig, ax = plt.subplots(2, 3, figsize=(15,8))
for i in n_cluster:
kmeans = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42)
q, mod = divmod(i, 4)
visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick', ax=ax[q-1][mod])
visualizer.fit(df_std)
The silhouette score that is good is the one on the lower right with an average value of 0.6, so the performance of the model obtained from the silhouette score is also better. In addition, if you pay attention. In general, a silhouette value that approaches 1 indicates that the data clustering within that cluster is very good.
# Membandingkan hasil scatter plot PCA dengan scatter plot sebelumnya
sns.pairplot(data=df_pca, hue='Labels', diag_kind='kde', palette=(random.shuffle(colors)))
plt.tight_layout(rect = (2,2,2,2))
c = ['#957DAD','#E0BBe4','#B7D3DF','#CDE8E6']
def dist_feats(features):
plt.figure(figsize=[len(features)*5,3])
i = 1
for feats in features:
ax = plt.subplot(1,len(features),i)
ax.vlines(cluster[feats].median(), ymin=-0.5, ymax=3, color='black', linewidth=1, linestyle='--')
dfg = cluster.groupby('Labels')
x = dfg[feats].median().index
y = dfg[feats].median().values
ax.barh(x,y, color=c)
plt.title(feats)
i = i+1
dist_feats(features)
• R, Recency: The higher the value of frequency, the more often the customer makes a purchase.
• F, Total_Purchases: The higher the value of frequency, the more often the customer makes a purchase.
• M, total Purchases: The higher the value of monetary, the more money the customer spends on purchases.
• Label 0 = has a high R pattern as well as F and M below the median.
• Label 1 = has a high F and M pattern as well as R below the median.
• Label 2 = has a low F, M, and R pattern.
• Label 3 = has a high F, M, and R pattern.
• Cluster 0: Most Loyal Customers:
Customers in this cluster last interacted with the business 74 days ago, with low shopping frequency and the highest spending.
• Cluster 1: New Customers:
Customers in this cluster have just interacted with the business within the last 22 days, with high shopping frequency and significant spending.
• Cluster 2: Impactful Customers:
Customers in this cluster have just interacted with the business within the last 24 days, with low shopping frequency and a fair amount of spending.
• Cluster 3: Passive Customers:
Customers in this cluster last interacted with the business 73 days ago, with high shopping frequency and significant spending.
Cluster 3 & Cluster 1: These clusters are good targets for retargeting because of their high shopping frequency and spending. Marketing strategies can focus on offering exclusive deals or purchase bonuses to increase customer loyalty in these groups.
Cluster_0 = cluster[cluster['Labels'] == 0]['Monetary'].sum()
Cluster_1 = cluster[cluster['Labels'] == 1]['Monetary'].sum()
Cluster_2 = cluster[cluster['Labels'] == 2]['Monetary'].sum()
Cluster_3 = cluster[cluster['Labels'] == 3]['Monetary'].sum()
total_spent = Cluster_0 + Cluster_1 + Cluster_2 + Cluster_3
potential_impact_cluster_3 = (Cluster_3 / total_spent) * 100
potential_impact_cluster_1 = (Cluster_1 / total_spent) * 100
print('Total Spent of Cluster 0: Rp', Cluster_0)
print('Total Spent of Cluster 1: Rp', Cluster_1)
print('Total Spent of Cluster 2: Rp', Cluster_2)
print('Total Spent of Cluster 3: Rp', Cluster_3)
print('Total Spent: Rp', total_spent)
print('Potential Impact of Cluster 3: {:.2f}%'.format(potential_impact_cluster_3))
print('Potential Impact of Cluster 1: {:.2f}%'.format(potential_impact_cluster_1))
output:
Total Spent of Cluster 0: Rp 88453000
Total Spent of Cluster 1: Rp 557668000
Total Spent of Cluster 2: Rp 80137000
Total Spent of Cluster 3: Rp 626855000
Total Spent: Rp 1353113000
Potential Impact of Cluster 3: 46.33%
Potential Impact of Cluster 1: 41.21%
"Focusing our retargeting efforts on Cluster 3 and Cluster 1 could yield significant returns. We can expect to secure around Rp 62.7 billion from Cluster 3 and Rp 55.8 billion from Cluster 1, translating to potential impact rates of 46.3% and 41.2% respectively. In essence, prioritizing these clusters presents a promising avenue for boosting revenue and customer engagement."