pandas 用户数据分析2

标签：数据分析 ... 01 1997 unactive df 用户 pandas 0.000

user_analysis

第一部分:数据类型处理¶

数据加载¶

字段含义:
    user_id:用户ID
    order_dt:购买日期
    order_product:购买产品的数量
    order_amount:购买金额

观察数据¶

查看数据的数据类型
数据中是否存储在缺失值
将order_dt转换成时间类型
查看数据的统计描述
    计算所有用户购买商品的平均数量
    计算所有用户购买商品的平均花费
    在源数据中添加一列表示月份:astype(datetime64[M])

In [ ]:

# 加载数据，定义字段含义
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df = pd.read_csv("./CDNOW_master.txt", header=None,
                 sep="\s+", names=["user_id", "order_dt", "order_product", "order_amount"])
df.head()

Out[ ]:

	user_id	order_dt	order_product	order_amount
0	1	19970101	1	11.770
1	2	19970112	1	12.000
2	2	19970112	5	77.000
3	3	19970102	2	20.760
4	3	19970330	2	20.760

In [ ]:

# 将order_dt转换成时间类型，格式化时间
df["order_dt"] = pd.to_datetime(df["order_dt"], format="%Y%m%d")

In [ ]:

# 添加month列
df["month"] = df["order_dt"].values.astype("datetime64[M]")
df.head(20)

Out[ ]:

	user_id	order_dt	order_product	order_amount	month
0	1	1997-01-01	1	11.770	1997-01-01
1	2	1997-01-12	1	12.000	1997-01-01
2	2	1997-01-12	5	77.000	1997-01-01
3	3	1997-01-02	2	20.760	1997-01-01
4	3	1997-03-30	2	20.760	1997-03-01
5	3	1997-04-02	2	19.540	1997-04-01
6	3	1997-11-15	5	57.450	1997-11-01
7	3	1997-11-25	4	20.960	1997-11-01
8	3	1998-05-28	1	16.990	1998-05-01
9	4	1997-01-01	2	29.330	1997-01-01
10	4	1997-01-18	2	29.730	1997-01-01
11	4	1997-08-02	1	14.960	1997-08-01
12	4	1997-12-12	2	26.480	1997-12-01
13	5	1997-01-01	2	29.330	1997-01-01
14	5	1997-01-14	1	13.970	1997-01-01
15	5	1997-02-04	3	38.900	1997-02-01
16	5	1997-04-11	3	45.550	1997-04-01
17	5	1997-05-31	3	38.710	1997-05-01
18	5	1997-06-16	2	26.140	1997-06-01
19	5	1997-07-22	2	28.140	1997-07-01

In [ ]:

# 计算所有用户购买商品的平均数量 2.410040
# 计算所有用户购买商品的平均花费 35.893648
df.describe()[["order_product", "order_amount"]]

Out[ ]:

	order_product	order_amount
count	69659.000	69659.000
mean	2.410	35.894
std	2.334	36.282
min	1.000	0.000
25%	1.000	14.490
50%	2.000	25.980
75%	3.000	43.700
max	99.000	1286.010

第二部分:按月数据分析¶

用户每月花费的总金额¶

绘制曲线图展示

所有用户每月的产品购买量¶

所有用户每月的消费总次数¶

统计每月的消费人数¶

In [ ]:

# 用户每月花费的总金额，并绘制折线图
df.groupby(by="month")["order_amount"].sum().plot()

Out[ ]:

<AxesSubplot: xlabel='month'>

In [ ]:

# 所有用户每月的产品购买量
df.groupby(by="month")["order_product"].sum().plot()

Out[ ]:

<AxesSubplot: xlabel='month'>

In [ ]:

# 所有用户每月的消费总次数
df.groupby(by="month")["user_id"].count()

Out[ ]:

month
1997-01-01     8928
1997-02-01    11272
1997-03-01    11598
1997-04-01     3781
1997-05-01     2895
1997-06-01     3054
1997-07-01     2942
1997-08-01     2320
1997-09-01     2296
1997-10-01     2562
1997-11-01     2750
1997-12-01     2504
1998-01-01     2032
1998-02-01     2026
1998-03-01     2793
1998-04-01     1878
1998-05-01     1985
1998-06-01     2043
Name: user_id, dtype: int64

In [ ]:

# 统计每月的消费人数
df.groupby(by="month")["user_id"].nunique()

Out[ ]:

month
1997-01-01    7846
1997-02-01    9633
1997-03-01    9524
1997-04-01    2822
1997-05-01    2214
1997-06-01    2339
1997-07-01    2180
1997-08-01    1772
1997-09-01    1739
1997-10-01    1839
1997-11-01    2028
1997-12-01    1864
1998-01-01    1537
1998-02-01    1551
1998-03-01    2060
1998-04-01    1437
1998-05-01    1488
1998-06-01    1506
Name: user_id, dtype: int64

第三部分: 用户个体消费数据分析¶

用户消费总金额和消费总次数的统计描述¶

用户消费金额和消费次数的散点图¶

各个用户消费总金额的直方分布图(消费金额在1000之内的分布)¶

各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)¶

In [ ]:

# 用户消费总金额
df.groupby(by="user_id")["order_amount"].sum()

Out[ ]:

user_id
1        11.770
2        89.000
3       156.460
4       100.500
5       385.610
          ...  
23566    36.000
23567    20.970
23568   121.700
23569    25.740
23570    94.080
Name: order_amount, Length: 23570, dtype: float64

In [ ]:

# 用户消费总次数
df.groupby(by="user_id")["order_amount"].count()

Out[ ]:

user_id
1         1
2         2
3         6
4         4
5        11
         ..
23566     1
23567     1
23568     3
23569     1
23570     2
Name: order_amount, Length: 23570, dtype: int64

In [ ]:

# 用户消费金额和消费次数的散点图
# 用户消费金额
money = df.groupby(by="user_id")["order_amount"].sum()
# 用户消费次数
times = df.groupby(by="user_id")["order_product"].count()
# 绘图
plt.scatter(times, money)

Out[ ]:

<matplotlib.collections.PathCollection at 0x25588bbaed0>

In [ ]:

# 各个用户消费总金额的直方分布图(消费金额在1000之内的分布)
df.groupby(by='user_id').sum().query("order_amount < 1000")["order_amount"].hist()

C:\Users\chenh\AppData\Local\Temp\ipykernel_22864\701786761.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby(by='user_id').sum().query("order_amount < 1000")["order_amount"].hist()

Out[ ]:

<AxesSubplot: >

In [ ]: In [ ]:

# 各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)
df.groupby(by="user_id").sum().query("order_product < 100")["order_product"].hist()

C:\Users\chenh\AppData\Local\Temp\ipykernel_22864\2679188117.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby(by="user_id").sum().query("order_product < 100")["order_product"].hist()

Out[ ]:

<AxesSubplot: >

第四部分: 用户消费行为分析¶

用户第一次消费的月份分布，和人数统计¶

绘制线形图

用户最后一次消费的时间分布，和人数统计¶

绘制线形图

新老客户的占比¶

消费一次为新用户
消费多次为老用户
    分析出每一个用户的第一个消费和最后一次消费的时间
    agg(['func1func2]):对分组后的结果进行指定聚合
    分析出新老客户的消费比例

用户分层¶

分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfm
RFM模型设计
    R表示客户最近一次交易时间的间隔
        /np.timedelta64(1，"D"):去除days。
    F表示客户购买商品的总数量,F值越大，表示客户交易越频繁，反之则表示客户交易不够活跃。
    M表示客户交易的金额。M值越大，表示客户价值越高，反之则表示客户价值越低。
    将R，F，M作用到rfm表中
根据价值分层，将用户分为:
    "重要价值客户"
    "重要保持客户"
    "重要挽留客户"
    "重要发展客户"
    "一般价值客户"
    "一般保持客户"
    "一般挽留客户"
    "一般发展客户"
        使用已有的分层模型rfm_func

In [ ]:

# 用户第一次消费的月份统计，和人数统计，绘制折线图
first_con = df.groupby(by="user_id")["month"].min().value_counts().plot()

In [ ]:

# 用户最后一次消费的月份统计和人数统计，绘制折线图
df.groupby(by="user_id")["month"].max().value_counts().plot()

Out[ ]:

<AxesSubplot: >

In [ ]:

# # 新老用户占比
# 消费一次新用户,消费多次老用户
# 如何获知用户是否为第一次消费? 可以根据用户的消费时间进行判定?
# 如果用户的第一次消费时间和最后一次消费时间一样，则该用户只消费了一次为新用户，否则为老用户

new_old_con_df = df.groupby(by="user_id")["order_dt"].agg(["min","max"])
new_old = new_old_con_df["min"] == new_old_con_df["max"].values
new = new_old.value_counts()[True]
old = new_old.value_counts()[False]
new_proportion = new / (new + old)
old_proportion = old / (new + old)

"老用户占比：{:.2f}%".format(old_proportion*100),"新用户占比：{:.2f}%".format(new_proportion*100)

Out[ ]:

('老用户占比：48.86%', '新用户占比：51.14%')

In [ ]:

# 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间的表格rfm 用透视表
rfm = df.pivot_table(index="user_id", aggfunc={"order_product":"sum", "order_amount": "sum", "order_dt":"max"})

In [ ]:

# R表示用户最近一次交易时间的间隔
# R = df中最大的日期 - 每个用户最后一次交易的日期
# 去除days用 /np.timedelta64(1，"D")
today = df["order_dt"].max()
rfm["R"] = (today - df.groupby(by="user_id")["order_dt"].max()) / np.timedelta64(1,"D")

In [ ]:

# 删除order_dt字段
rfm.drop("order_dt", axis=1, inplace=True)

In [ ]:

# 重命名字段名为MRF
rfm.columns = ["M", "F", "R"]
rfm

Out[ ]:

	M	F	R
user_id
1	11.770	1	545.000
2	89.000	6	534.000
3	156.460	16	33.000
4	100.500	7	200.000
5	385.610	29	178.000
...	...	...	...
23566	36.000	2	462.000
23567	20.970	1	462.000
23568	121.700	6	434.000
23569	25.740	2	462.000
23570	94.080	5	461.000

23570 rows × 3 columns

In [ ]:

# RFM模型
def rfm_func(x):
    level = x.map(lambda x: "1" if x >= 0 else "0")
    label = level.R + level.F + level.M
    d = {
        "111": "重要价值客户",
        "011": "重要保持客户",
        "101": "重要挽留客户",
        "001": "重要发展客户",
        "110": "一般价值客户",
        "010": "一般保持客户",
        "100": "一般挽留客户",
        "000": "一般发展客户"
    }
    result = d[label]
    return result

In [ ]:

# 将rfm_func计算的结果返回给新建label列 (lambda x: x - x.mean()).rfm_func
rfm["label"] = rfm.apply(lambda x: x - x.mean()).apply(rfm_func, axis=1)
rfm.head()

Out[ ]:

	M	F	R	label
user_id
1	11.770	1	545.000	一般挽留客户
2	89.000	6	534.000	一般挽留客户
3	156.460	16	33.000	重要保持客户
4	100.500	7	200.000	一般发展客户
5	385.610	29	178.000	重要保持客户

第五部分: 用户的生命周期¶

将用户划分为活跃用户和其他用户¶

统计每个用户每个月的消费次数
统计每个用户每个月是否消费，消费记录为1否则记录为0
    知识点: DataFrame的apply和applymap的区别
        applymap:返回df
            将函数做用于DataFrame中的所有元素(elements)
        apply:返回Series
            apply()将一个函数作用于DataFrame中的每个行或者列

将用户按照每一个月份分成:¶

unreg:观望用户(前两月没买，第三个月才第一次买,则用户前两个月为观望用户)。
unactive:首月购买后，后序月份没有购买则在没有购买的月份中该用户的为非活用户。 
new:当前月就进行首次购买的用户在当前月为新用户
active:连续月份购买的用户在这些月中为活跃用户
return:购买之后间隔n月再次购买的第一个月份为该月份的回头客

In [ ]:

# 统计每个用户每个月的消费次数 用透视 var:user_month_count_df
user_month_count_df = df.pivot_table(index="user_id",values="order_dt",aggfunc="count", columns="month").fillna(value=0)
user_month_count_df

Out[ ]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
2	2.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
3	1.000	0.000	1.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	2.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000
4	2.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000
5	2.000	1.000	0.000	1.000	1.000	1.000	1.000	0.000	1.000	0.000	0.000	2.000	1.000	0.000	0.000	0.000	0.000	0.000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23566	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23567	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23568	0.000	0.000	1.000	2.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23569	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
23570	0.000	0.000	2.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

23570 rows × 18 columns

In [ ]:

# 统计每个用户每个月是否消费，消费记录为1否则记录为0  var:df_purchase
df_purchase = user_month_count_df.applymap(lambda x : 1 if x >=1 else 0 )
df_purchase

Out[ ]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	0	1	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0
4	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0
5	1	1	0	1	1	1	1	0	1	0	0	1	1	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23566	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23567	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23568	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23569	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23570	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

23570 rows × 18 columns

In [ ]:

# 用户生命周期模型，固定算法
def active_status(data):
    status = []
    for i in range(18):
        # 若本月没有消费
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] == "unreg":
                    status.append("unreg")
                else:
                    status.append("unactive")
            else:
                status.append("unreg")

        # 若本月消费
        else:
            if len(status) == 0:
                status.append("new")
            else:
                if status[i-1] == "unactive":
                    status.append("return")
                elif status[i-1] == "ureg":
                    status.append("new")
                else:
                    status.append("active")
    return status

In [ ]:

# 将df_purchase中的原始数据0和1修改为new,unactive...返回新var:df_purchase_new
df_purchase_new = df_purchase.apply(active_status,axis=1)
df_purchase_new

Out[ ]:

user_id
1        [new, unactive, unactive, unactive, unactive, ...
2        [new, unactive, unactive, unactive, unactive, ...
3        [new, unactive, return, active, unactive, unac...
4        [new, unactive, unactive, unactive, unactive, ...
5        [new, active, unactive, return, active, active...
                               ...                        
23566    [unreg, unreg, active, unactive, unactive, una...
23567    [unreg, unreg, active, unactive, unactive, una...
23568    [unreg, unreg, active, active, unactive, unact...
23569    [unreg, unreg, active, unactive, unactive, una...
23570    [unreg, unreg, active, unactive, unactive, una...
Length: 23570, dtype: object

In [ ]:

# 将pivoted_status的values转成list，再将list转成DataFrame
# 将df_purchase的index作为df_pruchase的index，columns相同
# var:df_puechase_new
df_purchase_new1 = pd.DataFrame(data=df_purchase_new.to_list(),index=df_purchase.index, columns=df_purchase.columns)
df_purchase_new1.head()

Out[ ]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
user_id
1	new	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive
2	new	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive	unactive
3	new	unactive	return	active	unactive	unactive	unactive	unactive	unactive	unactive	return	unactive	unactive	unactive	unactive	unactive	return	unactive
4	new	unactive	unactive	unactive	unactive	unactive	unactive	return	unactive	unactive	unactive	return	unactive	unactive	unactive	unactive	unactive	unactive
5	new	active	unactive	return	active	active	active	unactive	return	unactive	unactive	return	active	unactive	unactive	unactive	unactive	unactive

In [ ]:

# 将每月不同活跃用户进行计数 var:purchase_status_ct
purchase_status_ct = df_purchase_new1.apply(lambda x : pd.value_counts(x),axis=0).fillna(0)
purchase_status_ct.head()

Out[ ]:

month	1997-01-01	1997-02-01	1997-03-01	1997-04-01	1997-05-01	1997-06-01	1997-07-01	1997-08-01	1997-09-01	1997-10-01	1997-11-01	1997-12-01	1998-01-01	1998-02-01	1998-03-01	1998-04-01	1998-05-01	1998-06-01
active	0.000	9633.000	8929.000	1773.000	852.000	747.000	746.000	604.000	528.000	532.000	624.000	632.000	512.000	472.000	571.000	518.000	459.000	446.000
new	7846.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
return	0.000	0.000	595.000	1049.000	1362.000	1592.000	1434.000	1168.000	1211.000	1307.000	1404.000	1232.000	1025.000	1079.000	1489.000	919.000	1029.000	1060.000
unactive	0.000	6689.000	14046.000	20748.000	21356.000	21231.000	21390.000	21798.000	21831.000	21731.000	21542.000	21706.000	22033.000	22019.000	21510.000	22133.000	22082.000	22064.000
unreg	15724.000	7248.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

In [ ]:

# 转置
t_purchase_status_ct = purchase_status_ct.T
t_purchase_status_ct

Out[ ]:

	active	new	return	unactive	unreg
month
1997-01-01	0.000	7846.000	0.000	0.000	15724.000
1997-02-01	9633.000	0.000	0.000	6689.000	7248.000
1997-03-01	8929.000	0.000	595.000	14046.000	0.000
1997-04-01	1773.000	0.000	1049.000	20748.000	0.000
1997-05-01	852.000	0.000	1362.000	21356.000	0.000
1997-06-01	747.000	0.000	1592.000	21231.000	0.000
1997-07-01	746.000	0.000	1434.000	21390.000	0.000
1997-08-01	604.000	0.000	1168.000	21798.000	0.000
1997-09-01	528.000	0.000	1211.000	21831.000	0.000
1997-10-01	532.000	0.000	1307.000	21731.000	0.000
1997-11-01	624.000	0.000	1404.000	21542.000	0.000
1997-12-01	632.000	0.000	1232.000	21706.000	0.000
1998-01-01	512.000	0.000	1025.000	22033.000	0.000
1998-02-01	472.000	0.000	1079.000	22019.000	0.000
1998-03-01	571.000	0.000	1489.000	21510.000	0.000
1998-04-01	518.000	0.000	919.000	22133.000	0.000
1998-05-01	459.000	0.000	1029.000	22082.000	0.000
1998-06-01	446.000	0.000	1060.000	22064.000	0.000

标签：数据分析,...,01,1997,unactive,df,用户,pandas,0.000
From： https://www.cnblogs.com/thankcat/p/17098782.html

第一部分:数据类型处理¶

数据加载¶

观察数据¶

第二部分:按月数据分析¶

用户每月花费的总金额¶

所有用户每月的产品购买量¶

所有用户每月的消费总次数¶

统计每月的消费人数¶

第三部分: 用户个体消费数据分析¶

用户消费总金额和消费总次数的统计描述¶

用户消费金额和消费次数的散点图¶

各个用户消费总金额的直方分布图(消费金额在1000之内的分布)¶

各个用户消费的总数量的直方分布图(消费商品的数量在100次之内的分布)¶

第四部分: 用户消费行为分析¶

用户第一次消费的月份分布，和人数统计¶

用户最后一次消费的时间分布，和人数统计¶

新老客户的占比¶

用户分层¶

第五部分: 用户的生命周期¶

将用户划分为活跃用户和其他用户¶

将用户按照每一个月份分成:¶

相关文章

赞助商

阅读排行