标签：count dates AARRR 留存 user behavior 2.1 id

一点一点学：

学习来源：https://zhuanlan.zhihu.com/p/285676746

数据来源：https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1

先从1亿数据中选择10万数据进行分析：

数据清洗

方法一：python

先把时间戳转化为标准时间格式：

或者通过下面方式转化：

看一下数据集里的时间区间是否在2017年11月25日至2017年12月3日

删除不符合的时间

查看数据集有99956个数据，并且去重后有99956个数据，说明数据没有重复值

没有空值：

有983位用户，商品总数为64440件，商品类目总数为3128，行为类型为pv、fav、cart、buy4种。

补充：

#求某列有多少种不同的数（相当于去重后count()）：
#df.name.nunique()
#对所有列分别求：
#df.nunique()
#unique()是以数组形式（numpy.ndarray）返回列的所有唯一值（特征的所有唯一值）
#nunique() Return number of unique elements in the object.即返回的是唯一值的个数

因为要计算各天及各时间段，所以要先提取时间,只保留日期的年月日:

需要排除同一天用户多次登录的情况，根据id和日期进行去重：

找到每个用户第一天登陆的日期：

源数据字段较多，求留存率暂时用不上，为了简化，只取一部分：

两个表进行拼接：

计算与第一次登陆相差天数：

先设置展示留存率的格式

转为数字格式

留存率计算for循环，使用索引

或者使用.iloc函数

方法二：mysql

创建user_behavior同时把时间戳格式转为年月日时分秒格式

create table user_behavior
select user_id,item_id,behavior,FROM_UNIXTIME(timestamps,'%Y-%m-%d %H:%i:%S') as times
from userbehavior1

检查时间范围：

DELETE from user_behavior
where times<'2017-11-25 00:00:00' or times>'2017-12-04 00:00:00';

SELECT max(times),min(times) from user_behavior;

查看记录数：

SELECT count(user_id),
count(item_id),
count(category_id),
count(behavior),
count(times)
from user_behavior;

查看每个字段有多少个不同的值：

SELECT count(DISTINCT user_id) as 用户总数,
count(DISTINCT item_id) as 商品总数,
count(DISTINCT category_id) as 商品类目总数,
count(DISTINCT behavior) as 行为数 from user_behavior;

增加一列数据date,只包含年月日：

#ALTER table user_behavior add column dates varchar(20);
#UPDATE user_behavior SET dates=date(times);
select * from user_behavior
limit 5

计算留存率：

SELECT *,
concat(round(100*t.次日留存/t.新用户数,2),'%') as 次日留存率,
concat(round(100*t.三日留存/t.新用户数,2),'%') as 三日留存率,
concat(round(100*t.七日留存/t.新用户数,2),'%') as 七日留存率
from
(SELECT a.dates,
count(DISTINCT b.user_id)as 新用户数,
count(DISTINCT case when DATE_ADD(b.`首次出现日期`,INTERVAL 1 day)=c.dates then c.user_id else null end) as 次日留存,
count(DISTINCT case when DATE_ADD(b.`首次出现日期`,INTERVAL 3 day)=c.dates then c.user_id else null end) as 三日留存,
count(DISTINCT case when DATE_ADD(b.`首次出现日期`,INTERVAL 7 day)=c.dates then c.user_id else null end) as 七日留存
from
(SELECT dates from user_behavior
GROUP BY dates)a
LEFT JOIN
(SELECT user_id,min(dates) as 首次出现日期 from user_behavior
group by user_id) b
on a.dates=b.`首次出现日期`
LEFT JOIN user_behavior c
on b.user_id=c.user_id
GROUP BY a.dates)t
;

留存率mysql计算方法似乎有别的办法？根据python的计算方法类推得到：

select min_day,
sum(case when 天数=1 then 1 else 0 end)/count(distinct user_id) as '次日留存率',
sum(case when 天数=2 then 1 else 0 end)/count(distinct user_id) as '2日留存率',
sum(case when 天数=3 then 1 else 0 end)/count(distinct user_id) as '3日留存率',
sum(case when 天数=4 then 1 else 0 end)/count(distinct user_id) as '4日留存率',
sum(case when 天数=5 then 1 else 0 end)/count(distinct user_id) as '5日留存率',
sum(case when 天数=6 then 1 else 0 end)/count(distinct user_id) as '6日留存率',
sum(case when 天数=7 then 1 else 0 end)/count(distinct user_id) as '7日留存率'

from(
　　select t1.user_id,min_day,normal_day,datediff(normal_day,min_day) as '天数'
　　from
　　　　(
　　　　　　(select distinct user_id, dates as normal_day
　　　　　　from user_behavior) t1
　　　　　　#t1需要排除同一天多次登录的情况
　　　　left join
　　　　　　#t2为每个用户最早登陆日期
　　　　　　(select user_id,min(dates) as min_day
　　　　　　from user_behavior
　　　　　　group by user_id) t2
　　　　on t1.user_id=t2.user_id
　　　　)
　　) t4
group by min_day

思路就是先把排除同一天多次登录的情况的用户时间表找出来，然后找出每个用户最早登陆日期，两张表拼接时间作差作为间隔，然后利用留存率公式计算。

参考：https://www.jianshu.com/p/d76ecbaa97d4

标签：count,dates,AARRR,留存,user,behavior,2.1,id
From： https://www.cnblogs.com/djbwxh/p/16625680.html

AARRR：2.1

数据清洗

方法二：mysql

相关文章

赞助商

阅读排行