pandas.cut()方法介绍
连续值经常需要离散化,或者分离成“箱子”进行分析,假设我们有某项研究中一组人群的数据,需要将其进行分组,放入离散的年龄框中:
ages = [2, 67, 40, 32, 4, 15, 82, 99, 26, 30, 11, 18, 53, 66]
现在将这些年龄分为12以下,1219,1961以及61以上,为实现这个,可以使用pandas中的cut,该语句含义是“根据ages列表中的最大值和最小值计算出等长的3个箱”:
cats = pd.cut(ages, bins=3)
cats
[(1.903, 34.333], (66.667, 99.0], (34.333, 66.667], (1.903, 34.333], (1.903, 34.333], ..., (1.903, 34.333], (1.903, 34.333], (1.903, 34.333], (34.333, 66.667], (34.333, 66.667]]
Length: 14
Categories (3, interval[float64, right]): [(1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]]
分析输出数据可以看到,pandas返回的是一个特殊的Catagorical对象,输出描述了由pandas.cut
计算出的箱。在这个箱中,第一部分列表和ages中的元素一一对应,分别表示ages中2属于(1.903, 34.333],67属于(66.667, 99.0],...;第二部分表示了被分割列表的长度;第三部分便是Catagorical对象的属性:平均分为3段、右侧为闭区间、每段的范围。
我们可以把这个Catagorical对象当作一个表示箱名的字符串数组,它内部包含一个categories(类别)数组,它指定了不同的类别名称以及codes属性中的ages数据标签:
cats.codes
array([0, 2, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 1, 1], dtype=int8)
该列表表示,ages列表中2被分配在第0个箱,即(1.903, 34.333],6被分配在第2个箱,即(66.667, 99.0],...
cats.categories
IntervalIndex([(1.903, 34.333], (34.333, 66.667], (66.667, 99.0]], dtype='interval[float64, right]')
pd.value_counts(cats)
(1.903, 34.333] 8
(34.333, 66.667] 3
(66.667, 99.0] 3
dtype: int64
注意pd.value_counts(cats)是对pandas.cut的结果中的箱数量的计数。
在默认情况下,分配的箱为左开右闭(不包含左侧数值,包含右侧数值),可以通过传递right=False
来改变哪一边是封闭的。
cats = pd.cut(ages, bins=3, right=False)
cats
[[2.0, 34.333), [66.667, 99.097), [34.333, 66.667), [2.0, 34.333), [2.0, 34.333), ..., [2.0, 34.333), [2.0, 34.333), [2.0, 34.333), [34.333, 66.667), [34.333, 66.667)]
Length: 14
Categories (3, interval[float64, left]): [[2.0, 34.333) < [34.333, 66.667) < [66.667, 99.097)]
cut()中的参数bins
在上一部分中,我们传入了参数3,代表将列表均分为3个箱,这便意味着当给bins传入一个n时,cut()将为我们根据最大值和最小值把箱均分为n份。同理,还可以给bins传入一个列表,如[a, b, c],此时cut()会按照(ab],(bc]分配,如下所示:
bins = [0, 12, 19, 61, 100]
cats = pd.cut(ages, bins=bins)
cats
[(0, 12], (61, 100], (19, 61], (19, 61], (0, 12], ..., (19, 61], (0, 12], (12, 19], (19, 61], (61, 100]]
Length: 14
Categories (4, interval[int64, right]): [(0, 12] < (12, 19] < (19, 61] < (61, 100]]
cut()中的参数labels与precision
labels可以帮助我们自定义箱名,我们可以向labels选项传递一个列表或数组来传入自定义的箱名:
labels=['Teens', 'Youth', 'Adult', 'Older']
bins = [0, 12, 19, 61, 100]
cats = pd.cut(ages, bins=bins, labels=labels)
cats
['Teens', 'Older', 'Adult', 'Adult', 'Teens', ..., 'Adult', 'Teens', 'Youth', 'Adult', 'Older']
Length: 14
Categories (4, object): ['Teens' < 'Youth' < 'Adult' < 'Older']
precision可以精确小数位如下所示:
cats = pd.cut(ages, bins=3, precision=2)
cats
[(1.9, 34.33], (66.67, 99.0], (34.33, 66.67], (1.9, 34.33], (1.9, 34.33], ..., (1.9, 34.33], (1.9, 34.33], (1.9, 34.33], (34.33, 66.67], (34.33, 66.67]]
Length: 14
Categories (3, interval[float64, right]): [(1.9, 34.33] < (34.33, 66.67] < (66.67, 99.0]]
pandas.qcut()方法介绍
qcut()是一个与分箱密切相关的函数,它基于样本分位数进行分箱,即等频分箱(每个箱中个数一致),而cut()是等距分箱,直接举例:
data = pd.DataFrame(np.random.rand(12), columns=['number'])
data
number
0 0.240733
1 0.143113
2 0.395232
3 0.845771
4 0.714535
5 0.561660
6 0.596787
7 0.921728
8 0.397136
9 0.422436
10 0.322760
11 0.588082
data['cut_group'] = pd.qcut(data['number'], q=4) # 将data切分成4份
data
number cut_group
0 0.240733 (0.142, 0.377]
1 0.143113 (0.142, 0.377]
2 0.395232 (0.377, 0.492]
3 0.845771 (0.626, 0.922]
4 0.714535 (0.626, 0.922]
5 0.561660 (0.492, 0.626]
6 0.596787 (0.492, 0.626]
7 0.921728 (0.626, 0.922]
8 0.397136 (0.377, 0.492]
9 0.422436 (0.377, 0.492]
10 0.322760 (0.142, 0.377]
11 0.588082 (0.492, 0.626]
pd.value_counts(data['cut_group'])
(0.142, 0.377] 3
(0.377, 0.492] 3
(0.492, 0.626] 3
(0.626, 0.922] 3
Name: cut_group, dtype: int64
可以看到,qcut()把data切分成了每个箱中内容数量都一致的4份,与cut类似,我们也可以传入自定义的分位数,但切分出的箱便不再具有以上特性:
data['cut_group'] = pd.qcut(data['number'], q=[0, 0.3, 0.6, 1])
data
number cut_group
0 0.240733 (0.142, 0.396]
1 0.143113 (0.142, 0.396]
2 0.395232 (0.142, 0.396]
3 0.845771 (0.578, 0.922]
4 0.714535 (0.578, 0.922]
5 0.561660 (0.396, 0.578]
6 0.596787 (0.578, 0.922]
7 0.921728 (0.578, 0.922]
8 0.397136 (0.396, 0.578]
9 0.422436 (0.396, 0.578]
10 0.322760 (0.142, 0.396]
11 0.588082 (0.578, 0.922]
pd.value_counts(data['cut_group'])
(0.578, 0.922] 5
(0.142, 0.396] 4
(0.396, 0.578] 3
Name: cut_group, dtype: int64
标签:分箱,cut,离散,66.667,34.333,pd,cats,data,Pandas
From: https://www.cnblogs.com/ToryRegulus/p/17134837.html