首页 > 其他分享 >Pandas 2.2 中文官方教程和指南(十二)

Pandas 2.2 中文官方教程和指南(十二)

时间:2024-04-24 12:01:33浏览次数:39  
标签:教程 slice self 00 Pandas two 2.2 pandas Out

原文:pandas.pydata.org/docs/

MultiIndex / 高级索引

原文:pandas.pydata.org/docs/user_guide/advanced.html

本节涵盖了使用 MultiIndex 进行索引和其他高级索引功能。

查看数据索引和选择以获取一般索引文档。

警告

在设置操作中返回副本还是引用可能取决于上下文。有时这被称为chained assignment,应该避免。请参阅返回视图与副本。

查看食谱以获取一些��级策略。

层次化索引(MultiIndex)

层次化/多级索引非常令人兴奋,因为它为一些相当复杂的数据分析和操作打开了大门,特别是用于处理更高维数据。本质上,它使您能够在较低维数据结构(如Series(1d)和DataFrame(2d))中存储和操作具有任意数量维度的数据。

在本节中,我们将展示“层次化”索引的确切含义以及它如何与上述和之前章节中描述的所有 pandas 索引功能集成。稍后,在讨论分组和数据透视和重塑时,我们将展示非平凡的应用程序,以说明它如何帮助构建数据进行分析。

查看食谱以获取一些高级策略。

创建一个 MultiIndex(层次化索引)对象

MultiIndex对象是标准Index对象的分层类比,通常在 pandas 对象中存储轴标签。您可以将MultiIndex视为元组数组,其中每个元组都是唯一的。可以从数组列表(使用MultiIndex.from_arrays())、元组数组(使用MultiIndex.from_tuples())、可迭代的交叉集(使用MultiIndex.from_product())或DataFrame(使用MultiIndex.from_frame())创建MultiIndex。当传递元组列表给Index构造函数时,它将尝试返回MultiIndex。以下示例演示了初始化 MultiIndexes 的不同方法。

In [1]: arrays = [
 ...:    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
 ...:    ["one", "two", "one", "two", "one", "two", "one", "two"],
 ...: ]
 ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')],
 names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
 two      -0.282863
baz    one      -1.509059
 two      -1.135632
foo    one       1.212112
 two      -0.173215
qux    one       0.119209
 two      -1.044236
dtype: float64 

当你想要两个可迭代元素的每个配对时,可以更容易地使用MultiIndex.from_product()方法:

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')],
 names=['first', 'second']) 

你还可以直接从DataFrame构建MultiIndex,使用方法MultiIndex.from_frame()。这是与MultiIndex.to_frame()互补的方法。

In [10]: df = pd.DataFrame(
 ....:    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
 ....:    columns=["first", "second"],
 ....: )
 ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('foo', 'one'),
 ('foo', 'two')],
 names=['first', 'second']) 

作为一种便利,你可以直接将数组列表传递给SeriesDataFrame以自动构建MultiIndex

In [12]: arrays = [
 ....:    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
 ....:    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
 ....: ]
 ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
 two   -2.104569
baz  one   -0.494929
 two    1.071804
foo  one    0.721555
 two   -0.706771
qux  one   -1.039575
 two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
 0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
 two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
 two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
 two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
 two -0.013960 -0.362543 -0.006154 -0.923061 

所有MultiIndex构造函数都接受一个names参数,用于存储级别本身的字符串名称。如果没有提供名称,将分配None

In [17]: df.index.names
Out[17]: FrozenList([None, None]) 

这个索引可以支持 pandas 对象的任何轴,并且索引的级别数量由你决定:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz  ...       foo       qux 
second       one       two       one  ...       two       one       two
A       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo 
second             one       two       one       two       one       two
first second 
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
 two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
 two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
 two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441 

我们“稀疏化”了索引的较高级别,以使控制台输出更加舒适。请注意,可以使用pandas.set_options()中的multi_sparse选项来控制索引的显示方式:

In [21]: with pd.option_context("display.multi_sparse", False):
 ....:    df
 ....: 

值得记住的是,没有什么可以阻止你在轴上使用元组作为原子标签:

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64 

MultiIndex的重要性在于它可以让你执行分组、选择和重塑操作,如下文和文档的后续部分所述。正如你将在后面的章节中看到的,你可能会发现自己在处理具有分层索引数据时,而不需要显式地创建MultiIndex。然而,在从文件加载数据时,你可能希望在准备数据集时自己生成MultiIndex

重建级别标签

方法get_level_values()将返回特定级别上每个位置的标签向量:

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second') 

使用MultiIndex在轴上进行基本索引

分层索引的一个重要特点是,你可以通过标识数据中的子组的“部分”标签来选择数据。部分选择会在结果中以与在常规 DataFrame 中选择列完全类似的方式“删除”分层索引的级别:

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64 

参见使用分层索引进行交叉选择以了解如何在更深层次上进行选择。

定义级别

MultiIndex保留索引的所有定义级别,即使它们实际上没有被使用。在对索引进行切片时,你可能会注意到这一点。例如:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']]) 

这样做是为了避免重新计算级别以使切片高度高效。如果你只想看到已使用的级别,可以使用get_level_values() 方法。

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
 dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first') 

要重建仅使用的级别的MultiIndex,可以使用remove_unused_levels() 方法。

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']]) 

数据对齐和使用reindex

在轴上具有MultiIndex的不同索引对象之间的操作将按照你的期望进行;数据对齐将与元组索引的索引相同:

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
 two   -4.209138
baz  one   -0.989859
 two    2.143608
foo  one    1.443110
 two   -1.413542
qux  one         NaN
 two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
 two         NaN
baz  one   -0.989859
 two         NaN
foo  one    1.443110
 two         NaN
qux  one   -2.079150
 two         NaN
dtype: float64 

Series/DataFramesreindex() 方法可以使用另一个MultiIndex,甚至是元组的列表或数组来调用:

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
 two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64 
```  ## 高级索引与分层索引

在使用`.loc`进行高级索引时,将`MultiIndex`语法上整合在一起有点具有挑战性,但我们已经尽最大努力去做到。一般来说,MultiIndex 键采用元组的形式。例如,以下操作会按照你的期望进行:

```py
In [39]: df = df.T

In [40]: df
Out[40]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
 two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")]
Out[41]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64 

请注意,在这个例子中,df.loc['bar', 'two']也可以工作,但这种简写符号通常会导致歧义。

如果你还想使用.loc索引特定列,你必须像这样使用元组:

In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785 

你不必通过仅传递元组的第一个元素来指定MultiIndex的所有级别。例如,你可以使用“部分”索引来获取所有第一个级别中包含bar的元素,如下所示:

In [43]: df.loc["bar"]
Out[43]: 
 A         B         C
second 
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920 

这是略微冗长的符号df.loc[('bar',),](在这个例子中等同于df.loc['bar',])的快捷方式。

“部分”切片也非常有效。

In [44]: df.loc["baz":"foo"]
Out[44]: 
 A         B         C
first second 
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372 

你可以通过提供元组切片来选择值的“范围”。

In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]: 
 A         B         C
first second 
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [46]: df.loc[("baz", "two"):"foo"]
Out[46]: 
 A         B         C
first second 
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372 

传递标签或元组列表类似于重新索引:

In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]: 
 A         B         C
first second 
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466 

注意

在 pandas 中,元组和列表在索引时并非相同对待。元组被解释为一个多级键,而列表用于指定多个键。换句话说,元组水平移动(遍历级别),列表垂直移动(扫描级别)。

重要的是,元组列表索引多个完整的MultiIndex键,而列表元组引用一个级别内的多个值:

In [48]: s = pd.Series(
 ....:    [1, 2, 3, 4, 5, 6],
 ....:    index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
 ....: )
 ....: 

In [49]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[49]: 
A  c    1
B  d    5
dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[50]: 
A  c    1
 d    2
B  c    4
 d    5
dtype: int64 

使用切片器

你可以通过提供多个索引器来切片MultiIndex

你可以像通过标签索引一样提供任何选择器,参见按标签选择,包括切片、标签列表、标签和布尔索引器。

你可以使用slice(None)来选择级别的所有内容。你不需要指定所有更深层的级别,它们将被隐含为slice(None)

与往常一样,切片器的两侧都包含在内,因为这是标签索引。

警告

.loc指定器中应指定所有轴,即索引的索引器。有一些模糊的情况,传递的索引器可能被误解为索引两个轴,而不是例如行的MultiIndex

您应该这样做:

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999 

您不应该这样做:

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999 
In [51]: def mklbl(prefix, n):
 ....:    return ["%s%s" % (prefix, i) for i in range(n)]
 ....: 

In [52]: miindex = pd.MultiIndex.from_product(
 ....:    [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
 ....: )
 ....: 

In [53]: micolumns = pd.MultiIndex.from_tuples(
 ....:    [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
 ....: )
 ....: 

In [54]: dfmi = (
 ....:    pd.DataFrame(
 ....:        np.arange(len(miindex) * len(micolumns)).reshape(
 ....:            (len(miindex), len(micolumns))
 ....:        ),
 ....:        index=miindex,
 ....:        columns=micolumns,
 ....:    )
 ....:    .sort_index()
 ....:    .sort_index(axis=1)
 ....: )
 ....: 

In [55]: dfmi
Out[55]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
 D1    5    4    7    6
 C1 D0    9    8   11   10
 D1   13   12   15   14
 C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
 C2 D0  241  240  243  242
 D1  245  244  247  246
 C3 D0  249  248  251  250
 D1  253  252  255  254

[64 rows x 4 columns] 

使用切片、列表和标签进行基本的多重索引切片。

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
 D1   77   76   79   78
 C3 D0   89   88   91   90
 D1   93   92   95   94
 B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
 B1 C1 D0  233  232  235  234
 D1  237  236  239  238
 C3 D0  249  248  251  250
 D1  253  252  255  254

[24 rows x 4 columns] 

您可以使用pandas.IndexSlice来使用:实现更自然的语法,而不是使用slice(None)

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
 D1   12   14
 C3 D0   24   26
 D1   28   30
 B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254

[32 rows x 2 columns] 

使用这种方法可以在同一时间在多个轴上执行相当复杂的选择。

In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
 D1   68   70
 C1 D0   72   74
 D1   76   78
 C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
 C2 D0  112  114
 D1  116  118
 C3 D0  120  122
 D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
 D1   12   14
 C3 D0   24   26
 D1   28   30
 B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254

[32 rows x 2 columns] 

使用布尔索引器,您可以提供与相关的选择。

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
 C3 D0  216  218
 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254 

您还可以指定.locaxis参数来解释单个轴上传递的切片器。

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
 D1   13   12   15   14
 C3 D0   25   24   27   26
 D1   29   28   31   30
 B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
 B1 C1 D0  233  232  235  234
 D1  237  236  239  238
 C3 D0  249  248  251  250
 D1  253  252  255  254

[32 rows x 4 columns] 

此外,您可以使用以下方法设置值。

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
Out[66]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
 D1    5    4    7    6
 C1 D0  -10  -10  -10  -10
 D1  -10  -10  -10  -10
 C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
 C2 D0  241  240  243  242
 D1  245  244  247  246
 C3 D0  -10  -10  -10  -10
 D1  -10  -10  -10  -10

[64 rows x 4 columns] 

您还可以使用可对齐对象的右侧。

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
Out[69]: 
lvl0              a               b 
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
 D1       5       4       7       6
 C1 D0    9000    8000   11000   10000
 D1   13000   12000   15000   14000
 C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
 C2 D0     241     240     243     242
 D1     245     244     247     246
 C3 D0  249000  248000  251000  250000
 D1  253000  252000  255000  254000

[64 rows x 4 columns] 
```  ### 交叉分析

`DataFrame`的`xs()`方法另外接受一个级别参数,使得在`MultiIndex`的特定级别上选择数据更容易。

```py
In [70]: df
Out[70]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
 two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
Out[71]: 
 A         B         C
first 
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466 
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466 

您还可以通过提供轴参数使用xs在列上进行选择。

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

xs还允许使用多个键进行选择。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681 
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64 

您可以将drop_level=False传递给xs以保留所选的级别。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

将上述与使用drop_level=True(默认值)的结果进行比较。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 
```  ### 高级重新索引和对齐

在 pandas 对象的`reindex()`和`align()`方法中使用参数`level`对跨级别广播值很有用。例如:

```py
In [80]: midx = pd.MultiIndex(
 ....:    levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
 ....: )
 ....: 

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
Out[82]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
Out[84]: 
 0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
Out[85]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
Out[87]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [88]: df2_aligned
Out[88]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416 

使用swaplevel交换级别

swaplevel()方法可以交换两个级别的顺序:

In [89]: df[:5]
Out[89]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 

使用reorder_levels重新排序级别

reorder_levels()方法推广了swaplevel方法,允许您在一步中对分层索引级别进行排列:

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 
```  ### 重命名`Index`或`MultiIndex`的名称

`rename()`方法用于重命名`MultiIndex`的标签,通常用于重命名`DataFrame`的列。`rename`的`columns`参数允许指定一个包含您希望重命名的列的字典。

```py
In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
 col0      col1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520 

这种方法也可以用来重命名DataFrame主索引的特定标签。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
 0         1
two  z  1.519970 -0.493662
 x  0.600178  0.274230
zero z  0.132885 -0.023688
 x  2.410179  1.450520 

rename_axis() 方法用于重命名 IndexMultiIndex 的名称。特别是,可以指定 MultiIndex 级别的名称,如果稍后使用 reset_index() 将值从 MultiIndex 移动到列中,则这很有用。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
 0         1
abc  def 
one  y    1.519970 -0.493662
 x    0.600178  0.274230
zero y    0.132885 -0.023688
 x    2.410179  1.450520 

请注意,DataFrame 的列是一个索引,因此使用 rename_axiscolumns 参数将更改该索引的名称。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols') 

renamerename_axis 都支持指定字典、Series 或映射函数,将标签/名称映射到新值。

当直接使用 Index 对象而不是通过 DataFrame 进行操作时,可以使用 Index.set_names() 来更改名称。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['new name', 'y']) 

无法通过级别设置 MultiIndex 的名称。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError  Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
  1686 @name.setter
  1687 def name(self, value: Hashable) -> None:
  1688     if self._no_setting_name:
  1689         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690         raise RuntimeError(
  1691             "Cannot set name on a level of a MultiIndex. Use "
  1692             "'MultiIndex.set_names' instead."
  1693         )
  1694     maybe_extract_name(value, None, type(self))
  1695     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead. 

使用 Index.set_names() 替代。

MultiIndex 进行排序

要有效地对 MultiIndex 对象进行索引和切片,它们需要被排序。与任何索引一样,您可以使用 sort_index()

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
Out[104]: 
qux  two    0.206053
bar  one   -0.251905
foo  one   -2.213588
qux  one    1.063327
foo  two    1.266143
baz  two    0.299368
bar  two   -0.863838
baz  one    0.408204
dtype: float64

In [105]: s.sort_index()
Out[105]: 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64

In [106]: s.sort_index(level=0)
Out[106]: 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64

In [107]: s.sort_index(level=1)
Out[107]: 
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64 

如果 MultiIndex 的级别已命名,还可以向 sort_index 传递级别名称。

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64

In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64 

在更高维度的对象上,如果它们具有 MultiIndex,则可以按级别对任何其他轴进行排序:

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
 one      zero       one      zero
 x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688 

即使数据未排序,索引也会起作用,但效率会相当低(并显示 PerformanceWarning)。它还将返回数据的副本而不是视��:

In [112]: dfm = pd.DataFrame(
 .....:    {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
 .....: )
 .....: 

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
Out[114]: 
 jolie
jim joe 
0   x    0.490671
 x    0.120248
1   z    0.537020
 y    0.110968

In [115]: dfm.loc[(1, 'z')]
Out[115]: 
 jolie
jim joe 
1   z    0.53702 

此外,如果尝试对未完全按字典顺序排序的内容进行索引,可能会引发:

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError  Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step)
  2852  """
  2853 For an ordered MultiIndex, compute the slice locations for input
  2854 labels.
 (...)
  2900 sequence of such.
  2901 """
  2902 # This function adds nothing to its parent implementation (the magic
  2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side)
  2846 if not isinstance(label, tuple):
  2847     label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side)
  2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
  2907     if len(tup) > self._lexsort_depth:
-> 2908         raise UnsortedIndexError(
  2909             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
  2910             f"({self._lexsort_depth})"
  2911         )
  2913     n = len(tup)
  2914     start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' 

is_monotonic_increasing() 方法在 MultiIndex 上显示索引是否已排序:

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False 
In [118]: dfm = dfm.sort_index()

In [119]: dfm
Out[119]: 
 jolie
jim joe 
0   x    0.490671
 x    0.120248
1   y    0.110968
 z    0.537020

In [120]: dfm.index.is_monotonic_increasing
Out[120]: True 

现在选择功能按预期工作。

In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]: 
 jolie
jim joe 
1   y    0.110968
 z    0.537020 

取方法

与 NumPy ndarrays 类似,pandas 的 IndexSeriesDataFrame 也提供了 take() 方法,该方法检索给定索引处给定轴上的元素。给定的索引必须是整数索引位置的列表或 ndarray。take 还将接受负整数作为相对于对象末尾的位置。

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64 

对于 DataFrames,给定的索引应该是指定行或列位置的 1d 列表或 ndarray。

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
Out[131]: 
 0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
Out[132]: 
 0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704 

需要注意的是,pandas 对象上的 take 方法不适用于布尔索引,并且可能返回意外结果。

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64 

最后,关于性能的一点小提示,因为 take 方法处理范围较窄的输入,它可以提供比花式索引快得多的性能。

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
 .....: %timeit arr.take(indexer, axis=0)
 .....: 
257 us +- 4.44 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
79.7 us +- 1.15 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 
In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
 .....: %timeit ser.take(indexer)
 .....: 
144 us +- 3.69 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
129 us +- 2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 

索引类型

我们在前面的章节中已经广泛讨论了MultiIndex。关于DatetimeIndexPeriodIndex的文档在这里显示 timeseries.html#timeseries-overview,关于TimedeltaIndex的文档在这里找到 timedeltas.html#timedeltas-index。

在接下来的子章节中,我们将重点介绍一些其他索引类型。

CategoricalIndex

CategoricalIndex是一种支持具有重复索引的索引类型。这是围绕Categorical的容器,允许高效地索引和存储具有大量重复元素的索引。

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
 A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object') 

设置索引将创建一个CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

使用__getitem__/.iloc/.loc进行索引类似于具有重复项的Index。索引器必须在类别中,否则操作将引发KeyError

In [153]: df2.loc["a"]
Out[153]: 
 A
B 
a  0
a  1
a  5 

在索引后,CategoricalIndex保留的:

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

对索引进行排序将按照类别的顺序排序(回想我们使用CategoricalDtype(list('cab'))创建了索引,所以排序顺序是cab)。

In [155]: df2.sort_index()
Out[155]: 
 A
B 
c  4
a  0
a  1
a  5
b  2
b  3 

对索引进行分组操作将保留索引的性质。

In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]: 
 A
B 
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

重新索引操作将根据传递的索引器的类型返回一个结果索引。传递列表将返回一个普通的Index;使用Categorical进行索引将返回一个CategoricalIndex,根据传递的Categorical dtype 的类别进行索引。这允许任意索引这些,即使值不在类别中,类似于如何重新索引任何pandas 索引。

In [158]: df3 = pd.DataFrame(
 .....:    {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
 .....: )
 .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
 A
B 
a  0
b  1
c  2 
In [161]: df3.reindex(["a", "e"])
Out[161]: 
 A
B 
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
 A
B 
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B') 

警告

CategoricalIndex进行重塑和比较操作必须具有相同的类别,否则将引发TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B') 
In [173]: pd.concat([df4, df5])
Out[173]: 
 A
B 
b  0
a  1
b  0
c  1 
```  ### RangeIndex

`RangeIndex`是`DataFrame`和`Series`对象的默认索引的子类。`RangeIndex`是`Index`的优化版本,可以表示一个单调有序集合。这类似于 Python 的[range 类型](https://docs.python.org/3/library/stdtypes.html#typesseq-range)。`RangeIndex`将始终具有`int64` dtype。

```py
In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1) 

RangeIndex是所有DataFrameSeries对象的默认索引:

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1) 

RangeIndex的行为类似于具有int64 dtype 的Index,对RangeIndex的操作,其结果无法由RangeIndex表示,但应具有整数 dtype,将转换为具有int64Index。例如:

In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64') 
```  ### 区间索引

`IntervalIndex`以及其自己的 dtype,`IntervalDtype`以及`Interval`标量类型,为 pandas 提供了区间表示的一流支持。

`IntervalIndex`允许一些独特的索引,并且也被用作`cut()`和`qcut()`中的返回类型。

#### 使用`IntervalIndex`进行索引

`IntervalIndex`可以作为`Series`和`DataFrame`的索引使用。

```py
In [182]: df = pd.DataFrame(
 .....:    {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
 .....: )
 .....: 

In [183]: df
Out[183]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4 

通过.loc进行基于标签的索引,沿着区间的边缘工作如你所期望的,选择特定的区间。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
 A
(1, 2]  2
(2, 3]  3 

���果选择一个在区间内的标签,这也将选择该区间。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
 A
(2, 3]  3
(3, 4]  4 

使用Interval进行选择将仅返回精确匹配。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64 

尝试选择不完全包含在IntervalIndex中的Interval将引发KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
  1429 # fall thru to straight lookup
  1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
  1379 def _get_label(self, label, axis: AxisInt):
  1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level)
  4299             new_index = index[loc]
  4300 else:
-> 4301     loc = index.get_loc(key)
  4303     if isinstance(loc, np.ndarray):
  4304         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
  676 matches = mask.sum()
  677 if matches == 0:
--> 678     raise KeyError(key)
  679 if matches == 1:
  680     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right') 

可以使用overlaps()方法创建一个布尔索引器,选择与给定Interval重叠的所有Intervals

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3 

使用cutqcut对数据进行分箱

cut()qcut()都返回一个Categorical对象,它们创建的区间被存储为其.categories属性中的IntervalIndex

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]') 

cut()还接受一个IntervalIndex作为其bins参数,这使得一种有用的 pandas 习惯成为可能。首先,我们使用一些数据和bins设置为一个固定数字调用cut(),以生成区间。然后,我们将.categories的值作为后续调用cut()bins参数传递,提供新的数据,这些数据将被分到相同的区间中。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]] 

任何落在所有区间之外的值将被赋予一个NaN值。

生成区间的范围

如果我们需要定期频率的区间,我们可以使用interval_range()函数使用startendperiods的各种组合创建一个IntervalIndexinterval_range的默认频率是数值区间为 1,日期时间区间为日历日:

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
 (2017-01-02 00:00:00, 2017-01-03 00:00:00],
 (2017-01-03 00:00:00, 2017-01-04 00:00:00],
 (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
 (1 days 00:00:00, 2 days 00:00:00],
 (2 days 00:00:00, 3 days 00:00:00]],
 dtype='interval[timedelta64[ns], right]') 

freq参数可用于指定非默认频率,并且可以利用各种频率别名与类似于日期时间的间隔:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
 (2017-01-08 00:00:00, 2017-01-15 00:00:00],
 (2017-01-15 00:00:00, 2017-01-22 00:00:00],
 (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
 (0 days 09:00:00, 0 days 18:00:00],
 (0 days 18:00:00, 1 days 03:00:00]],
 dtype='interval[timedelta64[ns], right]') 

此外,closed参数可用于指定区间的哪一侧是闭合的。默认情况下,区间在右侧是闭合的。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]') 

指定startendperiods将生成从startend的一系列均匀间隔的区间,包括IntervalIndex中的periods个元素:

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
 (2018-01-20 08:00:00, 2018-02-08 16:00:00],
 (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
 dtype='interval[datetime64[ns], right]') 

杂项索引常见问题解答

整数索引

使用整数轴标签进行基于标签的索引是一个棘手的问题。在邮件列表和科学 Python 社区的各个成员中已经广泛讨论过这个问题。在 pandas 中,我们的一般观点是标签比整数位置更重要。因此,只有使用整数轴索引时,才能使用标签为基础的索引,例如.loc等标准工具。以下代码将生成异常:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
  412 try:
--> 413     return self._range.index(new_key)
  414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
  413         return self._range.index(new_key)
  414     except ValueError as err:
--> 415         raise KeyError(key) from err
  416 if isinstance(key, Hashable):
  417     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
Out[210]: 
 0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
Out[211]: 
 0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221 

这个故意的决定是为了防止歧义和微妙的错误(许多用户报告在 API 更改为停止“回退”到基于位置的索引时发现错误)。

非单调索引需要精确匹配

如果SeriesDataFrame的索引是单调递增或递减的,那么基于标签的切片的边界可以超出索引范围,就像切片索引普通的 Python list一样。可以使用is_monotonic_increasing()is_monotonic_decreasing()属性测试索引的单调性。

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]: 
 data
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]: 
Empty DataFrame
Columns: [data]
Index: [] 

另一方面,如果索引不是单调的,那么切片的两个边界必须是索引的唯一成员。

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]: 
 data
2     0
3     1
1     2
4     3 
 # 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
  1374 if self._multi_take_opportunity(tup):
  1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
  1017 if com.is_null_slice(key):
  1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  1021 # We should never have retval.ndim < self.ndim, as that should
  1022 #  be handled by the _getitem_lowerdim call above.
  1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side)
  6801         return self._searchsorted_monotonic(label, side)
  6802     except ValueError:
  6803         # raise the original KeyError
-> 6804         raise err
  6806 if isinstance(slc, np.ndarray):
  6807     # get_loc may return a boolean array, which
  6808     # is OK as long as they are representable by a slice.
  6809     assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
  6796 # we need to look up the label
  6797 try:
-> 6798     slc = self.get_loc(label)
  6799 except KeyError as err:
  6800     try:

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
  1374 if self._multi_take_opportunity(tup):
  1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
  1017 if com.is_null_slice(key):
  1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  1021 # We should never have retval.ndim < self.ndim, as that should
  1022 #  be handled by the _getitem_lowerdim call above.
  1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step)
  6883 end_slice = None
  6884 if end is not None:
-> 6885     end_slice = self.get_slice_bound(end, "right")
  6886 if end_slice is None:
  6887     end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side)
  6810     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
  6811     if isinstance(slc, np.ndarray):
-> 6812         raise KeyError(
  6813             f"Cannot get {side} slice bound for non-unique "
  6814             f"label: {repr(original_label)}"
  6815         )
  6817 if isinstance(slc, slice):
  6818     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3' 

Index.is_monotonic_increasingIndex.is_monotonic_decreasing只检查索引是否弱单调。要检查严格单调性,可以将其中一个与is_unique()属性结合使用。

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False 

端点是包含的

与标准的 Python 序列切片相比,其中切片的端点不包含在内,pandas 中基于标签的切片是包含的。这样做的主要原因是往往不容易确定索引中特定标签后的“后继”或下一个元素。例如,考虑以下Series

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
Out[226]: 
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64 

假设我们希望从ce进行切片,使用整数可以这样实现:

In [227]: s[2:5]
Out[227]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64 

然而,如果你只有ce,确定索引中的下一个元素可能会有些复杂。例如,以下情况不起作用:

In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str 

一个非常常见的用例是将时间序列限制在两个特定日期开始和结束。为了实现这一点,我们做出了设计选择,使基于标签的切片包括两个端点:

In [229]: s.loc["c":"e"]
Out[229]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64 

这绝对是一种“实用性胜过纯粹性”的事情,但如果您期望基于标签的切片的行为与标准的 Python 整数切片完全相同,那么这是需要注意的事项。

索引可能会改变底层 Series 的 dtype

不同的索引操作可能会改变Series的 dtype。

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
Out[234]: 
0    1.0
4    NaN
dtype: float64 
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
Out[239]: 
0    True
1     NaN
2     NaN
dtype: object 

这是因为上述(重新)索引操作会悄悄地插入NaNs,并且dtype会相应地更改。这可能会在使用numpyufuncs(如numpy.logical_and)时造成一些问题。

请参阅GH 2388以获取更详细的讨论。

层次化索引(MultiIndex)

层次化/多级索引非常令人兴奋,因为它为一些相当复杂的数据分析和操作打开了大门,特别是在处理更高维数据时。本质上,它使您能够在较低维数据结构(如Series(1d)和DataFrame(2d))中存储和操作具有任意数量维度的数据。

在本节中,我们将展示“层次化”索引的确切含义以及它如何与上述和之前章节中描述的所有 pandas 索引功能集成。稍后,在讨论分组和数据透视和重塑时,我们将展示非平凡的应用程序,以说明它如何帮助结构化数据进行分析。

请参阅食谱以获取一些高级策略。

创建一个 MultiIndex(层次化索引)对象

MultiIndex对象是标准Index对象的分层类比,通常在 pandas 对象中存储轴标签。您可以将MultiIndex视为一个元组数组,其中每个元组都是唯一的。可以从数组列表(使用MultiIndex.from_arrays())、元组数组(使用MultiIndex.from_tuples())、可迭代的交叉集(使用MultiIndex.from_product())或DataFrame(使用MultiIndex.from_frame())创建MultiIndex。当传递元组列表给Index构造函数时,构造函数将尝试返回MultiIndex。以下示例演示了初始化 MultiIndexes 的不同方法。

In [1]: arrays = [
 ...:    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
 ...:    ["one", "two", "one", "two", "one", "two", "one", "two"],
 ...: ]
 ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')],
 names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
 two      -0.282863
baz    one      -1.509059
 two      -1.135632
foo    one       1.212112
 two      -0.173215
qux    one       0.119209
 two      -1.044236
dtype: float64 

当您想要两个可迭代元素的每个配对时,可以更容易地使用MultiIndex.from_product()方法:

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')],
 names=['first', 'second']) 

您还可以直接从 DataFrame 构建 MultiIndex,使用方法MultiIndex.from_frame()。这是与MultiIndex.to_frame()相辅相成的方法。

In [10]: df = pd.DataFrame(
 ....:    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
 ....:    columns=["first", "second"],
 ....: )
 ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('foo', 'one'),
 ('foo', 'two')],
 names=['first', 'second']) 

为了方便起见,您可以直接将数组列表传递给 SeriesDataFrame,以自动构建MultiIndex

In [12]: arrays = [
 ....:    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
 ....:    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
 ....: ]
 ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
 two   -2.104569
baz  one   -0.494929
 two    1.071804
foo  one    0.721555
 two   -0.706771
qux  one   -1.039575
 two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
 0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
 two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
 two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
 two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
 two -0.013960 -0.362543 -0.006154 -0.923061 

所有MultiIndex构造函数都接受一个names参数,用于存储级别本身的字符串名称。如果未提供名称,则将分配None

In [17]: df.index.names
Out[17]: FrozenList([None, None]) 

这个索引可以支持 pandas 对象的任何轴,并且索引的级别数量由您决定:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz  ...       foo       qux 
second       one       two       one  ...       two       one       two
A       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo 
second             one       two       one       two       one       two
first second 
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
 two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
 two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
 two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441 

我们“稀疏化”了索引的较高级别,使控制台输出更加易于阅读。请注意,索引的显示方式可以使用 pandas.set_options() 中的 multi_sparse 选项进行控制:

In [21]: with pd.option_context("display.multi_sparse", False):
 ....:    df
 ....: 

值得记住的是,没有任何限制您在轴上使用元组作为原子标签:

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64 

MultiIndex很重要的原因是它可以让您执行分组、选择和重塑操作,我们将在下面和文档的后续部分中描述。正如您将在后面的部分中看到的,您可能会发现自己在不显式创建MultiIndex的情况下使用分层索引数据。但是,在从文件加载数据时,您可能希望在准备数据集时生成自己的MultiIndex

重建级别标签

方法get_level_values()将返回特定级别上每个位置的标签向量:

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second') 

在具有 MultiIndex 的轴上进行基本索引

分层索引的一个重要特点是,您可以通过标识数据中的子组的“部分”标签来选择数据。部分选择在结果中以与在常规 DataFrame 中选择列完全类似的方式“删除”分层索引的级别:

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64 

查看具有分层索引的交叉部分以了解如何在更深层次上进行选择。

定义的级别

MultiIndex保留索引的所有定义级别,即使它们实际上没有被使用。在对索引进行切片时,您可能会注意到这一点。例如:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']]) 

这样做是为了避免重新计算级别以使切片高度高效。如果您只想看到已使用的级别,可以使用get_level_values()方法。

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
 dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first') 

要仅使用已使用的级别重建MultiIndex,可以使用remove_unused_levels()方法。

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']]) 

数据对齐和使用reindex

在轴上具有MultiIndex的不同索引对象之间的操作将按预期进行;数据对齐将与元组索引的索引相同:

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
 two   -4.209138
baz  one   -0.989859
 two    2.143608
foo  one    1.443110
 two   -1.413542
qux  one         NaN
 two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
 two         NaN
baz  one   -0.989859
 two         NaN
foo  one    1.443110
 two         NaN
qux  one   -2.079150
 two         NaN
dtype: float64 

Series/DataFramesreindex()方法可以使用另一个MultiIndex,甚至是元组的列表或数组来调用:

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
 two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64 

创建一个 MultiIndex(分层索引)对象

MultiIndex对象是标准Index对象的分层类比,通常在 pandas 对象中存储轴标签。您可以将MultiIndex视为元组数组,其中每个元组都是唯一的。可以从数组列表(使用MultiIndex.from_arrays())、元组数组(使用MultiIndex.from_tuples())、可迭代的交叉集(使用MultiIndex.from_product())或DataFrame(使用MultiIndex.from_frame())创建MultiIndex。当传递元组列表给Index构造函数时,该构造函数将尝试返回MultiIndex。以下示例演示了初始化 MultiIndexes 的不同方法。

In [1]: arrays = [
 ...:    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
 ...:    ["one", "two", "one", "two", "one", "two", "one", "two"],
 ...: ]
 ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')],
 names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
 two      -0.282863
baz    one      -1.509059
 two      -1.135632
foo    one       1.212112
 two      -0.173215
qux    one       0.119209
 two      -1.044236
dtype: float64 

当您想要两个可迭代元素的每个配对时,可以更容易地使用MultiIndex.from_product()方法:

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')],
 names=['first', 'second']) 

您还可以直接从DataFrame构造一个MultiIndex,使用方法MultiIndex.from_frame()。这是与MultiIndex.to_frame()互补的方法。

In [10]: df = pd.DataFrame(
 ....:    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
 ....:    columns=["first", "second"],
 ....: )
 ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
 ('bar', 'two'),
 ('foo', 'one'),
 ('foo', 'two')],
 names=['first', 'second']) 

作为一种便利,您可以直接将数组列表传递给SeriesDataFrame,以自动构造一个MultiIndex

In [12]: arrays = [
 ....:    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
 ....:    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
 ....: ]
 ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
 two   -2.104569
baz  one   -0.494929
 two    1.071804
foo  one    0.721555
 two   -0.706771
qux  one   -1.039575
 two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
 0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
 two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
 two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
 two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
 two -0.013960 -0.362543 -0.006154 -0.923061 

所有的MultiIndex构造函数都接受一个names参数,用于存储级别本身的字符串名称。如果没有提供名称,则将分配None

In [17]: df.index.names
Out[17]: FrozenList([None, None]) 

这个索引可以支持 pandas 对象的任何轴,并且索引的级别数量由您决定:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz  ...       foo       qux 
second       one       two       one  ...       two       one       two
A       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo 
second             one       two       one       two       one       two
first second 
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
 two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
 two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
 two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441 

我们“稀疏化”了索引的较高级别,以使控制台输出对眼睛更加友好。请注意,可以使用pandas.set_options()中的multi_sparse选项来控制索引的显示方式:

In [21]: with pd.option_context("display.multi_sparse", False):
 ....:    df
 ....: 

值得记住的是,没有任何限制阻止您在轴上使用元组作为原子标签:

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64 

MultiIndex很重要的原因是它可以让您执行分组、选择和重塑操作,正如我们将在下面和文档的后续部分中描述的那样。正如您将在后面的部分中看到的,您可能会发现自己在不显式创建MultiIndex的情况下使用分层索引数据。然而,在从文件加载数据时,您可能希望在准备数据集时生成自己的MultiIndex

重建级别标签

方法get_level_values()将返回特定级别位置的每个标签的向量:

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second') 

使用MultiIndex在轴上进行基本索引

分层索引的一个重要特点是,您可以通过标识数据中的子组的“部分”标签来选择数据。部分选择会在结果中以与在常规 DataFrame 中选择列完全类似的方式“删除”分层索引的级别:

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64 

查看具有分层索引的交叉部分以了解如何在更深层次上进行选择。

定义的级别

MultiIndex保留索引的所有定义级别,即使它们实际上没有被使用。在切片索引时,您可能会注意到这一点。例如:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']]) 

这样做是为了避免重新计算级别,以使切片高度高效。如果您只想看到已使用的级别,可以使用get_level_values()方法。

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
 dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first') 

要仅使用已使用的级别重建MultiIndex,可以使用remove_unused_levels()方法。

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']]) 

数据对齐和使用reindex

在轴上具有MultiIndex的不��索引对象之间的操作将按您的预期工作;数据对齐将与元组索引的索引相同:

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
 two   -4.209138
baz  one   -0.989859
 two    2.143608
foo  one    1.443110
 two   -1.413542
qux  one         NaN
 two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
 two         NaN
baz  one   -0.989859
 two         NaN
foo  one    1.443110
 two         NaN
qux  one   -2.079150
 two         NaN
dtype: float64 

Series/DataFramesreindex()方法可以与另一个MultiIndex,甚至是元组的列表或数组一起调用:

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
 two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64 

使用分层索引进行高级索引

在使用.loc进行高级索引时,将MultiIndex在语法上整合在一起有点具有挑战性,但我们已经尽力做到了。一般来说,MultiIndex 键采用元组的形式。例如,以下操作会按您的预期工作:

In [39]: df = df.T

In [40]: df
Out[40]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
 two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")]
Out[41]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64 

请注意,在这个例子中,df.loc['bar', 'two']也可以工作,但这种简写符号在一般情况下可能会导致歧义。

如果您还想使用.loc索引特定列,必须像这样使用元组:

In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785 

你不必通过仅传递元组的第一个元素来指定MultiIndex的所有级别。例如,你可以使用“部分”索引来获取所有第一级别中带有bar的元素如下:

In [43]: df.loc["bar"]
Out[43]: 
 A         B         C
second 
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920 

这是略微更冗长的符号df.loc[('bar',),]的快捷方式(在这个例子中等同于df.loc['bar',])。

“部分���切片也非常有效。

In [44]: df.loc["baz":"foo"]
Out[44]: 
 A         B         C
first second 
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372 

通过提供元组切片来对“范围”值进行切片。

In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]: 
 A         B         C
first second 
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [46]: df.loc[("baz", "two"):"foo"]
Out[46]: 
 A         B         C
first second 
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372 

传递标签列表或元组类似于重新索引:

In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]: 
 A         B         C
first second 
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466 

注意

需要注意的是,在 pandas 中,元组和列表在索引时并非被处理相同。元组被解释为一个多级键,而列表用于指定多个键。换句话说,元组水平移动(遍历级别),列表垂直移动(扫描级别)。

重要的是,元组列表索引多个完整的MultiIndex键,而列表元组引用一个级别内的多个值:

In [48]: s = pd.Series(
 ....:    [1, 2, 3, 4, 5, 6],
 ....:    index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
 ....: )
 ....: 

In [49]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[49]: 
A  c    1
B  d    5
dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[50]: 
A  c    1
 d    2
B  c    4
 d    5
dtype: int64 

使用切片器

通过提供多个索引器可以对MultiIndex进行切片。

你可以像通过标签索引一样提供任何选择器,参见按标签选择,包括切片、标签列表、标签和布尔索引器。

你可以使用slice(None)来选择级别的所有内容。你不需要指定所有更深层次的级别,它们将被隐含为slice(None)

与标签索引一样,切片器的两侧都包括在内。

警告

你应该在.loc指定器中指定所有轴,即索引的索引器。有一些模糊的情况,传递的索引器可能被误解为同时索引两个轴,而不是例如为行的MultiIndex

你应该这样做:

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999 

不应该这样做:

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999 
In [51]: def mklbl(prefix, n):
 ....:    return ["%s%s" % (prefix, i) for i in range(n)]
 ....: 

In [52]: miindex = pd.MultiIndex.from_product(
 ....:    [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
 ....: )
 ....: 

In [53]: micolumns = pd.MultiIndex.from_tuples(
 ....:    [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
 ....: )
 ....: 

In [54]: dfmi = (
 ....:    pd.DataFrame(
 ....:        np.arange(len(miindex) * len(micolumns)).reshape(
 ....:            (len(miindex), len(micolumns))
 ....:        ),
 ....:        index=miindex,
 ....:        columns=micolumns,
 ....:    )
 ....:    .sort_index()
 ....:    .sort_index(axis=1)
 ....: )
 ....: 

In [55]: dfmi
Out[55]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
 D1    5    4    7    6
 C1 D0    9    8   11   10
 D1   13   12   15   14
 C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
 C2 D0  241  240  243  242
 D1  245  244  247  246
 C3 D0  249  248  251  250
 D1  253  252  255  254

[64 rows x 4 columns] 

使用切片、列表和标签进行基本的 MultiIndex 切片。

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
 D1   77   76   79   78
 C3 D0   89   88   91   90
 D1   93   92   95   94
 B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
 B1 C1 D0  233  232  235  234
 D1  237  236  239  238
 C3 D0  249  248  251  250
 D1  253  252  255  254

[24 rows x 4 columns] 

你可以使用pandas.IndexSlice来使用:更自然的语法,而不是使用slice(None)

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
 D1   12   14
 C3 D0   24   26
 D1   28   30
 B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254

[32 rows x 2 columns] 

可以同时在多个轴上使用此方法执行相当复杂的选择。

In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
 D1   68   70
 C1 D0   72   74
 D1   76   78
 C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
 C2 D0  112  114
 D1  116  118
 C3 D0  120  122
 D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
 D1   12   14
 C3 D0   24   26
 D1   28   30
 B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254

[32 rows x 2 columns] 

使用布尔索引器可以提供与相关的选择。

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
 C3 D0  216  218
 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254 

你还可以在.loc中指定axis参数,以在单个轴上解释传递的切片器。

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
 D1   13   12   15   14
 C3 D0   25   24   27   26
 D1   29   28   31   30
 B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
 B1 C1 D0  233  232  235  234
 D1  237  236  239  238
 C3 D0  249  248  251  250
 D1  253  252  255  254

[32 rows x 4 columns] 

此外,你可以使用以下方法设置值。

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
Out[66]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
 D1    5    4    7    6
 C1 D0  -10  -10  -10  -10
 D1  -10  -10  -10  -10
 C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
 C2 D0  241  240  243  242
 D1  245  244  247  246
 C3 D0  -10  -10  -10  -10
 D1  -10  -10  -10  -10

[64 rows x 4 columns] 

你也可以使用可对齐对象的右侧。

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
Out[69]: 
lvl0              a               b 
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
 D1       5       4       7       6
 C1 D0    9000    8000   11000   10000
 D1   13000   12000   15000   14000
 C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
 C2 D0     241     240     243     242
 D1     245     244     247     246
 C3 D0  249000  248000  251000  250000
 D1  253000  252000  255000  254000

[64 rows x 4 columns] 
```  ### 横截面

`DataFrame`的`xs()`方法另外接受一个级别参数,以便更轻松地选择`MultiIndex`的特定级别的数据。

```py
In [70]: df
Out[70]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
 two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
Out[71]: 
 A         B         C
first 
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466 
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466 

你也可以使用xs在列上进行选择,通过提供轴参数。

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

xs还允许使用多个键进行选择。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681 
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64 

你可以向xs传递drop_level=False以保留所选的级别。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

将上述与使用drop_level=True(默认值)的结果进行比较。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 
```  ### 高级重新索引和对齐

在 pandas 对象的`reindex()`和`align()`方法中使用参数`level`可以实现跨级别广播值。例如:

```py
In [80]: midx = pd.MultiIndex(
 ....:    levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
 ....: )
 ....: 

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
Out[82]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
Out[84]: 
 0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
Out[85]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
Out[87]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [88]: df2_aligned
Out[88]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416 

使用swaplevel交换级别

swaplevel()方法可以交换两个级别的顺序:

In [89]: df[:5]
Out[89]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 

使用reorder_levels重新排序级别

reorder_levels()方法推广了swaplevel方法,允许您一次性对分层索引级别进行排列:

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 
```  ### 重命名`Index`或`MultiIndex`的名称

`rename()`方法用于重命名`MultiIndex`的标签,通常用于重命名`DataFrame`的列。`rename`的`columns`参数允许指定一个包含您希望重命名的列的字典。

```py
In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
 col0      col1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520 

该方法还可用于重命名DataFrame的主索引的特定标签。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
 0         1
two  z  1.519970 -0.493662
 x  0.600178  0.274230
zero z  0.132885 -0.023688
 x  2.410179  1.450520 

rename_axis()方法用于重命名IndexMultiIndex的名称。特别是,可以指定MultiIndex级别的名称,如果稍后使用reset_index()将值从MultiIndex移动到列中,则这是有用的。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
 0         1
abc  def 
one  y    1.519970 -0.493662
 x    0.600178  0.274230
zero y    0.132885 -0.023688
 x    2.410179  1.450520 

请注意,DataFrame的列是一个索引,因此使用带有columns参数的rename_axis将更改该索引的名称。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols') 

renamerename_axis都支持指定字典、Series或映射函数来将标签/名称映射到新值。

在直接使用Index对象而不是通过DataFrame时,可以使用Index.set_names()来更改名称。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['new name', 'y']) 

您无法通过级别设置MultiIndex的名称。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError  Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
  1686 @name.setter
  1687 def name(self, value: Hashable) -> None:
  1688     if self._no_setting_name:
  1689         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690         raise RuntimeError(
  1691             "Cannot set name on a level of a MultiIndex. Use "
  1692             "'MultiIndex.set_names' instead."
  1693         )
  1694     maybe_extract_name(value, None, type(self))
  1695     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead. 

使用Index.set_names()代替。 ### 使用切片器

通过提供多个索引器,可以对MultiIndex进行切片。

您可以提供任何选择器,就像您正在按标签进行索引一样,请参阅按标签选择,包括切片、标签列表、标签和布尔索引器。

您可以使用slice(None)来选择级别的所有内容。您不需要指定所有更深层次的级别,它们将被隐含为slice(None)

通常情况下,切片器的两侧都包含在内,因为这是标签索引。

警告

您应该在.loc指定器中指定所有轴,即索引的索引器。有一些模棱两可的情况,传递的索引器可能被误解为对两个轴进行索引,而不是例如对行的MultiIndex进行索引。

你应该这样做:

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999 

不应该这样做:

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999 
In [51]: def mklbl(prefix, n):
 ....:    return ["%s%s" % (prefix, i) for i in range(n)]
 ....: 

In [52]: miindex = pd.MultiIndex.from_product(
 ....:    [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
 ....: )
 ....: 

In [53]: micolumns = pd.MultiIndex.from_tuples(
 ....:    [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
 ....: )
 ....: 

In [54]: dfmi = (
 ....:    pd.DataFrame(
 ....:        np.arange(len(miindex) * len(micolumns)).reshape(
 ....:            (len(miindex), len(micolumns))
 ....:        ),
 ....:        index=miindex,
 ....:        columns=micolumns,
 ....:    )
 ....:    .sort_index()
 ....:    .sort_index(axis=1)
 ....: )
 ....: 

In [55]: dfmi
Out[55]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
 D1    5    4    7    6
 C1 D0    9    8   11   10
 D1   13   12   15   14
 C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
 C2 D0  241  240  243  242
 D1  245  244  247  246
 C3 D0  249  248  251  250
 D1  253  252  255  254

[64 rows x 4 columns] 

使用切片、列表和标签进行基本的 MultiIndex 切片。

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
 D1   77   76   79   78
 C3 D0   89   88   91   90
 D1   93   92   95   94
 B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
 B1 C1 D0  233  232  235  234
 D1  237  236  239  238
 C3 D0  249  248  251  250
 D1  253  252  255  254

[24 rows x 4 columns] 

你可以使用pandas.IndexSlice来使用:更自然的语法,而不是使用slice(None)

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
 D1   12   14
 C3 D0   24   26
 D1   28   30
 B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254

[32 rows x 2 columns] 

使用此方法可以在同一时间在多个轴上执行相当复杂的选择。

In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
 D1   68   70
 C1 D0   72   74
 D1   76   78
 C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
 C2 D0  112  114
 D1  116  118
 C3 D0  120  122
 D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
 D1   12   14
 C3 D0   24   26
 D1   28   30
 B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254

[32 rows x 2 columns] 

使用布尔索引器可以提供与相关的选择。

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
 C3 D0  216  218
 D1  220  222
 B1 C1 D0  232  234
 D1  236  238
 C3 D0  248  250
 D1  252  254 

您还可以在.loc中指定axis参数,以在单个轴上解释传递的切片器。

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
 D1   13   12   15   14
 C3 D0   25   24   27   26
 D1   29   28   31   30
 B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
 B1 C1 D0  233  232  235  234
 D1  237  236  239  238
 C3 D0  249  248  251  250
 D1  253  252  255  254

[32 rows x 4 columns] 

此外,您可以使用以下方法设置值。

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
Out[66]: 
lvl0           a         b 
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
 D1    5    4    7    6
 C1 D0  -10  -10  -10  -10
 D1  -10  -10  -10  -10
 C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
 C2 D0  241  240  243  242
 D1  245  244  247  246
 C3 D0  -10  -10  -10  -10
 D1  -10  -10  -10  -10

[64 rows x 4 columns] 

你也可以使用可对齐对象的右侧。

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
Out[69]: 
lvl0              a               b 
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
 D1       5       4       7       6
 C1 D0    9000    8000   11000   10000
 D1   13000   12000   15000   14000
 C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
 C2 D0     241     240     243     242
 D1     245     244     247     246
 C3 D0  249000  248000  251000  250000
 D1  253000  252000  255000  254000

[64 rows x 4 columns] 

交叉部分

DataFramexs()方法另外接受一个级别参数,使得在MultiIndex的特定级别选择数据更容易。

In [70]: df
Out[70]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
 two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
 two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
 two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
Out[71]: 
 A         B         C
first 
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466 
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
 A         B         C
first second 
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466 

你也可以通过提供轴参数在列上使用xs进行选择。

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

xs还允许使用多个键进行选择。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681 
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64 

你可以传递drop_level=Falsexs以保留所选的级别。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466 

将上述与使用drop_level=True(默认值)的结果进行比较。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466 

高级重新索引和对齐

在 pandas 对象的reindex()align()方法中使用参数level对值进行广播是很有用的。例如:

In [80]: midx = pd.MultiIndex(
 ....:    levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
 ....: )
 ....: 

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
Out[82]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
Out[84]: 
 0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
Out[85]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
Out[87]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [88]: df2_aligned
Out[88]: 
 0         1
one  y  1.060074 -0.109716
 x  1.060074 -0.109716
zero y  1.271532  0.713416
 x  1.271532  0.713416 

使用swaplevel交换级别

swaplevel()方法可以交换两个级别的顺序:

In [89]: df[:5]
Out[89]: 
 0         1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 

使用reorder_levels重新排序级别

reorder_levels()方法推广了swaplevel方法,允许您一次性对分层索引级别进行排列:

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
 0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520 

重命名IndexMultiIndex的名称

rename()方法用于重命名MultiIndex的标签,通常用于重命名DataFrame的列。renamecolumns参数允许指定一个字典,其中只包含您希望重命名的列。

In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
 col0      col1
one  y  1.519970 -0.493662
 x  0.600178  0.274230
zero y  0.132885 -0.023688
 x  2.410179  1.450520 

这种方法也可以用于重命名DataFrame主索引的特定标签。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
 0         1
two  z  1.519970 -0.493662
 x  0.600178  0.274230
zero z  0.132885 -0.023688
 x  2.410179  1.450520 

rename_axis()方法用于重命名IndexMultiIndex的名称。特别是,可以指定MultiIndex级别的名称,这在稍后使用reset_index()将值从MultiIndex移动到列时非常有用。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
 0         1
abc  def 
one  y    1.519970 -0.493662
 x    0.600178  0.274230
zero y    0.132885 -0.023688
 x    2.410179  1.450520 

注意,DataFrame 的列是一个索引,因此使用 rename_axis 时使用 columns 参数会更改该索引的名称。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols') 

renamerename_axis都支持指定字典、Series或映射函数来将标签/名称映射到新值。

当直接使用Index对象而不是通过DataFrame时,可以使用Index.set_names()来更改名称。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
 (1, 'b'),
 (2, 'a'),
 (2, 'b')],
 names=['new name', 'y']) 

不能通过级别设置 MultiIndex 的名称。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError  Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
  1686 @name.setter
  1687 def name(self, value: Hashable) -> None:
  1688     if self._no_setting_name:
  1689         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690         raise RuntimeError(
  1691             "Cannot set name on a level of a MultiIndex. Use "
  1692             "'MultiIndex.set_names' instead."
  1693         )
  1694     maybe_extract_name(value, None, type(self))
  1695     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead. 

而是使用 Index.set_names()

MultiIndex进行排序

为了有效地对MultiIndex对象进行索引和切片,它们需要被排序。与任何索引一样,您可以使用sort_index()

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
Out[104]: 
qux  two    0.206053
bar  one   -0.251905
foo  one   -2.213588
qux  one    1.063327
foo  two    1.266143
baz  two    0.299368
bar  two   -0.863838
baz  one    0.408204
dtype: float64

In [105]: s.sort_index()
Out[105]: 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64

In [106]: s.sort_index(level=0)
Out[106]: 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64

In [107]: s.sort_index(level=1)
Out[107]: 
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64 

如果MultiIndex的级别已命名,还可以将级别名称传递给sort_index

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one   -0.251905
 two   -0.863838
baz  one    0.408204
 two    0.299368
foo  one   -2.213588
 two    1.266143
qux  one    1.063327
 two    0.206053
dtype: float64

In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one   -0.251905
baz  one    0.408204
foo  one   -2.213588
qux  one    1.063327
bar  two   -0.863838
baz  two    0.299368
foo  two    1.266143
qux  two    0.206053
dtype: float64 

在更高维度的对象上,如果它们具有MultiIndex,则可以按级别对任何其他轴进行排序:

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
 one      zero       one      zero
 x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688 

即使数据未排序,索引也会起作用,但效率会非常低(并显示PerformanceWarning)。它还将返回数据的副本而不是视图:

In [112]: dfm = pd.DataFrame(
 .....:    {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
 .....: )
 .....: 

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
Out[114]: 
 jolie
jim joe 
0   x    0.490671
 x    0.120248
1   z    0.537020
 y    0.110968

In [115]: dfm.loc[(1, 'z')]
Out[115]: 
 jolie
jim joe 
1   z    0.53702 

此外,如果尝试索引未完全按字典排序的内容,可能会引发:

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError  Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step)
  2852  """
  2853 For an ordered MultiIndex, compute the slice locations for input
  2854 labels.
 (...)
  2900 sequence of such.
  2901 """
  2902 # This function adds nothing to its parent implementation (the magic
  2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side)
  2846 if not isinstance(label, tuple):
  2847     label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side)
  2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
  2907     if len(tup) > self._lexsort_depth:
-> 2908         raise UnsortedIndexError(
  2909             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
  2910             f"({self._lexsort_depth})"
  2911         )
  2913     n = len(tup)
  2914     start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' 

MultiIndex上的is_monotonic_increasing()方法显示索引是否已排序:

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False 
In [118]: dfm = dfm.sort_index()

In [119]: dfm
Out[119]: 
 jolie
jim joe 
0   x    0.490671
 x    0.120248
1   y    0.110968
 z    0.537020

In [120]: dfm.index.is_monotonic_increasing
Out[120]: True 

现在选择按预期运行。

In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]: 
 jolie
jim joe 
1   y    0.110968
 z    0.537020 

采取方法

与 NumPy ndarrays 类似,pandas 的IndexSeriesDataFrame还提供了take()方法,该方法在给定轴上以给定索引检索元素。给定的索引必须是整数索引位置的列表或 ndarray。take还将接受负整数作为相对于对象末尾的位置。

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64 

对于 DataFrame,给定的索引应该是指定行或列位置的一维列表或 ndarray。

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
Out[131]: 
 0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
Out[132]: 
 0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704 

重要的是要注意,pandas 对象上的take方法不适用于布尔索引,可能会返回意外的结果。

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64 

最后,关于性能的一个小提示,因为take方法处理的输入范围较窄,所以它的性能可能比花式索引快得多。

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
 .....: %timeit arr.take(indexer, axis=0)
 .....: 
257 us +- 4.44 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
79.7 us +- 1.15 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 
In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
 .....: %timeit ser.take(indexer)
 .....: 
144 us +- 3.69 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
129 us +- 2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 

索引类型

我们在前面的章节中已经广泛讨论了 MultiIndex。关于 DatetimeIndexPeriodIndex 的文档在 这里,关于 TimedeltaIndex 的文档在 这里。

在接下来的子节中,我们将突出一些其他索引类型。

CategoricalIndex

CategoricalIndex 是一种支持重复索引的索引类型。这是一个围绕着 Categorical 的容器,允许高效地索引和存储具有大量重复元素的索引。

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
 A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object') 

设置索引将创建一个 CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

使用 __getitem__/.iloc/.loc 进行索引类似于具有重复项的 Index。索引器必须位于类别中,否则操作将引发 KeyError

In [153]: df2.loc["a"]
Out[153]: 
 A
B 
a  0
a  1
a  5 

索引后,CategoricalIndex 被保留

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

对索引进行排序将按照类别的顺序排序(回想一下,我们使用 CategoricalDtype(list('cab')) 创建了索引,因此排序顺序是 cab)。

In [155]: df2.sort_index()
Out[155]: 
 A
B 
c  4
a  0
a  1
a  5
b  2
b  3 

对索引的分组操作也将保留索引的特性。

In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]: 
 A
B 
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

重新索引操作将返回基于传递的索引器类型的结果索引。传递列表将返回一个普通的 Index;使用 Categorical 进行索引将返回一个 CategoricalIndex,根据传递的 Categorical dtype 的类别进行索引。这允许任意索引这些,即使值不在类别中,类似于如何重新索引任何 pandas 索引。

In [158]: df3 = pd.DataFrame(
 .....:    {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
 .....: )
 .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
 A
B 
a  0
b  1
c  2 
In [161]: df3.reindex(["a", "e"])
Out[161]: 
 A
B 
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
 A
B 
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B') 

警告

CategoricalIndex 进行的重塑和比较操作必须具有相同的类别,否则将引发 TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B') 
In [173]: pd.concat([df4, df5])
Out[173]: 
 A
B 
b  0
a  1
b  0
c  1 
```  ### RangeIndex

`RangeIndex` 是 `Index` 的子类,为所有 `DataFrame` 和 `Series` 对象提供默认索引。`RangeIndex` 是 `Index` 的优化版本,可以表示单调有序集合。这类似于 Python 的 [range 类型](https://docs.python.org/3/library/stdtypes.html#typesseq-range)。`RangeIndex` 将始终具有 `int64` dtype。

```py
In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1) 

RangeIndex 是所有 DataFrameSeries 对象的默认索引:

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1) 

一个 RangeIndex 的行为类似于一个具有 int64 数据类型的 Index,对 RangeIndex 进行操作,其结果无法由 RangeIndex 表示,但应具有整数数据类型,并将转换为具有 int64Index。例如:

In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64') 
```  ### 区间索引

`IntervalIndex` 以及其自身的数据类型 `IntervalDtype` 以及 `Interval` 标量类型,为 pandas 提供了区间表示的一流支持。

`IntervalIndex` 允许一些独特的索引,并且还用作 `cut()` 和 `qcut()` 中的类别的返回类型。

#### 使用 `IntervalIndex` 进行索引

`IntervalIndex` 可以作为索引用于 `Series` 和 `DataFrame`。

```py
In [182]: df = pd.DataFrame(
 .....:    {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
 .....: )
 .....: 

In [183]: df
Out[183]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4 

通过 .loc 进行基于标签的索引,沿着区间的边缘工作方式如您所期望的那样,选择特定的区间。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
 A
(1, 2]  2
(2, 3]  3 

如果您选择一个 包含在 区间内的标签,这也将选择该区间。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
 A
(2, 3]  3
(3, 4]  4 

使用区间进行选择将仅返回精确匹配。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64 

尝试选择不完全包含在 IntervalIndex 中的 Interval 将引发 KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
  1429 # fall thru to straight lookup
  1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
  1379 def _get_label(self, label, axis: AxisInt):
  1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level)
  4299             new_index = index[loc]
  4300 else:
-> 4301     loc = index.get_loc(key)
  4303     if isinstance(loc, np.ndarray):
  4304         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
  676 matches = mask.sum()
  677 if matches == 0:
--> 678     raise KeyError(key)
  679 if matches == 1:
  680     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right') 

使用 overlaps() 方法可以选择与给定 Interval 重叠的所有 Intervals,从而创建一个布尔型索引器。

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3 

使用 cutqcut 进行数据分箱

cut()qcut() 都返回一个 Categorical 对象,它们创建的区间存储在其 .categories 属性中作为一个 IntervalIndex

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]') 

cut() 还接受一个 IntervalIndex 作为其 bins 参数,这使得一种有用的 pandas 习语成为可能。首先,我们调用 cut() 与一些数据和 bins 设置为固定数量,以生成区间。然后,我们将 .categories 的值作为 bins 参数传递给后续调用 cut() 的新数据,供新数据分配到相同的区间中。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]] 

任何落在所有区间之外的值将被分配一个 NaN 值。

生成区间范围

如果我们需要按照常规频率创建区间,则可以使用 interval_range() 函数,使用 startendperiods 的各种组合来创建一个 IntervalIndexinterval_range 的默认频率为数字区间为 1,日期时间区间为日历日:

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
 (2017-01-02 00:00:00, 2017-01-03 00:00:00],
 (2017-01-03 00:00:00, 2017-01-04 00:00:00],
 (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
 (1 days 00:00:00, 2 days 00:00:00],
 (2 days 00:00:00, 3 days 00:00:00]],
 dtype='interval[timedelta64[ns], right]') 

freq参数可用于指定非默认频率,并且可以利用各种频率别名与类似于日期时间的间隔:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
 (2017-01-08 00:00:00, 2017-01-15 00:00:00],
 (2017-01-15 00:00:00, 2017-01-22 00:00:00],
 (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
 (0 days 09:00:00, 0 days 18:00:00],
 (0 days 18:00:00, 1 days 03:00:00]],
 dtype='interval[timedelta64[ns], right]') 

此外,closed参数可用于指定间隔关闭的哪一侧。默认情况下,间隔在右侧关闭。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]') 

指定startendperiods将生成从startend的一系列均匀间隔,包括periods个元素在生成的IntervalIndex中:

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
 (2018-01-20 08:00:00, 2018-02-08 16:00:00],
 (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
 dtype='interval[datetime64[ns], right]') 
```  ### CategoricalIndex

`CategoricalIndex`是一种支持具有重复项索引的索引类型。这是围绕`Categorical`的容器,允许高效地索引和存储具有大量重复元素的索引。

```py
In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
 A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object') 

设置索引将创建一个CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

使用__getitem__/.iloc/.loc进行索引操作与具有重复项的Index类似。索引器必须在类别中,否则操作将引发KeyError

In [153]: df2.loc["a"]
Out[153]: 
 A
B 
a  0
a  1
a  5 

在索引后,CategoricalIndex保留

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

对索引进行排序将按照类别的顺序进行排序(回想我们使用CategoricalDtype(list('cab'))创建了索引,所以排序顺序是cab)。

In [155]: df2.sort_index()
Out[155]: 
 A
B 
c  4
a  0
a  1
a  5
b  2
b  3 

对索引进行分组操作将保留索引的性质。

In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]: 
 A
B 
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 

重新索引操作将根据传递的索引器类型返回结果索引。传递列表将返回普通的Index;使用Categorical进行索引将返回一个CategoricalIndex,根据传递的Categorical数据类型的类别进行索引。这允许任意索引这些值,即使值不在类别中,类似于如何重新索引任何pandas 索引。

In [158]: df3 = pd.DataFrame(
 .....:    {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
 .....: )
 .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
 A
B 
a  0
b  1
c  2 
In [161]: df3.reindex(["a", "e"])
Out[161]: 
 A
B 
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
 A
B 
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B') 

警告

CategoricalIndex进行重塑和比较操作必须具有相同的类别,否则将引发TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B') 
In [173]: pd.concat([df4, df5])
Out[173]: 
 A
B 
b  0
a  1
b  0
c  1 

RangeIndex

RangeIndexDataFrameSeries对象的默认索引的子类。RangeIndexIndex的优化版本,可以表示单调有序集合。这类似于 Python 的range 类型RangeIndex始终具有int64数据类型。

In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1) 

RangeIndex是所有DataFrameSeries对象的默认索引:

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1) 

一个 RangeIndex 的行为类似于具有 int64 dtype 的 Index,对 RangeIndex 的操作,其结果不能由 RangeIndex 表示,但应该具有整数 dtype,将转换为具有 int64Index。例如:

In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64') 

区间索引

IntervalIndex 以及它自己的 dtype,IntervalDtype 以及 Interval 标量类型,允许 pandas 对区间表示法提供一流支持。

IntervalIndex 允许一些独特的索引,并且也被用作 cut()qcut() 中的类别的返回类型。

使用 IntervalIndex 进行索引

IntervalIndex 可以在 SeriesDataFrame 中作为索引使用。

In [182]: df = pd.DataFrame(
 .....:    {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
 .....: )
 .....: 

In [183]: df
Out[183]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4 

通过 .loc 进行基于标签的索引,沿着区间的边缘工作如你所期望的那样,选择特定的区间。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
 A
(1, 2]  2
(2, 3]  3 

如果选择一个 包含 在区间内的标签,这也将选择该区间。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
 A
(2, 3]  3
(3, 4]  4 

使用 Interval 进行选择将只返回精确匹配。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64 

尝试选择一个不完全包含在 IntervalIndex 中的 Interval 将引发一个 KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
  1429 # fall thru to straight lookup
  1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
  1379 def _get_label(self, label, axis: AxisInt):
  1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level)
  4299             new_index = index[loc]
  4300 else:
-> 4301     loc = index.get_loc(key)
  4303     if isinstance(loc, np.ndarray):
  4304         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
  676 matches = mask.sum()
  677 if matches == 0:
--> 678     raise KeyError(key)
  679 if matches == 1:
  680     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right') 

可以使用 overlaps() 方法创建一个布尔索引器,选择与给定 Interval 重叠的所有 Intervals

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3 

使用 cutqcut 对数据进行分箱

cut()qcut() 都返回一个 Categorical 对象,它们创建的区间存储在其 .categories 属性中作为一个 IntervalIndex

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]') 

cut() 还接受 IntervalIndex 作为其 bins 参数,这使得 pandas 中有一个有用的习惯用法。首先,我们使用一些数据和将 bins 设置为一个固定数字来调用 cut(),以生成区间。然后,我们将.categories 的值作为后续对 cut() 的调用中的 bins 参数传递,提供新的数据,这些数据将被分配到相同的区间中。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]] 

所有落在所有区间之外的值将被分配一个 NaN 值。

生成区间范围

如果我们需要定期频率的区间,我们可以使用 interval_range() 函数,使用各种组合的 startendperiods 来创建一个 IntervalIndex。对于 interval_range 的默认频率是数值区间为 1,日期时间区间为日历日:

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
 (2017-01-02 00:00:00, 2017-01-03 00:00:00],
 (2017-01-03 00:00:00, 2017-01-04 00:00:00],
 (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
 (1 days 00:00:00, 2 days 00:00:00],
 (2 days 00:00:00, 3 days 00:00:00]],
 dtype='interval[timedelta64[ns], right]') 

freq 参数可以用于指定非默认频率,并且可以利用各种频率别名,以及类似于日期时间的间隔:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
 (2017-01-08 00:00:00, 2017-01-15 00:00:00],
 (2017-01-15 00:00:00, 2017-01-22 00:00:00],
 (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
 (0 days 09:00:00, 0 days 18:00:00],
 (0 days 18:00:00, 1 days 03:00:00]],
 dtype='interval[timedelta64[ns], right]') 

此外,closed 参数可用于指定区间关闭的哪一侧(或哪些侧)。区间默认在右侧关闭。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]') 

指定 startendperiods 将从 startend(包括 end)生成一系列等间隔的区间,结果为 IntervalIndex 中的 periods 个元素:

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
 (2018-01-20 08:00:00, 2018-02-08 16:00:00],
 (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
 dtype='interval[datetime64[ns], right]') 

使用 IntervalIndex 进行索引

IntervalIndex 可以在 SeriesDataFrame 中用作索引。

In [182]: df = pd.DataFrame(
 .....:    {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
 .....: )
 .....: 

In [183]: df
Out[183]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4 

基于标签的索引通过 .loc 在区间的边缘按预期工作,选择特定的区间。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
 A
(1, 2]  2
(2, 3]  3 

如果选择的标签包含在区间内,这也将选择该区间。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
 A
(2, 3]  3
(3, 4]  4 

使用 Interval 进行选择将只返回精确匹配。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64 

尝试选择不完全包含在 IntervalIndex 中的 Interval 将引发 KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
  1189 maybe_callable = com.apply_if_callable(key, self.obj)
  1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
  1429 # fall thru to straight lookup
  1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
  1379 def _get_label(self, label, axis: AxisInt):
  1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level)
  4299             new_index = index[loc]
  4300 else:
-> 4301     loc = index.get_loc(key)
  4303     if isinstance(loc, np.ndarray):
  4304         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
  676 matches = mask.sum()
  677 if matches == 0:
--> 678     raise KeyError(key)
  679 if matches == 1:
  680     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right') 

选择与给定 Interval 重叠的所有 Intervals 可以使用 overlaps() 方法创建布尔索引器。

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
 A
(0, 1]  1
(1, 2]  2
(2, 3]  3 

使用 cutqcut 对数据进行分箱

cut()qcut() 都返回一个 Categorical 对象,并且它们创建的区间被存储为其 .categories 属性中的 IntervalIndex

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]') 

cut() 还接受一个 IntervalIndex 作为其 bins 参数,这使得一个有用的 pandas 习惯用法成为可能。首先,我们用一些数据和 bins 设置为一个固定数目来调用 cut(),以生成区间。然后,我们将 .categories 的值作为 bins 参数传递给后续调用 cut(),提供将被分箱到相同区间的新数据。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]] 

所有落在所有区间之外的值都将被分配一个 NaN 值。

生成区间范围

如果我们需要定期的间隔,我们可以使用 interval_range() 函数使用各种组合的 startendperiods 创建 IntervalIndex。对于数值区间,默认频率为 1,对于类似于日期时间的区间,默认为日历日:

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
 (2017-01-02 00:00:00, 2017-01-03 00:00:00],
 (2017-01-03 00:00:00, 2017-01-04 00:00:00],
 (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
 (1 days 00:00:00, 2 days 00:00:00],
 (2 days 00:00:00, 3 days 00:00:00]],
 dtype='interval[timedelta64[ns], right]') 

freq 参数可以用于指定非默认频率,并且可以利用各种频率别名,以及类似于日期时间的间隔:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
 (2017-01-08 00:00:00, 2017-01-15 00:00:00],
 (2017-01-15 00:00:00, 2017-01-22 00:00:00],
 (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
 dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
 (0 days 09:00:00, 0 days 18:00:00],
 (0 days 18:00:00, 1 days 03:00:00]],
 dtype='interval[timedelta64[ns], right]') 

此外,closed 参数可用于指定区间关闭的哪一侧(或哪些侧)。区间默认在右侧关闭。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]') 

指定 startendperiods 将从 startend 生成一系列均匀间隔的区间,结果将包含 IntervalIndex 中的 periods 个元素:

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
 (2018-01-20 08:00:00, 2018-02-08 16:00:00],
 (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
 dtype='interval[datetime64[ns], right]') 

杂项索引 FAQ

整数索引

使用整数轴标签进行基于标签的索引是一个棘手的问题。它在邮件列表和科学 Python 社区的各个成员之间被广泛讨论。在 pandas 中,我们的一般观点是标签比整数位置更重要。因此,只有使用标准工具如 .loc 进行基于标签的索引。以下代码将生成异常:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
  412 try:
--> 413     return self._range.index(new_key)
  414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
  413         return self._range.index(new_key)
  414     except ValueError as err:
--> 415         raise KeyError(key) from err
  416 if isinstance(key, Hashable):
  417     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
Out[210]: 
 0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
Out[211]: 
 0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221 

这个刻意的决定是为了防止歧义和微妙的错误(许多用户报告在 API 更改为停止“回退”到基于位置的索引时发现错误)。

非单调索引需要精确匹配

如果 SeriesDataFrame 的索引单调递增或递减,则标签的边界可以超出索引的范围,就像对普通 Python list 进行切片索引一样。可以使用 is_monotonic_increasing()is_monotonic_decreasing() 属性测试索引的单调性。

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]: 
 data
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]: 
Empty DataFrame
Columns: [data]
Index: [] 

另一方面,如果索引不是单调的,那么切片的两个边界都必须是索引的唯一成员。

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]: 
 data
2     0
3     1
1     2
4     3 
 # 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
  1374 if self._multi_take_opportunity(tup):
  1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
  1017 if com.is_null_slice(key):
  1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  1021 # We should never have retval.ndim < self.ndim, as that should
  1022 #  be handled by the _getitem_lowerdim call above.
  1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side)
  6801         return self._searchsorted_monotonic(label, side)
  6802     except ValueError:
  6803         # raise the original KeyError
-> 6804         raise err
  6806 if isinstance(slc, np.ndarray):
  6807     # get_loc may return a boolean array, which
  6808     # is OK as long as they are representable by a slice.
  6809     assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
  6796 # we need to look up the label
  6797 try:
-> 6798     slc = self.get_loc(label)
  6799 except KeyError as err:
  6800     try:

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
  1374 if self._multi_take_opportunity(tup):
  1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
  1017 if com.is_null_slice(key):
  1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  1021 # We should never have retval.ndim < self.ndim, as that should
  1022 #  be handled by the _getitem_lowerdim call above.
  1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step)
  6883 end_slice = None
  6884 if end is not None:
-> 6885     end_slice = self.get_slice_bound(end, "right")
  6886 if end_slice is None:
  6887     end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side)
  6810     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
  6811     if isinstance(slc, np.ndarray):
-> 6812         raise KeyError(
  6813             f"Cannot get {side} slice bound for non-unique "
  6814             f"label: {repr(original_label)}"
  6815         )
  6817 if isinstance(slc, slice):
  6818     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3' 

Index.is_monotonic_increasingIndex.is_monotonic_decreasing 只检查索引是否弱单调。要检查严格单调性,可以将其中一个与 is_unique() 属性结合使用。

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False 

端点是包含的

与标准 Python 序列切片相比,在 pandas 中基于标签的切片是包含的。这样做的主要原因是通常不可能轻松确定索引中特定标签后的“后继”或下一个元素。例如,请考虑以下 Series

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
Out[226]: 
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64 

假设我们希望从 ce 进行切片,使用整数可以这样完成:

In [227]: s[2:5]
Out[227]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64 

然而,如果你只有 ce,确定索引中的下一个元素可能有些复杂。例如,以下方法不起作用:

In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str 

一个非常常见的用例是将时间序列限制在两个特定日期开始和结束。为了实现这个目标,我们做出了设计选择,使基于标签的切片包含两个端点:

In [229]: s.loc["c":"e"]
Out[229]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64 

这绝对是一个“实用性胜于纯粹性”的事情,但如果你期望基于标签的切片的行为与标准 Python 整数切片的行为完全相同,这是需要注意的事情。

索引可能会改变底层 Series 的 dtype

不同的索引操作可能会更改Seriesdtype

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
Out[234]: 
0    1.0
4    NaN
dtype: float64 
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
Out[239]: 
0    True
1     NaN
2     NaN
dtype: object 

这是因为上述重新索引操作会静默地插入NaNs,并相应地更改dtype。这在使用numpyufuncs(如numpy.logical_and)时可能会导致一些问题。

参见GH 2388以获取更详细的讨论。

整数索引

具有整数轴标签的基于标签的索引是一个棘手的问题。在邮件列表和科学 Python 社区的各个成员中已经进行了大量讨论。在 pandas 中,我们的一般观点是标签比整数位置更重要。因此,只有具有整数轴索引的情况下,才可以使用标准工具(如.loc)进行基于标签的索引。以下代码将生成异常:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
  412 try:
--> 413     return self._range.index(new_key)
  414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
  413         return self._range.index(new_key)
  414     except ValueError as err:
--> 415         raise KeyError(key) from err
  416 if isinstance(key, Hashable):
  417     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
Out[210]: 
 0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
Out[211]: 
 0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221 

这个故意的决定是为了防止歧义和细微的错误(许多用户报告在停止“回退”到基于位置的索引时进行 API 更改时发现错误)。

非单调索引需要精确匹配

如果SeriesDataFrame的索引是单调递增或递减的,则基于标签的切片的边界可以超出索引的范围,就像切片索引正常的 Python list一样。可以使用is_monotonic_increasing()is_monotonic_decreasing()属性来测试索引的单调性。

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]: 
 data
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]: 
Empty DataFrame
Columns: [data]
Index: [] 

另一方面,如果索引不是单调的,则切片边界必须是索引的唯一成员。

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]: 
 data
2     0
3     1
1     2
4     3 
 # 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
  1374 if self._multi_take_opportunity(tup):
  1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
  1017 if com.is_null_slice(key):
  1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  1021 # We should never have retval.ndim < self.ndim, as that should
  1022 #  be handled by the _getitem_lowerdim call above.
  1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
  6877 start_slice = None
  6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
  6880 if start_slice is None:
  6881     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side)
  6801         return self._searchsorted_monotonic(label, side)
  6802     except ValueError:
  6803         # raise the original KeyError
-> 6804         raise err
  6806 if isinstance(slc, np.ndarray):
  6807     # get_loc may return a boolean array, which
  6808     # is OK as long as they are representable by a slice.
  6809     assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
  6796 # we need to look up the label
  6797 try:
-> 6798     slc = self.get_loc(label)
  6799 except KeyError as err:
  6800     try:

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
  1182     if self._is_scalar_access(key):
  1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
  1185 else:
  1186     # we by definition only have the 0th axis
  1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
  1374 if self._multi_take_opportunity(tup):
  1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
  1017 if com.is_null_slice(key):
  1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  1021 # We should never have retval.ndim < self.ndim, as that should
  1022 #  be handled by the _getitem_lowerdim call above.
  1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
  1409 if isinstance(key, slice):
  1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
  1412 elif com.is_bool_indexer(key):
  1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
  1440     return obj.copy(deep=False)
  1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
  1445 if isinstance(indexer, slice):
  1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
  6618 def slice_indexer(
  6619     self,
  6620     start: Hashable | None = None,
  6621     end: Hashable | None = None,
  6622     step: int | None = None,
  6623 ) -> slice:
  6624  """
  6625 Compute the slice indexer for input labels and step.
  6626  
 (...)
  6660 slice(1, 3, None)
  6661 """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
  6664     # return a slice
  6665     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step)
  6883 end_slice = None
  6884 if end is not None:
-> 6885     end_slice = self.get_slice_bound(end, "right")
  6886 if end_slice is None:
  6887     end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side)
  6810     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
  6811     if isinstance(slc, np.ndarray):
-> 6812         raise KeyError(
  6813             f"Cannot get {side} slice bound for non-unique "
  6814             f"label: {repr(original_label)}"
  6815         )
  6817 if isinstance(slc, slice):
  6818     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3' 

Index.is_monotonic_increasingIndex.is_monotonic_decreasing仅检查索引是否是弱单调的。要检查严格的单调性,可以将其中一个与is_unique()属性结合使用。

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False 

端点是包含的

与标准的 Python 序列切片相比,在 pandas 中,基于标签的切片是包含的。主要原因是通常很难确定索引中特定标签后面的“后继”或下一个元素。例如,考虑以下 Series

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
Out[226]: 
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64 

假设我们希望从ce进行切片,使用整数可以这样完成:

In [227]: s[2:5]
Out[227]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64 

但是,如果你只有ce,确定索引中的下一个元素可能会有些复杂。例如,以下内容不起作用:

In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError  Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str 

一个非常常见的用例是将时间序列限制在两个特定日期开始和结束。为了实现这一点,我们做出了设计选择,使基于标签的切片包括两个端点:

In [229]: s.loc["c":"e"]
Out[229]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64 

这绝对是“实用性胜过纯粹性”的一种情况,但如果你期望基于标签的切片行为与标准的 Python 整数切片完全相同,则需要注意这一点。

索引可能会改变底层 Series 的数据类型

不同的索引操作可能会改变Series的数据类型。

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
Out[234]: 
0    1.0
4    NaN
dtype: float64 
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
Out[239]: 
0    True
1     NaN
2     NaN
dtype: object 

这是因为上述(重新)索引操作会默默地插入NaNs,并相应地更改dtype。这可能会在使用numpyufuncs(如numpy.logical_and)时造成一些问题。

参见GH 2388以获取更详细的讨论。

标签:教程,slice,self,00,Pandas,two,2.2,pandas,Out
From: https://www.cnblogs.com/apachecn/p/18154754

相关文章

  • Pandas 2.2 中文官方教程和指南(十六)
    原文:pandas.pydata.org/docs/处理缺失数据原文:pandas.pydata.org/docs/user_guide/missing_data.html被视为“缺失”的值pandas使用不同的标记值来表示缺失值(也称为NA),具体取决于数据类型。numpy.nan适用于NumPy数据类型。使用NumPy数据类型的缺点是原始数据类型将......
  • Pandas 2.2 中文官方教程和指南(十三)
    原文:pandas.pydata.org/docs/写时复制(CoW)原文:pandas.pydata.org/docs/user_guide/copy_on_write.html注意写时复制将成为pandas3.0的默认设置。我们建议现在就启用它以从所有改进中受益。写时复制首次引入于版本1.5.0。从版本2.0开始,大部分通过CoW可能实现和支持......
  • Pandas 2.2 中文官方教程和指南(十七)
    原文:pandas.pydata.org/docs/重复标签原文:pandas.pydata.org/docs/user_guide/duplicates.htmlIndex对象不需要是唯一的;你可以有重复的行或列标签。这一点可能一开始会有点困惑。如果你熟悉SQL,你会知道行标签类似于表上的主键,你绝不希望在SQL表中有重复项。但pandas的......
  • Pandas 2.2 中文官方教程和指南(六)
    原文:pandas.pydata.org/docs/与Stata的比较原文:pandas.pydata.org/docs/getting_started/comparison/comparison_with_stata.html对于可能来自Stata的潜在用户,本页面旨在演示如何在pandas中执行不同的Stata操作。如果您是pandas的新手,您可能首先想通过阅读10分......
  • Pandas 2.2 中文官方教程和指南(七)
    原文:pandas.pydata.org/docs/社区教程原文:pandas.pydata.org/docs/getting_started/tutorials.html这是社区提供的许多pandas教程的指南,主要面向新用户。由JuliaEvans撰写的pandascookbook这本2015年的cookbook(由JuliaEvans撰写)的目标是为您提供一些具体的示......
  • Pandas 2.2 中文官方教程和指南(三)
    原文:pandas.pydata.org/docs/如何操作文本数据原文:pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html将所有名称字符改为小写。In[4]:titanic["Name"].str.lower()Out[4]:0braund,mr.owenharris1......
  • Pandas 2.2 中文官方教程和指南(十八)
    原文:pandas.pydata.org/docs/可空整数数据类型原文:pandas.pydata.org/docs/user_guide/integer_na.html注意IntegerArray目前处于实验阶段。其API或实现可能会在没有警告的情况下发生变化。使用pandas.NA作为缺失值。在处理缺失数据中,我们看到pandas主要使用NaN来表......
  • Pandas 2.2 中文官方教程和指南(二十四)
    原文:pandas.pydata.org/docs/扩展到大型数据集原文:pandas.pydata.org/docs/user_guide/scale.htmlpandas提供了用于内存分析的数据结构,这使得使用pandas分析大于内存数据集的数据集有些棘手。即使是占用相当大内存的数据集也变得难以处理,因为一些pandas操作需要进行中......
  • Pandas 2.2 中文官方教程和指南(二十三)
    原文:pandas.pydata.org/docs/提高性能原文:pandas.pydata.org/docs/user_guide/enhancingperf.html在本教程的这一部分中,我们将研究如何加速在pandas的DataFrame上操作的某些函数,使用Cython、Numba和pandas.eval()。通常,使用Cython和Numba可以比使用pandas.eval()提......
  • Pandas 2.2 中文官方教程和指南(二)
    原文:pandas.pydata.org/docs/如何在pandas中创建图表?原文:pandas.pydata.org/docs/getting_started/intro_tutorials/04_plotting.htmlIn[1]:importpandasaspdIn[2]:importmatplotlib.pyplotasplt本教程使用的数据:空气质量数据本教程使用关于(NO_2)的......