首页 > 其他分享 >Pandas 2.2 中文官方教程和指南(十三)

Pandas 2.2 中文官方教程和指南(十三)

时间:2024-04-24 12:00:48浏览次数:27  
标签:..... 教程 NaN DataFrame K0 result pd 2.2 Pandas

原文:pandas.pydata.org/docs/

写时复制(CoW)

原文:pandas.pydata.org/docs/user_guide/copy_on_write.html

注意

写时复制将成为 pandas 3.0 的默认设置。我们建议现在就启用它以从所有改进中受益。

写时复制首次引入于版本 1.5.0。从版本 2.0 开始,大部分通过 CoW 可能实现和支持的优化已经实现。从 pandas 2.1 开始,所有可能的优化都得到支持。

写时复制将在版本 3.0 中默认启用。

CoW 将导致更可预测的行为,因为不可能用一个语句更新多个对象,例如索引操作或方法不会产生副作用。此外,通过尽可能延迟复制,平均性能和内存使用将得到改善。

先前的行为

pandas 的索引行为很难理解。一些操作返回视图,而其他操作返回副本。根据操作的结果,改变一个对象可能会意外地改变另一个对象:

In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [2]: subset = df["foo"]

In [3]: subset.iloc[0] = 100

In [4]: df
Out[4]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

改变subset,例如更新其值,也会更新df。确切的行为很难预测。写时复制解决了意外修改多个对象的问题,它明确禁止这种情况。启用写时复制后,df保持不变:

In [5]: pd.options.mode.copy_on_write = True

In [6]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [7]: subset = df["foo"]

In [8]: subset.iloc[0] = 100

In [9]: df
Out[9]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

接下来的部分将解释这意味着什么,以及它如何影响现有应用程序。

迁移到写时复制

写时复制将成为 pandas 3.0 的默认和唯一模式。这意味着用户需要迁移他们的代码以符合 CoW 规则。

pandas 的默认模式将对某些情况发出警告,这些情况将积极改变行为,从而改变用户预期的行为。

我们添加了另一种模式,例如

pd.options.mode.copy_on_write = "warn" 

将会对每个会改变 CoW 行为的操作发出警告。我们预计这种模式会非常嘈杂,因为许多我们不认为会影响用户的情况也会发出警告。我们建议检查这种模式并分析警告,但不需要解决所有这些警告。以下列表的前两项是需要解决的唯一情况,以使现有代码与 CoW 兼容。

接下来的几个项目描述了用户可见的变化:

链接赋值永远不会起作用

应该使用loc作为替代。查看链接赋值部分获取更多细节。

访问 pandas 对象的底层数组将返回一个只读视图

In [10]: ser = pd.Series([1, 2, 3])

In [11]: ser.to_numpy()
Out[11]: array([1, 2, 3]) 

这个示例返回一个 NumPy 数组,它是 Series 对象的一个视图。这个视图可以被修改,从而也会修改 pandas 对象。这不符合 CoW 规则。返回的数组被设置为不可写,以防止这种行为。创建这个数组的副本允许修改。如果你不再关心 pandas 对象,你也可以再次使数组可写。

有关只读 NumPy 数组的更多详细信息,请参阅相关部分。

一次只更新一个 pandas 对象

以下代码片段在没有 CoW 的情况下同时更新dfsubset

In [12]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [13]: subset = df["foo"]

In [14]: subset.iloc[0] = 100

In [15]: df
Out[15]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

这在 CoW 中将不再可能,因为 CoW 规则明确禁止这样做。这包括将单个列更新为Series并依赖于更改传播回父DataFrame。如果需要此行为,可以使用lociloc将此语句重写为单个语句。DataFrame.where()是此情况的另一个合适的替代方案。

使用就地方法从DataFrame中选择的列更新也将不再起作用。

In [16]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [17]: df["foo"].replace(1, 5, inplace=True)

In [18]: df
Out[18]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

这是另一种链式赋值的形式。通常可以以 2 种不同形式重写:

In [19]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [20]: df.replace({"foo": {1: 5}}, inplace=True)

In [21]: df
Out[21]: 
 foo  bar
0    5    4
1    2    5
2    3    6 

另一种选择是不使用inplace

In [22]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [23]: df["foo"] = df["foo"].replace(1, 5)

In [24]: df
Out[24]: 
 foo  bar
0    5    4
1    2    5
2    3    6 

构造函数现在默认复制 NumPy 数组

Series 和 DataFrame 构造函数现在默认情况下将复制 NumPy 数组。这一变化是为了避免在 pandas 之外就地更改 NumPy 数组时改变 pandas 对象。您可以设置copy=False以避免此复制。

描述

CoW 意味着以任何方式从另一个 DataFrame 或 Series 派生的任何 DataFrame 或 Series 始终表现为副本。因此,我们只能通过修改对象本身来更改对象的值。CoW 不允许就地更新与另一个 DataFrame 或 Series 对象共享数据的 DataFrame 或 Series。

这样可以避免在修改值时产生副作用,因此大多数方法可以避免实际复制数据,只在必要时触发复制。

以下示例将在 CoW 下就地操作:

In [25]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [26]: df.iloc[0, 0] = 100

In [27]: df
Out[27]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

对象df不与任何其他对象共享数据,因此在更新值时不会触发复制。相比之下,以下操作在 CoW 下触发数据的复制:

In [28]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [29]: df2 = df.reset_index(drop=True)

In [30]: df2.iloc[0, 0] = 100

In [31]: df
Out[31]: 
 foo  bar
0    1    4
1    2    5
2    3    6

In [32]: df2
Out[32]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

reset_index返回一个带有 CoW 的延迟复制,而不带 CoW 的复制数据。由于dfdf2两个对象共享相同的数据,当修改df2时会触发复制。对象df仍然具有最初的值,而df2已被修改。

如果在执行reset_index操作后不再需要对象df,您可以通过将reset_index的输出分配给同一变量来模拟类似就地操作:

In [33]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [34]: df = df.reset_index(drop=True)

In [35]: df.iloc[0, 0] = 100

In [36]: df
Out[36]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

reset_index的结果被重新分配时,初始对象立即超出范围,因此df不与任何其他对象共享数据。在修改对象时不需要复制。这通常适用于写时复制优化中列出的所有方法。

以前,在操作视图时,会修改视图和父对象:

In [37]: with pd.option_context("mode.copy_on_write", False):
 ....:    df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
 ....:    view = df[:]
 ....:    df.iloc[0, 0] = 100
 ....: 

In [38]: df
Out[38]: 
 foo  bar
0  100    4
1    2    5
2    3    6

In [39]: view
Out[39]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

df更改时触发拷贝,以避免突变view

In [40]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [41]: view = df[:]

In [42]: df.iloc[0, 0] = 100

In [43]: df
Out[43]: 
 foo  bar
0  100    4
1    2    5
2    3    6

In [44]: view
Out[44]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

链式赋值

链式赋值引用一种技术,通过两个连续的索引操作来更新对象,例如。

In [45]: with pd.option_context("mode.copy_on_write", False):
 ....:    df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
 ....:    df["foo"][df["bar"] > 5] = 100
 ....:    df
 ....: 

当列bar大于 5 时,更新列foo。尽管如此,这违反了写时拷贝的原则,因为它必须在一步中修改视图df["foo"]df。因此,链式赋值将始终无法工作,并在启用写时拷贝时引发ChainedAssignmentError警告:

In [46]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [47]: df["foo"][df["bar"] > 5] = 100 

通过使用loc可以实现写时拷贝。

In [48]: df.loc[df["bar"] > 5, "foo"] = 100 
```  ## 只读 NumPy 数组

如果数组与初始 DataFrame 共享数据,则访问 DataFrame 的底层 NumPy 数组将返回只读数组:

如果初始 DataFrame 由多个数组组成,则该数组是一个拷贝:

```py
In [49]: df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})

In [50]: df.to_numpy()
Out[50]: 
array([[1\. , 1.5],
 [2\. , 2.5]]) 

如果 DataFrame 仅由一个 NumPy 数组组成,则该数组与 DataFrame 共享数据:

In [51]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

In [52]: df.to_numpy()
Out[52]: 
array([[1, 3],
 [2, 4]]) 

此数组是只读的,这意味着它不能就地修改:

In [53]: arr = df.to_numpy()

In [54]: arr[0, 0] = 100
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[54], line 1
----> 1 arr[0, 0] = 100

ValueError: assignment destination is read-only 

对于 Series 也是如此,因为 Series 始终由单个数组组成。

这有两种潜在的解决方案:

  • 如果想避免更新与数组共享内存的 DataFrame,则手动触发拷贝。

  • 使数组可写。这是一种性能更好的解决方案,但是绕过了写时拷贝规则,因此应谨慎使用。

In [55]: arr = df.to_numpy()

In [56]: arr.flags.writeable = True

In [57]: arr[0, 0] = 100

In [58]: arr
Out[58]: 
array([[100,   3],
 [  2,   4]]) 

避免模式

如果两个对象共享相同的数据,而您正在就地修改一个对象,则不会执行防御性拷贝。

In [59]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [60]: df2 = df.reset_index(drop=True)

In [61]: df2.iloc[0, 0] = 100 

这将创建两个共享数据的对象,因此 setitem 操作将触发一个拷贝。如果不再需要初始对象df,则不需要这样做。简单地重新分配给相同的变量将使对象持有的引用无效。

In [62]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [63]: df = df.reset_index(drop=True)

In [64]: df.iloc[0, 0] = 100 

在这个例子中不需要拷贝。创建多个引用会保持不必要的引用存在,因此会影响性能,因为写时拷贝。

写时拷贝优化

新的惰性拷贝机制,直到修改问题对象并且仅当该对象与另一个对象共享数据时才进行拷贝。此机制已添加到不需要底层数据拷贝的方法中。流行的例子有DataFrame.drop()用于axis=1DataFrame.rename()

当启用写时拷贝时,这些方法返回视图,与常规执行相比提供了显著的性能改进。 ## 如何启用写时拷贝

写时拷贝可以通过配置选项copy_on_write启用。该选项可以通过以下任一方式 __ 全局 __ 启用:

In [65]: pd.set_option("mode.copy_on_write", True)

In [66]: pd.options.mode.copy_on_write = True 

先前的行为

pandas 的索引行为很难理解。一些操作返回视图,而另一些操作返回副本。根据操作的结果,改变一个对象可能会意外地改变另一个对象:

In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [2]: subset = df["foo"]

In [3]: subset.iloc[0] = 100

In [4]: df
Out[4]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

改变subset,例如更新其值,也会更新df。确切的行为很难预测。Copy-on-Write 解决了意外修改多个对象的问题,它明确禁止这种情况发生。启用 CoW 后,df保持不变:

In [5]: pd.options.mode.copy_on_write = True

In [6]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [7]: subset = df["foo"]

In [8]: subset.iloc[0] = 100

In [9]: df
Out[9]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

接下来的部分将解释这意味着什么以及它如何影响现有应用程序。

迁移至 Copy-on-Write

在 pandas 3.0 中,Copy-on-Write 将成为默认且唯一模式。这意味着用户需要迁移其代码以符合 CoW 规则。

pandas 的默认模式将对某些情况发出警告,这些情况将积极改变行为,从而改变用户预期的行为。

我们添加了另一种模式,例如。

pd.options.mode.copy_on_write = "warn" 

对于每个会改变行为的操作都会发出 CoW 警告。我们预计这种模式会非常嘈杂,因为许多我们不希望影响用户的情况也会发出警告。我们建议检查此模式并分析警告,但不需要解决所有这些警告。以下列表的前两项是需要解决的唯一情况,以使现有代码与 CoW 一起正常工作。

接下来的几个项目描述了用户可见的更改:

链式赋值永远不会起作用

应该使用loc作为替代方法。查看链式赋值部分以获取更多详细信息。

访问 pandas 对象的底层数组将返回一个只读视图

In [10]: ser = pd.Series([1, 2, 3])

In [11]: ser.to_numpy()
Out[11]: array([1, 2, 3]) 

此示例返回一个 Series 对象的视图的 NumPy 数组。此视图可以被修改,从而也修改 pandas 对象。这不符合 CoW 规则。返回的数组设置为不可写,以防止这种行为。创建此数组的副本允许修改。如果不再关心 pandas 对象,也可以再次使数组可写。

查看关于只读 NumPy 数组的部分以获取更多详细信息。

一次只更新一个 pandas 对象

以下代码片段在没有 CoW 的情况下同时更新dfsubset

In [12]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [13]: subset = df["foo"]

In [14]: subset.iloc[0] = 100

In [15]: df
Out[15]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

这在 CoW 下将不再可能,因为 CoW 规则明确禁止这样做。这包括更新单个列作为Series并依赖于更改传播回父DataFrame。如果需要此行为,可以将此语句重写为使用lociloc的单个语句。DataFrame.where()是此情况的另一个合适的替代方法。

使用就地方法从DataFrame中选择的列更新列也将不再起作用。

In [16]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [17]: df["foo"].replace(1, 5, inplace=True)

In [18]: df
Out[18]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

这是另一种链式赋值的形式。这通常可以以 2 种不同的形式重写:

In [19]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [20]: df.replace({"foo": {1: 5}}, inplace=True)

In [21]: df
Out[21]: 
 foo  bar
0    5    4
1    2    5
2    3    6 

另一种选择是不使用inplace

In [22]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [23]: df["foo"] = df["foo"].replace(1, 5)

In [24]: df
Out[24]: 
 foo  bar
0    5    4
1    2    5
2    3    6 

构造函数现在默认复制 NumPy 数组

当没有另行指定时,Series 和 DataFrame 构造函数现在默认复制 NumPy 数组。这一变更是为了避免在 pandas 之外原位更改 NumPy 数组时突变 pandas 对象。您可以设置copy=False来避免此复制。

描述

CoW 意味着以任何方式从另一个 DataFrame 或 Series 派生的任何 DataFrame 或 Series 都始终表现为副本。因此,我们只能通过修改对象本身来更改对象的值。CoW 不允许直接更新共享数据与另一个 DataFrame 或 Series 对象的 DataFrame 或 Series。

在修改值时避免副作用,因此,大多数方法可以避免实际复制数据,并且只在必要时触发复制。

以下示例将在 CoW 下进行就地操作:

In [25]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [26]: df.iloc[0, 0] = 100

In [27]: df
Out[27]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

对象df不与任何其他对象共享数据,因此在更新值时不触发复制。相比之下,下面的操作在 CoW 下触发数据的复制:

In [28]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [29]: df2 = df.reset_index(drop=True)

In [30]: df2.iloc[0, 0] = 100

In [31]: df
Out[31]: 
 foo  bar
0    1    4
1    2    5
2    3    6

In [32]: df2
Out[32]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

reset_index返回一个带有 CoW 的延迟副本,而在没有 CoW 的情况下复制数据。由于dfdf2这两个对象共享相同的数据,所以当修改df2时会触发复制。对象df仍然具有最初的相同值,而df2已经被修改。

如果在执行reset_index操作后不再需要对象df,则可以通过将reset_index的输出分配给同一变量来模拟类似于 inplace 的操作:

In [33]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [34]: df = df.reset_index(drop=True)

In [35]: df.iloc[0, 0] = 100

In [36]: df
Out[36]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

reset_index的结果重新分配时,初始对象就会超出范围,因此df与任何其他对象都不共享数据。在修改对象时,不需要复制。这通常对于列表中列出的所有方法都成立写时复制优化。

以前,在操作视图时,视图和父对象都会被修改:

In [37]: with pd.option_context("mode.copy_on_write", False):
 ....:    df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
 ....:    view = df[:]
 ....:    df.iloc[0, 0] = 100
 ....: 

In [38]: df
Out[38]: 
 foo  bar
0  100    4
1    2    5
2    3    6

In [39]: view
Out[39]: 
 foo  bar
0  100    4
1    2    5
2    3    6 

当修改df时,CoW 会触发复制以避免同时更改view

In [40]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [41]: view = df[:]

In [42]: df.iloc[0, 0] = 100

In [43]: df
Out[43]: 
 foo  bar
0  100    4
1    2    5
2    3    6

In [44]: view
Out[44]: 
 foo  bar
0    1    4
1    2    5
2    3    6 

链式赋值

链式赋值引用一种通过两个后续索引操作更新对象的技术,例如

In [45]: with pd.option_context("mode.copy_on_write", False):
 ....:    df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
 ....:    df["foo"][df["bar"] > 5] = 100
 ....:    df
 ....: 

当列bar大于 5 时,更新列foo。尽管如此,这违反了 CoW 原则,因为它需要一次性修改视图df["foo"]df。因此,链式赋值始终不起作用,并在启用 CoW 时引发ChainedAssignmentError警告:

In [46]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [47]: df["foo"][df["bar"] > 5] = 100 

使用loc可以使用写时复制来完成这个过程。

In [48]: df.loc[df["bar"] > 5, "foo"] = 100 

只读 NumPy 数组

访问 DataFrame 的底层 NumPy 数组将返回一个只读数组,如果数组与初始 DataFrame 共享数据:

如果初始 DataFrame 包含多个数组,则数组是副本:

In [49]: df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})

In [50]: df.to_numpy()
Out[50]: 
array([[1\. , 1.5],
 [2\. , 2.5]]) 

如果 DataFrame 只包含一个 NumPy 数组,则该数组与 DataFrame 共享数据:

In [51]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

In [52]: df.to_numpy()
Out[52]: 
array([[1, 3],
 [2, 4]]) 

此数组是只读的,这意味着它不能就地修改:

In [53]: arr = df.to_numpy()

In [54]: arr[0, 0] = 100
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[54], line 1
----> 1 arr[0, 0] = 100

ValueError: assignment destination is read-only 

对于 Series 也是如此,因为 Series 总是由单个数组组成。

有两种潜在的解决方案:

  • 如果您想要避免更新与数组共享内存的 DataFrame,则手动触发复制。

  • 使数组可写。这是一种更高效的解决方案,但是它绕过了写时复制规则,因此应谨慎使用。

In [55]: arr = df.to_numpy()

In [56]: arr.flags.writeable = True

In [57]: arr[0, 0] = 100

In [58]: arr
Out[58]: 
array([[100,   3],
 [  2,   4]]) 

避免的模式

如果两个对象在您就地修改一个对象时共享相同的数据,则不会执行防御性复制。

In [59]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [60]: df2 = df.reset_index(drop=True)

In [61]: df2.iloc[0, 0] = 100 

这会创建两个共享数据的对象,因此 setitem 操作将触发复制。如果初始对象 df 不再需要,则不需要这样做。简单地重新分配给同一个变量将使对象持有的引用失效。

In [62]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [63]: df = df.reset_index(drop=True)

In [64]: df.iloc[0, 0] = 100 

在这个例子中不需要复制。创建多个引用会保持不必要的引用活动,因此会通过写时复制对性能造成损害。

写时复制优化

新的惰性复制机制推迟了直到修改了问题对象并且仅在此对象与另一个对象共享数据时才复制该对象。此机制已添加到不需要复制底层数据的方法中。常见示例是DataFrame.drop()对于axis=1DataFrame.rename()

当启用写时复制(Copy-on-Write)时,这些方法返回视图,与常规执行相比,这提供了显著的性能改进。

如何启用写时复制

可以通过配置选项 copy_on_write 启用写时复制。该选项可以通过以下任一全局方式进行打开:

In [65]: pd.set_option("mode.copy_on_write", True)

In [66]: pd.options.mode.copy_on_write = True 

合并,连接,串联和比较

原文:pandas.pydata.org/docs/user_guide/merging.html

pandas 提供了各种方法来合并和比较SeriesDataFrame

  • concat(): 将多个SeriesDataFrame对象沿着共享的索引或列合并

  • DataFrame.join(): 沿着列合并多个DataFrame对象

  • DataFrame.combine_first(): 在相同位置使用非缺失值更新缺失值

  • merge(): 用类似 SQL 的方式合并两个SeriesDataFrame对象

  • merge_ordered(): 沿着有序轴合并两个SeriesDataFrame对象

  • merge_asof(): 通过近似匹配键而不是精确匹配键来合并两个SeriesDataFrame对象

  • Series.compare()DataFrame.compare(): 显示两个SeriesDataFrame对象之间的值差异

concat()

concat()函数沿着一个轴连接任意数量的SeriesDataFrame对象,同时在其他轴上执行可选的集合逻辑(并集或交集)索引。与numpy.concatenate类似,concat()接受一个同类型对象的列表或字典,并将它们连接起来。

In [1]: df1 = pd.DataFrame(
 ...:    {
 ...:        "A": ["A0", "A1", "A2", "A3"],
 ...:        "B": ["B0", "B1", "B2", "B3"],
 ...:        "C": ["C0", "C1", "C2", "C3"],
 ...:        "D": ["D0", "D1", "D2", "D3"],
 ...:    },
 ...:    index=[0, 1, 2, 3],
 ...: )
 ...: 

In [2]: df2 = pd.DataFrame(
 ...:    {
 ...:        "A": ["A4", "A5", "A6", "A7"],
 ...:        "B": ["B4", "B5", "B6", "B7"],
 ...:        "C": ["C4", "C5", "C6", "C7"],
 ...:        "D": ["D4", "D5", "D6", "D7"],
 ...:    },
 ...:    index=[4, 5, 6, 7],
 ...: )
 ...: 

In [3]: df3 = pd.DataFrame(
 ...:    {
 ...:        "A": ["A8", "A9", "A10", "A11"],
 ...:        "B": ["B8", "B9", "B10", "B11"],
 ...:        "C": ["C8", "C9", "C10", "C11"],
 ...:        "D": ["D8", "D9", "D10", "D11"],
 ...:    },
 ...:    index=[8, 9, 10, 11],
 ...: )
 ...: 

In [4]: frames = [df1, df2, df3]

In [5]: result = pd.concat(frames)

In [6]: result
Out[6]: 
 A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11 

../_images/merging_concat_basic.png

注意

concat()会对数据进行完全复制,并且反复使用concat()可能会创建不必要的副本。在使用concat()之前,先将所有DataFrameSeries对象收集到一个列表中。

frames = [process_your_file(f) for f in files]
result = pd.concat(frames) 

注意

当连接具有命名轴的DataFrame时,pandas 会尽可能保留这些索引/列名称。在所有输入共享一个公共名称的情况下,该名称将分配给结果。当输入名称不完全一致时,结果将没有名称。对于MultiIndex也是如此,但逻辑是逐级别分别应用的。

结果轴的连接逻辑

join关键字指定如何处理第一个DataFrame中不存在的轴值。

join='outer'取所有轴值的并集

In [7]: df4 = pd.DataFrame(
 ...:    {
 ...:        "B": ["B2", "B3", "B6", "B7"],
 ...:        "D": ["D2", "D3", "D6", "D7"],
 ...:        "F": ["F2", "F3", "F6", "F7"],
 ...:    },
 ...:    index=[2, 3, 6, 7],
 ...: )
 ...: 

In [8]: result = pd.concat([df1, df4], axis=1)

In [9]: result
Out[9]: 
 A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7 

../_images/merging_concat_axis1.png

join='inner'取轴值的交集

In [10]: result = pd.concat([df1, df4], axis=1, join="inner")

In [11]: result
Out[11]: 
 A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3 

../_images/merging_concat_axis1_inner.png

为了使用原始DataFrame确切索引执行有效的“左”连接,结果可以重新索引。

In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)

In [13]: result
Out[13]: 
 A   B   C   D    B    D    F
0  A0  B0  C0  D0  NaN  NaN  NaN
1  A1  B1  C1  D1  NaN  NaN  NaN
2  A2  B2  C2  D2   B2   D2   F2
3  A3  B3  C3  D3   B3   D3   F3 

../_images/merging_concat_axis1_join_axes.png ### 在连接轴上忽略索引

对于没有有意义索引的DataFrame对象,ignore_index会忽略重叠的索引。

In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False)

In [15]: result
Out[15]: 
 A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
4  NaN  B2  NaN  D2   F2
5  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7 

../_images/merging_concat_ignore_index.png ### 将SeriesDataFrame连接在一起

您可以连接一组SeriesDataFrame对象。Series将转换为具��列名的DataFrame,列名为Series的名称。

In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

In [17]: result = pd.concat([df1, s1], axis=1)

In [18]: result
Out[18]: 
 A   B   C   D   X
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3 

../_images/merging_concat_mixed_ndim.png

未命名的Series将按顺序编号。

In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"])

In [20]: result = pd.concat([df1, s2, s2, s2], axis=1)

In [21]: result
Out[21]: 
 A   B   C   D   0   1   2
0  A0  B0  C0  D0  _0  _0  _0
1  A1  B1  C1  D1  _1  _1  _1
2  A2  B2  C2  D2  _2  _2  _2
3  A3  B3  C3  D3  _3  _3  _3 

../_images/merging_concat_unnamed_series.png

ignore_index=True 将删除所有名称引用。

In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)

In [23]: result
Out[23]: 
 0   1   2   3   4
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3 

../_images/merging_concat_series_ignore_index.png

结果的keys

keys 参数将向结果索引或列添加另一个轴级别(创建一个MultiIndex),将特定键与每个原始DataFrame关联。

In [24]: result = pd.concat(frames, keys=["x", "y", "z"])

In [25]: result
Out[25]: 
 A    B    C    D
x 0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11

In [26]: result.loc["y"]
Out[26]: 
 A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7 

../_images/merging_concat_keys.png

keys 参数可以在基于现有Series创建新DataFrame时覆盖列名。

In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo")

In [28]: s4 = pd.Series([0, 1, 2, 3])

In [29]: s5 = pd.Series([0, 1, 4, 5])

In [30]: pd.concat([s3, s4, s5], axis=1)
Out[30]: 
 foo  0  1
0    0  0  0
1    1  1  1
2    2  2  4
3    3  3  5

In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
Out[31]: 
 red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5 

您还可以将字典传递给concat(),在这种情况下,除非指定了其他keys参数,否则将使用字典键作为keys参数:

In [32]: pieces = {"x": df1, "y": df2, "z": df3}

In [33]: result = pd.concat(pieces)

In [34]: result
Out[34]: 
 A    B    C    D
x 0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11 

../_images/merging_concat_dict.png

In [35]: result = pd.concat(pieces, keys=["z", "y"])

In [36]: result
Out[36]: 
 A    B    C    D
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7 

../_images/merging_concat_dict_keys.png

创建的MultiIndex具有从传递的键和DataFrame片段的索引构建的级别:

In [37]: result.index.levels
Out[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]]) 

levels 参数允许指定与keys相关联的结果级别

In [38]: result = pd.concat(
 ....:    pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
 ....: )
 ....: 

In [39]: result
Out[39]: 
 A    B    C    D
group_key 
x         0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y         4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z         8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11 

../_images/merging_concat_dict_keys_names.png

In [40]: result.index.levels
Out[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) 

将行附加到DataFrame

如果您有一个要附加为单行到 DataFrameSeries,您可以将该行转换为 DataFrame 并使用 concat()

In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])

In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True)

In [43]: result
Out[43]: 
 A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  X0  X1  X2  X3 

../_images/merging_append_series_as_row.png ## merge()

merge() 执行类似于关系数据库(如 SQL)的连接操作。熟悉 SQL 但是对 pandas 新手的用户可以参考与 SQL 的比较。

连接类型

merge() 实现了常见的 SQL 风格的连接操作。

  • 一对一:在它们的索引上连接两个 DataFrame 对象,这些索引必须包含唯一值。

  • 一对多:将唯一索引与不同 DataFrame 中的一个或多个列进行连接。

  • 多对多:在列上连接列。

注意

当在列上连接列时,可能是多对多的连接,传递的 DataFrame 对象上的任何索引将被丢弃

对于多对多的连接,如果一个键组合在两个表中出现多次,DataFrame 将具有相关数据的笛卡尔积

In [44]: left = pd.DataFrame(
 ....:    {
 ....:        "key": ["K0", "K1", "K2", "K3"],
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:    }
 ....: )
 ....: 

In [45]: right = pd.DataFrame(
 ....:    {
 ....:        "key": ["K0", "K1", "K2", "K3"],
 ....:        "C": ["C0", "C1", "C2", "C3"],
 ....:        "D": ["D0", "D1", "D2", "D3"],
 ....:    }
 ....: )
 ....: 

In [46]: result = pd.merge(left, right, on="key")

In [47]: result
Out[47]: 
 key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3 

../_images/merging_merge_on_key.png

对于 merge()how 参数指定了哪些键包含在结果表中。如果一个键组合不存在于左表或右表中,连接表中的值将为 NA。以下是 how 选项及其 SQL 等效名称的摘要:

合并方法 SQL 连接名称 描述
left LEFT OUTER JOIN 仅使用左框架中的键
right RIGHT OUTER JOIN 仅使用右框架中的键
outer FULL OUTER JOIN 使用两个框架的键的并集
inner INNER JOIN 使用两个框架中键的交集
cross CROSS JOIN 创建两个框架行的笛卡尔积
In [48]: left = pd.DataFrame(
 ....:   {
 ....:      "key1": ["K0", "K0", "K1", "K2"],
 ....:      "key2": ["K0", "K1", "K0", "K1"],
 ....:      "A": ["A0", "A1", "A2", "A3"],
 ....:      "B": ["B0", "B1", "B2", "B3"],
 ....:   }
 ....: )
 ....: 

In [49]: right = pd.DataFrame(
 ....:   {
 ....:      "key1": ["K0", "K1", "K1", "K2"],
 ....:      "key2": ["K0", "K0", "K0", "K0"],
 ....:      "C": ["C0", "C1", "C2", "C3"],
 ....:      "D": ["D0", "D1", "D2", "D3"],
 ....:   }
 ....: )
 ....: 

In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"])

In [51]: result
Out[51]: 
 key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN 

../_images/merging_merge_on_key_left.png

In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"])

In [53]: result
Out[53]: 
 key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3 

../_images/merging_merge_on_key_right.png

In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])

In [55]: result
Out[55]: 
 key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K0  NaN  NaN   C3   D3
5   K2   K1   A3   B3  NaN  NaN 

../_images/merging_merge_on_key_outer.png

In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])

In [57]: result
Out[57]: 
 key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2 

../_images/merging_merge_on_key_inner.png

In [58]: result = pd.merge(left, right, how="cross")

In [59]: result
Out[59]: 
 key1_x key2_x   A   B key1_y key2_y   C   D
0      K0     K0  A0  B0     K0     K0  C0  D0
1      K0     K0  A0  B0     K1     K0  C1  D1
2      K0     K0  A0  B0     K1     K0  C2  D2
3      K0     K0  A0  B0     K2     K0  C3  D3
4      K0     K1  A1  B1     K0     K0  C0  D0
..    ...    ...  ..  ..    ...    ...  ..  ..
11     K1     K0  A2  B2     K2     K0  C3  D3
12     K2     K1  A3  B3     K0     K0  C0  D0
13     K2     K1  A3  B3     K1     K0  C1  D1
14     K2     K1  A3  B3     K1     K0  C2  D2
15     K2     K1  A3  B3     K2     K0  C3  D3

[16 rows x 8 columns] 

../_images/merging_merge_cross.png

如果MultiIndex的名称与DataFrame中的列名对应,则可以使用Series和具有MultiIndexDataFrame。在合并之前,使用Series.reset_index()Series转换为DataFrame

In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})

In [61]: df
Out[61]: 
 Let  Num
0   A    1
1   B    2
2   C    3

In [62]: ser = pd.Series(
 ....:    ["a", "b", "c", "d", "e", "f"],
 ....:    index=pd.MultiIndex.from_arrays(
 ....:        [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
 ....:    ),
 ....: )
 ....: 

In [63]: ser
Out[63]: 
Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Out[64]: 
 Let  Num  0
0   A    1  a
1   B    2  b
2   C    3  c 

在具有重复连接键的DataFrame中执行外连接

In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [67]: result = pd.merge(left, right, on="B", how="outer")

In [68]: result
Out[68]: 
 A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6 

../_images/merging_merge_on_key_dup.png

警告

在重复键上进行合并会显著增加结果的维度,并可能导致内存溢出。

合并键的唯一性

validate参数检查合并键的唯一性。在执行合并操作之前检查键的唯一性可以防止内存溢出和意外键重复。

In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
---------------------------------------------------------------------------
MergeError  Traceback (most recent call last)
Cell In[71], line 1
----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  155     return _cross_merge(
  156         left_df,
  157         right_df,
   (...)
  167         copy=copy,
  168     )
  169 else:
--> 170     op = _MergeOperation(
  171         left_df,
  172         right_df,
  173         how=how,
  174         on=on,
  175         left_on=left_on,
  176         right_on=right_on,
  177         left_index=left_index,
  178         right_index=right_index,
  179         sort=sort,
  180         suffixes=suffixes,
  181         indicator=indicator,
  182         validate=validate,
  183     )
  184     return op.get_result(copy=copy)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)
  809 # If argument passed to validate,
  810 # check if columns specified as unique
  811 # are in fact unique.
  812 if validate is not None:
--> 813     self._validate_validate_kwd(validate)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate)
  1653         raise MergeError(
  1654             "Merge keys are not unique in left dataset; not a one-to-one merge"
  1655         )
  1656     if not right_unique:
-> 1657         raise MergeError(
  1658             "Merge keys are not unique in right dataset; not a one-to-one merge"
  1659         )
  1661 elif validate in ["one_to_many", "1:m"]:
  1662     if not left_unique:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge 

如果用户意识到右侧DataFrame中存在重复项,但希望确保左侧DataFrame中没有重复项,则可以使用validate='one_to_many'参数,这样不会引发异常。

In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
Out[72]: 
 A_x  B  A_y
0    1  1  NaN
1    2  2  4.0
2    2  2  5.0
3    2  2  6.0 
```  ### 合并结果指示器

`merge()`接受参数`indicator`。如果为`True`,则将向输出对象添加一个名为`_merge`的分类列,其取值为:

> | 观察来源 | `_merge`值 |
> | --- | --- |
> | 仅在`'left'`数据框中的合并键 | `left_only` |
> | 仅在`'right'`数据框中的合并键 | `right_only` |
> | 两个数据框中的合并键 | `both` |

```py
In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})

In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})

In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
Out[75]: 
 col1 col_left  col_right      _merge
0     0        a        NaN   left_only
1     1        b        2.0        both
2     2      NaN        2.0  right_only
3     2      NaN        2.0  right_only 

字符串参数indicator将使用该值作为指示器列的名称。

In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
Out[76]: 
 col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only 

重叠值列

合并suffixes参数接受一个字符串列表的元组,用于附加到输入DataFrame中重叠列名以消除结果列的歧义:

In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})

In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})

In [79]: result = pd.merge(left, right, on="k")

In [80]: result
Out[80]: 
 k  v_x  v_y
0  K0    1    4
1  K0    1    5 

../_images/merging_merge_overlapped.png

In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

In [82]: result
Out[82]: 
 k  v_l  v_r
0  K0    1    4
1  K0    1    5 

../_images/merging_merge_overlapped_suffix.png

DataFrame.join()

DataFrame.join()将多个可能具有不同索引的列的DataFrame合并为单个结果DataFrame

In [83]: left = pd.DataFrame(
 ....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
 ....: )
 ....: 

In [84]: right = pd.DataFrame(
 ....:    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
 ....: )
 ....: 

In [85]: result = left.join(right)

In [86]: result
Out[86]: 
 A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2 

../_images/merging_join.png

In [87]: result = left.join(right, how="outer")

In [88]: result
Out[88]: 
 A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3 

../_images/merging_join_outer.png

In [89]: result = left.join(right, how="inner")

In [90]: result
Out[90]: 
 A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2 

../_images/merging_join_inner.png

DataFrame.join()接受一个可选的on参数,可以是要对齐的列或多个列名。

In [91]: left = pd.DataFrame(
 ....:    {
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:        "key": ["K0", "K1", "K0", "K1"],
 ....:    }
 ....: )
 ....: 

In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

In [93]: result = left.join(right, on="key")

In [94]: result
Out[94]: 
 A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1 

../_images/merging_join_key_columns.png

In [95]: result = pd.merge(
 ....:    left, right, left_on="key", right_index=True, how="left", sort=False
 ....: )
 ....: 

In [96]: result
Out[96]: 
 A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1 

../_images/merging_merge_key_columns.png

要在多个键上连接,传递的DataFrame必须具有MultiIndex:

In [97]: left = pd.DataFrame(
 ....:    {
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:        "key1": ["K0", "K0", "K1", "K2"],
 ....:        "key2": ["K0", "K1", "K0", "K1"],
 ....:    }
 ....: )
 ....: 

In [98]: index = pd.MultiIndex.from_tuples(
 ....:    [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
 ....: )
 ....: 

In [99]: right = pd.DataFrame(
 ....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
 ....: )
 ....: 

In [100]: result = left.join(right, on=["key1", "key2"])

In [101]: result
Out[101]: 
 A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A3  B3   K2   K1   C3   D3 

../_images/merging_join_multikeys.png

DataFrame.join的默认行为是执行左连接,仅使用调用DataFrame中找到的键。其他连接类型可以通过how指定。

In [102]: result = left.join(right, on=["key1", "key2"], how="inner")

In [103]: result
Out[103]: 
 A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
2  A2  B2   K1   K0  C1  D1
3  A3  B3   K2   K1  C3  D3 

../_images/merging_join_multikeys_inner.png ### 将单个索引连接到多重索引

你可以将一个具有IndexDataFrame与具有MultiIndexDataFrame在一个级别上连接。Indexname将与MultiIndex的级别名称匹配。

In [104]: left = pd.DataFrame(
 .....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
 .....:    index=pd.Index(["K0", "K1", "K2"], name="key"),
 .....: )
 .....: 

In [105]: index = pd.MultiIndex.from_tuples(
 .....:    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
 .....:    names=["key", "Y"],
 .....: )
 .....: 

In [106]: right = pd.DataFrame(
 .....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
 .....:    index=index,
 .....: )
 .....: 

In [107]: result = left.join(right, how="inner")

In [108]: result
Out[108]: 
 A   B   C   D
key Y 
K0  Y0  A0  B0  C0  D0
K1  Y1  A1  B1  C1  D1
K2  Y2  A2  B2  C2  D2
 Y3  A2  B2  C3  D3 

../_images/merging_join_multiindex_inner.png ### 与两个MultiIndex连接

输入参数的MultiIndex必须完全用于连接,并且是左参数中索引的子集。

In [109]: leftindex = pd.MultiIndex.from_product(
 .....:    [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
 .....: )
 .....: 

In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)

In [111]: left
Out[111]: 
 v1
abc xy num 
a   x  1     0
 2     1
 y  1     2
 2     3
b   x  1     4
 2     5
 y  1     6
 2     7
c   x  1     8
 2     9
 y  1    10
 2    11

In [112]: rightindex = pd.MultiIndex.from_product(
 .....:    [list("abc"), list("xy")], names=["abc", "xy"]
 .....: )
 .....: 

In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

In [114]: right
Out[114]: 
 v2
abc xy 
a   x   100
 y   200
b   x   300
 y   400
c   x   500
 y   600

In [115]: left.join(right, on=["abc", "xy"], how="inner")
Out[115]: 
 v1   v2
abc xy num 
a   x  1     0  100
 2     1  100
 y  1     2  200
 2     3  200
b   x  1     4  300
 2     5  300
 y  1     6  400
 2     7  400
c   x  1     8  500
 2     9  500
 y  1    10  600
 2    11  600 
In [116]: leftindex = pd.MultiIndex.from_tuples(
 .....:    [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
 .....: )
 .....: 

In [117]: left = pd.DataFrame(
 .....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
 .....: )
 .....: 

In [118]: rightindex = pd.MultiIndex.from_tuples(
 .....:    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
 .....: )
 .....: 

In [119]: right = pd.DataFrame(
 .....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
 .....: )
 .....: 

In [120]: result = pd.merge(
 .....:    left.reset_index(), right.reset_index(), on=["key"], how="inner"
 .....: ).set_index(["key", "X", "Y"])
 .....: 

In [121]: result
Out[121]: 
 A   B   C   D
key X  Y 
K0  X0 Y0  A0  B0  C0  D0
 X1 Y0  A1  B1  C0  D0
K1  X2 Y1  A2  B2  C1  D1 

../_images/merging_merge_two_multiindex.png ### 在列和索引级别的组合上合并

作为onleft_onright_on参数传递的字符串可以引用列名或索引级别名。这使得在不重置索引的情况下,可以在索引级别和列的组合上合并DataFrame实例。

In [122]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")

In [123]: left = pd.DataFrame(
 .....:    {
 .....:        "A": ["A0", "A1", "A2", "A3"],
 .....:        "B": ["B0", "B1", "B2", "B3"],
 .....:        "key2": ["K0", "K1", "K0", "K1"],
 .....:    },
 .....:    index=left_index,
 .....: )
 .....: 

In [124]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")

In [125]: right = pd.DataFrame(
 .....:    {
 .....:        "C": ["C0", "C1", "C2", "C3"],
 .....:        "D": ["D0", "D1", "D2", "D3"],
 .....:        "key2": ["K0", "K0", "K0", "K1"],
 .....:    },
 .....:    index=right_index,
 .....: )
 .....: 

In [126]: result = left.merge(right, on=["key1", "key2"])

In [127]: result
Out[127]: 
 A   B key2   C   D
key1 
K0    A0  B0   K0  C0  D0
K1    A2  B2   K0  C1  D1
K2    A3  B3   K1  C3  D3 

../_images/merge_on_index_and_column.png

注意

DataFrame在匹配两个参数中的索引级别的字符串上进行连接时,索引级别将保留为结果DataFrame中的索引级别。

注意

当只使用MultiIndex的一些级别连接DataFrame时,多余的级别将从结果连接中删除。要保留这些级别,请在连接之前对这些级别名称使用DataFrame.reset_index()将这些级别移动到列中。 ### 连接多个DataFrame

也可以将DataFrame的列表或元组传递给join(),以在它们的索引上将它们连接在一起。

In [128]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

In [129]: result = left.join([right, right2]) 

../_images/merging_join_multi_df.png ### DataFrame.combine_first()

DataFrame.combine_first() 用另一个DataFrame中的非缺失值更新一个DataFrame中的缺失值,位置对应。

In [130]: df1 = pd.DataFrame(
 .....:    [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
 .....: )
 .....: 

In [131]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])

In [132]: result = df1.combine_first(df2)

In [133]: result
Out[133]: 
 0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0 

../_images/merging_combine_first.png ## merge_ordered()

merge_ordered() 将有序数据(如数值或时间序列数据)与可选的使用fill_method填充缺失数据合并。

In [134]: left = pd.DataFrame(
 .....:    {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]}
 .....: )
 .....: 

In [135]: right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})

In [136]: pd.merge_ordered(left, right, fill_method="ffill", left_by="s")
Out[136]: 
 k   lv  s   rv
0   K0  1.0  a  NaN
1   K1  1.0  a  1.0
2   K2  1.0  a  2.0
3   K4  1.0  a  3.0
4   K1  2.0  b  1.0
5   K2  2.0  b  2.0
6   K4  2.0  b  3.0
7   K1  3.0  c  1.0
8   K2  3.0  c  2.0
9   K4  3.0  c  3.0
10  K1  NaN  d  1.0
11  K2  4.0  d  2.0
12  K4  4.0  d  3.0 
```  ## `merge_asof()`

`merge_asof()` 类似于有序的左连接,不同之处在于匹配的是最近的键而不是相等的键。对于`left` `DataFrame`中的每一行,选择`right` `DataFrame`中最后一行,其中`on`键小于左侧的键。两个`DataFrame`必须按键排序。

可选地,`merge_asof()`可以通过在`by`键上匹配来执行分组合并,同时在`on`键上找到最近的匹配。

```py
In [137]: trades = pd.DataFrame(
 .....:    {
 .....:        "time": pd.to_datetime(
 .....:            [
 .....:                "20160525 13:30:00.023",
 .....:                "20160525 13:30:00.038",
 .....:                "20160525 13:30:00.048",
 .....:                "20160525 13:30:00.048",
 .....:                "20160525 13:30:00.048",
 .....:            ]
 .....:        ),
 .....:        "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
 .....:        "price": [51.95, 51.95, 720.77, 720.92, 98.00],
 .....:        "quantity": [75, 155, 100, 100, 100],
 .....:    },
 .....:    columns=["time", "ticker", "price", "quantity"],
 .....: )
 .....: 

In [138]: quotes = pd.DataFrame(
 .....:    {
 .....:        "time": pd.to_datetime(
 .....:            [
 .....:                "20160525 13:30:00.023",
 .....:                "20160525 13:30:00.023",
 .....:                "20160525 13:30:00.030",
 .....:                "20160525 13:30:00.041",
 .....:                "20160525 13:30:00.048",
 .....:                "20160525 13:30:00.049",
 .....:                "20160525 13:30:00.072",
 .....:                "20160525 13:30:00.075",
 .....:            ]
 .....:        ),
 .....:        "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
 .....:        "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
 .....:        "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
 .....:    },
 .....:    columns=["time", "ticker", "bid", "ask"],
 .....: )
 .....: 

In [139]: trades
Out[139]: 
 time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

In [140]: quotes
Out[140]: 
 time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

In [141]: pd.merge_asof(trades, quotes, on="time", by="ticker")
Out[141]: 
 time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN 

merge_asof() 在报价时间和交易时间之间的2ms内合并。

In [142]: pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms"))
Out[142]: 
 time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN 

merge_asof() 在报价时间和交易时间之间的10ms内合并,并排除时间上的精确匹配。请注意,尽管我们排除了精确匹配(报价),但之前的报价确实传播到那个时间点。

In [143]: pd.merge_asof(
 .....:    trades,
 .....:    quotes,
 .....:    on="time",
 .....:    by="ticker",
 .....:    tolerance=pd.Timedelta("10ms"),
 .....:    allow_exact_matches=False,
 .....: )
 .....: 
Out[143]: 
 time ticker   price  quantity    bid    ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75    NaN    NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155  51.97  51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100    NaN    NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100    NaN    NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100    NaN    NaN 
```  ## `compare()`

`Series.compare()` 和 `DataFrame.compare()` 方法允许您比较两个分别是`DataFrame`或`Series`的对象,并总结它们的差异。

```py
In [144]: df = pd.DataFrame(
 .....:    {
 .....:        "col1": ["a", "a", "b", "b", "a"],
 .....:        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
 .....:        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
 .....:    },
 .....:    columns=["col1", "col2", "col3"],
 .....: )
 .....: 

In [145]: df
Out[145]: 
 col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0

In [146]: df2 = df.copy()

In [147]: df2.loc[0, "col1"] = "c"

In [148]: df2.loc[2, "col3"] = 4.0

In [149]: df2
Out[149]: 
 col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

In [150]: df.compare(df2)
Out[150]: 
 col1       col3 
 self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0 

默认情况下,如果两个对应的值相等,它们将显示为NaN。此外,如果整行/列中的所有值都相等,则该行/列将从结果中省略。剩余的差异将对齐在列上。

在行上堆叠差异。

In [151]: df.compare(df2, align_axis=0)
Out[151]: 
 col1  col3
0 self     a   NaN
 other    c   NaN
2 self   NaN   3.0
 other  NaN   4.0 

保持所有原始行和列,使用keep_shape=True

In [152]: df.compare(df2, keep_shape=True)
Out[152]: 
 col1       col2       col3 
 self other self other self other
0    a     c  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN  NaN   NaN
2  NaN   NaN  NaN   NaN  3.0   4.0
3  NaN   NaN  NaN   NaN  NaN   NaN
4  NaN   NaN  NaN   NaN  NaN   NaN 

保留所有原始值,即使它们相等。

In [153]: df.compare(df2, keep_shape=True, keep_equal=True)
Out[153]: 
 col1       col2       col3 
 self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0 
```  ## `concat()`

`concat()`函数沿着一个轴连接任意数量的`Series`或`DataFrame`对象,同时在其他轴上执行可选的集合逻辑(并集或交集)索引。像`numpy.concatenate`一样,`concat()`接受同类型对象的列表或字典,并将它们连接起来。

```py
In [1]: df1 = pd.DataFrame(
 ...:    {
 ...:        "A": ["A0", "A1", "A2", "A3"],
 ...:        "B": ["B0", "B1", "B2", "B3"],
 ...:        "C": ["C0", "C1", "C2", "C3"],
 ...:        "D": ["D0", "D1", "D2", "D3"],
 ...:    },
 ...:    index=[0, 1, 2, 3],
 ...: )
 ...: 

In [2]: df2 = pd.DataFrame(
 ...:    {
 ...:        "A": ["A4", "A5", "A6", "A7"],
 ...:        "B": ["B4", "B5", "B6", "B7"],
 ...:        "C": ["C4", "C5", "C6", "C7"],
 ...:        "D": ["D4", "D5", "D6", "D7"],
 ...:    },
 ...:    index=[4, 5, 6, 7],
 ...: )
 ...: 

In [3]: df3 = pd.DataFrame(
 ...:    {
 ...:        "A": ["A8", "A9", "A10", "A11"],
 ...:        "B": ["B8", "B9", "B10", "B11"],
 ...:        "C": ["C8", "C9", "C10", "C11"],
 ...:        "D": ["D8", "D9", "D10", "D11"],
 ...:    },
 ...:    index=[8, 9, 10, 11],
 ...: )
 ...: 

In [4]: frames = [df1, df2, df3]

In [5]: result = pd.concat(frames)

In [6]: result
Out[6]: 
 A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11 

../_images/merging_concat_basic.png

注意

concat()会对数据进行完全复制,并且反复使用concat()可能会创建不必要的副本。在使用concat()之前,先将所有DataFrameSeries对象收集到一个列表中。

frames = [process_your_file(f) for f in files]
result = pd.concat(frames) 

注意

当连接具有命名轴的DataFrame时,pandas 会尽可能保留这些索引/列名称。在所有输入共享一个公共名称的情况下,该名称将分配给结果。当输入名称不完全一致时,结果将不具有名称。对于MultiIndex也是如此,但逻辑是逐级别分别应用的。

结果轴的连接逻辑

join关键字指定如何处理第一个DataFrame中不存在的轴值。

join='outer'取所有轴值的并集

In [7]: df4 = pd.DataFrame(
 ...:    {
 ...:        "B": ["B2", "B3", "B6", "B7"],
 ...:        "D": ["D2", "D3", "D6", "D7"],
 ...:        "F": ["F2", "F3", "F6", "F7"],
 ...:    },
 ...:    index=[2, 3, 6, 7],
 ...: )
 ...: 

In [8]: result = pd.concat([df1, df4], axis=1)

In [9]: result
Out[9]: 
 A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7 

../_images/merging_concat_axis1.png

join='inner'取轴值的交集

In [10]: result = pd.concat([df1, df4], axis=1, join="inner")

In [11]: result
Out[11]: 
 A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3 

../_images/merging_concat_axis1_inner.png

要使用原始DataFrame精确索引执行有效的“左连接”,结果可以重新索引。

In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)

In [13]: result
Out[13]: 
 A   B   C   D    B    D    F
0  A0  B0  C0  D0  NaN  NaN  NaN
1  A1  B1  C1  D1  NaN  NaN  NaN
2  A2  B2  C2  D2   B2   D2   F2
3  A3  B3  C3  D3   B3   D3   F3 

../_images/merging_concat_axis1_join_axes.png ### 在连接轴上忽略索引

对于没有有意义索引的DataFrame对象,ignore_index会忽略重叠的索引。

In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False)

In [15]: result
Out[15]: 
 A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
4  NaN  B2  NaN  D2   F2
5  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7 

../_images/merging_concat_ignore_index.png ### 将SeriesDataFrame连接在一起

你可以连接一组SeriesDataFrame对象。Series将被转换为DataFrame,列名为Series的名称。

In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

In [17]: result = pd.concat([df1, s1], axis=1)

In [18]: result
Out[18]: 
 A   B   C   D   X
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3 

../_images/merging_concat_mixed_ndim.png

未命名的Series将按顺序编号。

In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"])

In [20]: result = pd.concat([df1, s2, s2, s2], axis=1)

In [21]: result
Out[21]: 
 A   B   C   D   0   1   2
0  A0  B0  C0  D0  _0  _0  _0
1  A1  B1  C1  D1  _1  _1  _1
2  A2  B2  C2  D2  _2  _2  _2
3  A3  B3  C3  D3  _3  _3  _3 

../_images/merging_concat_unnamed_series.png

ignore_index=True将删除所有名称引用。

In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)

In [23]: result
Out[23]: 
 0   1   2   3   4
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3 

../_images/merging_concat_series_ignore_index.png

结果的keys

keys参数会为结果的索引或列添加另一个轴级别(创建一个MultiIndex),将特定键与每个原始DataFrame关联起来。

In [24]: result = pd.concat(frames, keys=["x", "y", "z"])

In [25]: result
Out[25]: 
 A    B    C    D
x 0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11

In [26]: result.loc["y"]
Out[26]: 
 A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7 

../_images/merging_concat_keys.png

keys参数可以在基于现有Series创建新DataFrame时覆盖列名。

In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo")

In [28]: s4 = pd.Series([0, 1, 2, 3])

In [29]: s5 = pd.Series([0, 1, 4, 5])

In [30]: pd.concat([s3, s4, s5], axis=1)
Out[30]: 
 foo  0  1
0    0  0  0
1    1  1  1
2    2  2  4
3    3  3  5

In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
Out[31]: 
 red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5 

你也可以向concat()传递一个字典,此时字典键将用于keys参数,除非指定了其他keys参数:

In [32]: pieces = {"x": df1, "y": df2, "z": df3}

In [33]: result = pd.concat(pieces)

In [34]: result
Out[34]: 
 A    B    C    D
x 0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11 

../_images/merging_concat_dict.png

In [35]: result = pd.concat(pieces, keys=["z", "y"])

In [36]: result
Out[36]: 
 A    B    C    D
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7 

../_images/merging_concat_dict_keys.png

创建的MultiIndex具有从传递的键和DataFrame片段的索引构建的级别:

In [37]: result.index.levels
Out[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]]) 

levels参数允许指定与keys关联的结果级别

In [38]: result = pd.concat(
 ....:    pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
 ....: )
 ....: 

In [39]: result
Out[39]: 
 A    B    C    D
group_key 
x         0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y         4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z         8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11 

../_images/merging_concat_dict_keys_names.png

In [40]: result.index.levels
Out[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) 

将行附加到DataFrame

如果您有一个想要附加为单行到DataFrameSeries,您可以将行转换为DataFrame并使用concat()

In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])

In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True)

In [43]: result
Out[43]: 
 A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  X0  X1  X2  X3 

../_images/merging_append_series_as_row.png

结果轴的连接逻辑

join 关键字指定如何处理第一个DataFrame中不存在的轴值。

join='outer'取所有轴值的并集

In [7]: df4 = pd.DataFrame(
 ...:    {
 ...:        "B": ["B2", "B3", "B6", "B7"],
 ...:        "D": ["D2", "D3", "D6", "D7"],
 ...:        "F": ["F2", "F3", "F6", "F7"],
 ...:    },
 ...:    index=[2, 3, 6, 7],
 ...: )
 ...: 

In [8]: result = pd.concat([df1, df4], axis=1)

In [9]: result
Out[9]: 
 A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7 

../_images/merging_concat_axis1.png

join='inner'取轴值的交集

In [10]: result = pd.concat([df1, df4], axis=1, join="inner")

In [11]: result
Out[11]: 
 A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3 

../_images/merging_concat_axis1_inner.png

要使用原始DataFrame确切索引执行有效的“左”连接,结果可以重新索引。

In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)

In [13]: result
Out[13]: 
 A   B   C   D    B    D    F
0  A0  B0  C0  D0  NaN  NaN  NaN
1  A1  B1  C1  D1  NaN  NaN  NaN
2  A2  B2  C2  D2   B2   D2   F2
3  A3  B3  C3  D3   B3   D3   F3 

../_images/merging_concat_axis1_join_axes.png ### 在连接轴上忽略索引

对于没有有意义索引的DataFrame对象,ignore_index会忽略重叠的索引。

In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False)

In [15]: result
Out[15]: 
 A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
4  NaN  B2  NaN  D2   F2
5  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7 

../_images/merging_concat_ignore_index.png ### 将SeriesDataFrame连接在一起

您可以连接一组SeriesDataFrame对象。Series将转换为具有列名的DataFrame,列名为Series的名称。

In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

In [17]: result = pd.concat([df1, s1], axis=1)

In [18]: result
Out[18]: 
 A   B   C   D   X
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3 

../_images/merging_concat_mixed_ndim.png

未命名的Series将按顺序编号。

In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"])

In [20]: result = pd.concat([df1, s2, s2, s2], axis=1)

In [21]: result
Out[21]: 
 A   B   C   D   0   1   2
0  A0  B0  C0  D0  _0  _0  _0
1  A1  B1  C1  D1  _1  _1  _1
2  A2  B2  C2  D2  _2  _2  _2
3  A3  B3  C3  D3  _3  _3  _3 

../_images/merging_concat_unnamed_series.png

ignore_index=True将删除所有名称引用。

In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)

In [23]: result
Out[23]: 
 0   1   2   3   4
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3 

../_images/merging_concat_series_ignore_index.png

结果keys

keys参数将另一个轴级别添加到结果索引或列(创建一个MultiIndex),将特定键与每个原始DataFrame关联起来。

In [24]: result = pd.concat(frames, keys=["x", "y", "z"])

In [25]: result
Out[25]: 
 A    B    C    D
x 0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11

In [26]: result.loc["y"]
Out[26]: 
 A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7 

../_images/merging_concat_keys.png

keys参数可以在基于现有Series创建新DataFrame时覆盖列名。

In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo")

In [28]: s4 = pd.Series([0, 1, 2, 3])

In [29]: s5 = pd.Series([0, 1, 4, 5])

In [30]: pd.concat([s3, s4, s5], axis=1)
Out[30]: 
 foo  0  1
0    0  0  0
1    1  1  1
2    2  2  4
3    3  3  5

In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
Out[31]: 
 red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5 

你也可以将字典传递给concat(),在这种情况下,除非指定了其他keys参数,否则字典键将用于keys参数:

In [32]: pieces = {"x": df1, "y": df2, "z": df3}

In [33]: result = pd.concat(pieces)

In [34]: result
Out[34]: 
 A    B    C    D
x 0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11 

../_images/merging_concat_dict.png

In [35]: result = pd.concat(pieces, keys=["z", "y"])

In [36]: result
Out[36]: 
 A    B    C    D
z 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11
y 4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7 

../_images/merging_concat_dict_keys.png

创建的MultiIndex具有从传递的键和DataFrame片段的索引构造的级别:

In [37]: result.index.levels
Out[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]]) 

levels参数允许指定与keys关联的结果级别

In [38]: result = pd.concat(
 ....:    pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
 ....: )
 ....: 

In [39]: result
Out[39]: 
 A    B    C    D
group_key 
x         0    A0   B0   C0   D0
 1    A1   B1   C1   D1
 2    A2   B2   C2   D2
 3    A3   B3   C3   D3
y         4    A4   B4   C4   D4
 5    A5   B5   C5   D5
 6    A6   B6   C6   D6
 7    A7   B7   C7   D7
z         8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11 

../_images/merging_concat_dict_keys_names.png

In [40]: result.index.levels
Out[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) 

将行附加到DataFrame

如果你有一个要附加为单行到DataFrameSeries,你可以将该行转换为DataFrame并使用concat()

In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])

In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True)

In [43]: result
Out[43]: 
 A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  X0  X1  X2  X3 

../_images/merging_append_series_as_row.png ## merge()

merge()执行类似于关系数据库(如 SQL)的连接操作。熟悉 SQL 但是对 pandas 新手的用户可以参考与 SQL 的比较。

合并类型

merge()实现常见的 SQL 风格连接操作。

  • 一对一:在它们的索引上连接两个DataFrame对象,这些索引必须包含唯一值。

  • 多对一:将唯一索引与不同DataFrame中的一个或多个列连接。

  • 多对多:在列上进行列连接。

注意

当在列上进行列连接时,可能是多对多的连接,传递的DataFrame对象上的任何索引将被丢弃

对于多对多连接,如果一个键组合在两个表中出现多次,则DataFrame将具有相关数据的笛卡尔积

In [44]: left = pd.DataFrame(
 ....:    {
 ....:        "key": ["K0", "K1", "K2", "K3"],
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:    }
 ....: )
 ....: 

In [45]: right = pd.DataFrame(
 ....:    {
 ....:        "key": ["K0", "K1", "K2", "K3"],
 ....:        "C": ["C0", "C1", "C2", "C3"],
 ....:        "D": ["D0", "D1", "D2", "D3"],
 ....:    }
 ....: )
 ....: 

In [46]: result = pd.merge(left, right, on="key")

In [47]: result
Out[47]: 
 key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3 

../_images/merging_merge_on_key.png

merge()how参数指定了包含在结果表中的键。如果一个键组合不存在于左表或右表中,那么连接表中的值将为NA。以下是how选项及其 SQL 等效名称的摘要:

合并方法 SQL 连接名称 描述
left LEFT OUTER JOIN 仅使用左侧框架的键
right RIGHT OUTER JOIN 仅使用右侧框架的键
outer FULL OUTER JOIN 使用两个框架的键的并集
inner INNER JOIN 使用两个框架键的交集
cross CROSS JOIN 创建两个框架行的笛卡尔积
In [48]: left = pd.DataFrame(
 ....:   {
 ....:      "key1": ["K0", "K0", "K1", "K2"],
 ....:      "key2": ["K0", "K1", "K0", "K1"],
 ....:      "A": ["A0", "A1", "A2", "A3"],
 ....:      "B": ["B0", "B1", "B2", "B3"],
 ....:   }
 ....: )
 ....: 

In [49]: right = pd.DataFrame(
 ....:   {
 ....:      "key1": ["K0", "K1", "K1", "K2"],
 ....:      "key2": ["K0", "K0", "K0", "K0"],
 ....:      "C": ["C0", "C1", "C2", "C3"],
 ....:      "D": ["D0", "D1", "D2", "D3"],
 ....:   }
 ....: )
 ....: 

In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"])

In [51]: result
Out[51]: 
 key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN 

../_images/merging_merge_on_key_left.png

In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"])

In [53]: result
Out[53]: 
 key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3 

../_images/merging_merge_on_key_right.png

In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])

In [55]: result
Out[55]: 
 key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K0  NaN  NaN   C3   D3
5   K2   K1   A3   B3  NaN  NaN 

../_images/merging_merge_on_key_outer.png

In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])

In [57]: result
Out[57]: 
 key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2 

../_images/merging_merge_on_key_inner.png

In [58]: result = pd.merge(left, right, how="cross")

In [59]: result
Out[59]: 
 key1_x key2_x   A   B key1_y key2_y   C   D
0      K0     K0  A0  B0     K0     K0  C0  D0
1      K0     K0  A0  B0     K1     K0  C1  D1
2      K0     K0  A0  B0     K1     K0  C2  D2
3      K0     K0  A0  B0     K2     K0  C3  D3
4      K0     K1  A1  B1     K0     K0  C0  D0
..    ...    ...  ..  ..    ...    ...  ..  ..
11     K1     K0  A2  B2     K2     K0  C3  D3
12     K2     K1  A3  B3     K0     K0  C0  D0
13     K2     K1  A3  B3     K1     K0  C1  D1
14     K2     K1  A3  B3     K1     K0  C2  D2
15     K2     K1  A3  B3     K2     K0  C3  D3

[16 rows x 8 columns] 

../_images/merging_merge_cross.png

如果MultiIndex的名称与DataFrame的列对应,则可以使用Series.reset_index()Series转换为DataFrame,然后再进行合并。

In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})

In [61]: df
Out[61]: 
 Let  Num
0   A    1
1   B    2
2   C    3

In [62]: ser = pd.Series(
 ....:    ["a", "b", "c", "d", "e", "f"],
 ....:    index=pd.MultiIndex.from_arrays(
 ....:        [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
 ....:    ),
 ....: )
 ....: 

In [63]: ser
Out[63]: 
Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Out[64]: 
 Let  Num  0
0   A    1  a
1   B    2  b
2   C    3  c 

DataFrame中具有重复连接键执行外连接

In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [67]: result = pd.merge(left, right, on="B", how="outer")

In [68]: result
Out[68]: 
 A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6 

../_images/merging_merge_on_key_dup.png

警告

在重复键上合并会显著增加结果的维度,并可能导致内存溢出。

合并键的唯一性

validate 参数检查合并键的唯一性。在合并操作之前检查键的唯一性,可以防止内存溢出和意外键重复。

In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
---------------------------------------------------------------------------
MergeError  Traceback (most recent call last)
Cell In[71], line 1
----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  155     return _cross_merge(
  156         left_df,
  157         right_df,
   (...)
  167         copy=copy,
  168     )
  169 else:
--> 170     op = _MergeOperation(
  171         left_df,
  172         right_df,
  173         how=how,
  174         on=on,
  175         left_on=left_on,
  176         right_on=right_on,
  177         left_index=left_index,
  178         right_index=right_index,
  179         sort=sort,
  180         suffixes=suffixes,
  181         indicator=indicator,
  182         validate=validate,
  183     )
  184     return op.get_result(copy=copy)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)
  809 # If argument passed to validate,
  810 # check if columns specified as unique
  811 # are in fact unique.
  812 if validate is not None:
--> 813     self._validate_validate_kwd(validate)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate)
  1653         raise MergeError(
  1654             "Merge keys are not unique in left dataset; not a one-to-one merge"
  1655         )
  1656     if not right_unique:
-> 1657         raise MergeError(
  1658             "Merge keys are not unique in right dataset; not a one-to-one merge"
  1659         )
  1661 elif validate in ["one_to_many", "1:m"]:
  1662     if not left_unique:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge 

如果用户知道右侧 DataFrame 中存在重复项,但希望确保左侧 DataFrame 中没有重复项,可以使用 validate='one_to_many' 参数,而不会引发异常。

In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
Out[72]: 
 A_x  B  A_y
0    1  1  NaN
1    2  2  4.0
2    2  2  5.0
3    2  2  6.0 
```  ### 合并结果指示器

`merge()` 接受参数 `indicator`。如果为 `True`,则将添加一个名为 `_merge` 的分类列到输出对象中,其值为:

> | 观察来源 | `_merge` 值 |
> | --- | --- |
> | 仅在 `'left'` 框架中的合并键 | `left_only` |
> | 仅在 `'right'` 框架中的合并键 | `right_only` |
> | 两个框架中的合并键 | `both` |

```py
In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})

In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})

In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
Out[75]: 
 col1 col_left  col_right      _merge
0     0        a        NaN   left_only
1     1        b        2.0        both
2     2      NaN        2.0  right_only
3     2      NaN        2.0  right_only 

indicator 的字符串参数将用作指示器列的名称。

In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
Out[76]: 
 col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only 

重叠值列

合并 suffixes 参数接受一个字符串列表元组,以附加到输入 DataFrame 中重叠列名称以消除结果列的歧义:

In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})

In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})

In [79]: result = pd.merge(left, right, on="k")

In [80]: result
Out[80]: 
 k  v_x  v_y
0  K0    1    4
1  K0    1    5 

../_images/merging_merge_overlapped.png

In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

In [82]: result
Out[82]: 
 k  v_l  v_r
0  K0    1    4
1  K0    1    5 

../_images/merging_merge_overlapped_suffix.png

合并类型

merge() 实现常见的 SQL 风格连接操作。

  • 一对一:在它们的索引上连接两个 DataFrame 对象,这些对象必须包含唯一值。

  • 多对一:将唯一索引与不同 DataFrame 中的一个或多个列连接。

  • 多对多:在列上连接列。

注意

在列上连接列时,可能是多对多连接,传递的 DataFrame 对象上的任何索引将被丢弃

对于多对多连接,如果在两个表中一个键组合出现多次,DataFrame 将具有相关数据的笛卡尔积

In [44]: left = pd.DataFrame(
 ....:    {
 ....:        "key": ["K0", "K1", "K2", "K3"],
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:    }
 ....: )
 ....: 

In [45]: right = pd.DataFrame(
 ....:    {
 ....:        "key": ["K0", "K1", "K2", "K3"],
 ....:        "C": ["C0", "C1", "C2", "C3"],
 ....:        "D": ["D0", "D1", "D2", "D3"],
 ....:    }
 ....: )
 ....: 

In [46]: result = pd.merge(left, right, on="key")

In [47]: result
Out[47]: 
 key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3 

../_images/merging_merge_on_key.png

merge()how参数指定了哪些键包含在结果表中。如果一个键组合在左表或右表中都不存在,则连接表中的值将为NA。以下是how选项及其 SQL 等效名称的摘要:

合并方法 SQL 连接名称 描述
left LEFT OUTER JOIN 仅使用左侧框架的键
right RIGHT OUTER JOIN 仅使用右侧框架的键
outer FULL OUTER JOIN 使用两个框架的键的并集
inner INNER JOIN 使用两个框架的键的交集
cross CROSS JOIN 创建两个框架行的笛卡尔积
In [48]: left = pd.DataFrame(
 ....:   {
 ....:      "key1": ["K0", "K0", "K1", "K2"],
 ....:      "key2": ["K0", "K1", "K0", "K1"],
 ....:      "A": ["A0", "A1", "A2", "A3"],
 ....:      "B": ["B0", "B1", "B2", "B3"],
 ....:   }
 ....: )
 ....: 

In [49]: right = pd.DataFrame(
 ....:   {
 ....:      "key1": ["K0", "K1", "K1", "K2"],
 ....:      "key2": ["K0", "K0", "K0", "K0"],
 ....:      "C": ["C0", "C1", "C2", "C3"],
 ....:      "D": ["D0", "D1", "D2", "D3"],
 ....:   }
 ....: )
 ....: 

In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"])

In [51]: result
Out[51]: 
 key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN 

../_images/merging_merge_on_key_left.png

In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"])

In [53]: result
Out[53]: 
 key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3 

../_images/merging_merge_on_key_right.png

In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])

In [55]: result
Out[55]: 
 key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K0  NaN  NaN   C3   D3
5   K2   K1   A3   B3  NaN  NaN 

../_images/merging_merge_on_key_outer.png

In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])

In [57]: result
Out[57]: 
 key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2 

../_images/merging_merge_on_key_inner.png

In [58]: result = pd.merge(left, right, how="cross")

In [59]: result
Out[59]: 
 key1_x key2_x   A   B key1_y key2_y   C   D
0      K0     K0  A0  B0     K0     K0  C0  D0
1      K0     K0  A0  B0     K1     K0  C1  D1
2      K0     K0  A0  B0     K1     K0  C2  D2
3      K0     K0  A0  B0     K2     K0  C3  D3
4      K0     K1  A1  B1     K0     K0  C0  D0
..    ...    ...  ..  ..    ...    ...  ..  ..
11     K1     K0  A2  B2     K2     K0  C3  D3
12     K2     K1  A3  B3     K0     K0  C0  D0
13     K2     K1  A3  B3     K1     K0  C1  D1
14     K2     K1  A3  B3     K1     K0  C2  D2
15     K2     K1  A3  B3     K2     K0  C3  D3

[16 rows x 8 columns] 

../_images/merging_merge_cross.png

如果MultiIndex的名称与DataFrame中的列对应,则可以使用SeriesDataFrame。在合并之前,使用Series.reset_index()Series转换为DataFrame

In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})

In [61]: df
Out[61]: 
 Let  Num
0   A    1
1   B    2
2   C    3

In [62]: ser = pd.Series(
 ....:    ["a", "b", "c", "d", "e", "f"],
 ....:    index=pd.MultiIndex.from_arrays(
 ....:        [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
 ....:    ),
 ....: )
 ....: 

In [63]: ser
Out[63]: 
Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Out[64]: 
 Let  Num  0
0   A    1  a
1   B    2  b
2   C    3  c 

DataFrame中执行具有重复连接键的外部连接

In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [67]: result = pd.merge(left, right, on="B", how="outer")

In [68]: result
Out[68]: 
 A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6 

../_images/merging_merge_on_key_dup.png

警告

在重复键上合并会显著增加结果的维度,并可能导致内存溢出。

合并键唯一性

validate参数检查合并键的唯一性。在合并操作之前检查键的唯一性,可以防止内存溢出和意外键重复。

In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
---------------------------------------------------------------------------
MergeError  Traceback (most recent call last)
Cell In[71], line 1
----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  155     return _cross_merge(
  156         left_df,
  157         right_df,
   (...)
  167         copy=copy,
  168     )
  169 else:
--> 170     op = _MergeOperation(
  171         left_df,
  172         right_df,
  173         how=how,
  174         on=on,
  175         left_on=left_on,
  176         right_on=right_on,
  177         left_index=left_index,
  178         right_index=right_index,
  179         sort=sort,
  180         suffixes=suffixes,
  181         indicator=indicator,
  182         validate=validate,
  183     )
  184     return op.get_result(copy=copy)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)
  809 # If argument passed to validate,
  810 # check if columns specified as unique
  811 # are in fact unique.
  812 if validate is not None:
--> 813     self._validate_validate_kwd(validate)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate)
  1653         raise MergeError(
  1654             "Merge keys are not unique in left dataset; not a one-to-one merge"
  1655         )
  1656     if not right_unique:
-> 1657         raise MergeError(
  1658             "Merge keys are not unique in right dataset; not a one-to-one merge"
  1659         )
  1661 elif validate in ["one_to_many", "1:m"]:
  1662     if not left_unique:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge 

如果用户意识到右侧 DataFrame 中存在重复项,但希望确保左侧 DataFrame 中没有重复项,则可以使用 validate='one_to_many' 参数,而不会引发异常。

In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
Out[72]: 
 A_x  B  A_y
0    1  1  NaN
1    2  2  4.0
2    2  2  5.0
3    2  2  6.0 

合并结果指示器

merge() 接受参数 indicator。如果为 True,则会向输出对象添加一个名为 _merge 的分类列,其取值为:

观察来源 _merge
仅在 'left' 框架中的合并键 left_only
仅在 'right' 框架中的合并键 right_only
两个框架中的合并键 both
In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})

In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})

In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
Out[75]: 
 col1 col_left  col_right      _merge
0     0        a        NaN   left_only
1     1        b        2.0        both
2     2      NaN        2.0  right_only
3     2      NaN        2.0  right_only 

indicator 的字符串参数将使用该值作为指示器列的名称。

In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
Out[76]: 
 col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only 

重叠值列

合并 suffixes 参数接受一个字符串列表的元组,以附加到输入 DataFrame 中重叠列名称以消除结果列的歧义:

In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})

In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})

In [79]: result = pd.merge(left, right, on="k")

In [80]: result
Out[80]: 
 k  v_x  v_y
0  K0    1    4
1  K0    1    5 

../_images/merging_merge_overlapped.png

In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

In [82]: result
Out[82]: 
 k  v_l  v_r
0  K0    1    4
1  K0    1    5 

../_images/merging_merge_overlapped_suffix.png

DataFrame.join()

DataFrame.join() 将多个、可能具有不同索引的 DataFrame 的列合并为单个结果 DataFrame

In [83]: left = pd.DataFrame(
 ....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
 ....: )
 ....: 

In [84]: right = pd.DataFrame(
 ....:    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
 ....: )
 ....: 

In [85]: result = left.join(right)

In [86]: result
Out[86]: 
 A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2 

../_images/merging_join.png

In [87]: result = left.join(right, how="outer")

In [88]: result
Out[88]: 
 A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3 

../_images/merging_join_outer.png

In [89]: result = left.join(right, how="inner")

In [90]: result
Out[90]: 
 A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2 

../_images/merging_join_inner.png

DataFrame.join() 接受一个可选的 on 参数,该参数可以是要对齐的列或多个列名,传递的 DataFrame 将对齐。

In [91]: left = pd.DataFrame(
 ....:    {
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:        "key": ["K0", "K1", "K0", "K1"],
 ....:    }
 ....: )
 ....: 

In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

In [93]: result = left.join(right, on="key")

In [94]: result
Out[94]: 
 A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1 

../_images/merging_join_key_columns.png

In [95]: result = pd.merge(
 ....:    left, right, left_on="key", right_index=True, how="left", sort=False
 ....: )
 ....: 

In [96]: result
Out[96]: 
 A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1 

../_images/merging_merge_key_columns.png

要根据多个键进行连接,传递的 DataFrame 必须具有 MultiIndex

In [97]: left = pd.DataFrame(
 ....:    {
 ....:        "A": ["A0", "A1", "A2", "A3"],
 ....:        "B": ["B0", "B1", "B2", "B3"],
 ....:        "key1": ["K0", "K0", "K1", "K2"],
 ....:        "key2": ["K0", "K1", "K0", "K1"],
 ....:    }
 ....: )
 ....: 

In [98]: index = pd.MultiIndex.from_tuples(
 ....:    [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
 ....: )
 ....: 

In [99]: right = pd.DataFrame(
 ....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
 ....: )
 ....: 

In [100]: result = left.join(right, on=["key1", "key2"])

In [101]: result
Out[101]: 
 A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A3  B3   K2   K1   C3   D3 

../_images/merging_join_multikeys.png

DataFrame.join的默认行为是执行左连接,仅使用调用DataFrame中找到的键。可以使用how指定其他连接类型。

In [102]: result = left.join(right, on=["key1", "key2"], how="inner")

In [103]: result
Out[103]: 
 A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
2  A2  B2   K1   K0  C1  D1
3  A3  B3   K2   K1  C3  D3 

../_images/merging_join_multikeys_inner.png ### 将单个索引连接到多重索引

您可以将具有MultiIndexIndex与具有级别的DataFrame连接。Indexname将与MultiIndex的级别名称匹配。

In [104]: left = pd.DataFrame(
 .....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
 .....:    index=pd.Index(["K0", "K1", "K2"], name="key"),
 .....: )
 .....: 

In [105]: index = pd.MultiIndex.from_tuples(
 .....:    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
 .....:    names=["key", "Y"],
 .....: )
 .....: 

In [106]: right = pd.DataFrame(
 .....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
 .....:    index=index,
 .....: )
 .....: 

In [107]: result = left.join(right, how="inner")

In [108]: result
Out[108]: 
 A   B   C   D
key Y 
K0  Y0  A0  B0  C0  D0
K1  Y1  A1  B1  C1  D1
K2  Y2  A2  B2  C2  D2
 Y3  A2  B2  C3  D3 

../_images/merging_join_multiindex_inner.png ### 与两个MultiIndex连接

输入参数的MultiIndex必须完全在连接中使用,并且是左参数中索引的子集。

In [109]: leftindex = pd.MultiIndex.from_product(
 .....:    [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
 .....: )
 .....: 

In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)

In [111]: left
Out[111]: 
 v1
abc xy num 
a   x  1     0
 2     1
 y  1     2
 2     3
b   x  1     4
 2     5
 y  1     6
 2     7
c   x  1     8
 2     9
 y  1    10
 2    11

In [112]: rightindex = pd.MultiIndex.from_product(
 .....:    [list("abc"), list("xy")], names=["abc", "xy"]
 .....: )
 .....: 

In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

In [114]: right
Out[114]: 
 v2
abc xy 
a   x   100
 y   200
b   x   300
 y   400
c   x   500
 y   600

In [115]: left.join(right, on=["abc", "xy"], how="inner")
Out[115]: 
 v1   v2
abc xy num 
a   x  1     0  100
 2     1  100
 y  1     2  200
 2     3  200
b   x  1     4  300
 2     5  300
 y  1     6  400
 2     7  400
c   x  1     8  500
 2     9  500
 y  1    10  600
 2    11  600 
In [116]: leftindex = pd.MultiIndex.from_tuples(
 .....:    [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
 .....: )
 .....: 

In [117]: left = pd.DataFrame(
 .....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
 .....: )
 .....: 

In [118]: rightindex = pd.MultiIndex.from_tuples(
 .....:    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
 .....: )
 .....: 

In [119]: right = pd.DataFrame(
 .....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
 .....: )
 .....: 

In [120]: result = pd.merge(
 .....:    left.reset_index(), right.reset_index(), on=["key"], how="inner"
 .....: ).set_index(["key", "X", "Y"])
 .....: 

In [121]: result
Out[121]: 
 A   B   C   D
key X  Y 
K0  X0 Y0  A0  B0  C0  D0
 X1 Y0  A1  B1  C0  D0
K1  X2 Y1  A2  B2  C1  D1 

../_images/merging_merge_two_multiindex.png ### 在列和索引级别的组合上合并

作为onleft_onright_on参数传递的字符串可以引用列名或索引级别名称。这使得在不重置索引的情况下,可以在索引级别和列的组合上合并DataFrame实例。

In [122]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")

In [123]: left = pd.DataFrame(
 .....:    {
 .....:        "A": ["A0", "A1", "A2", "A3"],
 .....:        "B": ["B0", "B1", "B2", "B3"],
 .....:        "key2": ["K0", "K1", "K0", "K1"],
 .....:    },
 .....:    index=left_index,
 .....: )
 .....: 

In [124]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")

In [125]: right = pd.DataFrame(
 .....:    {
 .....:        "C": ["C0", "C1", "C2", "C3"],
 .....:        "D": ["D0", "D1", "D2", "D3"],
 .....:        "key2": ["K0", "K0", "K0", "K1"],
 .....:    },
 .....:    index=right_index,
 .....: )
 .....: 

In [126]: result = left.merge(right, on=["key1", "key2"])

In [127]: result
Out[127]: 
 A   B key2   C   D
key1 
K0    A0  B0   K0  C0  D0
K1    A2  B2   K0  C1  D1
K2    A3  B3   K1  C3  D3 

../_images/merge_on_index_and_column.png

注意

DataFrame在两个参数中匹配索引级别的字符串上进行连接时,索引级别将保留为结果DataFrame中的索引级别。

注意

当仅使用MultiIndex的一些级别来连接DataFrame时,结果连接中将删除额外的级别。要保留这些级别,请在连接之前对这些级别名称使用DataFrame.reset_index()将这些级别移动到列中。 ### 连接多个DataFrame

一个:class:DataFrame``的列表或元组也可以传递给join(),以便根据它们的索引将它们连接在一起。

In [128]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

In [129]: result = left.join([right, right2]) 

../_images/merging_join_multi_df.png ### DataFrame.combine_first()

DataFrame.combine_first()将一个DataFrame中的��失值更新为另一个DataFrame中相应位置的非缺失值。

In [130]: df1 = pd.DataFrame(
 .....:    [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
 .....: )
 .....: 

In [131]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])

In [132]: result = df1.combine_first(df2)

In [133]: result
Out[133]: 
 0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0 

../_images/merging_combine_first.png ### 将单个索引连接到多重索引

您可以将一个带有IndexDataFrame与具有MultiIndexDataFrame在一个级别上连接。Indexname将与MultiIndex的级别名称匹配。

In [104]: left = pd.DataFrame(
 .....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
 .....:    index=pd.Index(["K0", "K1", "K2"], name="key"),
 .....: )
 .....: 

In [105]: index = pd.MultiIndex.from_tuples(
 .....:    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
 .....:    names=["key", "Y"],
 .....: )
 .....: 

In [106]: right = pd.DataFrame(
 .....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
 .....:    index=index,
 .....: )
 .....: 

In [107]: result = left.join(right, how="inner")

In [108]: result
Out[108]: 
 A   B   C   D
key Y 
K0  Y0  A0  B0  C0  D0
K1  Y1  A1  B1  C1  D1
K2  Y2  A2  B2  C2  D2
 Y3  A2  B2  C3  D3 

../_images/merging_join_multiindex_inner.png ### 与两个MultiIndex连接

输入参数的MultiIndex必须完全在连接中使用,并且是左侧参数中索引的子集。

In [109]: leftindex = pd.MultiIndex.from_product(
 .....:    [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
 .....: )
 .....: 

In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)

In [111]: left
Out[111]: 
 v1
abc xy num 
a   x  1     0
 2     1
 y  1     2
 2     3
b   x  1     4
 2     5
 y  1     6
 2     7
c   x  1     8
 2     9
 y  1    10
 2    11

In [112]: rightindex = pd.MultiIndex.from_product(
 .....:    [list("abc"), list("xy")], names=["abc", "xy"]
 .....: )
 .....: 

In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

In [114]: right
Out[114]: 
 v2
abc xy 
a   x   100
 y   200
b   x   300
 y   400
c   x   500
 y   600

In [115]: left.join(right, on=["abc", "xy"], how="inner")
Out[115]: 
 v1   v2
abc xy num 
a   x  1     0  100
 2     1  100
 y  1     2  200
 2     3  200
b   x  1     4  300
 2     5  300
 y  1     6  400
 2     7  400
c   x  1     8  500
 2     9  500
 y  1    10  600
 2    11  600 
In [116]: leftindex = pd.MultiIndex.from_tuples(
 .....:    [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
 .....: )
 .....: 

In [117]: left = pd.DataFrame(
 .....:    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
 .....: )
 .....: 

In [118]: rightindex = pd.MultiIndex.from_tuples(
 .....:    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
 .....: )
 .....: 

In [119]: right = pd.DataFrame(
 .....:    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
 .....: )
 .....: 

In [120]: result = pd.merge(
 .....:    left.reset_index(), right.reset_index(), on=["key"], how="inner"
 .....: ).set_index(["key", "X", "Y"])
 .....: 

In [121]: result
Out[121]: 
 A   B   C   D
key X  Y 
K0  X0 Y0  A0  B0  C0  D0
 X1 Y0  A1  B1  C0  D0
K1  X2 Y1  A2  B2  C1  D1 

../_images/merging_merge_two_multiindex.png ### 在列和索引级别的组合上合并

作为onleft_onright_on参数传递的字符串可以引用列名或索引级别名称。这使得可以在不重置索引的情况下,根据索引级别和列的组合来合并DataFrame实例。

In [122]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")

In [123]: left = pd.DataFrame(
 .....:    {
 .....:        "A": ["A0", "A1", "A2", "A3"],
 .....:        "B": ["B0", "B1", "B2", "B3"],
 .....:        "key2": ["K0", "K1", "K0", "K1"],
 .....:    },
 .....:    index=left_index,
 .....: )
 .....: 

In [124]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")

In [125]: right = pd.DataFrame(
 .....:    {
 .....:        "C": ["C0", "C1", "C2", "C3"],
 .....:        "D": ["D0", "D1", "D2", "D3"],
 .....:        "key2": ["K0", "K0", "K0", "K1"],
 .....:    },
 .....:    index=right_index,
 .....: )
 .....: 

In [126]: result = left.merge(right, on=["key1", "key2"])

In [127]: result
Out[127]: 
 A   B key2   C   D
key1 
K0    A0  B0   K0  C0  D0
K1    A2  B2   K0  C1  D1
K2    A3  B3   K1  C3  D3 

../_images/merge_on_index_and_column.png

注意

当在两个参数中都匹配索引级别的字符串上连接DataFrame时,索引级别将作为结果DataFrame中的索引级别保留。

注意

当仅使用MultiIndex的部分级别连接DataFrame时,结果连接中的额外级别将被丢弃。要保留这些级别,请在连接之前对这些级别名称使用DataFrame.reset_index()将这些级别移动到列中。

连接多个DataFrame

也可以将DataFrame的列表或元组传递给join(),以便根据它们的索引将它们连接在一起。

In [128]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

In [129]: result = left.join([right, right2]) 

../_images/merging_join_multi_df.png ### DataFrame.combine_first()

DataFrame.combine_first()将一个DataFrame中的缺失值更新为另一个DataFrame中相应位置的非缺失值。

In [130]: df1 = pd.DataFrame(
 .....:    [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
 .....: )
 .....: 

In [131]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])

In [132]: result = df1.combine_first(df2)

In [133]: result
Out[133]: 
 0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0 

../_images/merging_combine_first.png ## merge_ordered()

merge_ordered()将顺序数据(如数字或时间序列数据)与可选的使用fill_method填充缺失数据的数据合并。

In [134]: left = pd.DataFrame(
 .....:    {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]}
 .....: )
 .....: 

In [135]: right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})

In [136]: pd.merge_ordered(left, right, fill_method="ffill", left_by="s")
Out[136]: 
 k   lv  s   rv
0   K0  1.0  a  NaN
1   K1  1.0  a  1.0
2   K2  1.0  a  2.0
3   K4  1.0  a  3.0
4   K1  2.0  b  1.0
5   K2  2.0  b  2.0
6   K4  2.0  b  3.0
7   K1  3.0  c  1.0
8   K2  3.0  c  2.0
9   K4  3.0  c  3.0
10  K1  NaN  d  1.0
11  K2  4.0  d  2.0
12  K4  4.0  d  3.0 

merge_asof()

merge_asof() 类似于有序的左连接,只是匹配的是最近的键而不是相等的键。对于left DataFrame中的每一行,选择right DataFrame中的最后一行,其中 on 键小于左侧的键。两个DataFrame必须按键排序。

可选地,merge_asof() 可以通过在 on 键上最接近的匹配的同时匹配 by 键来执行分组合并。

In [137]: trades = pd.DataFrame(
 .....:    {
 .....:        "time": pd.to_datetime(
 .....:            [
 .....:                "20160525 13:30:00.023",
 .....:                "20160525 13:30:00.038",
 .....:                "20160525 13:30:00.048",
 .....:                "20160525 13:30:00.048",
 .....:                "20160525 13:30:00.048",
 .....:            ]
 .....:        ),
 .....:        "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
 .....:        "price": [51.95, 51.95, 720.77, 720.92, 98.00],
 .....:        "quantity": [75, 155, 100, 100, 100],
 .....:    },
 .....:    columns=["time", "ticker", "price", "quantity"],
 .....: )
 .....: 

In [138]: quotes = pd.DataFrame(
 .....:    {
 .....:        "time": pd.to_datetime(
 .....:            [
 .....:                "20160525 13:30:00.023",
 .....:                "20160525 13:30:00.023",
 .....:                "20160525 13:30:00.030",
 .....:                "20160525 13:30:00.041",
 .....:                "20160525 13:30:00.048",
 .....:                "20160525 13:30:00.049",
 .....:                "20160525 13:30:00.072",
 .....:                "20160525 13:30:00.075",
 .....:            ]
 .....:        ),
 .....:        "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
 .....:        "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
 .....:        "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
 .....:    },
 .....:    columns=["time", "ticker", "bid", "ask"],
 .....: )
 .....: 

In [139]: trades
Out[139]: 
 time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

In [140]: quotes
Out[140]: 
 time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

In [141]: pd.merge_asof(trades, quotes, on="time", by="ticker")
Out[141]: 
 time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN 

merge_asof() 在报价时间和交易时间之间的2ms内。

In [142]: pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms"))
Out[142]: 
 time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN 

merge_asof() 在报价时间和交易时间之间的10ms内,并排除时间上的精确匹配。请注意,尽管我们排除了精确匹配(报价),但之前的报价确实传播到那个时间点。

In [143]: pd.merge_asof(
 .....:    trades,
 .....:    quotes,
 .....:    on="time",
 .....:    by="ticker",
 .....:    tolerance=pd.Timedelta("10ms"),
 .....:    allow_exact_matches=False,
 .....: )
 .....: 
Out[143]: 
 time ticker   price  quantity    bid    ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75    NaN    NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155  51.97  51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100    NaN    NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100    NaN    NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100    NaN    NaN 

compare()

Series.compare()DataFrame.compare() 方法允许您比较两个分别是DataFrameSeries的对象,并总结它们的差异。

In [144]: df = pd.DataFrame(
 .....:    {
 .....:        "col1": ["a", "a", "b", "b", "a"],
 .....:        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
 .....:        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
 .....:    },
 .....:    columns=["col1", "col2", "col3"],
 .....: )
 .....: 

In [145]: df
Out[145]: 
 col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0

In [146]: df2 = df.copy()

In [147]: df2.loc[0, "col1"] = "c"

In [148]: df2.loc[2, "col3"] = 4.0

In [149]: df2
Out[149]: 
 col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

In [150]: df.compare(df2)
Out[150]: 
 col1       col3 
 self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0 

默认情况下,如果两个对应的值相等,它们将显示为 NaN。此外,如果整行/列中的所有值都相等,则该行/列将从结果中省略。剩余的差异将对齐在列上。

将差异堆叠在行上。

In [151]: df.compare(df2, align_axis=0)
Out[151]: 
 col1  col3
0 self     a   NaN
 other    c   NaN
2 self   NaN   3.0
 other  NaN   4.0 

保留所有原始行和列,使用 keep_shape=True

In [152]: df.compare(df2, keep_shape=True)
Out[152]: 
 col1       col2       col3 
 self other self other self other
0    a     c  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN  NaN   NaN
2  NaN   NaN  NaN   NaN  3.0   4.0
3  NaN   NaN  NaN   NaN  NaN   NaN
4  NaN   NaN  NaN   NaN  NaN   NaN 

保留所有原始值,即使它们相等。

In [153]: df.compare(df2, keep_shape=True, keep_equal=True)
Out[153]: 
 col1       col2       col3 
 self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0 

标签:.....,教程,NaN,DataFrame,K0,result,pd,2.2,Pandas
From: https://www.cnblogs.com/apachecn/p/18154766

相关文章

  • Pandas 2.2 中文官方教程和指南(十七)
    原文:pandas.pydata.org/docs/重复标签原文:pandas.pydata.org/docs/user_guide/duplicates.htmlIndex对象不需要是唯一的;你可以有重复的行或列标签。这一点可能一开始会有点困惑。如果你熟悉SQL,你会知道行标签类似于表上的主键,你绝不希望在SQL表中有重复项。但pandas的......
  • Pandas 2.2 中文官方教程和指南(六)
    原文:pandas.pydata.org/docs/与Stata的比较原文:pandas.pydata.org/docs/getting_started/comparison/comparison_with_stata.html对于可能来自Stata的潜在用户,本页面旨在演示如何在pandas中执行不同的Stata操作。如果您是pandas的新手,您可能首先想通过阅读10分......
  • Pandas 2.2 中文官方教程和指南(七)
    原文:pandas.pydata.org/docs/社区教程原文:pandas.pydata.org/docs/getting_started/tutorials.html这是社区提供的许多pandas教程的指南,主要面向新用户。由JuliaEvans撰写的pandascookbook这本2015年的cookbook(由JuliaEvans撰写)的目标是为您提供一些具体的示......
  • Pandas 2.2 中文官方教程和指南(三)
    原文:pandas.pydata.org/docs/如何操作文本数据原文:pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html将所有名称字符改为小写。In[4]:titanic["Name"].str.lower()Out[4]:0braund,mr.owenharris1......
  • Pandas 2.2 中文官方教程和指南(十八)
    原文:pandas.pydata.org/docs/可空整数数据类型原文:pandas.pydata.org/docs/user_guide/integer_na.html注意IntegerArray目前处于实验阶段。其API或实现可能会在没有警告的情况下发生变化。使用pandas.NA作为缺失值。在处理缺失数据中,我们看到pandas主要使用NaN来表......
  • Pandas 2.2 中文官方教程和指南(二十四)
    原文:pandas.pydata.org/docs/扩展到大型数据集原文:pandas.pydata.org/docs/user_guide/scale.htmlpandas提供了用于内存分析的数据结构,这使得使用pandas分析大于内存数据集的数据集有些棘手。即使是占用相当大内存的数据集也变得难以处理,因为一些pandas操作需要进行中......
  • Pandas 2.2 中文官方教程和指南(二十三)
    原文:pandas.pydata.org/docs/提高性能原文:pandas.pydata.org/docs/user_guide/enhancingperf.html在本教程的这一部分中,我们将研究如何加速在pandas的DataFrame上操作的某些函数,使用Cython、Numba和pandas.eval()。通常,使用Cython和Numba可以比使用pandas.eval()提......
  • Pandas 2.2 中文官方教程和指南(二)
    原文:pandas.pydata.org/docs/如何在pandas中创建图表?原文:pandas.pydata.org/docs/getting_started/intro_tutorials/04_plotting.htmlIn[1]:importpandasaspdIn[2]:importmatplotlib.pyplotasplt本教程使用的数据:空气质量数据本教程使用关于(NO_2)的......
  • Pandas 2.2 中文官方教程和指南(二十)
    原文:pandas.pydata.org/docs/按组分组:分割-应用-合并原文:pandas.pydata.org/docs/user_guide/groupby.html通过“按组”我们指的是涉及以下一个或多个步骤的过程:根据某些标准将数据分成组。应用一个函数到每个组独立地。合并结果到数据结构中。在这些中,分割步......
  • Pandas 2.2 中文官方教程和指南(二十二)
    原文:pandas.pydata.org/docs/时间增量原文:pandas.pydata.org/docs/user_guide/timedeltas.html时间增量是时间之间的差异,以不同的单位表示,例如天、小时、分钟、秒。它们可以是正数也可以是负数。Timedelta是datetime.timedelta的子类,并且行为类似,但也允许与np.timedelta64......