标签：... 教程 dtype .... 00 pandas 2.2 Pandas Out

原文：pandas.pydata.org/docs/

扩展到大型数据集

原文：pandas.pydata.org/docs/user_guide/scale.html

pandas 提供了用于内存分析的数据结构，这使得使用 pandas 分析大于内存数据集的数据集有些棘手。即使是占用相当大内存的数据集也变得难以处理，因为一些 pandas 操作需要进行中间复制。

本文提供了一些建议，以便将您的分析扩展到更大的数据集。这是对提高性能的补充，后者侧重于加快适��内存的数据集的分析。

加载更少的数据

假设我们在磁盘上的原始数据集有许多列。

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
 ...:    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
 ...:    n = len(index)
 ...:    state = np.random.RandomState(seed)
 ...:    columns = {
 ...:        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
 ...:        "id": state.poisson(1000, size=n),
 ...:        "x": state.rand(n) * 2 - 1,
 ...:        "y": state.rand(n) * 2 - 1,
 ...:    }
 ...:    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
 ...:    if df.index[-1] == end:
 ...:        df = df.iloc[:-1]
 ...:    return df
 ...: 

In [4]: timeseries = [
 ...:    make_timeseries(freq="1min", seed=i).rename(columns=lambda x: f"{x}_{i}")
 ...:    for i in range(10)
 ...: ]
 ...: 

In [5]: ts_wide = pd.concat(timeseries, axis=1)

In [6]: ts_wide.head()
Out[6]: 
 id_0 name_0       x_0  ...   name_9       x_9       y_9
timestamp                                   ... 
2000-01-01 00:00:00   977  Alice -0.821225  ...  Charlie -0.957208 -0.757508
2000-01-01 00:01:00  1018    Bob -0.219182  ...    Alice -0.414445 -0.100298
2000-01-01 00:02:00   927  Alice  0.660908  ...  Charlie -0.325838  0.581859
2000-01-01 00:03:00   997    Bob -0.852458  ...      Bob  0.992033 -0.686692
2000-01-01 00:04:00   965    Bob  0.717283  ...  Charlie -0.924556 -0.184161

[5 rows x 40 columns]

In [7]: ts_wide.to_parquet("timeseries_wide.parquet")

要加载我们想要的列，我们有两个选项。选项 1 加载所有数据，然后筛选我们需要的数据。

In [8]: columns = ["id_0", "name_0", "x_0", "y_0"]

In [9]: pd.read_parquet("timeseries_wide.parquet")[columns]
Out[9]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

选项 2 仅加载我们请求的列。

In [10]: pd.read_parquet("timeseries_wide.parquet", columns=columns)
Out[10]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

如果我们测量这两个调用的内存使用情况，我们会发现在这种情况下指定columns使用的内存约为 1/10。

使用pandas.read_csv()，您可以指定usecols来限制读入内存的列。并非所有可以被 pandas 读取的文件格式都提供读取子集列的选项。

使用高效的数据类型

默认的 pandas 数据类型并不是最节省内存的。特别是对于具有相对少量唯一值的文本数据列（通常称为“低基数”数据），这一点尤为明显。通过使用更高效的数据类型，您可以在内存中存储更大的数据集。

In [11]: ts = make_timeseries(freq="30s", seed=0)

In [12]: ts.to_parquet("timeseries.parquet")

In [13]: ts = pd.read_parquet("timeseries.parquet")

In [14]: ts
Out[14]: 
 id     name         x         y
timestamp 
2000-01-01 00:00:00  1041    Alice  0.889987  0.281011
2000-01-01 00:00:30   988      Bob -0.455299  0.488153
2000-01-01 00:01:00  1018    Alice  0.096061  0.580473
2000-01-01 00:01:30   992      Bob  0.142482  0.041665
2000-01-01 00:02:00   960      Bob -0.036235  0.802159
...                   ...      ...       ...       ...
2000-12-30 23:58:00  1022    Alice  0.266191  0.875579
2000-12-30 23:58:30   974    Alice -0.009826  0.413686
2000-12-30 23:59:00  1028  Charlie  0.307108 -0.656789
2000-12-30 23:59:30  1002    Alice  0.202602  0.541335
2000-12-31 00:00:00   987    Alice  0.200832  0.615972

[1051201 rows x 4 columns]

现在，让我们检查数据类型和内存使用情况，看看我们应该关注哪些方面。

In [15]: ts.dtypes
Out[15]: 
id        int64
name     object
x       float64
y       float64
dtype: object

In [16]: ts.memory_usage(deep=True)  # memory usage in bytes
Out[16]: 
Index     8409608
id        8409608
name     65176434
x         8409608
y         8409608
dtype: int64

name列占用的内存比其他任何列都多得多。它只有几个唯一值，因此很适合转换为pandas.Categorical。使用pandas.Categorical，我们只需一次存储每个唯一名称，并使用节省空间的整数来知道每行中使用了哪个特定名称。

In [17]: ts2 = ts.copy()

In [18]: ts2["name"] = ts2["name"].astype("category")

In [19]: ts2.memory_usage(deep=True)
Out[19]: 
Index    8409608
id       8409608
name     1051495
x        8409608
y        8409608
dtype: int64

我们可以进一步将数值列降级为它们的最小类型，使用pandas.to_numeric()。

In [20]: ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")

In [21]: ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")

In [22]: ts2.dtypes
Out[22]: 
id        uint16
name    category
x        float32
y        float32
dtype: object

In [23]: ts2.memory_usage(deep=True)
Out[23]: 
Index    8409608
id       2102402
name     1051495
x        4204804
y        4204804
dtype: int64

In [24]: reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()

In [25]: print(f"{reduction:0.2f}")
0.20

总的来说，我们将这个数据集的内存占用减少到原始大小的 1/5。

有关pandas.Categorical的更多信息，请参阅分类数据，有关 pandas 所有数据类型的概述，请参阅数据类型。

使用分块加载

通过将一个大问题分成一堆小问题，一些工作负载可以通过分块来实现。例如，将单个 CSV 文件转换为 Parquet 文件，并为目录中的每个文件重复此操作。只要每个块适合内存，您就可以处理比内存大得多的数据集。

注意

当你执行的操作需要零或最小的块之间协调时，分块工作效果很好。对于更复杂的工作流程，最好使用其他库。

假设我们在磁盘上有一个更大的“逻辑数据集”，它是一个 parquet 文件目录。目录中的每个文件代表整个数据集的不同年份。

In [26]: import pathlib

In [27]: N = 12

In [28]: starts = [f"20{i:>02d}-01-01" for i in range(N)]

In [29]: ends = [f"20{i:>02d}-12-13" for i in range(N)]

In [30]: pathlib.Path("data/timeseries").mkdir(exist_ok=True)

In [31]: for i, (start, end) in enumerate(zip(starts, ends)):
 ....:    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
 ....:    ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
 ....:

data
└── timeseries
    ├── ts-00.parquet
    ├── ts-01.parquet
    ├── ts-02.parquet
    ├── ts-03.parquet
    ├── ts-04.parquet
    ├── ts-05.parquet
    ├── ts-06.parquet
    ├── ts-07.parquet
    ├── ts-08.parquet
    ├── ts-09.parquet
    ├── ts-10.parquet
    └── ts-11.parquet

现在我们将实现一个分布式的pandas.Series.value_counts()。这个工作流程的峰值内存使用量是最大块的内存，再加上一个小系列存储到目前为止的唯一值计数。只要每个单独的文件都适合内存，这将适用于任意大小的数据集。

In [32]: %%time
 ....: files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
 ....: counts = pd.Series(dtype=int)
 ....: for path in files:
 ....:    df = pd.read_parquet(path)
 ....:    counts = counts.add(df["name"].value_counts(), fill_value=0)
 ....: counts.astype(int)
 ....: 
CPU times: user 760 ms, sys: 26.1 ms, total: 786 ms
Wall time: 559 ms
Out[32]: 
name
Alice      1994645
Bob        1993692
Charlie    1994875
dtype: int64

一些读取器，比如pandas.read_csv()，在读取单个文件时提供了控制chunksize的参数。

手动分块是一个适合不需要太复杂操作的工作流程的选择。一些操作，比如pandas.DataFrame.groupby()，在块方式下要困难得多。在这些情况下，最好切换到一个实现这些分布式算法的不同库。

使用其他库

还有其他类似于 pandas 并与 pandas DataFrame 很好配合的库，可以通过并行运行时、分布式内存、集群等功能来扩展大型数据集的处理和分析能力。您可以在生态系统页面找到更多信息。

加载更少的数据

假设我们在磁盘上的原始数据集有许多列。

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
 ...:    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
 ...:    n = len(index)
 ...:    state = np.random.RandomState(seed)
 ...:    columns = {
 ...:        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
 ...:        "id": state.poisson(1000, size=n),
 ...:        "x": state.rand(n) * 2 - 1,
 ...:        "y": state.rand(n) * 2 - 1,
 ...:    }
 ...:    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
 ...:    if df.index[-1] == end:
 ...:        df = df.iloc[:-1]
 ...:    return df
 ...: 

In [4]: timeseries = [
 ...:    make_timeseries(freq="1min", seed=i).rename(columns=lambda x: f"{x}_{i}")
 ...:    for i in range(10)
 ...: ]
 ...: 

In [5]: ts_wide = pd.concat(timeseries, axis=1)

In [6]: ts_wide.head()
Out[6]: 
 id_0 name_0       x_0  ...   name_9       x_9       y_9
timestamp                                   ... 
2000-01-01 00:00:00   977  Alice -0.821225  ...  Charlie -0.957208 -0.757508
2000-01-01 00:01:00  1018    Bob -0.219182  ...    Alice -0.414445 -0.100298
2000-01-01 00:02:00   927  Alice  0.660908  ...  Charlie -0.325838  0.581859
2000-01-01 00:03:00   997    Bob -0.852458  ...      Bob  0.992033 -0.686692
2000-01-01 00:04:00   965    Bob  0.717283  ...  Charlie -0.924556 -0.184161

[5 rows x 40 columns]

In [7]: ts_wide.to_parquet("timeseries_wide.parquet")

要加载我们想要的列，我们有两个选项。选项 1 加载所有数据，然后筛选我们需要的数据。

In [8]: columns = ["id_0", "name_0", "x_0", "y_0"]

In [9]: pd.read_parquet("timeseries_wide.parquet")[columns]
Out[9]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

选项 2 只加载我们请求的列。

In [10]: pd.read_parquet("timeseries_wide.parquet", columns=columns)
Out[10]: 
 id_0 name_0       x_0       y_0
timestamp 
2000-01-01 00:00:00   977  Alice -0.821225  0.906222
2000-01-01 00:01:00  1018    Bob -0.219182  0.350855
2000-01-01 00:02:00   927  Alice  0.660908 -0.798511
2000-01-01 00:03:00   997    Bob -0.852458  0.735260
2000-01-01 00:04:00   965    Bob  0.717283  0.393391
...                   ...    ...       ...       ...
2000-12-30 23:56:00  1037    Bob -0.814321  0.612836
2000-12-30 23:57:00   980    Bob  0.232195 -0.618828
2000-12-30 23:58:00   965  Alice -0.231131  0.026310
2000-12-30 23:59:00   984  Alice  0.942819  0.853128
2000-12-31 00:00:00  1003  Alice  0.201125 -0.136655

[525601 rows x 4 columns]

如果我们测量这两个调用的内存使用情况，我们会发现在这种情况下指定columns使用的内存约为 1/10。

使用pandas.read_csv()，您可以指定usecols来限制读入内存的列。并非所有可以被 pandas 读取的文件格式都提供了读取子集列的选项。

使用高效的数据类型

默认的 pandas 数据类型不是最节省内存的。对于具有相对少量唯一值的文本数据列（通常称为“低基数”数据），这一点尤为明显。通过使用更高效的数据类型，您可以在内存中存储更大的数据集。

In [11]: ts = make_timeseries(freq="30s", seed=0)

In [12]: ts.to_parquet("timeseries.parquet")

In [13]: ts = pd.read_parquet("timeseries.parquet")

In [14]: ts
Out[14]: 
 id     name         x         y
timestamp 
2000-01-01 00:00:00  1041    Alice  0.889987  0.281011
2000-01-01 00:00:30   988      Bob -0.455299  0.488153
2000-01-01 00:01:00  1018    Alice  0.096061  0.580473
2000-01-01 00:01:30   992      Bob  0.142482  0.041665
2000-01-01 00:02:00   960      Bob -0.036235  0.802159
...                   ...      ...       ...       ...
2000-12-30 23:58:00  1022    Alice  0.266191  0.875579
2000-12-30 23:58:30   974    Alice -0.009826  0.413686
2000-12-30 23:59:00  1028  Charlie  0.307108 -0.656789
2000-12-30 23:59:30  1002    Alice  0.202602  0.541335
2000-12-31 00:00:00   987    Alice  0.200832  0.615972

[1051201 rows x 4 columns]

现在，让我们检查数据类型和内存使用情况，看看我们应该把注意力放在哪里。

In [15]: ts.dtypes
Out[15]: 
id        int64
name     object
x       float64
y       float64
dtype: object

In [16]: ts.memory_usage(deep=True)  # memory usage in bytes
Out[16]: 
Index     8409608
id        8409608
name     65176434
x         8409608
y         8409608
dtype: int64

name列占用的内存比其他任何列都多。它只有很少的唯一值，因此很适合转换为pandas.Categorical。使用pandas.Categorical，我们只需一次存储每个唯一名称，并使用空间高效的整数来知道每行中使用了哪个特定名称。

In [17]: ts2 = ts.copy()

In [18]: ts2["name"] = ts2["name"].astype("category")

In [19]: ts2.memory_usage(deep=True)
Out[19]: 
Index    8409608
id       8409608
name     1051495
x        8409608
y        8409608
dtype: int64

我们可以进一步将数值列降级为它们的最小类型，使用pandas.to_numeric()。

In [20]: ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")

In [21]: ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")

In [22]: ts2.dtypes
Out[22]: 
id        uint16
name    category
x        float32
y        float32
dtype: object

In [23]: ts2.memory_usage(deep=True)
Out[23]: 
Index    8409608
id       2102402
name     1051495
x        4204804
y        4204804
dtype: int64

In [24]: reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()

In [25]: print(f"{reduction:0.2f}")
0.20

总的来说，我们已将此数据集的内存占用减少到原始大小的 1/5。

请查看 Categorical data 以了解更多关于pandas.Categorical和 dtypes 以获得 pandas 所有 dtypes 的概述。

使用分块

通过将一个大问题分解为一堆小问题，可以使用分块来实现某些工作负载。例如，将单个 CSV 文件转换为 Parquet 文件，并为目录中的每个文件重复此操作。只要每个块适合内存，您就可以处理比内存大得多的数据集。

注意

当您执行的操作需要零或最小的分块之间协调时，分块效果很好。对于更复杂的工作流程，最好使用其他库。

假设我们在磁盘上有一个更大的“逻辑数据集”，它是一个 parquet 文件目录。目录中的每个文件代表整个数据集的不同年份。

In [26]: import pathlib

In [27]: N = 12

In [28]: starts = [f"20{i:>02d}-01-01" for i in range(N)]

In [29]: ends = [f"20{i:>02d}-12-13" for i in range(N)]

In [30]: pathlib.Path("data/timeseries").mkdir(exist_ok=True)

In [31]: for i, (start, end) in enumerate(zip(starts, ends)):
 ....:    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
 ....:    ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
 ....:

data
└── timeseries
    ├── ts-00.parquet
    ├── ts-01.parquet
    ├── ts-02.parquet
    ├── ts-03.parquet
    ├── ts-04.parquet
    ├── ts-05.parquet
    ├── ts-06.parquet
    ├── ts-07.parquet
    ├── ts-08.parquet
    ├── ts-09.parquet
    ├── ts-10.parquet
    └── ts-11.parquet

现在我们将实现一个基于磁盘的pandas.Series.value_counts()。此工作流的峰值内存使用量是最大的单个块，再加上一个小系列，用于存储到目前为止的唯一值计数。只要每个单独的文件都适合内存，这将适用于任意大小的数据集。

In [32]: %%time
 ....: files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
 ....: counts = pd.Series(dtype=int)
 ....: for path in files:
 ....:    df = pd.read_parquet(path)
 ....:    counts = counts.add(df["name"].value_counts(), fill_value=0)
 ....: counts.astype(int)
 ....: 
CPU times: user 760 ms, sys: 26.1 ms, total: 786 ms
Wall time: 559 ms
Out[32]: 
name
Alice      1994645
Bob        1993692
Charlie    1994875
dtype: int64

一些读取器，如pandas.read_csv()，在读取单个文件时提供控制chunksize的参数。

手动分块是一个适用于不需要太复杂操作的工作流程的选择。一些操作，比如pandas.DataFrame.groupby()，在分块方式下要困难得多。在这些情况下，最好切换到另一个库，该库为您实现这些基于外存储算法。

使用其他库

还有其他库提供类似于 pandas 的 API，并与 pandas DataFrame 很好地配合，可以通过并行运行时、分布式内存、集群等功能来扩展大型数据集的处理和分析能力。您可以在生态系统页面找到更多信息。

稀疏数据结构

原文：pandas.pydata.org/docs/user_guide/sparse.html

pandas 提供了用于高效存储稀疏数据的数据结构。这些数据结构不一定是典型的“大部分为 0”的稀疏数据。相反，您可以将这些对象视为“压缩的”，其中任何与特定值匹配的数据（NaN / 缺失值，尽管可以选择任何值，包括 0）都被省略。压缩的值实际上并未存储在数组中。

In [1]: arr = np.random.randn(10)

In [2]: arr[2:-2] = np.nan

In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))

In [4]: ts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]

注意 dtype，Sparse[float64, nan]。nan表示数组中的nan元素实际上并未存储，只有非nan元素。这些非nan元素具有float64 dtype。

稀疏对象存在是为了内存效率的原因。假设您有一个大多数为 NA 的DataFrame：

In [5]: df = pd.DataFrame(np.random.randn(10000, 4))

In [6]: df.iloc[:9998] = np.nan

In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))

In [8]: sdf.head()
Out[8]: 
 0    1    2    3
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

In [9]: sdf.dtypes
Out[9]: 
0    Sparse[float64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
3    Sparse[float64, nan]
dtype: object

In [10]: sdf.sparse.density
Out[10]: 0.0002

正如您所看到的，密度（未“压缩”的值的百分比）非常低。这个稀疏对象在磁盘（pickled）和 Python 解释器中占用的内存要少得多。

In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
Out[11]: 'dense : 320.13 bytes'

In [12]: 'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)
Out[12]: 'sparse: 0.22 bytes'

从功能上讲，它们的行为应该几乎与它们的密集对应物相同。

稀疏数组

arrays.SparseArray 是用于存储稀疏值数组的ExtensionArray（有关扩展数组的更多信息，请参见 dtypes）。它是一个一维类似 ndarray 的对象，仅存储与fill_value不同的值：

In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
Out[17]: 
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

稀疏数组可以使用numpy.asarray()转换为常规（密集）ndarray

In [18]: np.asarray(sparr)
Out[18]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
 nan,  0.606 ,  1.3342]) 
```  ## 稀疏 dtype

`SparseArray.dtype` 属性存储两个信息

1.  非稀疏值的 dtype

1.  标量填充值

```py
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

可以通过仅传递 dtype 来构造SparseDtype

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], numpy.datetime64('NaT')]

在这种情况下，将使用默认填充值（对于 NumPy dtypes，通常是该 dtype 的“缺失”值）。可以传递显式填充值来覆盖此默认值

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
 ....:               fill_value=pd.Timestamp('2017-01-01'))
 ....: 
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

最后，字符串别名'Sparse[dtype]'可用于在许多地方指定稀疏 dtype

In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
Out[22]: 
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32) 
```  ## 稀疏访问器

pandas 提供了一个`.sparse`访问器，类似于字符串数据的`.str`，分类数据的`.cat`和日期时间数据的`.dt`。此命名空间提供了特定于稀疏数据的属性和方法。

```py
In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

此访问器仅适用于具有SparseDtype的数据，并且适用于Series类本身，用于从 scipy COO 矩阵创建具有稀疏数据的 Series。

为DataFrame也添加了一个.sparse访问器。更多信息请参见 Sparse accessor。 ## 稀疏计算

你可以将 NumPy ufuncs应用于arrays.SparseArray，并得到一个arrays.SparseArray作为结果。

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

ufunc也应用于fill_value。这是为了获得正确的稠密结果。

In [28]: arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)

In [29]: np.abs(arr)
Out[29]: 
[1, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([3], dtype=int32)

In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])

转换

要将稀疏数据转换为稠密数据，使用.sparse访问器

In [31]: sdf.sparse.to_dense()
Out[31]: 
 0         1         2         3
0          NaN       NaN       NaN       NaN
1          NaN       NaN       NaN       NaN
2          NaN       NaN       NaN       NaN
3          NaN       NaN       NaN       NaN
4          NaN       NaN       NaN       NaN
...        ...       ...       ...       ...
9995       NaN       NaN       NaN       NaN
9996       NaN       NaN       NaN       NaN
9997       NaN       NaN       NaN       NaN
9998  0.509184 -0.774928 -1.369894 -0.382141
9999  0.280249 -1.648493  1.490865 -0.890819

[10000 rows x 4 columns]

从稠密到稀疏，使用带有SparseDtype的DataFrame.astype()。

In [32]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})

In [33]: dtype = pd.SparseDtype(int, fill_value=0)

In [34]: dense.astype(dtype)
Out[34]: 
 A
0  1
1  0
2  0
3  1 
```  ## 与*scipy.sparse*的交互

使用`DataFrame.sparse.from_spmatrix()`从稀疏矩阵创建具有稀疏值的`DataFrame`。

```py
In [35]: from scipy.sparse import csr_matrix

In [36]: arr = np.random.random(size=(1000, 5))

In [37]: arr[arr < .9] = 0

In [38]: sp_arr = csr_matrix(arr)

In [39]: sp_arr
Out[39]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in Compressed Sparse Row format>

In [40]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

In [41]: sdf.head()
Out[41]: 
 0  1  2         3  4
0   0.95638  0  0         0  0
1         0  0  0         0  0
2         0  0  0         0  0
3         0  0  0         0  0
4  0.999552  0  0  0.956153  0

In [42]: sdf.dtypes
Out[42]: 
0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
dtype: object

所有稀疏格式都受支持，但不在COOrdinate格式中的矩阵将被转换，根据需要复制数据。要转换回 COO 格式的稀疏 SciPy 矩阵，可以使用DataFrame.sparse.to_coo()方法：

In [43]: sdf.sparse.to_coo()
Out[43]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in COOrdinate format>

Series.sparse.to_coo()用于将由MultiIndex索引的具有稀疏值的Series转换为scipy.sparse.coo_matrix。

该方法需要具有两个或更多级别的MultiIndex。

In [44]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])

In [45]: s.index = pd.MultiIndex.from_tuples(
 ....:    [
 ....:        (1, 2, "a", 0),
 ....:        (1, 2, "a", 1),
 ....:        (1, 1, "b", 0),
 ....:        (1, 1, "b", 1),
 ....:        (2, 1, "b", 0),
 ....:        (2, 1, "b", 1),
 ....:    ],
 ....:    names=["A", "B", "C", "D"],
 ....: )
 ....: 

In [46]: ss = s.astype('Sparse')

In [47]: ss
Out[47]: 
A  B  C  D
1  2  a  0    3.0
 1    NaN
 1  b  0    1.0
 1    3.0
2  1  b  0    NaN
 1    NaN
dtype: Sparse[float64, nan]

在下面的示例中，我们通过指定第一和第二个MultiIndex级别定义行的标签，第三和第四个级别定义列的标签，将Series转换为 2 维数组的稀疏表示。我们还指定列和行标签应在最终稀疏表示中排序。

In [48]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
 ....: )
 ....: 

In [49]: A
Out[49]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [50]: A.todense()
Out[50]: 
matrix([[0., 0., 1., 3.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

In [51]: rows
Out[51]: [(1, 1), (1, 2), (2, 1)]

In [52]: columns
Out[52]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

指定不同的行和列标签（并且不对它们进行排序）将产生不同的稀疏矩阵：

In [53]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
 ....: )
 ....: 

In [54]: A
Out[54]: 
<3x2 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [55]: A.todense()
Out[55]: 
matrix([[3., 0.],
 [1., 3.],
 [0., 0.]])

In [56]: rows
Out[56]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]

In [57]: columns
Out[57]: [(0,), (1,)]

为从 scipy.sparse.coo_matrix 创建具有稀疏值的 Series 实现了一个方便的方法 Series.sparse.from_coo()。

In [58]: from scipy import sparse

In [59]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))

In [60]: A
Out[60]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [61]: A.todense()
Out[61]: 
matrix([[0., 0., 1., 2.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

默认行为（使用 dense_index=False）只返回一个仅包含非空条目的 Series。

In [62]: ss = pd.Series.sparse.from_coo(A)

In [63]: ss
Out[63]: 
0  2    1.0
 3    2.0
1  0    3.0
dtype: Sparse[float64, nan]

指定 dense_index=True 将导致索引为矩阵的行和列坐标的笛卡尔乘积。请注意，如果稀疏矩阵足够大（且稀疏），则这将消耗大量内存（相对于 dense_index=False）。

In [64]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)

In [65]: ss_dense
Out[65]: 
1  0    3.0
 2    NaN
 3    NaN
0  0    NaN
 2    1.0
 3    2.0
 0    NaN
 2    1.0
 3    2.0
dtype: Sparse[float64, nan] 
```  ## 稀疏数组

`arrays.SparseArray` 是用于存储稀疏值数组的 `ExtensionArray`（有关扩展数组的更多信息，请参阅数据类型）。它是一个一维类似 ndarray 的对象，仅存储与 `fill_value` 不同的值：

```py
In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
Out[17]: 
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

使用 numpy.asarray() 可将稀疏数组转换为常规（密集）ndarray。

In [18]: np.asarray(sparr)
Out[18]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
 nan,  0.606 ,  1.3342])

稀疏数据类型

SparseArray.dtype 属性存储两个信息

非稀疏值的数据类型
标量填充值

In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

可以通过仅传递一个数据类型来构造 SparseDtype。

In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], numpy.datetime64('NaT')]

在这种情况下，将使用默认填充值（对于 NumPy 数据类型，这通常是该数据类型的“缺失”值）。可以传递一个显式的填充值以覆盖此默认值

In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
 ....:               fill_value=pd.Timestamp('2017-01-01'))
 ....: 
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

最后，可以使用字符串别名 'Sparse[dtype]' 来在许多地方指定稀疏数据类型

In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
Out[22]: 
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

稀疏访问器

pandas 提供了一个 .sparse 访问器，类似于字符串数据的 .str、分类数据的 .cat 和类似日期时间数据的 .dt。此命名空间提供了特定于稀疏数据的属性和方法。

In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

此访问器仅在具有 SparseDtype 的数据上可用，并且在 Series 类本身上可用于使用 scipy COO 矩阵创建具有稀疏数据的 Series。

为 DataFrame 添加了 .sparse 访问器。有关更多信息，请参阅稀疏访问器。

稀疏计算

您可以对 arrays.SparseArray 应用 NumPy ufuncs，并获得 arrays.SparseArray 作为结果。

In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

ufunc 也适用于 fill_value。这是为了获得正确的密集结果而需要的。

In [28]: arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)

In [29]: np.abs(arr)
Out[29]: 
[1, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([3], dtype=int32)

In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])

转换

要将数据从稀疏转换为密集，使用 .sparse 访问器。

In [31]: sdf.sparse.to_dense()
Out[31]: 
 0         1         2         3
0          NaN       NaN       NaN       NaN
1          NaN       NaN       NaN       NaN
2          NaN       NaN       NaN       NaN
3          NaN       NaN       NaN       NaN
4          NaN       NaN       NaN       NaN
...        ...       ...       ...       ...
9995       NaN       NaN       NaN       NaN
9996       NaN       NaN       NaN       NaN
9997       NaN       NaN       NaN       NaN
9998  0.509184 -0.774928 -1.369894 -0.382141
9999  0.280249 -1.648493  1.490865 -0.890819

[10000 rows x 4 columns]

从密集到稀疏，使用 DataFrame.astype() 和 SparseDtype。

In [32]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})

In [33]: dtype = pd.SparseDtype(int, fill_value=0)

In [34]: dense.astype(dtype)
Out[34]: 
 A
0  1
1  0
2  0
3  1

与 scipy.sparse 的交互

使用 DataFrame.sparse.from_spmatrix() 可以从稀疏矩阵创建具有稀疏值的 DataFrame。

In [35]: from scipy.sparse import csr_matrix

In [36]: arr = np.random.random(size=(1000, 5))

In [37]: arr[arr < .9] = 0

In [38]: sp_arr = csr_matrix(arr)

In [39]: sp_arr
Out[39]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in Compressed Sparse Row format>

In [40]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

In [41]: sdf.head()
Out[41]: 
 0  1  2         3  4
0   0.95638  0  0         0  0
1         0  0  0         0  0
2         0  0  0         0  0
3         0  0  0         0  0
4  0.999552  0  0  0.956153  0

In [42]: sdf.dtypes
Out[42]: 
0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
dtype: object

所有稀疏格式都受支持，但不在 COOrdinate 格式中的矩阵将被转换，根据需要复制数据。要转换回 COO 格式的稀疏 SciPy 矩阵，您可以使用 DataFrame.sparse.to_coo() 方法：

In [43]: sdf.sparse.to_coo()
Out[43]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
 with 517 stored elements in COOrdinate format>

Series.sparse.to_coo() 方法用于将由 MultiIndex 索引的稀疏值的 Series 转换为 scipy.sparse.coo_matrix。

该方法需要具有两个或更多级别的 MultiIndex。

In [44]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])

In [45]: s.index = pd.MultiIndex.from_tuples(
 ....:    [
 ....:        (1, 2, "a", 0),
 ....:        (1, 2, "a", 1),
 ....:        (1, 1, "b", 0),
 ....:        (1, 1, "b", 1),
 ....:        (2, 1, "b", 0),
 ....:        (2, 1, "b", 1),
 ....:    ],
 ....:    names=["A", "B", "C", "D"],
 ....: )
 ....: 

In [46]: ss = s.astype('Sparse')

In [47]: ss
Out[47]: 
A  B  C  D
1  2  a  0    3.0
 1    NaN
 1  b  0    1.0
 1    3.0
2  1  b  0    NaN
 1    NaN
dtype: Sparse[float64, nan]

在下面的示例中，我们通过指定第一和第二个 MultiIndex 级别定义行的标签，第三和第四个级别定义列的标签，将 Series 转换为 2-d 数组的稀疏表示。我们还指定列和行标签应在最终稀疏表示中排序。

In [48]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
 ....: )
 ....: 

In [49]: A
Out[49]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [50]: A.todense()
Out[50]: 
matrix([[0., 0., 1., 3.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

In [51]: rows
Out[51]: [(1, 1), (1, 2), (2, 1)]

In [52]: columns
Out[52]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

指定不同的行和列标签（且不排序它们）会产生不同的稀疏矩阵：

In [53]: A, rows, columns = ss.sparse.to_coo(
 ....:    row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
 ....: )
 ....: 

In [54]: A
Out[54]: 
<3x2 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [55]: A.todense()
Out[55]: 
matrix([[3., 0.],
 [1., 3.],
 [0., 0.]])

In [56]: rows
Out[56]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]

In [57]: columns
Out[57]: [(0,), (1,)]

一个方便的方法Series.sparse.from_coo()被实现用于从scipy.sparse.coo_matrix创建一个稀疏值的Series。

In [58]: from scipy import sparse

In [59]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))

In [60]: A
Out[60]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
 with 3 stored elements in COOrdinate format>

In [61]: A.todense()
Out[61]: 
matrix([[0., 0., 1., 2.],
 [3., 0., 0., 0.],
 [0., 0., 0., 0.]])

默认行为（使用dense_index=False）简单地返回一个只包含非空条目的Series。

In [62]: ss = pd.Series.sparse.from_coo(A)

In [63]: ss
Out[63]: 
0  2    1.0
 3    2.0
1  0    3.0
dtype: Sparse[float64, nan]

指定dense_index=True将导致一个索引，该索引是矩阵的行和列坐标的笛卡尔积。请注意，如果稀疏矩阵足够大（且稀疏），这将消耗大量内存（相对于dense_index=False）。

In [64]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)

In [65]: ss_dense
Out[65]: 
1  0    3.0
 2    NaN
 3    NaN
0  0    NaN
 2    1.0
 3    2.0
 0    NaN
 2    1.0
 3    2.0
dtype: Sparse[float64, nan]

常见问题（FAQ）

原文：pandas.pydata.org/docs/user_guide/gotchas.html

DataFrame 内存使用情况

在调用 info() 时，DataFrame 的内存使用情况（包括索引）会显示出来。一个配置选项，display.memory_usage（参见选项列表），指定了在调用 info() 方法时是否会显示 DataFrame 的内存使用情况。

例如，在调用 info() 时，下面的 DataFrame 的内存使用情况会显示如下：

In [1]: dtypes = [
 ...:    "int64",
 ...:    "float64",
 ...:    "datetime64[ns]",
 ...:    "timedelta64[ns]",
 ...:    "complex128",
 ...:    "object",
 ...:    "bool",
 ...: ]
 ...: 

In [2]: n = 5000

In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}

In [4]: df = pd.DataFrame(data)

In [5]: df["categorical"] = df["object"].astype("category")

In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 288.2+ KB

+ 符号表示真实内存使用量可能更高，因为 pandas 不会计算具有 dtype=object 的列中的值所使用的内存。

传递 memory_usage='deep' 将启用更准确的内存使用报告，考虑到所包含对象的完整使用情况。这是可选的，因为进行这种更深层次的内省可能很昂贵。

In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 424.7 KB

默认情况下，显示选项设置为 True，但是在调用 info() 时可以通过显式传递 memory_usage 参数来明确覆盖。

可以通过调用 memory_usage() 方法找到每列的内存使用情况。这会返回一个 Series，其索引由列名表示，并显示每列的内存使用情况（以字节为单位）。对于上述的 DataFrame，可以通过 memory_usage() 方法找到每列的内存使用情况和总内存使用情况：

In [8]: df.memory_usage()
Out[8]: 
Index                128
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 295096

默认情况下，返回的 Series 中显示 DataFrame 索引的内存使用情况，可以通过传递 index=False 参数来抑制索引的内存使用情况：

In [10]: df.memory_usage(index=False)
Out[10]: 
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

info() 方法显示的内存使用情况利用了 memory_usage() 方法来确定 DataFrame 的内存使用情况，同时以人类可读的单位格式化输出（基于 2 的表示法；即 1KB = 1024 字节）。

另请参阅分类记忆用法。 ## 在 pandas 中使用 if/truth 语句

pandas 遵循 NumPy 的惯例，当你尝试将某些内容转换为 bool 时会引发错误。这会在 if 语句中或使用布尔操作：and、or 和 not 时发生。以下代码的结果不清楚：

>>> if pd.Series([False, True, False]):
...     pass

应该是 True 吗，因为它不是零长度，还是 False 因为有 False 值？不清楚，所以 pandas 引发了 ValueError：

In [11]: if pd.Series([False, True, False]):
 ....:    print("I was true")
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-11-5c782b38cd2f> in ?()
----> 1 if pd.Series([False, True, False]):
  2     print("I was true")

~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

你需要明确选择你想要对 DataFrame 做什么，例如使用 any()、all() 或 empty()。或者，你可能想要比较 pandas 对象是否为 None：

In [12]: if pd.Series([False, True, False]) is not None:
 ....:    print("I was not None")
 ....: 
I was not None

下面是如何检查任何值是否为 True：

In [13]: if pd.Series([False, True, False]).any():
 ....:    print("I am any")
 ....: 
I am any

位运算布尔值

位运算布尔运算符如 == 和 != 返回一个布尔 Series，与标量进行比较时执行逐元素比较。

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

查看布尔值比较获取更多示例。

使用 `in` 运算符

在 Series 上使用 Python in 运算符测试成员身份在索引中，而不是在值之间。

In [16]: s = pd.Series(range(5), index=list("abcde"))

In [17]: 2 in s
Out[17]: False

In [18]: 'b' in s
Out[18]: True

如果这种行为令人惊讶，请记住，在 Python 字典上使用 in 测试键，而不是值，并且 Series 类似于字典。要测试成员身份是否在值中，请使用方法 isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool

In [20]: s.isin([2]).any()
Out[20]: True

对于 DataFrame，同样地，in 应用于列轴，测试是否在列名列表中。 ## 通过用户定义的函数 (UDF) 方法进行变异

此部分适用于需要 UDF 的 pandas 方法。特别是 DataFrame.apply()、DataFrame.aggregate()、DataFrame.transform() 和 DataFrame.filter() 方法。

在编程中，通常的规则是在容器被迭代时不要改变容器。变异将使迭代器无效，导致意外行为。考虑以下例子：

In [21]: values = [0, 1, 2, 3, 4, 5]

In [22]: n_removed = 0

In [23]: for k, value in enumerate(values):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [24]: values
Out[24]: [1, 4, 5]

人们可能会期望结果是 [1, 3, 5]。当使用需要 UDF 的 pandas 方法时，内部 pandas 通常会迭代 DataFrame 或其他 pandas 对象。因此，如果 UDF 改变了 DataFrame，可能会出现意外行为。

这里有一个类似的例子，使用 DataFrame.apply()：

In [25]: def f(s):
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [26]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [27]: df.apply(f, axis="columns")
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'a'

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[27], line 1
----> 1 df.apply(f, axis="columns")

File ~/work/pandas/pandas/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")

File ~/work/pandas/pandas/pandas/core/apply.py:916, in FrameApply.apply(self)
  913 elif self.raw:
  914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File ~/work/pandas/pandas/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
  1061 def apply_standard(self):
  1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
  1064     else:
  1065         results, res_index = self.apply_series_numba()

File ~/work/pandas/pandas/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
  1078 with option_context("mode.chained_assignment", None):
  1079     for i, v in enumerate(series_gen):
  1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
  1082         if isinstance(results[i], ABCSeries):
  1083             # If we have a view on v, we need to make a copy because
  1084             #  series_generator will swap out the underlying data
  1085             results[i] = results[i].copy(deep=False)

Cell In[25], line 2, in f(s)
  1 def f(s):
----> 2     s.pop("a")
  3     return s

File ~/work/pandas/pandas/pandas/core/series.py:5391, in Series.pop(self, item)
  5366 def pop(self, item: Hashable) -> Any:
  5367  """
  5368 Return item and drops from series. Raise KeyError if not found.
  5369  
 (...)
  5389 dtype: int64
  5390 """
-> 5391     return super().pop(item=item)

File ~/work/pandas/pandas/pandas/core/generic.py:947, in NDFrame.pop(self, item)
  946 def pop(self, item: Hashable) -> Series | Any:
--> 947     result = self[item]
  948     del self[item]
  950     return result

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 'a'

要解决这个问题，可以制作一份副本，这样变异就不会应用于正在迭代的容器。

In [28]: values = [0, 1, 2, 3, 4, 5]

In [29]: n_removed = 0

In [30]: for k, value in enumerate(values.copy()):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [31]: values
Out[31]: [1, 3, 5]

In [32]: def f(s):
 ....:    s = s.copy()
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [33]: df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})

In [34]: df.apply(f, axis="columns")
Out[34]: 
 b
0  4
1  5
2  6

NumPy 类型的缺失值表示

`np.nan` 作为 NumPy 类型的 `NA` 表示

由于在 NumPy 和 Python 中普遍缺乏对 NA（缺失）的支持，NA 可以用以下方式表示：

一种 掩码数组 解决方案：一个数据数组和一个布尔值数组，指示值是否存在或缺失。
使用特殊的哨兵值、位模式或一组哨兵值来表示各种 dtypes 中的 NA。

选择特殊值 np.nan（非数字）作为 NumPy 类型的 NA 值，并且有一些 API 函数如 DataFrame.isna() 和 DataFrame.notna() 可以用于各种 dtypes 来检测 NA 值。然而，这个选择有一个缺点，即将缺失的整数数据强制转换为浮点类型，如整数 NA 的支持所示。

NumPy 类型的 `NA` 类型提升

当通过reindex()或其他方式向现有的Series或DataFrame引入 NA 时，布尔和整数类型将被提升为不同的 dtype 以存储 NA。这些提升总结在这个表中：

类型	用于存储 NA 的提升 dtype
`floating`	无变化
`object`	无变化
`integer`	转换为`float64`
`boolean`	转换为`object`

支持整数`NA`

在 NumPy 中没有从头开始构建高性能NA支持的情况下，主要的牺牲品是无法在整数数组中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [37]: s.dtype
Out[37]: dtype('int64')

In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])

In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [40]: s2.dtype
Out[40]: dtype('float64')

这种权衡主要是出于内存和性能原因，以及确保生成的Series继续是“数值型”的原因。

如果需要表示可能缺失值的整数，请使用 pandas 或 pyarrow 提供的可空整数扩展 dtypes 之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())

In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [43]: s_int.dtype
Out[43]: Int64Dtype()

In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])

In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [46]: s2_int.dtype
Out[46]: Int64Dtype()

In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

查看可空整数数据类型和 PyArrow 功能以获取更多信息。

为什么不让 NumPy 像 R 一样呢？

许多人建议 NumPy 应该简单地模仿更多领域特定的统计编程语言R中存在的NA支持。部分原因是 NumPy 类型层次结构：

类型	Dtypes
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 语言只有少数几种内置数据类型：integer、numeric（浮点数）、character和boolean。NA类型是通过为每种类型保留特殊的位模式来实现的，用作缺失值。虽然在整个 NumPy 类型层次结构中执行此操作是可能的，但这将是一个更重大的权衡（特别是对于 8 位和 16 位数据类型），并且需要更多的实现工作。

但是，R 的NA语义现在可通过使用遮罩 NumPy 类型（例如Int64Dtype）或 PyArrow 类型（ArrowDtype）来实现。

与 NumPy 的差异

对于Series和DataFrame对象，var()通过N-1进行归一化以生成无偏的总体方差估计，而 NumPy 的numpy.var()通过 N 进行归一化，该方法测量样本的方差。请注意，cov()在 pandas 和 NumPy 中都通过N-1进行归一化。

线程安全性

pandas 并非 100%线程安全。已知问题与copy()方法有关。如果您在线程之间共享的DataFrame对象上进行大量复制操作，我们建议在发生数据复制的线程内持有锁定。

有关更多信息，请参见此链接。

字节顺序问题

偶尔你可能需要处理在与运行 Python 的机器上的字节顺序不同的机器上创建的数据。此问题的常见症状是错误，例如：

Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler

要处理此问题，您应该在将底层 NumPy 数组传递给Series或DataFrame构造函数之前将其转换为本机系统字节顺序，如下所示：

In [49]: x = np.array(list(range(10)), ">i4")  # big endian

In [50]: newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder

In [51]: s = pd.Series(newx)

有关更多详情，请参阅NumPy 关于字节顺序的文档。

DataFrame 内存使用情况

调用info()时，会显示DataFrame（包括索引）的内存使用情况。配置选项display.memory_usage（请参阅选项列表）指定在调用info()方法时是否显示DataFrame的内存使用情况。

例如，调用 info() 时，下面的 DataFrame 的内存使用情况会显示出来：

In [1]: dtypes = [
 ...:    "int64",
 ...:    "float64",
 ...:    "datetime64[ns]",
 ...:    "timedelta64[ns]",
 ...:    "complex128",
 ...:    "object",
 ...:    "bool",
 ...: ]
 ...: 

In [2]: n = 5000

In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}

In [4]: df = pd.DataFrame(data)

In [5]: df["categorical"] = df["object"].astype("category")

In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 288.2+ KB

+ 符号表示真正的内存使用量可能更高，因为 pandas 不计算具有 dtype=object 的列中值的内存使用量。

通过传递 memory_usage='deep' 将启用更准确的内存使用报告，考虑到所包含对象的完整使用情况。这是可选的，因为进行更深入的内省可能会很昂贵。

In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   int64            5000 non-null   int64 
 1   float64          5000 non-null   float64 
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128 
 5   object           5000 non-null   object 
 6   bool             5000 non-null   bool 
 7   categorical      5000 non-null   category 
dtypes: bool(1), category(1), complex128(1), datetime64ns, float64(1), int64(1), object(1), timedelta64ns
memory usage: 424.7 KB

默认情况下，显示选项设置为 True，但可以通过在调用 info() 时传递 memory_usage 参数来显式地覆盖。

通过调用 memory_usage() 方法可以找到每列的内存使用情况。这将返回一个由列名表示的索引的 Series，其中显示了每列的内存使用情况（以字节为单位）。对于上述的 DataFrame，可以通过 memory_usage() 方法找到每列的内存使用情况和总内存使用情况：

In [8]: df.memory_usage()
Out[8]: 
Index                128
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 295096

默认情况下，返回的 Series 中显示了 DataFrame 索引的内存使用情况，可以通过传递 index=False 参数来抑制索引的内存使用情况：

In [10]: df.memory_usage(index=False)
Out[10]: 
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

info() 方法显示的内存使用情况利用 memory_usage() 方法来确定 DataFrame 的内存使用情况，同时以人类可读的单位格式化输出（基于 2 的表示法；即 1KB = 1024 字节）。

另请参阅分类内存使用。

使用 pandas 进行 if/truth 语句

pandas 遵循 NumPy 的惯例，当你尝试将某些东西转换为 bool 时会引发错误。这发生在 if 语句中或在使用布尔运算时：and、or 和 not。下面的代码应该得到什么结果不清楚：

>>> if pd.Series([False, True, False]):
...     pass

它应该是 True，因为它不是零长度，还是 False，因为存在 False 值？不清楚，因此，pandas 引发了一个 ValueError：

In [11]: if pd.Series([False, True, False]):
 ....:    print("I was true")
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-11-5c782b38cd2f> in ?()
----> 1 if pd.Series([False, True, False]):
  2     print("I was true")

~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

您需要明确选择您要对DataFrame进行的操作，例如使用any()、all()或empty()。或者，您可能想要比较 pandas 对象是否为None：

In [12]: if pd.Series([False, True, False]) is not None:
 ....:    print("I was not None")
 ....: 
I was not None

以下是如何检查任何值是否为True：

In [13]: if pd.Series([False, True, False]).any():
 ....:    print("I am any")
 ....: 
I am any

位运算布尔

像==和!=这样的位运算布尔运算符返回一个布尔Series，当与标量比较时进行逐元素比较。

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

请参阅布尔比较以获取更多示例。

使用`in`运算符

在Series上使用 Python 的in运算符测试是否属于索引，而不是值之间的成员关系。

In [16]: s = pd.Series(range(5), index=list("abcde"))

In [17]: 2 in s
Out[17]: False

In [18]: 'b' in s
Out[18]: True

如果此行为令人惊讶，请记住，在 Python 字典上使用in测试键，而不是值，而Series类似于字典。要测试值的成员资格，请使用方法isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool

In [20]: s.isin([2]).any()
Out[20]: True

对于DataFrame，同样地，in应用于列轴，测试是否在列名列表中。

位运算布尔

像==和!=这样的位运算布尔运算符返回一个布尔Series，当与标量比较时进行逐元素比较。

In [14]: s = pd.Series(range(5))

In [15]: s == 4
Out[15]: 
0    False
1    False
2    False
3    False
4     True
dtype: bool

请参阅布尔比较以获取更多示例。

使用`in`运算符

在Series上使用 Python 的in运算符测试是否属于索引，而不是值之间的成员关系。

In [16]: s = pd.Series(range(5), index=list("abcde"))

In [17]: 2 in s
Out[17]: False

In [18]: 'b' in s
Out[18]: True

如果此行为令人惊讶，请记住，在 Python 字典上使用in测试键，而不是值，而Series类似于字典。要测试值的成员资格，请使用方法isin()：

In [19]: s.isin([2])
Out[19]: 
a    False
b    False
c     True
d    False
e    False
dtype: bool

In [20]: s.isin([2]).any()
Out[20]: True

对于DataFrame，同样地，in应用于列轴，测试是否在列名列表中。

使用用户定义函数（UDF）方法进行变异

本节适用于接受 UDF 的 pandas 方法。特别是，方法 DataFrame.apply()、DataFrame.aggregate()、DataFrame.transform() 和 DataFrame.filter()。

编程中的一个通用规则是，在迭代容器时不应该改变容器。改变会使迭代器失效，导致意外行为。考虑下面的例子：

In [21]: values = [0, 1, 2, 3, 4, 5]

In [22]: n_removed = 0

In [23]: for k, value in enumerate(values):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [24]: values
Out[24]: [1, 4, 5]

人们可能本来期望结果会是[1, 3, 5]。当使用一个接受用户定义函数（UDF）的 pandas 方法时，内部 pandas 经常会迭代DataFrame 或其他 pandas 对象。因此，如果 UDF 改变了 DataFrame，可能会导致意外行为的发生。

下面是一个类似的例子，使用了 DataFrame.apply()：

In [25]: def f(s):
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [26]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

In [27]: df.apply(f, axis="columns")
---------------------------------------------------------------------------
KeyError  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
  3804 try:
-> 3805     return self._engine.get_loc(casted_key)
  3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'a'

The above exception was the direct cause of the following exception:

KeyError  Traceback (most recent call last)
Cell In[27], line 1
----> 1 df.apply(f, axis="columns")

File ~/work/pandas/pandas/pandas/core/frame.py:10374, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10360 from pandas.core.apply import frame_apply
  10362 op = frame_apply(
  10363     self,
  10364     func=func,
   (...)
  10372     kwargs=kwargs,
  10373 )
> 10374 return op.apply().__finalize__(self, method="apply")

File ~/work/pandas/pandas/pandas/core/apply.py:916, in FrameApply.apply(self)
  913 elif self.raw:
  914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File ~/work/pandas/pandas/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
  1061 def apply_standard(self):
  1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
  1064     else:
  1065         results, res_index = self.apply_series_numba()

File ~/work/pandas/pandas/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
  1078 with option_context("mode.chained_assignment", None):
  1079     for i, v in enumerate(series_gen):
  1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
  1082         if isinstance(results[i], ABCSeries):
  1083             # If we have a view on v, we need to make a copy because
  1084             #  series_generator will swap out the underlying data
  1085             results[i] = results[i].copy(deep=False)

Cell In[25], line 2, in f(s)
  1 def f(s):
----> 2     s.pop("a")
  3     return s

File ~/work/pandas/pandas/pandas/core/series.py:5391, in Series.pop(self, item)
  5366 def pop(self, item: Hashable) -> Any:
  5367  """
  5368 Return item and drops from series. Raise KeyError if not found.
  5369  
 (...)
  5389 dtype: int64
  5390 """
-> 5391     return super().pop(item=item)

File ~/work/pandas/pandas/pandas/core/generic.py:947, in NDFrame.pop(self, item)
  946 def pop(self, item: Hashable) -> Series | Any:
--> 947     result = self[item]
  948     del self[item]
  950     return result

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
  1118     return self._values[key]
  1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
  1123 # Convert generator to list before going through hashable part
  1124 # (We will iterate through the generator there to check for slices)
  1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
  1234     return self._values[label]
  1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
  1239 if is_integer(loc):
  1240     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
  3807     if isinstance(casted_key, slice) or (
  3808         isinstance(casted_key, abc.Iterable)
  3809         and any(isinstance(x, slice) for x in casted_key)
  3810     ):
  3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
  3813 except TypeError:
  3814     # If we have a listlike key, _check_indexing_error will raise
  3815     #  InvalidIndexError. Otherwise we fall through and re-raise
  3816     #  the TypeError.
  3817     self._check_indexing_error(key)

KeyError: 'a'

要解决此问题，可以制作一个副本，以便变化不适用于被迭代的容器。

In [28]: values = [0, 1, 2, 3, 4, 5]

In [29]: n_removed = 0

In [30]: for k, value in enumerate(values.copy()):
 ....:    idx = k - n_removed
 ....:    if value % 2 == 1:
 ....:        del values[idx]
 ....:        n_removed += 1
 ....:    else:
 ....:        values[idx] = value + 1
 ....: 

In [31]: values
Out[31]: [1, 3, 5]

In [32]: def f(s):
 ....:    s = s.copy()
 ....:    s.pop("a")
 ....:    return s
 ....: 

In [33]: df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})

In [34]: df.apply(f, axis="columns")
Out[34]: 
 b
0  4
1  5
2  6

NumPy 类型的缺失值表示

`np.nan` 作为 NumPy 类型的 `NA` 表示

由于 NumPy 和 Python 一般都不支持从底层开始的 NA（缺失）支持，因此 NA 可以用以下方式表示：

掩码数组 解决方案：一个数据数组和一个布尔值数组，指示值是否存在或缺失。
使用特殊的哨兵值、位模式或一组哨兵值来表示跨 dtypes 的 NA。

选择了特殊值 np.nan（Not-A-Number）作为 NumPy 类型的 NA 值，并且有像 DataFrame.isna() 和 DataFrame.notna() 这样的 API 函数，可以用于跨 dtypes 检测 NA 值。然而，这种选择的缺点是会将缺失的整数数据强制转换为浮点类型，如在整数 NA 的支持中所示。

NumPy 类型的 `NA` 类型提升

通过 reindex() 或其他方式将 NA 引入现有的 Series 或 DataFrame 时，布尔和整数类型将被提升为不同的 dtype 以存储 NA。这些提升总结在这个表中：

类型类	用于存储 NA 的提升 dtype
`floating`	无变化
`object`	无变化
`integer`	转换为 `float64`
`boolean`	转换为 `object`

对整数 `NA` 的支持

在 NumPy 中没有内置高性能的 NA 支持的情况下，主要的牺牲是无法在整数数组中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [37]: s.dtype
Out[37]: dtype('int64')

In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])

In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [40]: s2.dtype
Out[40]: dtype('float64')

这种权衡主要是为了内存和性能原因，以及确保生成的 Series 仍然是“数值型”的。

如果需要表示可能缺失值的整数，请使用 pandas 或 pyarrow 提供的可空整数扩展 dtypes 之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())

In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [43]: s_int.dtype
Out[43]: Int64Dtype()

In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])

In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [46]: s2_int.dtype
Out[46]: Int64Dtype()

In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

更多信息请参阅可空整数数据类型和 PyArrow 功能。

为什么不让 NumPy 像 R 一样？

许多人建议 NumPy 应该简单地模仿更多领域特定的统计编程语言 R 中存在的 NA 支持。部分原因是 NumPy 的类型层次结构：

类型类	Dtypes
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 语言只有少数几种内置数据类型：integer、numeric（浮点数）、character 和 boolean。 NA 类型是通过为每种类型保留特殊的位模式来实现的，用作缺失值。虽然在 NumPy 的完整类型层次结构中执行这一操作是可能的，但这将是一个更为重大的权衡（特别是对于 8 位和 16 位数据类型）和实现任务。

然而，通过使用像 Int64Dtype 或 PyArrow 类型（ArrowDtype）这样的掩码 NumPy 类型，现在可以使用 R NA 语义。

使用 `np.nan` 作为 NumPy 类型的 `NA` 表示

由于 NumPy 和 Python 在一般情况下缺乏从头开始的 NA（缺失）支持，NA 可以用以下方式表示：

一种 掩码数组 解决方案：一个数据数组和一个布尔值数组，指示值是否存在或缺失。
使用特殊的标记值、位模式或一组标记值来表示跨数据类型的 NA。

选择了特殊值 np.nan（非数字）作为 NumPy 类型的 NA 值，还有像 DataFrame.isna() 和 DataFrame.notna() 这样的 API 函数，可以跨数据类��用于检测 NA 值。然而，这种选择的缺点是将缺失的整数数据强制转换为浮点类型，如整数 NA 支持中所示。

NumPy 类型的`NA`类型提升

当通过 reindex() 或其他方式将 NAs 引入现有的 Series 或 DataFrame 时，布尔值和整数类型将被提升为不同的数据类型以存储 NA。这些提升总结在这个表中：

类型类	用于存储 NA 的提升数据类型
`浮点数`	无变化
`对象`	无变化
`整数`	转换为 `float64`
`布尔值`	转换为 `对象`

整数 `NA` 支持

在 NumPy 中没有从头开始构建高性能NA支持的情况下，主要的牺牲品是无法在整数数组中表示 NA。例如：

In [35]: s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [36]: s
Out[36]: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [37]: s.dtype
Out[37]: dtype('int64')

In [38]: s2 = s.reindex(["a", "b", "c", "f", "u"])

In [39]: s2
Out[39]: 
a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [40]: s2.dtype
Out[40]: dtype('float64')

这种权衡主要是出于内存和性能原因，以及确保生成的 Series 仍然是“数值型”的。

如果您需要表示可能缺失值的整数，请使用 pandas 或 pyarrow 提供的可空整数扩展数据类型之一

Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
ArrowDtype

In [41]: s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())

In [42]: s_int
Out[42]: 
a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [43]: s_int.dtype
Out[43]: Int64Dtype()

In [44]: s2_int = s_int.reindex(["a", "b", "c", "f", "u"])

In [45]: s2_int
Out[45]: 
a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [46]: s2_int.dtype
Out[46]: Int64Dtype()

In [47]: s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")

In [48]: s_int_pa
Out[48]: 
0       1
1       2
2    <NA>
dtype: int64[pyarrow]

更多信息，请参见可空整数数据类型和 PyArrow 功能。

为什么不让 NumPy 像 R 一样？

许多人建议 NumPy 应该简单地模仿更多领域特定的统计编程语言R中存在的NA支持。部分原因是 NumPy 类型层次结构：

类型类	数据类型
`numpy.floating`	`float16, float32, float64, float128`
`numpy.integer`	`int8, int16, int32, int64`
`numpy.unsignedinteger`	`uint8, uint16, uint32, uint64`
`numpy.object_`	`object_`
`numpy.bool_`	`bool_`
`numpy.character`	`bytes_, str_`

相比之下，R 语言只有少数几种内置数据类型：integer、numeric（浮点数）、character和boolean。NA类型是通过为每种类型保留特殊的位模式来实现的，以用作缺失值。虽然使用完整的 NumPy 类型层次结构进行此操作是可能的，但这将是一个更重大的折衷（特别是对于 8 位和 16 位数据类型）和实施任务。

然而，现在可以通过使用掩码 NumPy 类型（如Int64Dtype）或 PyArrow 类型（ArrowDtype）来实现 R 的NA语义。

与 NumPy 的差异

对于Series和DataFrame对象，var()通过N-1进行归一化，以产生总体方差的无偏估计，而 NumPy 的numpy.var()通过 N 进行归一化，这测量了样本的方差。请注意，cov()在 pandas 和 NumPy 中都通过N-1进行归一化。

线程安全性

pandas 并非 100%线程安全。已知问题与copy()方法有关。如果您正在对在线程之间共享的DataFrame对象进行大量复制，我们建议在进行数据复制的线程内部保持锁定。

更多信息，请参见此链接。

字节顺序问题

有时您可能需要处理在与运行 Python 的机器上具有不同字节顺序的机器上创建的数据。这个问题的常见症状是出现错误，如：

Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler

要解决这个问题，您应该在将其传递给Series或DataFrame构造函数之前，将底层 NumPy 数组转换为本机系统字节顺序，类似于以下内容：

In [49]: x = np.array(list(range(10)), ">i4")  # big endian

In [50]: newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder

In [51]: s = pd.Series(newx)

查看更多详细信息，请参阅NumPy 文档中关于字节顺序的部分。

标签：...,教程,dtype,....,00,pandas,2.2,Pandas,Out
From： https://www.cnblogs.com/apachecn/p/18154713

Pandas 2.2 中文官方教程和指南（二十四）

扩展到大型数据集

加载更少的数据

使用高效的数据类型

使用分块加载

使用其他库

加载更少的数据

使用高效的数据类型

使用分块

使用其他库

稀疏数据结构

稀疏数组

稀疏数据类型

稀疏访问器

稀疏计算

与 scipy.sparse 的交互

常见问题（FAQ）

DataFrame 内存使用情况

位运算布尔值

使用 in 运算符

NumPy 类型的缺失值表示

np.nan 作为 NumPy 类型的 NA 表示

NumPy 类型的 NA 类型提升

支持整数NA

为什么不让 NumPy 像 R 一样呢？

与 NumPy 的差异

线程安全性

字节顺序问题

DataFrame 内存使用情况

使用 pandas 进行 if/truth 语句

位运算布尔

使用in运算符

位运算布尔

使用in运算符

使用用户定义函数（UDF）方法进行变异

NumPy 类型的缺失值表示

np.nan 作为 NumPy 类型的 NA 表示

NumPy 类型的 NA 类型提升

对整数 NA 的支持

为什么不让 NumPy 像 R 一样？

使用 np.nan 作为 NumPy 类型的 NA 表示

NumPy 类型的NA类型提升

整数 NA 支持

为什么不让 NumPy 像 R 一样？

与 NumPy 的差异

线程安全性

字节顺序问题

相关文章

赞助商

阅读排行

使用 `in` 运算符

`np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的 `NA` 类型提升

支持整数`NA`

使用`in`运算符

使用`in`运算符

`np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的 `NA` 类型提升

对整数 `NA` 的支持

使用 `np.nan` 作为 NumPy 类型的 `NA` 表示

NumPy 类型的`NA`类型提升

整数 `NA` 支持