按比例分割具有多个目标列的数据框

标签：python pandas dataframe numpy data-analysis

我有一个 30 行 10 列的数据框。其中 5 列是输入特征，另外 5 列是输出/目标列。目标列包含表示为 0、1、2 的类。我想将数据集拆分为训练和测试，以便在 训练集 中，对于 每个输出列 ，比例1 级的值在 0.15 和 0.3 之间。（我不关心 测试集中类的分布 ）。

附加上下文 ：我试图平衡多类和多输出数据集中的输出类。我的理解是，这将是一个具有 25（？）自由度的优化问题。因此，如果我有任何输入数据集，我将能够创建该输入数据集的子集，它是我的训练数据，并且具有所需的类平衡（即每个输出列的类 1 在 0.15 和 0.3 之间）。

I使用此制作数据框

import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split

np.random.seed(42)
data = pd.DataFrame({
    'A': np.random.rand(30),
    'B': np.random.rand(30),
    'C': np.random.rand(30),
    'D': np.random.rand(30),
    'E': np.random.rand(30),
    'F': np.random.choice([0, 1, 2], 30),
    'G': np.random.choice([0, 1, 2], 30),
    'H': np.random.choice([0, 1, 2], 30),
    'I': np.random.choice([0, 1, 2], 30),
    'J': np.random.choice([0, 1, 2], 30)
})

我当前针对此问题的愚蠢/轻率的解决方案涉及使用两个单独的函数。我有一个辅助函数来检查每列中 1 类的比例是否在我想要的范围内

def check_proportions(df, cols, min_prop = 0.15, max_prop = 0.3, class_category = 1):
    for col in cols:
        prop = (df[col] == class_category).mean()
        if not (min_prop <= prop <= max_prop):
            return False
    return True

def proportionately_split_data(data, target_cols, min_prop = 0.15, max_prop = 0.3):
    while True:
        random_state = np.random.randint(100_000)
        train_df, test_df = train_test_split(data, test_size = 0.3, random_state = random_state)
        if check_proportions(train_df, target_cols, min_prop, max_prop):
            return train_df, test_df

最后，我使用

target_cols = ["F", "G", "H", "I", "J"]

train, test = proportionately_split_data(data, target_cols)

运行代码 proportionately_split_data 我对当前“解决方案”的担心是它是概率性的而不是确定性的。如果我在 train_test_split 中设置的随机状态都不能随机生成具有所需比例的数据，我可以看到

陷入无限循环。任何帮助将不胜感激！ 我很抱歉没有早点提供这个，对于一个最小的工作示例，输入（ 数据

）可以是	A	B	C\| \|\|D	E	OUTPUT_1	OUTPUT_2	OUTPUT_3	OUTPUT_4	OUTPUT_5
5.65	3.56	0.94	9.23	6.43	0	1	1	0	1
7.43	3.95	1.24	7.22	2.66	0	0	0\| \|\|1	2	9.31
2.42	2.91	2.64	6.28	2	1	2	2	0	8.19
5.12	1.32	3.12	8.41	1	2	0	1	2	9.35
1.92	3.12\| \|\|4.13	3.14	0	1	1	0	1	8.43	9.72
7.23	8.29	9.18	1	0	0	2	2	4.32	2.12
3.84	9.42	8.19	0	0	0\| \|\|0	0	3.92	3.91	2.90
8.19	8.41	2	2	2	2	1	7.89	1.92	4.12
8.19	7.28	1	1	2	0	2	5.21	2.42	3.10\| \|\|0.31
1.31	2	0	1	1	0	有 10 行和 10 列，	和预期输出 (	train设置	) 可以是

B C D

E	OUTPUT_1	OUTPUT_2	OUTPUT_3	OUTPUT_4	OUTPUT_5	5.65	3.56	0.94	9.23
6.43	0	1	1	0	1	7.43	3.95	1.24\| \|\|7.22	2.66
0	0	0	1	2	9.31	2.42	2.91	2.64	6.28
2	1	2	2	0	8.19	5.12	1.32	3.12	8.41
1	2	0\| \|\|1	2	8.43	9.72	7.23	8.29	9.18	1
0	0	2	2	3.92	3.91	2.90	8.19	8.41	2
2	2	2	1	5.21	2.42	3.10\| \|\|0.31	1.31	2	0
1	1	0	其中	训练集中的每个输出列	至少有 2 (> = 0.15 *	输入数据	中的行数	1类	的实例，最多3个（<= 0.3 *

输入数据 中的行数）。我想我也没有澄清该比例与 输入数据集 中的示例（或行）数量有关。我的 测试集将是输入数据集 中的剩余行。 ). I guess I also didn't clarify that the proportion is in relation to the number of examples (or rows) in the input dataset . My test set would be the remaining rows in the input dataset .

的担心是有效的 - 的当前解决方案可能陷入无限循环，如果它找不到满足的条件的拆分，它并不能保证找到解决方案（如果存在的话）。

与其反复随机拆分数据，不如尝试一种更系统的方法。以下是一种使用以下步骤的方法：

计算每个目标列的类 1 目标计数： 根据所需的比例和训练集大小，确定每个目标列中需要有多少个 1 类样本。
按目标列对样本进行分组： 根据其在目标列中的值创建样本组。
从每个组中抽取样本： 从每个组中抽取所需数量的样本以满足 1 类的目标计数，同时也要确保从其他类中选择足够的样本以达到所需的训练集大小。
组合训练集和测试集： 将选定的样本组合成训练集，其余样本放入测试集。

以下是如何实现此目的的方法：

import pandas as pd
import numpy as np

def stratified_split(data, target_cols, min_prop=0.15, max_prop=0.3):
    n_train = int(len(data) * (1 - 0.3))  # Calculate the size of the training set
    train_indices = []

    for col in target_cols:
        # Calculate the target count for class 1 in the current target column
        target_count = int(n_train * (min_prop + max_prop) / 2) 

        # Group indices by class for the current target column
        class_indices = {
            class_val: data.index[data[col] == class_val].tolist()
            for class_val in data[col].unique()
        }

        # Sample from class 1
        selected_indices = np.random.choice(
            class_indices[1], size=min(target_count, len(class_indices[1])), replace=False
        ).tolist()

        # Sample from other classes to reach the desired training set size
        remaining_count = n_train - len(selected_indices)
        for class_val, indices in class_indices.items():
            if class_val != 1:
                selected_indices.extend(
                    np.random.choice(
                        indices,
                        size=min(remaining_count, len(indices)),
                        replace=False,
                    )
                )
                remaining_count = n_train - len(selected_indices)
                if remaining_count == 0:
                    break

        train_indices.extend(selected_indices)

    # Create the train and test sets
    train_df = data.iloc[train_indices]
    test_df = data.drop(train_indices)

    return train_df, test_df

target_cols = ["F", "G", "H", "I", "J"]

train, test = stratified_split(data.copy(), target_cols)

# Verify that the proportions are within the desired range
for col in target_cols:
    prop = (train[col] == 1).mean()
    print(f"Column {col}: Proportion of class 1 = {prop:.2f}")

此代码将遍历每个目标列，并尝试从每个类中选择正确数量的样本，以满足的条件。它首先尝试从 1 类中抽取目标数量的样本，然后从其他类中抽取剩余数量的样本。这样，它就可以保证为训练集找到一个满足条件的拆分（如果存在）。

请注意，此解决方案假设每个目标列的每个类中都有足够的样本以满足的条件。如果某个类的样本数量太少，则可能无法满足所需的比例。在这种情况下，可能需要考虑收集更多数据或调整比例要求。

标签：python,pandas,dataframe,numpy,data-analysis
From： 78634138

按比例分割具有多个目标列的数据框

相关文章

赞助商

阅读排行