如何更快地比较同一 csv 文件中的行

标签：python pandas dataframe csv analytics

我有一个 csv 文件，包含 720,000 行和 10 列，与问题相关的列是 ['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'] 此文件是人们在游戏中掠夺地面的物品的日志，问题是有时地面虫子和类型在两个不同的行中两次掠夺相同物品的人（这两行最多可以相隔 5 行），在 'timestamp_utc' 列中略有不同，其他方面 ['looted_by__name', 'item_id', 'quantity'] 是相同的例子是：

2024-06-23T11:40:43.2187312Z,Georgeeto,T4_SOUL,2

2024-06-23T11:40:43.4588316Z,Georgeeto,T4_SOUL,2

在这个例子中， 2024-06-23T11:40:43.2187312Z 将是 'timestamp_utc'

'Georgeeto' 将是 'looted_by__name'

'T4_SOUL' 将是 'item_id'

'2' 将是 'quantity'

我在这里试图做的是看看 ['looted_by__name', 'item_id', 'quantity'] 两行是否相等，以及它们是否将两行时间戳相减，如果小于 0.5 秒，我将两行损坏的行复制到 Corrupted.csv 文件中，并只将其中一行放入一个 Clean.csv 文件

我这样做的方式如下

import pandas as pd
import time
from datetime import datetime

start_time = time.time()
combined_df_3 = pd.read_csv("Proccesing/combined_file_refined.csv", delimiter= ',', usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])
combined_df_4 = pd.read_csv("Proccesing/combined_file_refined.csv", delimiter= ',', usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])
bugged_item_df = pd.DataFrame()
clean_item_df = pd.DataFrame()

bugged_item_list = []
clean_item_list = []

date_format = '%Y-%m-%dT%H:%M:%S.%f'

for index1,row1 in combined_df_3.iterrows():
    n = 0 
    time_stamp_1 = datetime.strptime(row1['timestamp_utc'][:26], date_format)
    name_1 = row1['looted_by__name']
    item_id_1 = row1['item_id']
    quantity_1 = row1['quantity']
    

    for index2, row2 in combined_df_4.iterrows():
        print(str(n))
        n += 1
        if n > 5:
            break

        time_stamp_2 = datetime.strptime(row2['timestamp_utc'][:26], date_format)
        name_2 = row2['looted_by__name']
        item_id_2 = row2['item_id']
        quantity_2 = row2['quantity']



        if time_stamp_1 == time_stamp_2 and name_1 == name_2 and item_id_1 == item_id_2 and quantity_2 == quantity_2:
            break # get out of for loop here

        elif name_1 == name_2 and item_id_1 == item_id_2 and quantity_1 == quantity_2:
            if time_stamp_1 > time_stamp_2:
                date_diff = abs(time_stamp_1 - time_stamp_2)
                date_diff_sec = date_diff.total_seconds()
            
            elif time_stamp_1 < time_stamp_2:
                date_diff = abs(time_stamp_2 - time_stamp_1)
                date_diff_sec = date_diff.total_seconds()

            if date_diff_sec < 0.5:
                bugged_item_df = bugged_item_df._append(row1 ,ignore_index=True)
                bugged_item_df = bugged_item_df._append(row2 ,ignore_index=True) #add both lines into  a csv file and not write 1 of them into the final csv file
            
            elif date_diff_sec > 0.5:
                pass # type line into a csv file normally
        else:
            pass # type line into a csv file normally

bugged_item_df.to_csv("test.csv", index=False)
clean_item_df.to_csv('test2.csv', index=False)

end_time = time.time()
execution_time = end_time - start_time

print(f"Execution time: {execution_time} seconds")

我的方式“技术上”可行，但需要大约6-13小时才能扔掉整个文件我来问有没有办法优化它运行得更快

注意：代码还没有完成，但你可以从中得到想法

更新：感谢AKZ的建议（我爱你）我能够将时间从 13.4 小时减少到 32 分钟，并且我意识到我发布的代码在 for 循环中也做错了，所以我采用以下答案

import time
import pandas as pd
from datetime import datetime

#orgnizing the rows
df = pd.read_csv("proccesing/combined_file_refined.csv", delimiter= ',', usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])
df = df.groupby(['looted_by__name', 'timestamp_utc']).sum().reset_index()
df.to_csv("test.csv", index=False)


bugged_item_df = pd.DataFrame()
clean_item_df = pd.DataFrame()
df1 =pd.read_csv("test.csv", delimiter= ',', usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])
date_format = '%Y-%m-%dT%H:%M:%S.%f'
n = 0
num_of_runs = 0

start_time = time.time()

for index1,row1 in df.iterrows():
    num_of_runs += 1
    n += 1
    try:
        row2 = df1.iloc[n]
    except IndexError:
        clean_item_df = clean_item_df._append(row1 ,ignore_index=True)
        break
    time_stamp_1 = datetime.strptime(row1['timestamp_utc'][:26], date_format)
    name_1 = row1['looted_by__name']
    item_id_1 = row1['item_id']
    quantity_1 = row1['quantity']
    
    time_stamp_2 = datetime.strptime(row2['timestamp_utc'][:26], date_format)
    name_2 = row2['looted_by__name']
    item_id_2 = row2['item_id']
    quantity_2 = row2['quantity']

    if name_1 != name_2 or item_id_1 != item_id_2 or quantity_1 != quantity_2:
        #add row 1 to df
        continue
    elif time_stamp_1 > time_stamp_2:
        date_diff_1 = abs(time_stamp_1 - time_stamp_2)
        date_diff_sec_1 = date_diff_1.total_seconds()
        if date_diff_sec_1 < 0.5:
            #donot add row 1 to df and add row 1 and row 2 to bugged item list
            bugged_item_df = bugged_item_df._append(row1 ,ignore_index=True)
            bugged_item_df = bugged_item_df._append(row2 ,ignore_index=True)
            pass
        elif date_diff_sec_1 > 0.5:
            clean_item_df = clean_item_df._append(row1 ,ignore_index=True)
            #add row 1 to df
            continue

    elif time_stamp_1 < time_stamp_2:
        date_diff_2 = abs(time_stamp_2 - time_stamp_1)
        date_diff_sec_2 = date_diff_2.total_seconds()
        if date_diff_sec_2 < 0.5:
            bugged_item_df = bugged_item_df._append(row1 ,ignore_index=True)
            bugged_item_df = bugged_item_df._append(row2 ,ignore_index=True)
            #donot add row 1 to df and add row 1 and row 2 to bugged item list
            pass
        elif date_diff_sec_2 > 0.5:
            clean_item_df = clean_item_df._append(row1 ,ignore_index=True)
            #add row 1 to df
            continue
    

bugged_item_df.to_csv("bugged.csv", index=False)
clean_item_df.to_csv("clean.csv", index=False)

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

如果有人有比我更好的答案请发布它，我将不胜感激

更新2：我再次编辑了代码，意识到我可以更快地删除有问题的行，现在它可以在 60 秒内完成

import time
import pandas as pd
from datetime import datetime

#orgnizing the rows
combined_df_3 = pd.read_csv("proccesing/combined_file_refined.csv", delimiter= ',', usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])
combined_df_3 = combined_df_3.groupby(['looted_by__name', 'timestamp_utc']).sum().reset_index()
combined_df_3.to_csv("proccesing/combined_file_orgnized.csv", index=False)



bugged_item_df = pd.DataFrame()
bugged_item_2df = pd.DataFrame()
combined_df_4 =pd.read_csv("proccesing/combined_file_orgnized.csv", delimiter= ',', usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])
date_format = '%Y-%m-%dT%H:%M:%S.%f'
num_of_runs = 0


for index1,row1 in combined_df_3.iterrows():
    num_of_runs += 1
    try:
        row2 = combined_df_4.iloc[num_of_runs]
    except IndexError:
        break
    time_stamp_1 = datetime.strptime(row1['timestamp_utc'][:26], date_format)
    name_1 = row1['looted_by__name']
    item_id_1 = row1['item_id']
    quantity_1 = row1['quantity']
    
    time_stamp_2 = datetime.strptime(row2['timestamp_utc'][:26], date_format)
    name_2 = row2['looted_by__name']
    item_id_2 = row2['item_id']
    quantity_2 = row2['quantity']

    if name_1 != name_2 or item_id_1 != item_id_2 or quantity_1 != quantity_2:
        continue
    elif time_stamp_1 > time_stamp_2:
        date_diff_1 = abs(time_stamp_1 - time_stamp_2)
        date_diff_sec_1 = date_diff_1.total_seconds()
        if date_diff_sec_1 < 0.5:
            #donot add row 1 to df and add row 1 and row 2 to bugged item list
            bugged_item_df = bugged_item_df._append(row1 ,ignore_index=True)
            bugged_item_df = bugged_item_df._append(row2 ,ignore_index=True)
            bugged_item_2df = bugged_item_2df._append(row1,ignore_index=True)
            

    elif time_stamp_1 < time_stamp_2:
        date_diff_2 = abs(time_stamp_2 - time_stamp_1)
        date_diff_sec_2 = date_diff_2.total_seconds()
        if date_diff_sec_2 < 0.5:
            bugged_item_df = bugged_item_df._append(row1 ,ignore_index=True)
            bugged_item_df = bugged_item_df._append(row2 ,ignore_index=True)
            bugged_item_2df = bugged_item_2df._append(row1,ignore_index=True)
            #donot add row 1 to df and add row 1 and row 2 to bugged item list
            
    

bugged_item_df.to_csv("bugged.csv", index=False)
print('here')
clean_item_df = combined_df_3.merge(bugged_item_2df, on=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'], how='left', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
clean_item_df.to_csv("clean.csv", index=False)

如果有人知道如何改进它超过 30 秒，请随意添加另一种方法

可以利用 Pandas 的矢量化操作来更快地完成此任务，而不是进行迭代。以下是如何操作：

import pandas as pd

# 读取 CSV 文件
df = pd.read_csv("Proccesing/combined_file_refined.csv", usecols=['timestamp_utc', 'looted_by__name', 'item_id', 'quantity'])

# 将时间戳列转换为 datetime 对象
df['timestamp_utc'] = pd.to_datetime(df['timestamp_utc'])

# 按 'looted_by__name'、'item_id' 和 'quantity' 对 DataFrame 排序
df = df.sort_values(['looted_by__name', 'item_id', 'quantity', 'timestamp_utc'])

# 计算与下一行的时差（按分组）
df['time_diff'] = df.groupby(['looted_by__name', 'item_id', 'quantity'])['timestamp_utc'].diff().dt.total_seconds()

# 查找时间差小于 0.5 秒的行
bugged_rows = df[df['time_diff'] < 0.5]

# 从原始 DataFrame 中删除有问题的行以获取干净的数据
clean_df = df[~df.index.isin(bugged_rows.index)]

# 将有问题的行和干净的行保存到 CSV 文件
bugged_rows.to_csv("Corrupted.csv", index=False)
clean_df.to_csv("Clean.csv", index=False)

此代码首先按相关列对 DataFrame 进行排序。然后，它使用 diff() 函数计算同一组内连续行之间的时差。最后，它根据时差过滤行以识别有问题的行并创建干净的 DataFrame。

此方法应比的迭代方法快得多，尤其是对于大型数据集，因为它利用了 Pandas 的矢量化操作。

标签：python,pandas,dataframe,csv,analytics
From： 78770763

如何更快地比较同一 csv 文件中的行

相关文章

赞助商

阅读排行