读取许多 CSV 最后一行的高性能方法？

时间：2024-07-29 08:20:18浏览次数：18

标签：python pandas dataframe

在 Python 中，我想循环遍历 1000 多个包含 100k 行的 CSV 文件的列表，并且只读取其中的最后 100 行左右。将它们加载到 pandas 数据框中需要太多的时间和内存，因为它将整个文件加载到数据框中。在 Python 中执行此操作的快速有效方法是什么？

import os
import pandas as pd

directory = "/tmp/path/to/csv/"
out = []

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    if np.all( df.iloc[-1] < 100):
        out.append(filename)

以下是几种在 Python 中读取 CSV 文件最后几行的高效方法：

1. 使用 tail 命令与 csv 库

这可能是最简单且通常最快的方法，特别是对于大型文件，因为它使用 Unix 实用程序 tail 来处理文件处理：

import os
import csv
import subprocess

def read_last_n_lines(file_path, n):
    """使用 tail 命令读取文件的最后 n 行。"""
    try:
        # 使用 tail 命令获取最后 n 行
        result = subprocess.run(['tail', f'-n{n}', file_path], stdout=subprocess.PIPE, text=True, check=True)
        lines = result.stdout.splitlines()
        reader = csv.reader(lines)
        return list(reader)
    except FileNotFoundError:
        print(f"文件未找到：{file_path}")
        return []

directory = "/tmp/path/to/csv/"
out = []

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    last_100_rows = read_last_n_lines(file_path, 100)

    # 处理 last_100_rows 中的数据，例如：
    if last_100_rows:
        last_row = last_100_rows[-1]
        # 检查 last_row 中的值是否都小于 100
        if all(float(x) < 100 for x in last_row):
            out.append(filename)

2. 使用文件迭代器

这种方法仍然非常高效，因为它不会将整个文件读入内存。对于更大的文件，它比 pandas 快，并且更易于理解比 deque ：

import os

def read_last_n_lines(file_path, n):
    """使用文件迭代器读取文件的最后 n 行。"""
    with open(file_path, 'rb') as f:
        # 从文件末尾开始查找换行符，以处理不同的行尾
        f.seek(-2, os.SEEK_END)
        lines_to_read = n
        line_count = 0
        while lines_to_read > 0 and f.tell() > 0:
            f.seek(-2, os.SEEK_CUR)
            if f.read(1) == b'\n':
                lines_to_read -= 1
                line_count += 1
            f.seek(-1, os.SEEK_CUR)
        return f.readlines()[-line_count:]

directory = "/tmp/path/to/csv/"
out = []

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    last_100_rows = read_last_n_lines(file_path, 100)

    # 使用 csv.reader 处理字节，并检查最后一行的值
    if last_100_rows:
        import csv
        last_row = next(csv.reader([line.decode('utf-8') for line in last_100_rows[-1:]]))
        if all(float(x) < 100 for x in last_row):
            out.append(filename)

3. 使用 collections.deque

这种方法对于较小的文件非常有效，但对于非常大的文件， tail 方法可能会更快：

import os
from collections import deque

def read_last_n_lines(file_path, n):
    """使用 deque 读取文件的最后 n 行。"""
    with open(file_path, 'r') as f:
        return list(deque(f, maxlen=n))

directory = "/tmp/path/to/csv/"
out = []

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    last_100_rows = read_last_n_lines(file_path, 100)

    # 处理 last_100_rows 中的数据，例如：
    if last_100_rows:
        last_row = last_100_rows[-1].split(',')
        # 检查 last_row 中的值是否都小于 100
        if all(float(x) < 100 for x in last_row):
            out.append(filename)

这些方法都避免了将整个 CSV 文件读入内存，这将提高性能，特别是当处理大量大型文件时。选择最适合的方法取决于的特定需求和文件大小。

标签：python,pandas,dataframe
From： 78805058

来自 PyArrow ChunkedArray 的虚拟编码 PyArrow 表，无需通过 pandas？
假设我importpyarrowaspaca=pa.chunked_array([['a','b','b','c']])print(ca)<pyarrow.lib.ChunkedArrayobjectat0x7fc938bcea70>[["a","b","b","......
如何用Python制作Android服务？
我想构建一个简单的Android应用程序，例如PushOver应用程序，它具有TCP服务器并接收其记录的文本消息，然后将其作为推送通知发送。这部分已经完成并且工作正常。但即使GUI应用程序关闭，我也想接收消息。我知道这是可能的，因为PushOver应用程序做到了！我想，我可能需要一......
Python Discord Bot 的应用程序命令的区域设置名称（多语言别名）
如何根据用户的语言设置，使应用程序命令的名称具有不同的名称例如，如果一个用户将其discord的语言设置为英语，则用户可以看到英语的应用程序命令名称。另一方面，如果另一个用户将其不和谐语言设置为法语，则用户可以看到法语中的相同应用程序命令的名称。为此，我尝试使用ap......
如何在Python中添加热键？
我正在为游戏制作一个机器人，我想在按下热键时调用该函数。我已经尝试了一些解决方案，但效果不佳。这是我的代码：defstart():whileTrue:ifkeyboard.is_pressed('alt+s'):break...defmain():whileTrue:ifkeyboard.is_pr......
在Python中解压文件
我通读了zipfile文档，但不明白如何解压缩文件，只了解如何压缩文件。如何将zip文件的所有内容解压缩到同一目录中？importzipfilewithzipfile.ZipFile('your_zip_file.zip','r')aszip_ref:zip_ref.extractall('target_directory')将......
如何在Python中从RSA公钥中提取N和E？
我有一个RSA公钥，看起来像-----BEGINPUBLICKEY-----MIIBIDANBgkqhkiG9w0BAQEFAAOCAQ0AMIIBCAKCAQEAvm0WYXg6mJc5GOWJ+5jkhtbBOe0gyTlujRER++cvKOxbIdg8So3mV1eASEHxqSnp5lGa8R9Pyxz3iaZpBCBBvDB7Fbbe5koVTmt+K06o96ki1/4NbHGyRVL/x5fFiVuTVfmk+GZNakH5dXDq0fwvJyVmUtGYA......
Swagger、Docker、Python-Flask: : https://editor.swagger.io/ 生成服务器 python-fl
在https://editor.swagger.io/上您可以粘贴一些json/yaml。我正在将此作为JSON进行测试（不要转换为YAML）：{"swagger":"2.0","info":{"version":"1.0","title":"OurfirstgeneratedRES......
参考 - Python 类型提示
这是什么？这是与在Python中使用类型提示主题相关的问题和答案的集合。这个问题本身就是一个社区维基；欢迎大家参与维护。这是为什么？Python类型提示是一个不断增长的话题，因此许多（可能的）新问题已经被提出，其中许多甚至已经有了答案。该集合有助于查找现有内容。范......
我的 Python 程序中解决 UVa 860 的运行时错误 - 熵文本分析器
我正在尝试为UVa860编写一个解决方案，但是当我通过vJudge发送它时，它一直显示“运行时错误”。fromsysimportstdinimportmathdefmain():end_of_input=Falselambda_words=0dictionary={}text_entropy=0relative_entropy=0whilenotend_of_in......

读取许多 CSV 最后一行的高性能方法？

相关文章

赞助商

阅读排行