python系列：使用Python快速读取PDF中的表单数据以及error处理

标签：读取 python 表单 content Python error PDF pdf page

使用Python快速读取PDF中的表单数据以及error处理

使用Python快速读取PDF中的表单数据
python读取PDF文件中文本、表格、图片
- python读取PDF文件中文本、表格、图片
问题

使用Python快速读取PDF中的表单数据

PDF表单是常见的数据收集工具，用于收集用户或客户提供的信息。通过编程的方式读取PDF表单的数据可以准确获取用户信息，避免手动输入或转录，从而节省时间和劳动力，同时降低数据输入错误的风险。这篇文章将探讨如何使用Python快速读取PDF表单数据。

安装Python PDF库

Python中有许多可以处理PDF的库，这篇文章使用的库是Spire.PDF for Python，它支持创建和读取各种类型的PDF表单，包括文本框、列表框、下拉列表（组合框）、复选框、单选按钮等。此外，还支持对PDF文档进行很多其他操作，例如合并PDF，拆分PDF，转换PDF到Word、Excel等格式。

你可以通过在终端运行以下命令来从PyPI安装Spire.PDF for Python：

pip install Spire.PDF

Python读取PDF表单数据

在读取PDF文档的表单数据时，你可以选择一次性读取多个表单的数据，也可以只读取某个特定表单的数据。下面将逐一介绍这两种PDF表单数据提取场景。

1、一次性读取多种PDF表单的数据

要一次性读取PDF中多种表单的数据，你需要遍历这些表单并判断每个表单的类型，然后根据它的类型相应地获取它的数据。以下步骤展示了如何获取PDF中文本框、列表框、下拉列表（组合框）、单选按钮和复选框的名称和值：

创建PdfDocument实例。
使用PdfDocument.LoadFromFile()方法加载PDF文档。
使用PdfDocument.Form属性获取PDF文档的表单集合。
创建列表存储提取的表单数据。
循环遍历表单集合中的所有表单，对于每个表单，判断其类型，并根据类型获取相应的信息。
- 如果是文本框（PdfTextBoxFieldWidget），则获取文本框的名称和值，并将其添加到列表中。
- 如果是列表框（PdfListBoxWidgetFieldWidget），则获取列表框的名称、选中项的值以及列表框的所有项，并将它们添加到列表中。
- 如果是下拉列表（PdfComboBoxWidgetFieldWidget），则获取下拉列表的名称、选中项的值以及下拉列表的所有项，并将它们添加到列表中。
- 如果是单选按钮（PdfRadioButtonListFieldWidget），则获取单选按钮的名称和选中项的值，并将它们添加到列表中。
- 如果是复选框（PdfCheckBoxWidgetFieldWidget），则获取复选框的名称和状态（选中或未选中），并将它们添加到列表中。
使用open函数创建一个文本文件，并将列表中的内容写入文件中。

from spire.pdf.common import *
from spire.pdf import *
 
# 创建 PdfDocument 类的对象
doc = PdfDocument()
# 加载 PDF 文档
doc.LoadFromFile("表单.pdf")
 
# 创建列表存储提取的表单名称和值
content = []
 
# 从文档中获取表单集合
form = doc.Form
formWidget = PdfFormWidget(form)
 
# 遍历每个表单
if formWidget.FieldsWidget.Count > 0:
    for i in range(formWidget.FieldsWidget.List.Count):
        field = formWidget.FieldsWidget.get_Item(i)
 
        # 获取文本框表单的名称和值
        if isinstance(field, PdfTextBoxFieldWidget):
            textBoxField = field
            name = textBoxField.Name
            value = textBoxField.Text
            content.append(f"文本框名称：{name}\n")
            content.append(f"文本框值：{value}\r\n")
 
        # 获取列表框表单的名称、选项和选中的项
        if isinstance(field, PdfListBoxWidgetFieldWidget):
            listBoxField = field
            name = listBoxField.Name
            content.append(f"列表框名称：{name}\n")
            content.append("列表框选项：\n")
            items = listBoxField.Values
            for i in range(items.Count):
                item = items.get_Item(i)
                content.append(f"{item.Value}\n")
            selectedValue = listBoxField.SelectedValue
            content.append(f"列表框选中项：{selectedValue}\r\n")
 
        # 获取下拉列表（组合框）表单的名称、选项和选中的项
        if isinstance(field, PdfComboBoxWidgetFieldWidget):
            comBoxField = field
            name = comBoxField.Name
            content.append(f"下拉列表名称：{name}\n")
            content.append("下拉列表选项：\n")
            items = comBoxField.Values
            for i in range(items.Count):
                item = items.get_Item(i)
                content.append(f"{item.Value}\n")
            selectedValue = comBoxField.SelectedValue
            content.append(f"下拉列表选中项：{selectedValue}\r\n")
 
        # 获取单选按钮表单的名称和选中的项
        if isinstance(field, PdfRadioButtonListFieldWidget):
            radioBtnField = field
            name = radioBtnField.Name
            content.append(f"单选按钮名称：{name}\n")
            selectedValue = radioBtnField.SelectedValue
            content.append(f"单选按钮选中项：{selectedValue}\r\n")
 
        # 获取复选框表单的名称和状态
        if isinstance(field, PdfCheckBoxWidgetFieldWidget):
            checkBoxField = field
            name = checkBoxField.Name
            content.append(f"复选框名称：{name}\n")
            status = checkBoxField.Checked
            if status:
                content.append("复选框状态：已选中\n")
            else:
                content.append("复选框状态：未选中\r\n")
 
# 将列表内容写入文本文件
with open("表单数据.txt", "w", encoding="UTF-8") as file:
    file.writelines(content)
 
doc.Dispose()

在这里插入图片描述

2、读取特定PDF表单的数据

除了一次性读取多个表单数据外，你也可以通过表单名称或它的索引获取该表单，然后获取它的数据。以下步骤展示了如何获取一个特定文本框表单的名称和值：

创建PdfDocument实例。
使用PdfDocument.LoadFromFile()方法加载PDF文档。
使用PdfDocument.Form属性获取PDF文档的表单集合。
创建列表存储提取的表单数据。
通过名称或索引获取特定的文本框。
获取文本框的名称和值，并将它们添加到列表中。
使用open函数创建一个文本文件，并将列表中的内容写入文件中。

from spire.pdf.common import *
from spire.pdf import *
 


# 创建 PdfDocument 类的实例
doc = PdfDocument()
 
# 加载 PDF 文档
doc.LoadFromFile("表单.pdf")
 
# 创建列表以存储提取的表单名称和值
content = []
 
# 获取 PDF 表单
form = doc.Form
formWidget = PdfFormWidget(form)
 
# 通过名称获取文本框表单
field = formWidget.FieldsWidget.get_Item("姓名")
# 或者通过索引获取文本框表单
# field = formWidget.FieldsWidget.get_Item(0)
textbox = PdfTextBoxFieldWidget(field.Ptr)
 
# 获取文本框的名称和值
name = textbox.Name
value = textbox.Text
content.append(f"文本框名称: {name}\n")
content.append(f"文本框值: {value}")
 
# 将结果保存到文本文件
with open("特定表单数据.txt", "w", encoding="UTF-8") as file:
    file.writelines(content)
 
doc.Close()

在这里插入图片描述

以上代码介绍了如何使用Python从常用类型的PDF表单中提取数据，你可以根据自己PDF文档中的表单类型对代码进行扩展。

python读取PDF文件中文本、表格、图片

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

一、文本读取

基于fitz

import fitz
pdf_file = "example.pdf"
pdf_document = fitz.open(pdf_file)
text = ""
for page_number in range(len(pdf_document)):
    page = pdf_document.load_page(page_number)
    for block in page.get_text("blocks"):
        x0, y0, x1, y1 = block[0:4]
        text_block = block[4]
        # 根据文本块属性过滤表格中的文本
        # 这只是一个示例，你可以根据文本块的位置和其他属性来进一步过滤
        if y1 - y0 < 20:  # 通过高度过滤小文本块
            continue
        if "image" in text_block:
            continue
        text += text_block
pdf_document.close()
print(text)

二、图片读取

基于fitz

import fitz
doc = fitz.open("example.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    image_list = page.get_images()
    # print the number of images found on the page
    if image_list:
        print(f"Found {len(image_list)} images on page {page_index}")
    else:
        print("No images found on page", page_index)
    for image_index, img in enumerate(image_list, start=1): # enumerate the image list
        xref = img[0] # get the XREF of the image
        pix = fitz.Pixmap(doc, xref) # create a Pixmap
        if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
            pix = fitz.Pixmap(fitz.csRGB, pix)
        pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
        pix = None

三、表格读取

基于fitz

import fitz
doc = fitz.open("example.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    image_list = page.get_images()
    # print the number of images found on the page
    if image_list:
        print(f"Found {len(image_list)} images on page {page_index}")
    else:
        print("No images found on page", page_index)
    for image_index, img in enumerate(image_list, start=1): # enumerate the image list
        xref = img[0] # get the XREF of the image
        pix = fitz.Pixmap(doc, xref) # create a Pixmap
        if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
            pix = fitz.Pixmap(fitz.csRGB, pix)
        pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
        pix = None

基于fitz，将表格数据当作文本内容抽取

import fitz
doc = fitz.open("example.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

基于pdfplumber

import pdfplumber
import pandas as pd
# 读取pdf文件，保存为pdf实例
pdf =  pdfplumber.open("example.pdf") 
# 访问第二页
first_page = pdf.pages[1]
# 自动读取表格信息，返回列表
tables = first_page.extract_tables(table_settings = {})
for table in tables:
    table = pd.DataFrame(table[1:], columns=table[0])
    print(table)

问题

AttributeError: ‘PdfPageBase’ object has no attribute ‘ExtractText’

解释：

这个错误表明你正在尝试在一个PdfPageBase对象上调用ExtractText方法，但是这个对象并没有这个属性或方法。这通常发生在使用Python的PyPDF2库处理PDF文件时，因为PyPDF2库中的PdfPageBase对象确实不包含提取文本的方法。

解决方法：

确保你在正确的对象上调用正确的方法。如果你想要提取PDF页面的文本，你应该在PdfPage对象上调用extract_text()方法，而不是ExtractText。以下是一个使用PyPDF2库提取PDF文本的简单示例：

import PyPDF2
 
# 打开PDF文件
with open('example.pdf', 'rb') as file:
    reader = PyPDF2.PdfFileReader(file)
 
    # 遍历PDF的每一页
    for i in range(reader.numPages):
        page = reader.getPage(i)
        print(page.extract_text())  # 提取文本

请确保你的代码中使用的是extract_text()而不是ExtractText。注意Python中的方法名通常是全小写，并且没有大写字母，除非它们是专门的类名或者首字母大写的变量名。

PyMuPDF 读取pdf时显示 AttributeError: ‘Page‘ object has no attribute ‘getText‘ 解决方案

先上出错代码

import fitz
from tqdm import tqdm #一个遍历的读条包 可以无视
 
doc = fitz.open(input_path)
content =''
for page in tqdm(doc):
    content += page.getText('html')

这问题很简单因为新款的PyMuPDF包里 getText方法更名为了 get_text 望周知！！！

所以代码更改为

import fitz
from tqdm import tqdm #一个遍历的读条包 可以无视
 
doc = fitz.open(input_path)
content =''
for page in tqdm(doc):
    content += page.get_text('html')

之后就运行正常了！

致敬我用于搜索的一个半小时！

Young_Lb

python读取PDF文件中文本、表格、图片

nuclear2011

使用Python快速读取PDF中的表单数据

Evalikepython

PyMuPDF 读取pdf时显示 AttributeError: ‘Page‘ object has no attribute ‘getText‘ 解决方案

标签：读取,python,表单,content,Python,error,PDF,pdf,page
From： https://blog.csdn.net/weixin_54626591/article/details/139752459

python系列：使用Python快速读取PDF中的表单数据以及error处理

使用Python快速读取PDF中的表单数据以及error处理

使用Python快速读取PDF中的表单数据

安装Python PDF库

Python读取PDF表单数据

1、一次性读取多种PDF表单的数据

2、读取特定PDF表单的数据

python读取PDF文件中文本、表格、图片

python读取PDF文件中文本、表格、图片

一、文本读取

二、图片读取

三、表格读取

问题

AttributeError: ‘PdfPageBase’ object has no attribute ‘ExtractText’

解释：

解决方法：

PyMuPDF 读取pdf时显示 AttributeError: ‘Page‘ object has no attribute ‘getText‘ 解决方案

先上出错代码

所以代码更改为

相关文章

赞助商

阅读排行

python系列：使用Python快速读取PDF中的表单数据以及error处理

使用Python快速读取PDF中的表单数据以及error处理

使用Python快速读取PDF中的表单数据

安装Python PDF库

Python读取PDF表单数据

1、一次性读取多种PDF表单的数据

2、读取特定PDF表单的数据

python读取PDF文件中文本、表格、图片

python读取PDF文件中文本、表格、图片

一、文本读取

二、图片读取

三、表格读取

问题

AttributeError: ‘PdfPageBase’ object has no attribute ‘ExtractText’

解释：

解决方法：

PyMuPDF 读取pdf时 显示 AttributeError: ‘Page‘ object has no attribute ‘getText‘ 解决方案

先上出错代码

所以代码更改为

相关文章

赞助商

阅读排行

PyMuPDF 读取pdf时显示 AttributeError: ‘Page‘ object has no attribute ‘getText‘ 解决方案