首页 > 其他分享 >[949] Using re to extract unstructured tables of PDF files

[949] Using re to extract unstructured tables of PDF files

时间:2023-11-22 11:34:57浏览次数:23  
标签:files tables 949 df ### PDF ##### page match

Here is the problem, this unstructured table of a PDF file can not be extrcted as a table directly. We can only extract the whole texts of every page.

My task is to extract the Place ID, Place Name, and Title Details. Then only Title Details include patterns like this will be kept 00XXX0000, numbers + characters + numbers.

Another issues, the extracted texts have some \n or \n\n.

The script:

import re, os, PyPDF2
import pandas as pd 

# Specify the path to the PDF file
pdf_path = r"D:\Bingnan_Li\01_Tasks\11_20231109_PDF_reading\Planning_LGA\Fraser Coast Regional Council\DOCSHBCC__3131535_v6_Cover_sheet_of_Local_Heritage_Register_.pdf"

# Extract all the texts from the PDF file page by page
with open(pdf_path, "rb") as file:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(file)
    
    page_text = ""
    
    # From Page 2 to Page 6
    for i in range(2, 7):
        page = pdf_reader.getPage(i)
        page_text += page.extractText()

a = page_text
# In order to match the text better, we replace the "\n" and "\n \n"
a = a.replace("\n \n", "#####") 
a = a.replace("\n", "") 
# Delete the "*" in the text
a = a.replace("*", "")

# Try to match the text like this
# "#####1#####Howard War Memorial#####Cnr William and#####Steley Streets Howard#####Refer to Queensland Heritage Register Place ID 600545#####A, B, D, E, G#####2##########"
# (###[#]+[\d]{1,3}###[#]+) try to match "#####1#####"
# (.*?) try to match the middle part
# (###[#]+[\d]{1,3}###[#]+) try to match "#####2##########"
# [\d]{1,3} means numbers with 1 digit, 2 digits or 3 digits
pattern = r"(###[#]+[\d]{1,3}###[#]+)(.*?)(###[#]+[\d]{1,3}###[#]+)"

# Create an emplty DataFrame
df = pd.DataFrame(columns=["ID", "Heritage Name", "Lot", "Plan", "LotPlan"])   

# Get all the matches
# We cannot use the function of re.findall(), because it will miss the one start with "#####2##########"
# So every time, we only find the first one, then move the string one the right to match another first one
# Finally, we will get all the matches
while True:
    match = re.search(pattern, a)
    if not match: break 
    print(match.groups()[0], match.groups()[1])
    
    # From the Title Details, we need to match the lot and the plan
    pattern_2 = r"([0-9]+)([a-zA-Z]+)([0-9]+)"
    matches_2 = re.findall(pattern_2, match.groups()[1])
    
    for m_2 in matches_2:
        # Add this information in to the DataFrame
        df.loc[len(df)] = [match.groups()[0].replace("#", ""), 
                           match.groups()[1].split("#####")[0], 
                           m_2[0], 
                           m_2[1] + m_2[2], 
                           m_2[0]+m_2[1]+m_2[2]]

    a = a[match.span()[1]-20:]
    
df.drop_duplicates()
df.index = range(len(df))
df

标签:files,tables,949,df,###,PDF,#####,page,match
From: https://www.cnblogs.com/alex-bn-lee/p/17848584.html

相关文章

  • [944] Extracting tables from a PDF in Python
    ToextracttablesfromaPDFinPython,wecanuseseverallibraries.Onepopularchoiceisthe tabula-pylibrary,whichisaPythonwrapperforApachePDFBox.Hereisastep-by-stepguidetogetstarted:1.Installtherequiredlibraries:pipinstalltab......
  • 无涯教程-D语言 - 不可变(Immutables)
    我们经常使用可变的变量,但是在很多情况下不需要可变性。D的不变性概念由const和immutable关键字表示,尽管这两个词本身的含义很接近,但它们在程序中的职责有所不同,有时是不兼容的。枚举常量枚举常量使将常量值与有意义的名称相关联成为可能,一个简单的如下所示。importstd.stdi......
  • iptables 介绍及用法
    Netfilter我们在介绍这个iptables工具之前,需要知道这个Netfilter是什么。Linux防火墙是由Netfilter组件提供的,Netfilter工作在内核空间,集成在linux内核中Netfilter是Linux2.4.x之后新一代的Linux防火墙机制,是linux内核的一个子系统。Netfilter采用模块化设计,具有良好的可扩充性......
  • Deploying RDLC files in local mode for ASP.NET applications
    RanintoproblemstryingtodeploymyfirstwebapplicationtouseaSQLServerReportingServicesreport.IcreatedaRDLCfileandboundmyreportviewercontroltoanobjectdatasource.Workedfineonmylocalmachinebutasoftenhappensstoppedwork......
  • mkfs.xfs报错 mkfs.xfs: /dev/new/new_box appears to contain an existing fil
    在设置逻辑卷文件类型时候报错mkfs.xfs:/dev/new/new_boxappearstocontainanexistingfilesystem(ext4).mkfs.xfs:Usethe-foptiontoforceoverwrite.上面是说目标分区,已经存在一个文件系统但是我们有很需要他更改文件系统的话就加一个-f选项[root@server~]......
  • [938] How to operate with shapefiles using Geopandas
    GeopandasisaPythonlibrarythatmakesworkingwithgeospatialdataeasierbyextendingthedatamanipulationcapabilitiesofpandastospatialdata.Here'sabriefoverviewofhowtooperatewithshapefilesusingGeopandas:Installation:Makesure......
  • frps: 2023/11/15 10:49:24 http: Accept error: accept tcp [::]:7650: accept4: too
    0.错误信息表明frps服务在接受传入连接时遇到了问题,特别是与端口7750相关的错误,具体错误为"accepttcp[::]:7750:accept4:toomanyopenfiles",意味着打开文件数目过多。这种错误通常发生在系统达到文件描述符的打开数目限制时。在类Unix操作系统中,每个进程都有同时可以......
  • ERROR: Failed to Setup IP tables: Unable to enable SKIP DNAT rule
    1、错误场景和现象Linux开启或重启防火墙后,使用默认驱动程序创建网络“docker-compose_default”报错如下:Creatingnetwork"docker-compose_default"withthedefaultdriverERROR:FailedtoSetupIPtables:UnabletoenableSKIPDNATrule:(iptablesfailed:iptab......
  • IPTABLES的参数解释
    iptables手册页总览用iptables-ADC来指定链的规则,-A添加-D删除-C修改iptables-[RI]chainrulenumrule-specification[option]用iptables-RI通过规则的顺序指定iptables-Dchainrulenum[option]删除指定规则iptables-[LFZ][chain][option]用iptables-LFZ链名[......
  • C++ insert into tables of pgsql via libpq-fe.h and compile by g++-13
    1.Installlibpq-devsudoaptinstalllibpq-devlocatelibpq-fe.h/usr/include/postgresql/libpq-fe.h 2.createtablet1createtablet1(idbigserialnotnullprimarykey,authorvarchar(40)notnull,commentvarchar(40)notnull,contentvarchar(40)notn......