首页 > 编程语言 >Python顺序读取word文档中的文本与表格

Python顺序读取word文档中的文本与表格

时间:2023-09-03 15:57:13浏览次数:35  
标签:docx word parent Python import 文档 table path

import os
import docx

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


def read_table(table):
    return [[cell.text for cell in row.cells] for row in table.rows]


def read_word(word_path):
    doc = docx.Document(word_path)
    for block in iter_block_items(doc):
        if isinstance(block, Paragraph):
            print("text", [block.text])
        elif isinstance(block, Table):
            print("table", read_table(block))


if __name__ == '__main__':
    ROOT_DIR_P = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))  # 项目根目录
    # word_path = os.path.join(ROOT_DIR_P, "data/test_to_word.docx")  # pdf文件路径及文件名
    word_path = r'e:/学生错题归集/word/第一周考试.docx'
    # word_path = os.path.join(ROOT_DIR_P, "data/test_to_word2.docx")  # pdf文件路径及文件名
    read_word(word_path)

 

标签:docx,word,parent,Python,import,文档,table,path
From: https://www.cnblogs.com/QQ-77Ly/p/17675052.html

相关文章

  • python办公自动化(6)——读取word文档
     #使用import导入os模块importos#将乔老师的答题卡文件夹路径/Users/qiao/answerKey赋值给变量allKeyPathallKeyPath="/Users/qiao/answerKey"#使用os.listdir()函数获取该路径下所有的文件,并赋值给变量allItemsallItems=os.listdir(allKeyPath)#定义一个......
  • Lnton羚通AI云算力平台在OpenCV-Python中如何创建计数器
    CVUI之计数器cvui::counter()为一个整型或者double值渲染一个计数器,可以点击向上或向下增加或减少值。PythonCPP原型参数theWhere:画布theX:绘制的XtheY:绘制的YtheValue:值theStep:间隔theFormat:格式化的值或数字。例如,%d或%.2f。theFontScale:字体大小theInsideColo......
  • Python学习第二天
    一、Python2or3?Insummary:Python2.xislegacy,Python3.xisthepresentandfutureofthelanguagePython3.0wasreleasedin2008.Thefinal2.xversion2.7releasecameoutinmid-2010,withastatementofextendedsupportforthisend-of-lifereleas......
  • Python:使用Resend发送邮件
    官网:https://resend.com/很简单,只需调用api接口,即可发送邮件需要提前准备好参数api_key从Resend申请的keyto_email接收邮件的邮箱地址importrequestsheaders={'Authorization':'Bearer<api_key>','Content-Type':'application/json',}json_d......
  • idea配置默认javadoc类、接口注释,自动生成文档
    idea配置类#if(${PACKAGE_NAME}&&${PACKAGE_NAME}!="")package${PACKAGE_NAME};#end#parse("FileHeader.java")/***总体描述*<p>创建时间:${DATE}${TIME}</p>*@authorzhaoXin*@sincev1.0*/publicclass${NAME}{}接......
  • python操作sqlite
    importjsonimportsqlite3importpandasaspdclassSqliteTool:def__init__(self,db_path):self.db_path=db_pathself.conn=sqlite3.connect(self.db_path)self.conn.row_factory=sqlite3.Rowself.cursor=self.con......
  • RedisTemplate使用文档
    一.Redis五种基本数据类型1.String字符串String的数据结构是简单的Key-Value模型,Value可以是字符串,也可以是数字。应用场景计数器—点赞,视频播放量,每播放一次就+1统计多单位的数量粉丝数对象缓存存储2.Hash散列表Redis的哈希是键值对的集合。Redis的哈希值是字符串......
  • Python安装
    Python3编译安装1.安装编译相关工具yum-ygroupinstall"Developmenttools"yum-yinstallzlib-develbzip2-developenssl-develncurses-develsqlite-develreadline-develtk-develgdbm-develdb4-devellibpcap-develxz-develyuminstalllibffi-devel-y2.下载安......
  • python+selenium自动化测试
    自动化测试工具selenium使用指南python+selenium环境安装:直接pipinstallselenium 安装webdriver打开/关闭浏览器:importtimefromseleniumimportwebdriverbrowser=webdriver.Edge()browser.get("http://www.baidu.com/")time.sleep(5)browser.get("https://ma......
  • python学习
    python学习正则表达式的使用正则表达式以下是替换指定文件夹下文本中的内容对图片形式的pdf提取目录,可以用以下程序叠加多个正则表达式来去除重复项。importosimportredefreplace_timestamp(directory):#遍历目录下的所有文件和文件夹forroot,dirs,fil......