利用ocr将pdf转为docx

时间：2022-11-04 21:34:48浏览次数：98

标签：docx img image file time pdf ocr

利用ocr将pdf转为docx

项目地址：https://github.com/jiangnanboy/pdf_to_docx

背景

该项目首先将pdf文件转为图片形式，再使用百度的paddleocr对这些图片文件分别进行识别，利用PPStructure对识别的内容进行结构化，最终将结构化的内容保存成docx文件，最后将所有docx文件进行合并，形成一个docx文件。

代码结构

这里将paddleocr源码拿过来进行了相应修改，以支持中文路径

├── pdf_recovery_doc.py  # 主程序，用于pdf到docx转换
├── image_process.py    # 用于pdf到image转换

主程序

def pdf2doc(pdf_file):
    '''
    pdf2doc
    :param pdf_file:
    :return:
    '''
    table_engine = PPStructure(recovery=True, lang='ch', enable_mkldnn=True)
    start_time = time.time()
    if pdf_file.endswith('.pdf') or pdf_file.endswith('.PDF'):
        pdf_file_base_name = os.path.splitext(pdf_file)[0]
        image_path = pdf_file_base_name + '/' + 'images'

        # 1.pdf to image
        pdf_image(pdf_file, image_path)
        image_files = os.listdir(image_path)

        # 2.image to docx
        for img_file in image_files:
            if img_file.endswith('png'):
                img = cv2.imdecode(np.fromfile(os.path.join(image_path, img_file), dtype=np.uint8), -1)
                img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
                if img is not None:
                    result = table_engine(img)
                    save_structure_res(result, pdf_file_base_name, img_file.split('.')[0])
                    h, w, _ = img.shape
                    res = sorted_layout_boxes(result, w)
                    convert_info_docx(img, res, pdf_file_base_name, img_file.split('.')[0])

        # 3.merge all docx
        merge_docx_v1(pdf_file_base_name)

    end_time = time.time()
    print('pdf转换doc时间: {}'.format(end_time - start_time))

例子

pdf文件

docx文件

标签：docx,img,image,file,time,pdf,ocr
From： https://www.cnblogs.com/little-horse/p/16859178.html

深入浅出DPDK 电子书 pdf
作者:朱河清/梁存铭/胡雪焜出版社:机械工业出版社链接：深入浅出DPDK 近年来，随着网络技术的不断创新和市场的发展，越来越多的网络设备基础架构开始向基于通......
如何通过Java将PDF转为Excel
当您收到一份PDF格式的表格后，却又想要对表格内容进行某项操作时（例如更改数据、改变表格样式等），将其转换成Excel文档格式可能会更加便捷。FreeSpire.PDFforJava就可以实......
异步框架tornado下使用pyppeteer将动态html转化为pdf
项目背景:云上服务器存储html，前端通过传递给后端html_url,由后端服务器获取html文件进行渲染，生成pdf，然后将pdf上传云上服务器。使用的框架/库:tornado/pyppeteer/......
pandoc+py+pdf
迟来的文章，想开始写文了[奸笑]Markdown都再熟悉不过了吧！用Markdown写笔记转pdf[bash]foriin{01..38};dotouch$i.md生成38个.md文件假设有.md的笔记38个想......
qt输出自定义的pdf文件源码详解
qt中有两种方式可以输出pdf：方式1：使用QPrinter即打印机的方式打印pdf这种方式，在qt4成为唯一的方式。QPrinterprinter(QPrinter::HighResolution);//高清晰度printer.set......
操作系统导论问题答案 pdf
链接：操作系统导论问题答案 ......
深入理解LINUX内核第三版电子书 pdf
作者:（美）博韦，西斯特出版社:中国电力出版社原作名:UnderstandingtheLinuxKernel译者:陈莉君;张琼声;张宏伟链接：深入理解LINUX内核第三版为了彻底理解......
C++17 The Complete Guide 电子书 pdf
作者:[德]NicolaiM·Josuttis 链接：C++17TheCompleteGuide ......
Memory systems Cache DRAM Disk 电子书 pdf
作者:BruceJacob/SpencerNg/DavidWang出版社:MorganKaufmann副标题:Cache,DRAM,Disk 链接：MemorysystemsCacheDRAMDisk Isyourmemoryhierarc......
《神经网络与深度学习》最新2018版中英PDF+源码
机器学习AI算法工程公众号：datayx资料获取1.关注微信公众号datayx 然后回复深度学习即可获取。不断更新资源深度学习、机器学习、数据分析、python搜索公众号添加： ......

利用ocr将pdf转为docx

利用ocr将pdf转为docx

背景

代码结构

主程序

例子

pdf文件

docx文件

相关文章

赞助商

阅读排行