【ERNIE + PaddleOCR】 创建自己的论文字典,更好的写论文吧!

4.1 环境安装


%cd ~

# 首先建议你先从github上下载PaddleOCR的源码,https://github.com/PaddlePaddle/PaddleOCR.git,我这里会先上传一份。
# 先不要使用左侧套件里的PaddleOCR,它的版本使用起来有各种小问题。
# 我这里先下载好了源代码并上传解压。
!unzip PaddleOCR-main.zip

# 安装源码所依赖环境
%cd ~/PaddleOCR-main
!pip install --user -r requirements.txt
!pip install --user paddleclas PyMuPDF==1.19.0

# 这里是因为它下载很慢,所以我就本地下载下来上传安装的,在线安装包有可能因为网络问题失败,多试几次或者下载下来离线安装
# !pip install /home/aistudio/work/opencv_python_headless-

# 运行setup安装
# ******  注意,如果遇见安装超时或者其他原因导致安装环境失败,多半是网络问题,请多尝试几次。 ******
!python setup.py build install

4.2 PaddleOCR版面分析和文本识别


# 我的pdf文件在work文件夹里面
%cd ~/work

# 通过命令行进行版面分析和文本识别,不需要重现文档
!paddleocr --image_dir=molecular.pdf --type=structure --recovery=false --lang='en'


%cd ~/work

import os
import cv2
import numpy as np
from paddleocr import PPStructure,save_structure_res
from paddle.utils import try_import
from PIL import Image

ocr_engine = PPStructure(table=False, ocr=True, show_log=True)

save_folder = './output'
img_path = 'molecular.pdf'

fitz = try_import("fitz")
imgs = []
with fitz.open(img_path) as pdf:
    for pg in range(0, pdf.page_count):
        page = pdf[pg]
        mat = fitz.Matrix(2, 2)
        pm = page.get_pixmap(matrix=mat, alpha=False)

        # if width or height > 2000 pixels, don't enlarge the image
        if pm.width > 2000 or pm.height > 2000:
            pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

        img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
        img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)

for index, img in enumerate(imgs):
    result = ocr_engine(img)
    save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0], index)
    for line in result:

        我看了源码,发现代码会报错的原因在于,它没有对pdf文件做判断,且判断逻辑有问题,目前只适用于单page的图片或者gif。于是我改了相关代码,并提交了pull requests,修复了直接使用报错的BUG。

4.3 字典生成


# 读取output里对应的文本结果

%cd ~
import os
import re
import dill
# 假设所有txt文件都位于这个目录下  
directory = 'work/output/molecular'  
# 获取目录下的所有txt文件  
txt_files = [f for f in os.listdir(directory) if f.endswith('.txt')]  
txt_files.sort(key=lambda x: int(re.search(r'res_(\d+)\.txt', x).group(1)))
result = {}
result_text = []
# 遍历每个txt文件  
for filename in txt_files:  
    file_path = os.path.join(directory, filename)
    line_list = []
    with open(file_path, 'r', encoding='utf-8') as file:  
        # 逐行读取文件内容
        for line in file:  
            # 尝试将字符串转换为字典  
                data_dict = eval(line)  
                # 在这里处理你的字典数据
                if data_dict['type'] == 'text' and isinstance(data_dict['res'], list):
                    for res in data_dict['res']:
            except Exception as e:  
                # 如果转换失败,打印错误信息  
                print(f"Error evaluating line in {filename}: {e}")
    result[filename] = line_list

# 保存数据文件
dill.dump(result, open('result.pkl', 'wb'))
dill.dump(result_text, open('result_text.pkl', 'wb'))



%cd ~
!pip install spacy

# 建议提前下载模型包,在线下载太慢了
# !pip install work/en_core_web_trf-3.7.3-py3-none-any.whl


%cd ~
import dill
import re
import spacy  

# 加载spaCy的英文模型  
nlp = spacy.load('en_core_web_trf')  
def process_sentences(sentences):  
    # 合并短句并处理  
    processed_text = ' '.join(sentences)  
    doc = nlp(processed_text)  
    # 提取单词  
    words = [token.text for token in doc if not token.is_stop and token.is_alpha]  
    return words  
result_text = dill.load(open('result_text.pkl', 'rb'))
# 提取并处理单词  
words = process_sentences(result_text)  
# 打印提取的单词  
dill.dump(words, open('words.pkl', 'wb'))


# 生成字典
from collections import defaultdict, OrderedDict
import dill

words = dill.load(open('words.pkl', 'rb'))
count_dict = defaultdict(int)
for word in words:
    count_dict[word] += 1
# 倒序排序
sorted_dict = OrderedDict(sorted(count_dict.items(), key=lambda x: x[1], reverse=True))

# 过滤数据
filtered_words = [k+':'+str(v) for k, v in sorted_dict.items() if v > 5]
dict_str = ';'.join(filtered_words)


4.4 ERNIE翻译

        之后使用ERNIE Bot进行语句的翻译,并给他准备好的词典。

# 安装ERNIE Bot
!pip install --upgrade erniebot


# 设计prompt

prompt = (
    # "现在我提供字典:"
prompt += dict_str
prompt += "现在我提供英文语句,你来翻译。英文语句为:"



import erniebot

models = erniebot.Model.list()

# Set authentication params
erniebot.api_type = "aistudio"
erniebot.access_token = "你的token"

content = (
    "对于分子,它的 2D 和 3D 形式描述了相同的原子集合,但使用结构的不同特征。"

# Create a chat completion
response = erniebot.ChatCompletion.create(
    messages=[{"role": "user", "content": prompt+content}]


# 不用字典
# For molecules, their 2D and 3D forms represent the same set of atoms but use different structural features. Therefore, the key challenge lies in capturing the structural knowledge from different formulas and training the parameters to learn from the two sources of information in an expressive and compatible manner.

# 使用字典
# For molecules, their 2D and 3D forms describe the same set of atoms, but use different characteristics of the structure. Therefore, the key challenge is to capture structural knowledge in different formulas and train parameters to be expressive and compatible when learning from both sources of information.


        For molecules, their 2D and 3D forms represent the same set of atoms but use different structural features. Therefore, the key challenge lies in capturing the structural knowledge from different formulas and training the parameters to learn from the two sources of information in an expressive and compatible manner.


        For molecules, their 2D and 3D forms describe the same set of atoms, but use different characteristics of the structure. Therefore, the key challenge is to capture structural knowledge in different formulas and train parameters to be expressive and compatible when learning from both sources of information.




From: https://blog.csdn.net/class4715/article/details/139207622
