RAG流程:
线下:
1、文档加载
2、文档切分
3、向量化
4、向向量数据库灌数据
线上:
1、获取用户问题
2、用户问题向量化
3、检索向量数据库
4、将检索结果和问题填充到pomp模板
5、用最终获得的pomp调用LLM
6、最终由LLM生成回复
本篇完成文档加载与切割(pdf加载与切割)
1、文档加载
加载PDF:
llama2.pdf 安装pdf读取包 pip install pdfminer.six from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer#从pdf中提取文本extract_text_from_pdf def extract_text_from_pdf(pdf_path,page_numbers=None,min_line_length =1): paragraphs =[] buff ='' full_text = '' for i , page_layout in enumerate(extract_pages(pdf_path)): if page_numbers is not None and i not in page_numbers: continue for element in page_layout: if isinstance(element,LTTextContainer): full_text += element.get_text() +'\n' lines = full_text.split('\n') for line in lines: if len(line) >= min_line_length: buff += (' '+line) if not line.endswith('-') else line.strip('-') elif buff: paragraphs.append(buff) buff = '' if buff: paragraphs.append(buff)
return paragraphs #以上是pdf读取方法extract_text_from_pdf #调用程序,并显示前四行 paragraphs = extract_text_from_pdf('llama2.pdf',min_line_length=4)
for page in paragraphs[:4]: print(page+'\n')
在terminal执行:py .\pdfread.py显示结果
pdf加载与切割完毕。
标签:RAG,--,text,paragraphs,pdf,line,page,buff From: https://www.cnblogs.com/goldball/p/18181318