How to pass multimodal data directly to models
https://python.langchain.com/v0.2/docs/how_to/multimodal_inputs/
Here we demonstrate how to pass multimodal input directly to models. We currently expect all input to be passed in the same format as OpenAI expects. For other model providers that support multimodal input, we have added logic inside the class to convert to the expected format.
In this example we will ask a model to describe an image.
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
API Reference:HumanMessage | ChatOpenAIfrom langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o")The most commonly supported way to pass in images is to pass it in as a byte string. This should work for most model integrations.
import base64
import httpx
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")message = HumanMessage(
content=[
{"type": "text", "text": "describe the weather in this image"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
)
response = model.invoke([message])
print(response.content)The weather in the image appears to be clear and pleasant. The sky is mostly blue with scattered, light clouds, suggesting a sunny day with minimal cloud cover. There is no indication of rain or strong winds, and the overall scene looks bright and calm. The lush green grass and clear visibility further indicate good weather conditions.
We can feed the image URL directly in a content block of type "image_url". Note that only some model providers support this.
Multi-Vector Retriever for RAG on tables, text, and images
https://blog.langchain.dev/semi-structured-multi-modal-rag/
Summary
Seamless question-answering across diverse data types (images, text, tables) is one of the holy grails of RAG. We’re releasing three new cookbooks that showcase the multi-vector retriever for RAG on documents that contain a mixture of content types. These cookbooks as also present a few ideas for pairing multimodal LLMs with the multi-vector retriever to unlock RAG on images.
https://developer.volcengine.com/articles/7387287884799148073
图像做矢量
下面稍微介绍一下几个关键步骤:
步骤1:从PDF中提取图像
使用unstructured库抽取PDF信息,并创建一个文本和图像列表。提取的图像需要存储在特定的文件夹中。
# Extract images, tables, and chunk text from unstructured.partition.pdf import partition_pdf raw_pdf_elements = partition_pdf( filename="LCM_2020_1112.pdf", extract_images_in_pdf=True, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path=path, )
步骤2:创建矢量数据库
准备矢量数据库,并将图像URI和文本添加到矢量数据库中。
# Create chroma vectorstore = Chroma( collection_name="mm_rag_clip_photos", embedding_function=OpenCLIPEmbeddings() ) # Get image URIs with .jpg extension only image_uris = sorted( [ os.path.join(path, image_name) for image_name in os.listdir(path) if image_name.endswith(".jpg") ] ) print(image_uris) # Add images vectorstore.add_images(uris=image_uris) # Add documents vectorstore.add_texts(texts=texts)
标签:RAG,models,text,image,multimodal,How,images,data From: https://www.cnblogs.com/lightsong/p/18341289