首页 > 其他分享 >How to pass multimodal data directly to models

How to pass multimodal data directly to models

时间:2024-08-03 23:16:58浏览次数:24  
标签:RAG models text image multimodal How images data

How to pass multimodal data directly to models

https://python.langchain.com/v0.2/docs/how_to/multimodal_inputs/

Here we demonstrate how to pass multimodal input directly to models. We currently expect all input to be passed in the same format as OpenAI expects. For other model providers that support multimodal input, we have added logic inside the class to convert to the expected format.

In this example we will ask a model to describe an image.

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
 
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o")
  API Reference:HumanMessage | ChatOpenAI

The most commonly supported way to pass in images is to pass it in as a byte string. This should work for most model integrations.

import base64

import httpx

image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
 
message = HumanMessage(
content=[
{"type": "text", "text": "describe the weather in this image"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
)
response = model.invoke([message])
print(response.content)
 
The weather in the image appears to be clear and pleasant. The sky is mostly blue with scattered, light clouds, suggesting a sunny day with minimal cloud cover. There is no indication of rain or strong winds, and the overall scene looks bright and calm. The lush green grass and clear visibility further indicate good weather conditions.
 

We can feed the image URL directly in a content block of type "image_url". Note that only some model providers support this.

 

Multi-Vector Retriever for RAG on tables, text, and images

https://blog.langchain.dev/semi-structured-multi-modal-rag/

Summary

Seamless question-answering across diverse data types (images, text, tables) is one of the holy grails of RAG. We’re releasing three new cookbooks that showcase the multi-vector retriever for RAG on documents that contain a mixture of content types. These cookbooks as also present a few ideas for pairing multimodal LLMs with the multi-vector retriever to unlock RAG on images.

 

https://developer.volcengine.com/articles/7387287884799148073

图像做矢量

下面稍微介绍一下几个关键步骤:

步骤1:从PDF中提取图像

   使用unstructured库抽取PDF信息,并创建一个文本和图像列表。提取的图像需要存储在特定的文件夹中。

          
# Extract images, tables, and chunk text
          
from unstructured.partition.pdf import partition_pdf
          

          
raw_pdf_elements = partition_pdf(
          
    filename="LCM_2020_1112.pdf",
          
    extract_images_in_pdf=True,
          
    infer_table_structure=True,
          
    chunking_strategy="by_title",
          
    max_characters=4000,
          
    new_after_n_chars=3800,
          
    combine_text_under_n_chars=2000,
          
    image_output_dir_path=path,
          
)
      

步骤2:创建矢量数据库

    准备矢量数据库,并将图像URI和文本添加到矢量数据库中。

          
# Create chroma
          
vectorstore = Chroma(
          
    collection_name="mm_rag_clip_photos", embedding_function=OpenCLIPEmbeddings()
          
)
          

          
# Get image URIs with .jpg extension only
          
image_uris = sorted(
          
    [
          
        os.path.join(path, image_name)
          
        for image_name in os.listdir(path)
          
        if image_name.endswith(".jpg")
          
    ]
          
)
          

          
print(image_uris)
          
# Add images
          
vectorstore.add_images(uris=image_uris)
          

          
# Add documents
          
vectorstore.add_texts(texts=texts)
      

 

标签:RAG,models,text,image,multimodal,How,images,data
From: https://www.cnblogs.com/lightsong/p/18341289

相关文章

  • CTFSHOW 萌新 web9 解题思路和方法(利用system函数执行系统命令)
    点击题目链接,从题目页面显示的代码中可以看到我们可以使用命令执行漏洞查看网站的文件:我们首先使用system函数并使用ls命令查看当前目录下的所有文件:因为题目中提示flag在config.php文件中,所有可以直接读取该文件当然,如果题目中没有说明falg在那个文件中,我们可以使用......
  • CTFSHOW 萌新 web10 解题思路和方法(passthru执行命令)
    点击题目链接,分析页面代码。发现代码中过滤了system、exec函数,这意味着我们不能通过system(cmd命令)、exec(cmd命令)的方式运行命令。在命令执行中,常用的命令执行函数有:system(cmd_code);exec(cmd_code);shell_exec(cmd_code);passthru(cmd_code);可以发现,passthru未被过滤,......
  • 跟《经济学人》学英文:2024年08月03日这期 GPT, Claude, Llama? How to tell which AI
    GPT,Claude,Llama?HowtotellwhichAImodelisbestBewaremodel-makersmarkingtheirownhomework原文:WhenMeta,theparentcompanyofFacebook,announceditslatestopen-sourcelargelanguagemodel(LLM)onJuly23rd,itclaimedthatthemostpo......
  • Enhancing Question Answering for Enterprise Knowledge Bases using Large Language
    本文是LLM系列文章,针对《EnhancingQuestionAnsweringforEnterpriseKnowledgeBasesusingLargeLanguageModels》的翻译。使用大型语言模型增强企业知识库的问答能力摘要1引言2相关工作3前言4方法5实验6结论摘要高效的知识管理在提高企业和组......
  • Large Language Models meet Collaborative Filtering
    本文是LLM系列文章,针对《LargeLanguageModelsmeetCollaborativeFiltering:AnEfficientAll大型语言模型与协同过滤:一个高效的基于LLM的全方位推荐系统摘要1引言2相关工作3问题定义4提出的方法5实验6结论摘要协同过滤推荐系统(CFRecSys)在增强社......
  • GitHub Models服务允许开发人员免费查找和试用AI模型
    今天,GitHub宣布推出一项新服务–GitHubModels,允许开发人员免费查找和试用人工智能模型。它将领先的大型和小型语言模型的强大功能直接带给GitHub的1亿多用户。GitHub模型将提供对领先模型的访问,包括OpenAI的GPT-4o和GPT-4omini、微软的Phi3、Meta的Llama3.......
  • Pixel Aligned Language Models论文阅读笔记
    Motivation&Abs近年来,大语言模型在视觉方面取得了极大的进步,但其如何完成定位任务(如wordgrounding等)仍然不清楚。本文旨在设计一种模型能够将一系列点/边界框作为输入或者输出。当模型接受定位信息作为输入时,可以进行以定位为condition的captioning。当生成位置作为输出时,模型......
  • Modelsim仿真实现Verilog HDL序列检测器
    检测接收到的数字序列中出现“10011”的次数。例如输入序列为40位:1100_1001_1100_1001_0100_1100_1011_0010_1100_1011从最高位开始检测,出现了2次:1100_1001_1100_1001_0100_1100_1011_0010_1100_1011所以,序列检测器的计数结果应该是2。状态机如下:当前状态current_stat......
  • 3.校验,格式化,ModelSerializer使用
    【一】反序列化校验1)三层校验字段自己校验直接写在字段类的属性上局部钩子在序列化中写validata_字段名全局钩子#serializers.pyclassBookSerializer(serializers.Serializer):#1)name字段的要大于1小于10name=serializers.CharField(min_length=......
  • ctfshow-web入门-sql注入(web171-web175)
    目录1、web1712、web1723、web1734、web1745、web1751、web171单引号测一下,报错 --+闭合后回显正常 也可以用#,不过需要URL编码成功闭合之后,先判断下字段数:1'orderby3--+3的时候正常 4的时候报错,说明只有3列  测了一下,三个回显位都能正......