首页 > 其他分享 >上海AI Lab Mono-InternVL环境搭建&推理测试

上海AI Lab Mono-InternVL环境搭建&推理测试

时间:2024-11-18 10:19:03浏览次数:1  
标签:ratio target AI Mono image question Lab aspect size

​引子

        原生多模态大模型性能瓶颈,迎来新突破!上海AI Lab代季峰老师团队,提出了全新的原生多模态大模型Mono-InternVL。与非原生模型相比,该模型首个单词延迟最多降低67%,在多个评测数据集上均达到了SOTA水准。OK,那就让我们开始吧。

一、模型介绍

        将视觉编码和文本解码集成到一个单一的大语言模型中。在Mono-InternVL中,一组视觉专家通过专家混合机制嵌入到预训练的语言模型中。通过冻结语言模型的语言部分参数,Mono-InternVL确保了视觉能力的优化,同时不会影响预训练的语言知识。基于这一结构,我们引入了内生视觉预训练(Endogenous Visual Pretraining, EViP),实现了由粗粒度到精粒度的视觉学习。

        Mono-InternVL在性能上优于当前最先进的多模态语言模型Mini-InternVL-2B-1.5,并且显著超越了其他原生多模态模型,如上方的雷达图所示。同时,它的部署效率也得到了提升,首个单词的延迟降低了最多达67%。

二、环境搭建

1、模型下载

https://huggingface.co/OpenGVLab/Mono-InternVL-2B/tree/main

2、环境安装

docker run -it --rm --gpus=all -v /datas/work/zzq:/workspace pytorch/pytorch:2.2.2-cuda12.1-cudnn8-devel bash

pip install transformers==4.37.2 -i Simple Index

pip install decord -i Simple Index

pip install einops -i Simple Index

pip install sentencepiece -i Simple Index

三、推理测试

测试代码

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


path = 'models'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

python test.py

原图

 回复截图

 

标签:ratio,target,AI,Mono,image,question,Lab,aspect,size
From: https://www.cnblogs.com/nick-algorithmer/p/18551900

相关文章

  • langchain long term memory
    Messagehistorieshttps://python.langchain.com/docs/integrations/memory/众多数据库支持。 redis数据库https://www.cnblogs.com/mangod/p/18243321fromlangchain_community.chat_message_historiesimportRedisChatMessageHistoryfromlangchain_core.promptsimpo......
  • XSS-labs
    环境:Windows10+phpstudy+xss-labs靶场:https://github.com/do0dl3/xss-labs放在(网站根目录)练习方法:通过输入语句判断是否弹窗,查看源代码(以及F12)找出编改信息,尝试注入语句一、第一关直接注入观察三处箭头,可以发现它是向服务器提交了一个name参数,值为“test”,从页面回......
  • 百度发布“秒哒”,AI真的可以取代程序员吗?
    2024年11月12日,百度公司董事长李彦宏在2024百度世界大会上宣布推出“秒哒”应用,预计于2025年1月初正式发布。秒哒的主要特性介绍为:“秒哒”由大模型和智能体组成,是多智能体协作工具。包括无代码的编程、多智能体的协作,以及规模化地调用各种工具的能力。“秒哒”跟现在市面......
  • 异常值检测:SOS算法(Stochastic Outlier Selection Algorithm)MATLAB代码
    SOS算法(StochasticOutlierSelectionAlgorithm)是由JeroenJanssens提出的一种无监督异常检测算法。该算法通过计算数据点之间的关联度(affinity)来识别异常点。核心思想是,如果一个点与其他所有点的关联度都很低,那么它被视为异常点。以下是该算法的详细公式和步骤:其MATLAB代码......
  • AI大模型如何重塑软件开发
    随着AI技术的不断发展,AI大模型正在重塑软件开发流程,从代码自动生成到智能测试,未来,AI大模型将会对软件开发者、企业,以及整个产业链都产生深远的影响。欢迎与我们一起,从AI大模型的定义、应用场景、优势以及挑战等方面,探讨AI是如何重塑软件开发的各个环节以及带来的新......
  • aiortc && WebSocket and django-channels
    aiortchttps://github.com/aiortc/aiortc/tree/mainWebRTCandORTCimplementationforPythonusingasyncioWhatisaiortc?aiortcisalibraryforWebReal-TimeCommunication(WebRTC)andObjectReal-TimeCommunication(ORTC)inPython.Itisbuilton......
  • 【AI日记】24.11.12 东京贫困女子读后感 | 未来学习工作时间分配
    【AI论文解读】【AI知识点】【AI小项目】【AI战略思考】【AI日记】读书豆瓣地址:东京贫困女子时间:3小时评估:不错,完成感想:这本书读的有点压抑,因为越到后面越惨,有些地方看着看着就眼眶湿润了。书中多次提到了日本看护业的问题,本来我想接着看《看护sha人》这本书进一......
  • 基于 MATLAB 的实战训练:长短期记忆网络(LSTM)模型来进行序列预测任务
    目录步骤概述详细步骤1.加载数据集2.数据预处理3.定义网络结构4.设置训练选项5.训练模型6.测试模型完整代码数据集准备运行代码结果解释基于MATLAB的大模型实例,我们将使用一个复杂的长短期记忆网络(LSTM)模型来进行序列预测任务。这个模型将用于预测股票......
  • Matlab 基于声学超表面的深亚波长厚度完美吸收体
    传统吸声器的结构厚度与工作波长相当,这在低频范围的实际应用中造成了很大的障碍。我们提出了一种基于超表面的完美吸收器,能够在极低频区域实现声波的全吸收。该超表面具有深亚波长厚度,特征尺寸为k=223,由穿孔板和卷曲共面气室组成。利用全耦合声学模拟、热力学方程和理论阻......
  • 使用React和Vite构建一个AirBnb Experiences克隆网站
    这一篇文章中,我会教你如何做一个AirBnbExperiences的克隆网站。主要涵盖React中Props的使用。克隆网站最终呈现的效果:1.使用vite构建基础框架npmcreatevite@latestcdairbnb-projectnpminstallnpmrundev2.构建网站的3个部分网站从上至下主要分为导航栏......