标签：None Transformers -- FloatTensor torch 37 num optional size

Transformers 4.37 中文文档（四十五）

原文：huggingface.co/docs/transformers

OWL-ViT

原文：huggingface.co/docs/transformers/v4.37.2/en/model_doc/owlvit

概述

OWL-ViT（Vision Transformer for Open-World Localization）是由 Matthias Minderer、Alexey Gritsenko、Austin Stone、Maxim Neumann、Dirk Weissenborn、Alexey Dosovitskiy、Aravindh Mahendran、Anurag Arnab、Mostafa Dehghani、Zhuoran Shen、Xiao Wang、Xiaohua Zhai、Thomas Kipf 和 Neil Houlsby 在Simple Open-Vocabulary Object Detection with Vision Transformers中提出的。OWL-ViT 是一个在各种（图像，文本）对上训练的开放词汇目标检测网络。它可以用于使用一个或多个文本查询查询图像，以搜索和检测文本中描述的目标对象。

来自论文的摘要如下：

将简单的架构与大规模预训练相结合，已经在图像分类方面取得了巨大的改进。对于目标检测，预训练和扩展方法尚未建立良好的基础，特别是在长尾和开放词汇设置中，训练数据相对稀缺的情况下。在本文中，我们提出了一个强大的配方，将图像文本模型转移到开放词汇的目标检测中。我们使用标准的 Vision Transformer 架构进行最小修改，对比图像文本预训练，并进行端到端的检测微调。我们对这一设置的扩展属性进行了分析，结果表明增加图像级别的预训练和模型大小可以在下游检测任务中获得一致的改进。我们提供了适应策略和规范化，以实现零样本文本条件和一次样本图像条件的目标检测的非常强大的性能。代码和模型可在 GitHub 上获得。

drawing OWL-ViT 架构。摘自原始论文。

此模型由adirik贡献。原始代码可在此处找到。

使用提示

OWL-ViT 是一个零样本文本条件的目标检测模型。OWL-ViT 使用 CLIP 作为其多模态骨干，具有类似 ViT 的 Transformer 来获取视觉特征和因果语言模型来获取文本特征。为了使用 CLIP 进行检测，OWL-ViT 移除了视觉模型的最终令牌池化层，并将轻量级分类和框头附加到每个 Transformer 输出令牌上。通过用从文本模型获得的类名嵌入替换固定的分类层权重，实现了开放词汇分类。作者首先从头开始训练 CLIP，然后在标准检测数据集上使用二部匹配损失对其进行端到端的微调，包括分类和框头。可以使用一个或多个文本查询来执行零样本文本条件的目标检测。

OwlViTImageProcessor 可用于调整（或重新缩放）和规范化模型的图像，而 CLIPTokenizer 用于对文本进行编码。OwlViTProcessor 将 OwlViTImageProcessor 和 CLIPTokenizer 包装成一个单一实例，用于同时对文本进行编码和准备图像。以下示例展示了如何使用 OwlViTProcessor 和 OwlViTForObjectDetection 执行目标检测。

>>> import requests
>>> from PIL import Image
>>> import torch

>>> from transformers import OwlViTProcessor, OwlViTForObjectDetection

>>> processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
>>> model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> texts = [["a photo of a cat", "a photo of a dog"]]
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
>>> target_sizes = torch.Tensor([image.size[::-1]])
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
>>> i = 0  # Retrieve predictions for the first image for the corresponding text queries
>>> text = texts[i]
>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
>>> for box, score, label in zip(boxes, scores, labels):
...     box = [round(i, 2) for i in box.tolist()]
...     print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29]
Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17]

资源

可以在这里找到使用 OWL-ViT 进行零样本和一样本（图像引导）目标检测的演示笔记本。

OwlViTConfig

`class transformers.OwlViTConfig`

< source >

( text_config = None vision_config = None projection_dim = 512 logit_scale_init_value = 2.6592 return_dict = True **kwargs )

参数

text_config (dict, optional) — 用于初始化 OwlViTTextConfig 的配置选项字典。
vision_config (dict, optional) — 用于初始化 OwlViTVisionConfig 的配置选项字典。
projection_dim (int, optional, 默认为 512) — 文本和视觉投影层的维度。
logit_scale_init_value (float, optional, 默认为 2.6592) — logit_scale 参数的初始值。默认值根据原始的 OWL-ViT 实现而定。
return_dict (bool, optional, 默认为 True) — 模型是否应返回一个字典。如果为 False，则返回一个元组。
kwargs (optional) — 关键字参数的字典。

OwlViTConfig 是用于存储 OwlViTModel 配置的配置类。根据指定的参数实例化 OWL-ViT 模型，定义文本模型和视觉模型配置。使用默认值实例化配置将产生类似于 OWL-ViT google/owlvit-base-patch32 架构的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读来自 PretrainedConfig 的文档以获取更多信息。

`from_text_vision_configs`

< source >

( text_config: Dict vision_config: Dict **kwargs ) → export const metadata = 'undefined';OwlViTConfig

OwlViTConfig

配置对象的一个实例

从 owlvit 文本模型配置和 owlvit 视觉模型配置实例化一个 OwlViTConfig（或派生类）。

OwlViTTextConfig

`class transformers.OwlViTTextConfig`

< source >

( vocab_size = 49408 hidden_size = 512 intermediate_size = 2048 num_hidden_layers = 12 num_attention_heads = 8 max_position_embeddings = 16 hidden_act = 'quick_gelu' layer_norm_eps = 1e-05 attention_dropout = 0.0 initializer_range = 0.02 initializer_factor = 1.0 pad_token_id = 0 bos_token_id = 49406 eos_token_id = 49407 **kwargs )

参数

vocab_size (int, optional, 默认为 49408) — OWL-ViT 文本模型的词汇量。定义了在调用 OwlViTTextModel 时可以表示的不同标记数量。
hidden_size (int, optional, 默认为 512) — 编码器层和池化层的维度。
intermediate_size (int, optional, 默认为 2048) — Transformer 编码器中“中间”（即前馈）层的维度。
num_hidden_layers (int, optional, 默认为 12) — Transformer 编码器中的隐藏层数量。
num_attention_heads (int, optional, 默认为 8) — Transformer 编码器中每个注意力层的注意力头数。
max_position_embeddings (int, optional, 默认为 16) — 该模型可能使用的最大序列长度。通常设置为一个较大的值（例如 512、1024 或 2048）以防万一。
hidden_act (str or function, optional, defaults to "quick_gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，则支持"gelu"、"relu"、"selu"和"gelu_new"以及"quick_gelu"。
layer_norm_eps (float, optional, defaults to 1e-05) — 层归一化层使用的 epsilon。
attention_dropout (float, optional, defaults to 0.0) — 注意力概率的丢失比率。
initializer_range (float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
initializer_factor (float, optional, defaults to 1.0) — 用于初始化所有权重矩阵的因子（应保持为 1，用于内部初始化测试）。
pad_token_id (int, optional, defaults to 0) — 输入序列中填充标记的 id。
bos_token_id (int, optional, defaults to 49406) — 输入序列中起始标记的 id。
eos_token_id (int, optional, defaults to 49407) — 输入序列中终止标记的 id。

这是用于存储 OwlViTTextModel 配置的配置类。根据指定的参数实例化一个 OwlViT 文本编码器，定义模型架构。使用默认值实例化配置将产生与 OwlViT google/owlvit-base-patch32架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import OwlViTTextConfig, OwlViTTextModel

>>> # Initializing a OwlViTTextModel with google/owlvit-base-patch32 style configuration
>>> configuration = OwlViTTextConfig()

>>> # Initializing a OwlViTTextConfig from the google/owlvit-base-patch32 style configuration
>>> model = OwlViTTextModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

OwlViTVisionConfig

`class transformers.OwlViTVisionConfig`

< source >

( hidden_size = 768 intermediate_size = 3072 num_hidden_layers = 12 num_attention_heads = 12 num_channels = 3 image_size = 768 patch_size = 32 hidden_act = 'quick_gelu' layer_norm_eps = 1e-05 attention_dropout = 0.0 initializer_range = 0.02 initializer_factor = 1.0 **kwargs )

参数

hidden_size (int, optional, defaults to 768) — 编码器层和池化器层的维度。
intermediate_size (int, optional, defaults to 3072) — Transformer 编码器中“中间”（即前馈）层的维度。
num_hidden_layers (int, optional, defaults to 12) — Transformer 编码器中的隐藏层数量。
num_attention_heads (int, optional, defaults to 12) — Transformer 编码器中每个注意力层的注意力头数量。
num_channels (int, optional, defaults to 3) — 输入图像中的通道数。
image_size (int, optional, defaults to 768) — 每个图像的大小（分辨率）。
patch_size (int, optional, defaults to 32) — 每个补丁的大小（分辨率）。
hidden_act (str or function, optional, defaults to "quick_gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，则支持"gelu"、"relu"、"selu"和"gelu_new"以及"quick_gelu"。
layer_norm_eps (float, optional, defaults to 1e-05) — 层归一化层使用的 epsilon。
attention_dropout (float, optional, defaults to 0.0) — 注意力概率的丢失比率。
initializer_range (float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
initializer_factor (float, optional, defaults to 1.0) — 用于初始化所有权重矩阵的因子（应保持为 1，用于内部初始化测试）。

这是一个配置类，用于存储 OwlViTVisionModel 的配置。它用于根据指定的参数实例化一个 OWL-ViT 图像编码器，定义模型架构。使用默认值实例化配置将产生与 OWL-ViT google/owlvit-base-patch32架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import OwlViTVisionConfig, OwlViTVisionModel

>>> # Initializing a OwlViTVisionModel with google/owlvit-base-patch32 style configuration
>>> configuration = OwlViTVisionConfig()

>>> # Initializing a OwlViTVisionModel model from the google/owlvit-base-patch32 style configuration
>>> model = OwlViTVisionModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Transformers--4-37-中文文档-四十五-

Transformers 4.37 中文文档（四十五）

OWL-ViT

概述

使用提示

资源

OwlViTConfig

class transformers.OwlViTConfig

from_text_vision_configs

OwlViTTextConfig

class transformers.OwlViTTextConfig

OwlViTVisionConfig

class transformers.OwlViTVisionConfig

OwlViTImageProcessor

class transformers.OwlViTImageProcessor

preprocess

post_process_object_detection

post_process_image_guided_detection

OwlViTFeatureExtractor

class transformers.OwlViTFeatureExtractor

__call__

post_process

post_process_image_guided_detection

OwlViTProcessor

class transformers.OwlViTProcessor

batch_decode

decode

post_process

post_process_image_guided_detection

post_process_object_detection

OwlViTModel

class transformers.OwlViTModel

forward

get_text_features

get_image_features

OwlViTTextModel

class transformers.OwlViTTextModel

forward

OwlViTVisionModel

class transformers.OwlViTVisionModel

forward

OwlViT 目标检测

class transformers.OwlViTForObjectDetection

forward

image_guided_detection

OWLv2

概述

用法示例

资源

Owlv2Config

class transformers.Owlv2Config

from_text_vision_configs

Owlv2TextConfig

class transformers.Owlv2TextConfig

Owlv2VisionConfig

class transformers.Owlv2VisionConfig

Owlv2ImageProcessor

class transformers.Owlv2ImageProcessor

preprocess

post_process_object_detection

post_process_image_guided_detection

Owlv2Processor

class transformers.Owlv2Processor

batch_decode

decode

post_process_image_guided_detection

post_process_object_detection

Owlv2Model

class transformers.Owlv2Model

forward

get_text_features

get_image_features

Owlv2TextModel

class transformers.Owlv2TextModel

forward

Owlv2VisionModel

class transformers.Owlv2VisionModel

forward

Owlv2ForObjectDetection

class transformers.Owlv2ForObjectDetection

`class transformers.OwlViTConfig`

`from_text_vision_configs`

`class transformers.OwlViTTextConfig`

`class transformers.OwlViTVisionConfig`

`class transformers.OwlViTImageProcessor`

`preprocess`

`post_process_object_detection`

`post_process_image_guided_detection`

`class transformers.OwlViTFeatureExtractor`

`call`

`post_process`

`post_process_image_guided_detection`

`class transformers.OwlViTProcessor`

`batch_decode`

`decode`

`post_process`

`post_process_image_guided_detection`

`post_process_object_detection`

`class transformers.OwlViTModel`

`forward`

`get_text_features`

`get_image_features`

`class transformers.OwlViTTextModel`

`forward`

`class transformers.OwlViTVisionModel`

`forward`

`class transformers.OwlViTForObjectDetection`

`forward`

`image_guided_detection`

`class transformers.Owlv2Config`

`from_text_vision_configs`

`class transformers.Owlv2TextConfig`

`class transformers.Owlv2VisionConfig`

`class transformers.Owlv2ImageProcessor`

`preprocess`

`post_process_object_detection`

`post_process_image_guided_detection`

`class transformers.Owlv2Processor`

`batch_decode`

`decode`

`post_process_image_guided_detection`

`post_process_object_detection`

`class transformers.Owlv2Model`

`forward`

`get_text_features`

`get_image_features`

`class transformers.Owlv2TextModel`

`forward`

`class transformers.Owlv2VisionModel`

`forward`

`class transformers.Owlv2ForObjectDetection`

`forward`

`image_guided_detection`

`class transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput`

`class transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput`

`class transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput`

`class transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput`

`class transformers.PerceiverConfig`

`class transformers.PerceiverTokenizer`

`call`

`class transformers.PerceiverFeatureExtractor`

`call`

`class transformers.PerceiverImageProcessor`

`preprocess`

`class transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverProjectionDecoder`

`class transformers.models.perceiver.modeling_perceiver.PerceiverBasicDecoder`

`class transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder`

`class transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder`

`class transformers.models.perceiver.modeling_perceiver.PerceiverBasicVideoAutoencodingDecoder`

`class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder`

`class transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor`

`class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor`

`class transformers.PerceiverModel`