2024-06-06BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModeMotivation&Abs端到端大规模视觉语言预训练的开销极大。为此,本文提出了BLIP2,利用现成的冻住的imageencoder以及LLM引导视觉语言预训练。模态差距:通过两阶段训练的轻量级的QueryTransformer(Q-Former)弥补。第一阶段:从冻结的imageencoder引导VL学习;第二阶段:从冻结的LLM引导视