首页 > 其他分享 >2305.19270Learning without Forgetting for Vision-Language Models

2305.19270Learning without Forgetting for Vision-Language Models

时间:2023-05-31 13:01:35浏览次数:61  
标签:pre methods Language Models Forgetting learning performance model PROOF

https://arxiv.org/pdf/2305.19270.pdf

2305.19270.pd

    Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in

the real world, which requires a learning system to adapt to new tasks without

forgetting former ones. While traditional CIL methods focus on visual information

to grasp core features, recent advances in Vision-Language Models (VLM) have

shown promising capabilities in learning generalizable representations with the aid

of textual information. However, when continually trained with new classes, VLMs

often suffer from catastrophic forgetting of former knowledge. Applying VLMs to

CIL poses two major challenges: 1) how to adapt the model without forgetting; and

2) how to make full use of the multi-modal information. To this end, we propose

PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To

handle the first challenge, we propose training task-specific projections based on the

frozen image/text encoders. When facing new tasks, new projections are expanded

and former projections are fixed, alleviating the forgetting of old concepts. For the

second challenge, we propose the fusion module to better utilize the cross-modality

information. By jointly adjusting visual and textual features, the model can capture

semantic information with a stronger representation ability. Extensive experiments

on nine benchmark datasets validate PROOF achieves state-of-the-art performance


1 Introduction

In our ever-changing world, training data often comes in a stream format with new classes, requiring

a learning system to absorb them continually [19, 18]. To address the challenge of learning emerging

new classes, Class-Incremental Learning (CIL) has been proposed [47]. However, in CIL, the absence

of former classes triggers catastrophic forgetting [16], where learning new concepts overwrites the

knowledge of old ones and results in decline in performance [33]. Numerous efforts have been

made [37, 15, 79, 53, 62, 77] to combat catastrophic forgetting in the machine learning field.

With the rapid development of pre-training techniques [20], recent years have witnessed the transition

of CIL research from training from scratch [67, 21, 78] to utilizing pre-trained models (PTM) [63, 64,

49]. With the help of PTM, e.g., Vision Transformers [13], incremental models are born with strong

transferability to grasp the visual features. Facing the domain gap introduced by the incremental

classes, they only need to learn a limited number of additional parameters [26, 11, 34] as the patches

to bridge the gap, which significantly simplifies the challenge of incremental learning.

While pre-trained ViT-based CIL methods focus on learning the visual features to recognize new

concepts, recent advances in Vision-Language Models (VLM) have demonstrated the potential of

textual information in building generalized feature representations. A typical work, i.e., contrastive

∗Han-Jia Ye and Ziwei Liu are corresponding authors.
Preprint. Under review.arXiv:2305.19270v1 [cs.CV] 30 May 2023

language-image pre-training [46] (CLIP), maps the visual and textual information in the shared

embedding space, enabling robust learning and recognition of concepts from diverse sources. This

integration of visual and textual modalities presents a promising avenue for developing continual

learning models that can effectively adapt to real-world scenarios.

Extending VLMs to CIL faces two significant challenges. First, sequentially tuning the VLM

overwrites the innate generalizability and former concepts, leading to forgetting and poor performance

on future tasks. Second, relying solely on textual information for classification neglects the valuable

cross-modal features present in the multi-modal inputs. To fully utilize this information, it is necessary

to explore methods for cross-modal fusion beyond textual features.

Correspondingly, we aim to turn a VLM into a continual learner that is both retentive and comprehensive. Retentive refers to the model’s ability to maintain its pre-trained capabilities, thereby preserving

generalizability and enabling it to perform well on future tasks without forgetting. Comprehensive

refers to the model’s capacity to integrate and adjust information from multiple modalities. By

leveraging these characteristics, we can mitigate catastrophic forgetting and use cross-modal features

to build more robust classifiers as data evolves.

In this paper, we propose PROjectiOn Fusion (PROOF) to address catastrophic forgetting in VLM.

To make the model retentive, we freeze the pre-trained image/text backbones and append liner

projections on top of them. The task-specific information is encoded in the corresponding projection

layer by mapping the projected features. When facing new tasks, new projections are extended while

old ones are frozen, preserving former knowledge. Besides, we aim to fuse the information from

different modalities via cross-model fusion, which allows for the query embedding to be adjusted

with context information. Consequently, PROOF efficiently incorporates new classes and meanwhile

resists forgetting old ones, achieving state-of-the-art performance on nine benchmark datasets. We

also investigate the zero-shot performance of VLM with new evaluation protocols and metrics, and

find that PROOF maintains its zero-shot performance with a simple modification.

2 Related Work

Vision-Language Model (VLM) Tuning: Recent years have witnessed the prosperity of research

in VLMs, e.g., CLIP [46], ALIGN [25], CoCa [70], Florence [73], BLIP [31], CLIPPO [54], and

Flamingo [1]. These models are pre-trained on vast amounts of images and texts, achieving a

unified embedding space across modalities. With great generalizability, they can be applied for

downstream tasks in a zero-shot manner. However, a domain gap still exists between the pre-trained

and downstream datasets, requiring further tuning for better performance. CoOp and CoCoOp [85, 84]

apply prompt learning [32] into VLM tuning with learnable prompt tokens. Subsequent works explore

VLM tuning via adapter tuning [17], prompt distribution learning [39], task residual learning [72],

similarity learning [76], descriptor learning [42], and optimal transport mapping [10]. However, they

only focus on adapting VLM to downstream tasks while overlooking the forgetting of former ones.

Class-Incremental Learning (CIL): aims to learn from evolutive data and absorb new knowledge

without forgetting [81]. Replay-based methods [40, 4, 8, 38, 9] save and replay former instances to

recover old knowledge when learning new ones. Knowledge distillation-based methods [47, 33, 14]

build the mapping between models as regularization. Parameter regularization-based methods [27,

2, 74, 3] weigh the importance of different parameters as regularization. Model rectification-based

methods [50, 78, 67, 71] rectify the inductive bias for unbiased predictions. Dynamic networks [69,

58, 82, 59] show strong performance by expanding the network structure as data evolves.

CIL with VLM: Aforementioned CIL methods aim to train an incremental model from scratch,

while it would be easier to start with a pre-trained model [30]. The integration of pre-trained Vision

Transformer [13] into CIL has attracted the attention of the community, and most methods [63,

64, 49] employ parameter-efficient tuning techniques to learn without forgetting. S-Prompt [61]

explores CLIP in domain-incremental learning, but the application of VLM in CIL remains relatively

unexplored. WiSE-FT [66] utilizes weight ensemble for robust finetuning, while it cannot be extended

to multiple tasks. This paper aims to address this research gap by presenting a comprehensive solution

for tuning vision-language models without suffering from forgetting.

2


3 From Old Classes to New Classes

In this section, we introduce the background information about class-incremental learning and vision

language models. We also discuss the naïve solutions for tuning VLM in CIL.

3.1 Class-Incremental LearningGiven a data stream with emerging new classes, class-incremental learning aims to continually

incorporate the knowledge and build a unified classifier [81]. We denote the sequence of B training

sets without overlapping classes as D1, D2, · · · , DB , where Db = {(xi, yi)}n i=1 b is the b-th training

set with nb instances. A training instance xi ∈ RD belongs to class yi ∈ Yb. Yb is the label space of

task b, and Yb ∩ Yb′ = ∅ for b ̸= b′. Following the typical CIL setting [47, 22, 67], a fixed number of

exemplars from the former classes are selected as the exemplar set E. During the b-th incremental

stage, we can only access data from Db and E for model training. The target is to build a unified

classifier for all seen classes Yb = Y1 ∪ · · · Yb continually. In other words, we hope to find a model

f(x) : X → Yb that minimizes the expected risk:

f∗ = argmin

f∈H

E

(x,y)∼Dt1∪···DtbI (y ̸= f(x)) , (1)

where H denotes the hypothesis space and I(·) is the indicator function. Dtb denotes the data

distribution of task b. Following [63, 64, 61], we assume that a pre-trained vision-language model is

available as the initialization for f(x), which will be introduced in Section 3.2.

3.2 Vision-Language ModelThis paper focuses on contrastive language-image pre-training (CLIP) [46] as the VLM. During pretraining, CLIP jointly learns an image encoder gi(·) : RD → Rd and a text encoder gt(·) : RDt → Rd

in a contrastive manner, where D/Dt are input dimensions of image/text, and d is the embedding

dimension. CLIP projects a batch of image-text pairs into a shared embedding space. It maximizes

the cosine similarity of paired inputs and minimizes it for unmatched ones. Benefiting from the

massive training data, CLIP can synthesize a zero-shot classifier that generalizes to unseen classes.

The output of CLIP is formulated as:

p(yi | x) = P|Y jexp (cos ( =1 b| exp (cos ( z, wzi,)w/τ j))/τ) , (2)

where cos(·, ·) denotes cosine similarity, τ is learnable temperature parameter, z = gi(x) is the image

embedding. Correspondingly, wi is the text embedding of class yi obtained by feeding templated

texts, e.g., “a photo of a [CLASS]” into the text encoder. We denote the templated text of class i as ti.

Eq. 2 aims to find the most similar text ti that maximizes the cosine similarity to the query image.

3.3 Overcome Forgetting in Class-Incremental LearningCIL, as a long-standing problem, has garnered significant attention from the research community. In

this section, we introduce two typical solutions for adapting pre-trained models with new classes.

Vision-Based Learning: Traditional CIL methods primarily rely on the image encoder to capture

the patterns of new classes. One such method, L2P [64], leverages visual prompt tuning [26] to

enable incremental updates of a pre-trained Vision Transformer [13]. By keeping the image encoder

frozen, L2P trains a learnable prompt pool Pool and combines it with patch embeddings to obtain

instance-specific embeddings. The optimization target can be formulated as:

L = ℓ (h ( ¯ gi (xi, Pool)) , yi) + Lreg , (3)

where h(·) is the classification head, g¯i is the frozen image encoder, Lreg is the regularization loss

for prompt selection. By freezing the encoder, Eq. 3 grasps the new pattern with humble forgetting.

CLIP Tuning: The issue of tuning VLM without forgetting in CIL remains unaddressed, as previous

works have solely focused on transferring CLIP to downstream tasks without considering the performance of former tasks. For instance, CoOp [85] converts text inputs into a learnable prompt, i.e.,

3

ti = [V]1[V]2 · · · [V]M [CLASS]i. The posterior probability in Eq. 2 is transformed into:

p(yi | x) =

P|Y jexp (cos ( =1 b| exp (cos ( z, gzt(, g tit)) (t/τ j)))/τ)

.

(4)

With the help of the learned prompt, Eq. 4 enables the model to be transferred to the downstream

task. However, since the prompt template is shared for all tasks, sequentially tuning CoOp will suffer

catastrophic forgetting of former concepts.

Discussions: Current methods focus on different aspects of CIL. Vision-based methods (e.g., Eq. 3)

address the issue of forgetting but neglect the valuable semantic information conveyed in texts.

Conversely, CLIP’s pre-trained text encoder captures class-wise relationships that can enhance model

learning. Meanwhile, transfer learning methods (e.g., Eq. 4) effectively leverage the cross-modal

information, while sequentially tuning them suffers the catastrophic forgetting of former concepts. Is

it possible to combine the cross-modal information and meanwhile resist catastrophic forgetting?

4 PROOF: Projection Fusion for VLM

Observing the limitations of typical vision-based methods in utilizing textual information and

forgetting in CLIP tuning, we aim to leverage cross-modality knowledge in CLIP while effectively

mitigating forgetting. To this end, we must make the model retentive and comprehensive. Retentive

represents the ability to adapt to downstream tasks without forgetting, and we propose projections

to map the pre-trained features in the projected feature space. Our unique training strategy ensures

the preservation of former knowledge by freezing old projections and expanding new ones for new

tasks. The comprehensive aspect involves co-adapting and utilizing cross-modal information to

enhance unified predictions. The query instance’s embedding is influenced by both visual and textual

information, allowing for instance-specific adaptation and enabling comprehensive predictions.

In the following sections, we introduce the learning paradigm and the co-adaptation process. Lastly,

we provide detailed guidelines for training and inference.

4.1 Expandable Feature ProjectionCLIP is known for its strong zero-shot performance [46], i.e., Eq. 2 obtains competitive results even

without explicit training on the specific tasks. However, given the domain gap between pre-trained

and downstream tasks, an adaptation process is still necessary to capture the characteristics of the

latter. Specifically, we introduce a linear layer (denoted as “projection”) which is appended after the

frozen image and text embeddings to facilitate the matching of pair-wise projected features. Denoting

the projection of image/text as Pi(·) : Rd → Rd and Pt(·) : Rd → Rd, Eq. 2 is transformed into:

p(yi | x) =

P|Y jexp (cos ( =1 b| exp (cos ( Pi (Pzi)(, P z)t, P (wt i()) w/τ j)))/τ)

.

(5)

| {z }

Projected MatchingWe denote the classification based on Eq. 5 as fPM(x). By freezing the image and text encoders, it

aligns the downstream features in the projected space, allowing the model to encode the relevant

downstream information into projection layers. Since the pre-trained model outputs generalizable

features, the projection layer learns to recombine features in a data-driven manner. For instance, in a

task involving ‘birds,’ the projection would assign a higher weight to features like ‘beaks’ and ‘wings.’

This adaptation enables the projected features to better discern and recognize downstream tasks.

Expandable Projections: However, sequentially training a single projection layer still leads to

forgetting of former tasks, resulting in confusion when combining old and new concepts. To

this end, we expand task-specific projections for each new task. Specifically, we append a newly

initialized projection layer Pib, Ptb when a new task Db arrives. This results in a set of projections:

{Pi1, P

ib, }, {Pt1, P

tb, }, and we adopt the aggregation as the output, i.e.,

Pi(z) = Pb m=1 Pim (z) , Pt(w) = Pb n=1 Ptn (w) .

(6)

i2, · · · Pt2, · · · PIn Eq. 6, projected features from different stages are mapped and aggregated to capture the different

emphases of former and latter tasks. For example, former tasks might emphasize ‘beak’ features

4A photo of

a panda

Image
EncoderVisual

Prototypes 

标签:pre,methods,Language,Models,Forgetting,learning,performance,model,PROOF
From: https://blog.51cto.com/u_15892225/6386157

相关文章

  • Self-consistency Improves Chain of Thought Reasoning in Language Models 论文阅读
    ICLR2023原文地址1.MotivationChain-of-Thought(CoT)使LargeLanguageModels(LLMs)在复杂的推理任务中取得了令人鼓舞的结果。本文提出了一种新的解码策略——self-consistency,以取代贪婪解码。self-consistency利用了一种直觉,即一个复杂的推理问题通常允许多种不同的思维......
  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
     Abstract:具身人工智能(Embodied AI)让机器人有规划、执行动作序列的能力,以在物理环境中完成长期任务。本文提出EmbodiedGPT,它是一个端到端的多模态基础模型,赋予具身代理多模态理解和执行能力。本文的贡献主要有三点:制作了一个大规模的具身规划数据集EgoCOT。该数据集包含......
  • SQL(Structured Query Language)介绍及查询示例
    SQL(StructuredQueryLanguage)是结构化查询语言的缩写,它是一种专门用于操作关系型数据库的编程语言。SQL可以用于数据的存储、查询、更新、删除等常见操作,并且是目前世界上最流行的关系型数据库操作语言。SQL的主要特点包括:1.简单易学:SQL的语法清晰简单,易于学习和使用。2.......
  • vivado2019.2对modelsim2019.2编译库全报错解析
    最近在用vivado2019.2编译modelsim2019.2库时,所有库全部报错,查阅了博主们的各种解决办法,最终在一篇文章的评论中找到了解决办法,特此记录问题描述:1、ERROR:[Vivado12-5602]compile_simlibfailedtocompileformodelsimwitherrorinxxxlibraries2、ERROR:[Common17-......
  • [论文阅读] Diffusion Models Beat GANs on Image Synthesis
    Pretitle:DiffusionModelsBeatGANsonImageSynthesisaccepted:NeurIPS2021paper:https://arxiv.org/abs/2105.05233code:https://github.com/openai/guided-diffusionref:https://sunlin-ai.github.io/2022/05/30/guided-diffusion.htmlref:https://blog.cs......
  • 论文解析 -- A Survey of Large Language Models
     什么是语言模型?生成式,完成语言接龙或填空Technically,languagemodeling(LM)isoneofthemajorapproachestoadvancinglanguageintelligenceofmachines.Ingeneral,LMaimstomodelthegenerativelikelihoodofwordsequences,soastopredictthepro......
  • AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
    AbstractEvaluatingthegeneralabilitiesoffoundationmodelstotacklehuman-leveltasksisavitalaspectoftheirdevelopmentandapplicationinthepursuitofArtificialGeneralIntelligence(AGI).Traditionalbenchmarks,whichrelyonartificialdat......
  • CLIP-S^4:Language-Guided Self-Supervised Semantic Segmentation论文阅读笔记
    摘要作者提出了CLIP-S4,借助自监督像素表示学习和V-L模型实现各种语义分割任务,不需要使用任何像素级别标注以及未知类的信息。作者首先通过对图像的不同增强视角进行像素-分割对比学习来学习像素嵌入。之后,为进一步改善像素嵌入并实现基于自然语言的语义分割,作者设计了由V-L模型指......
  • Combining Label Propagation and Simple Models Out-performs Graph Neural Networks
    目录概符号说明C&S代码HuangQ.,HeH.,SinghA.,LimS.andBensonA.R.Combininglabelpropagationandsimplemodelsout-performsgraphneuralnetworks.ICLR,2021.概将预测概率作为信号进行传播.符号说明\(G=(V,E)\),图;\(|V|=n\);\(X\in\mathbb{R}......
  • Day04 drf之source序列化字段定制与反序列化校验、modelserializer使用
    今日内容1序列化高级用法之source(了解)#1创建了5个表(图书管理的5个)#2对booke进行序列化#总结:source的用法 -1修改前端看到的字段key值---》source指定的必须是对象的属性 book_name=serializers.CharField(source='name')-2修改前端看到的value值,---......