2305.19270Learning without Forgetting for Vision-Language Models

标签：pre methods Language Models Forgetting learning performance model PROOF

https://arxiv.org/pdf/2305.19270.pdf

2305.19270.pd

Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in

the real world, which requires a learning system to adapt to new tasks without

forgetting former ones. While traditional CIL methods focus on visual information

to grasp core features, recent advances in Vision-Language Models (VLM) have

shown promising capabilities in learning generalizable representations with the aid

of textual information. However, when continually trained with new classes, VLMs

often suffer from catastrophic forgetting of former knowledge. Applying VLMs to

CIL poses two major challenges: 1) how to adapt the model without forgetting; and

2) how to make full use of the multi-modal information. To this end, we propose

PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To

handle the first challenge, we propose training task-specific projections based on the

frozen image/text encoders. When facing new tasks, new projections are expanded

and former projections are fixed, alleviating the forgetting of old concepts. For the

second challenge, we propose the fusion module to better utilize the cross-modality

information. By jointly adjusting visual and textual features, the model can capture

semantic information with a stronger representation ability. Extensive experiments

on nine benchmark datasets validate PROOF achieves state-of-the-art performance

1 Introduction

In our ever-changing world, training data often comes in a stream format with new classes, requiring

a learning system to absorb them continually [19, 18]. To address the challenge of learning emerging

new classes, Class-Incremental Learning (CIL) has been proposed [47]. However, in CIL, the absence

of former classes triggers catastrophic forgetting [16], where learning new concepts overwrites the

knowledge of old ones and results in decline in performance [33]. Numerous efforts have been

made [37, 15, 79, 53, 62, 77] to combat catastrophic forgetting in the machine learning field.

With the rapid development of pre-training techniques [20], recent years have witnessed the transition

of CIL research from training from scratch [67, 21, 78] to utilizing pre-trained models (PTM) [63, 64,

49]. With the help of PTM, e.g., Vision Transformers [13], incremental models are born with strong

transferability to grasp the visual features. Facing the domain gap introduced by the incremental

classes, they only need to learn a limited number of additional parameters [26, 11, 34] as the patches

to bridge the gap, which significantly simplifies the challenge of incremental learning.

While pre-trained ViT-based CIL methods focus on learning the visual features to recognize new

concepts, recent advances in Vision-Language Models (VLM) have demonstrated the potential of

textual information in building generalized feature representations. A typical work, i.e., contrastive

∗Han-Jia Ye and Ziwei Liu are corresponding authors.
Preprint. Under review.arXiv:2305.19270v1 [cs.CV] 30 May 2023

language-image pre-training [46] (CLIP), maps the visual and textual information in the shared

embedding space, enabling robust learning and recognition of concepts from diverse sources. This

integration of visual and textual modalities presents a promising avenue for developing continual

learning models that can effectively adapt to real-world scenarios.

Extending VLMs to CIL faces two significant challenges. First, sequentially tuning the VLM

overwrites the innate generalizability and former concepts, leading to forgetting and poor performance

on future tasks. Second, relying solely on textual information for classification neglects the valuable

cross-modal features present in the multi-modal inputs. To fully utilize this information, it is necessary

to explore methods for cross-modal fusion beyond textual features.

Correspondingly, we aim to turn a VLM into a continual learner that is both retentive and comprehensive. Retentive refers to the model’s ability to maintain its pre-trained capabilities, thereby preserving

generalizability and enabling it to perform well on future tasks without forgetting. Comprehensive

refers to the model’s capacity to integrate and adjust information from multiple modalities. By

leveraging these characteristics, we can mitigate catastrophic forgetting and use cross-modal features

to build more robust classifiers as data evolves.

In this paper, we propose PROjectiOn Fusion (PROOF) to address catastrophic forgetting in VLM.

To make the model retentive, we freeze the pre-trained image/text backbones and append liner

projections on top of them. The task-specific information is encoded in the corresponding projection

layer by mapping the projected features. When facing new tasks, new projections are extended while

old ones are frozen, preserving former knowledge. Besides, we aim to fuse the information from

different modalities via cross-model fusion, which allows for the query embedding to be adjusted

with context information. Consequently, PROOF efficiently incorporates new classes and meanwhile

resists forgetting old ones, achieving state-of-the-art performance on nine benchmark datasets. We

also investigate the zero-shot performance of VLM with new evaluation protocols and metrics, and

find that PROOF maintains its zero-shot performance with a simple modification.

2 Related Work

Vision-Language Model (VLM) Tuning: Recent years have witnessed the prosperity of research

in VLMs, e.g., CLIP [46], ALIGN [25], CoCa [70], Florence [73], BLIP [31], CLIPPO [54], and

Flamingo [1]. These models are pre-trained on vast amounts of images and texts, achieving a

unified embedding space across modalities. With great generalizability, they can be applied for

downstream tasks in a zero-shot manner. However, a domain gap still exists between the pre-trained

and downstream datasets, requiring further tuning for better performance. CoOp and CoCoOp [85, 84]

apply prompt learning [32] into VLM tuning with learnable prompt tokens. Subsequent works explore

VLM tuning via adapter tuning [17], prompt distribution learning [39], task residual learning [72],

similarity learning [76], descriptor learning [42], and optimal transport mapping [10]. However, they

only focus on adapting VLM to downstream tasks while overlooking the forgetting of former ones.

Class-Incremental Learning (CIL): aims to learn from evolutive data and absorb new knowledge

without forgetting [81]. Replay-based methods [40, 4, 8, 38, 9] save and replay former instances to

recover old knowledge when learning new ones. Knowledge distillation-based methods [47, 33, 14]

build the mapping between models as regularization. Parameter regularization-based methods [27,

2, 74, 3] weigh the importance of different parameters as regularization. Model rectification-based

methods [50, 78, 67, 71] rectify the inductive bias for unbiased predictions. Dynamic networks [69,

58, 82, 59] show strong performance by expanding the network structure as data evolves.

CIL with VLM: Aforementioned CIL methods aim to train an incremental model from scratch,

while it would be easier to start with a pre-trained model [30]. The integration of pre-trained Vision

Transformer [13] into CIL has attracted the attention of the community, and most methods [63,

64, 49] employ parameter-efficient tuning techniques to learn without forgetting. S-Prompt [61]

explores CLIP in domain-incremental learning, but the application of VLM in CIL remains relatively

unexplored. WiSE-FT [66] utilizes weight ensemble for robust finetuning, while it cannot be extended

to multiple tasks. This paper aims to address this research gap by presenting a comprehensive solution

for tuning vision-language models without suffering from forgetting.

3 From Old Classes to New Classes

In this section, we introduce the background information about class-incremental learning and vision

language models. We also discuss the naïve solutions for tuning VLM in CIL.

3.1 Class-Incremental LearningGiven a data stream with emerging new classes, class-incremental learning aims to continually

incorporate the knowledge and build a unified classifier [81]. We denote the sequence of B training

sets without overlapping classes as D1, D2, · · · , DB , where Db = {(xi, yi)}n i=1 b is the b-th training

set with nb instances. A training instance xi ∈ RD belongs to class yi ∈ Yb. Yb is the label space of

task b, and Yb ∩ Yb′ = ∅ for b ̸= b′. Following the typical CIL setting [47, 22, 67], a fixed number of

exemplars from the former classes are selected as the exemplar set E. During the b-th incremental

stage, we can only access data from Db and E for model training. The target is to build a unified

classifier for all seen classes Yb = Y1 ∪ · · · Yb continually. In other words, we hope to find a model

f(x) : X → Yb that minimizes the expected risk:

f∗ = argmin

f∈H

(x,y)∼Dt1∪···DtbI (y ̸= f(x)) , (1)

where H denotes the hypothesis space and I(·) is the indicator function. Dtb denotes the data

distribution of task b. Following [63, 64, 61], we assume that a pre-trained vision-language model is

available as the initialization for f(x), which will be introduced in Section 3.2.

3.2 Vision-Language ModelThis paper focuses on contrastive language-image pre-training (CLIP) [46] as the VLM. During pretraining, CLIP jointly learns an image encoder gi(·) : RD → Rd and a text encoder gt(·) : RDt → Rd

in a contrastive manner, where D/Dt are input dimensions of image/text, and d is the embedding

dimension. CLIP projects a batch of image-text pairs into a shared embedding space. It maximizes

the cosine similarity of paired inputs and minimizes it for unmatched ones. Benefiting from the

massive training data, CLIP can synthesize a zero-shot classifier that generalizes to unseen classes.

The output of CLIP is formulated as:

p(yi | x) = P|Y jexp (cos ( =1 b| exp (cos ( z, wzi,)w/τ j))/τ) , (2)

where cos(·, ·) denotes cosine similarity, τ is learnable temperature parameter, z = gi(x) is the image

embedding. Correspondingly, wi is the text embedding of class yi obtained by feeding templated

texts, e.g., “a photo of a [CLASS]” into the text encoder. We denote the templated text of class i as ti.

Eq. 2 aims to find the most similar text ti that maximizes the cosine similarity to the query image.

3.3 Overcome Forgetting in Class-Incremental LearningCIL, as a long-standing problem, has garnered significant attention from the research community. In

this section, we introduce two typical solutions for adapting pre-trained models with new classes.

Vision-Based Learning: Traditional CIL methods primarily rely on the image encoder to capture

the patterns of new classes. One such method, L2P [64], leverages visual prompt tuning [26] to

enable incremental updates of a pre-trained Vision Transformer [13]. By keeping the image encoder

frozen, L2P trains a learnable prompt pool Pool and combines it with patch embeddings to obtain

instance-specific embeddings. The optimization target can be formulated as:

L = ℓ (h ( ¯ gi (xi, Pool)) , yi) + Lreg , (3)

where h(·) is the classification head, g¯i is the frozen image encoder, Lreg is the regularization loss

for prompt selection. By freezing the encoder, Eq. 3 grasps the new pattern with humble forgetting.

CLIP Tuning: The issue of tuning VLM without forgetting in CIL remains unaddressed, as previous

works have solely focused on transferring CLIP to downstream tasks without considering the performance of former tasks. For instance, CoOp [85] converts text inputs into a learnable prompt, i.e.,

ti = [V]1[V]2 · · · [V]M [CLASS]i. The posterior probability in Eq. 2 is transformed into:

p(yi \| x) =	P\|Y jexp (cos ( =1 b\| exp (cos ( z, gzt(, g tit)) (t/τ j)))/τ)	.

(4)

With the help of the learned prompt, Eq. 4 enables the model to be transferred to the downstream

task. However, since the prompt template is shared for all tasks, sequentially tuning CoOp will suffer

catastrophic forgetting of former concepts.

Discussions: Current methods focus on different aspects of CIL. Vision-based methods (e.g., Eq. 3)

address the issue of forgetting but neglect the valuable semantic information conveyed in texts.

Conversely, CLIP’s pre-trained text encoder captures class-wise relationships that can enhance model

learning. Meanwhile, transfer learning methods (e.g., Eq. 4) effectively leverage the cross-modal

information, while sequentially tuning them suffers the catastrophic forgetting of former concepts. Is

it possible to combine the cross-modal information and meanwhile resist catastrophic forgetting?

4 PROOF: Projection Fusion for VLM

Observing the limitations of typical vision-based methods in utilizing textual information and

forgetting in CLIP tuning, we aim to leverage cross-modality knowledge in CLIP while effectively

mitigating forgetting. To this end, we must make the model retentive and comprehensive. Retentive

represents the ability to adapt to downstream tasks without forgetting, and we propose projections

to map the pre-trained features in the projected feature space. Our unique training strategy ensures

the preservation of former knowledge by freezing old projections and expanding new ones for new

tasks. The comprehensive aspect involves co-adapting and utilizing cross-modal information to

enhance unified predictions. The query instance’s embedding is influenced by both visual and textual

information, allowing for instance-specific adaptation and enabling comprehensive predictions.

In the following sections, we introduce the learning paradigm and the co-adaptation process. Lastly,

we provide detailed guidelines for training and inference.

4.1 Expandable Feature ProjectionCLIP is known for its strong zero-shot performance [46], i.e., Eq. 2 obtains competitive results even

without explicit training on the specific tasks. However, given the domain gap between pre-trained

and downstream tasks, an adaptation process is still necessary to capture the characteristics of the

latter. Specifically, we introduce a linear layer (denoted as “projection”) which is appended after the

frozen image and text embeddings to facilitate the matching of pair-wise projected features. Denoting

the projection of image/text as Pi(·) : Rd → Rd and Pt(·) : Rd → Rd, Eq. 2 is transformed into:

p(yi \| x) =	P\|Y jexp (cos ( =1 b\| exp (cos ( Pi (Pzi)(, P z)t, P (wt i()) w/τ j)))/τ)	.	(5)

| {z }

Projected MatchingWe denote the classification based on Eq. 5 as fPM(x). By freezing the image and text encoders, it

aligns the downstream features in the projected space, allowing the model to encode the relevant

downstream information into projection layers. Since the pre-trained model outputs generalizable

features, the projection layer learns to recombine features in a data-driven manner. For instance, in a

task involving ‘birds,’ the projection would assign a higher weight to features like ‘beaks’ and ‘wings.’

This adaptation enables the projected features to better discern and recognize downstream tasks.

Expandable Projections: However, sequentially training a single projection layer still leads to

forgetting of former tasks, resulting in confusion when combining old and new concepts. To

this end, we expand task-specific projections for each new task. Specifically, we append a newly

initialized projection layer Pib, Ptb when a new task Db arrives. This results in a set of projections:

{Pi1, P	ib, }, {Pt1, P	tb, }, and we adopt the aggregation as the output, i.e.,
Pi(z) = Pb m=1 Pim (z) , Pt(w) = Pb n=1 Ptn (w) .	(6)

i2, · · · Pt2, · · · PIn Eq. 6, projected features from different stages are mapped and aggregated to capture the different

emphases of former and latter tasks. For example, former tasks might emphasize ‘beak’ features

4A photo of

a panda

Image
EncoderVisual

Prototypes

标签：pre,methods,Language,Models,Forgetting,learning,performance,model,PROOF
From： https://blog.51cto.com/u_15892225/6386157

2305.19270Learning without Forgetting for Vision-Language Models

1 Introduction

2 Related Work

3 From Old Classes to New Classes

相关文章

赞助商

阅读排行