https://arxiv.org/pdf/2305.19270.pdf
2305.19270.pd
Abstract
Class-Incremental Learning (CIL) or continual learning is a desired capability in
the real world, which requires a learning system to adapt to new tasks without
forgetting former ones. While traditional CIL methods focus on visual information
to grasp core features, recent advances in Vision-Language Models (VLM) have
shown promising capabilities in learning generalizable representations with the aid
of textual information. However, when continually trained with new classes, VLMs
often suffer from catastrophic forgetting of former knowledge. Applying VLMs to
CIL poses two major challenges: 1) how to adapt the model without forgetting; and
2) how to make full use of the multi-modal information. To this end, we propose
PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To
handle the first challenge, we propose training task-specific projections based on the
frozen image/text encoders. When facing new tasks, new projections are expanded
and former projections are fixed, alleviating the forgetting of old concepts. For the
second challenge, we propose the fusion module to better utilize the cross-modality
information. By jointly adjusting visual and textual features, the model can capture
semantic information with a stronger representation ability. Extensive experiments
on nine benchmark datasets validate PROOF achieves state-of-the-art performance
1 Introduction
In our ever-changing world, training data often comes in a stream format with new classes, requiring
a learning system to absorb them continually [19, 18]. To address the challenge of learning emerging
new classes, Class-Incremental Learning (CIL) has been proposed [47]. However, in CIL, the absence
of former classes triggers catastrophic forgetting [16], where learning new concepts overwrites the
knowledge of old ones and results in decline in performance [33]. Numerous efforts have been
made [37, 15, 79, 53, 62, 77] to combat catastrophic forgetting in the machine learning field.
With the rapid development of pre-training techniques [20], recent years have witnessed the transition
of CIL research from training from scratch [67, 21, 78] to utilizing pre-trained models (PTM) [63, 64,
49]. With the help of PTM, e.g., Vision Transformers [13], incremental models are born with strong
transferability to grasp the visual features. Facing the domain gap introduced by the incremental
classes, they only need to learn a limited number of additional parameters [26, 11, 34] as the patches
to bridge the gap, which significantly simplifies the challenge of incremental learning.
While pre-trained ViT-based CIL methods focus on learning the visual features to recognize new
concepts, recent advances in Vision-Language Models (VLM) have demonstrated the potential of
textual information in building generalized feature representations. A typical work, i.e., contrastive
∗Han-Jia Ye and Ziwei Liu are corresponding authors.
Preprint. Under review.arXiv:2305.19270v1 [cs.CV] 30 May 2023
language-image pre-training [46] (CLIP), maps the visual and textual information in the shared
embedding space, enabling robust learning and recognition of concepts from diverse sources. This
integration of visual and textual modalities presents a promising avenue for developing continual
learning models that can effectively adapt to real-world scenarios.
Extending VLMs to CIL faces two significant challenges. First, sequentially tuning the VLM
overwrites the innate generalizability and former concepts, leading to forgetting and poor performance
on future tasks. Second, relying solely on textual information for classification neglects the valuable
cross-modal features present in the multi-modal inputs. To fully utilize this information, it is necessary
to explore methods for cross-modal fusion beyond textual features.
Correspondingly, we aim to turn a VLM into a continual learner that is both retentive and comprehensive. Retentive refers to the model’s ability to maintain its pre-trained capabilities, thereby preserving
generalizability and enabling it to perform well on future tasks without forgetting. Comprehensive
refers to the model’s capacity to integrate and adjust information from multiple modalities. By
leveraging these characteristics, we can mitigate catastrophic forgetting and use cross-modal features
to build more robust classifiers as data evolves.
In this paper, we propose PROjectiOn Fusion (PROOF) to address catastrophic forgetting in VLM.
To make the model retentive, we freeze the pre-trained image/text backbones and append liner
projections on top of them. The task-specific information is encoded in the corresponding projection
layer by mapping the projected features. When facing new tasks, new projections are extended while
old ones are frozen, preserving former knowledge. Besides, we aim to fuse the information from
different modalities via cross-model fusion, which allows for the query embedding to be adjusted
with context information. Consequently, PROOF efficiently incorporates new classes and meanwhile
resists forgetting old ones, achieving state-of-the-art performance on nine benchmark datasets. We
also investigate the zero-shot performance of VLM with new evaluation protocols and metrics, and
find that PROOF maintains its zero-shot performance with a simple modification.
2 Related Work
Vision-Language Model (VLM) Tuning: Recent years have witnessed the prosperity of research
in VLMs, e.g., CLIP [46], ALIGN [25], CoCa [70], Florence [73], BLIP [31], CLIPPO [54], and
Flamingo [1]. These models are pre-trained on vast amounts of images and texts, achieving a
unified embedding space across modalities. With great generalizability, they can be applied for
downstream tasks in a zero-shot manner. However, a domain gap still exists between the pre-trained
and downstream datasets, requiring further tuning for better performance. CoOp and CoCoOp [85, 84]
apply prompt learning [32] into VLM tuning with learnable prompt tokens. Subsequent works explore
VLM tuning via adapter tuning [17], prompt distribution learning [39], task residual learning [72],
similarity learning [76], descriptor learning [42], and optimal transport mapping [10]. However, they
only focus on adapting VLM to downstream tasks while overlooking the forgetting of former ones.
Class-Incremental Learning (CIL): aims to learn from evolutive data and absorb new knowledge
without forgetting [81]. Replay-based methods [40, 4, 8, 38, 9] save and replay former instances to
recover old knowledge when learning new ones. Knowledge distillation-based methods [47, 33, 14]
build the mapping between models as regularization. Parameter regularization-based methods [27,
2, 74, 3] weigh the importance of different parameters as regularization. Model rectification-based
methods [50, 78, 67, 71] rectify the inductive bias for unbiased predictions. Dynamic networks [69,
58, 82, 59] show strong performance by expanding the network structure as data evolves.
CIL with VLM: Aforementioned CIL methods aim to train an incremental model from scratch,
while it would be easier to start with a pre-trained model [30]. The integration of pre-trained Vision
Transformer [13] into CIL has attracted the attention of the community, and most methods [63,
64, 49] employ parameter-efficient tuning techniques to learn without forgetting. S-Prompt [61]
explores CLIP in domain-incremental learning, but the application of VLM in CIL remains relatively
unexplored. WiSE-FT [66] utilizes weight ensemble for robust finetuning, while it cannot be extended
to multiple tasks. This paper aims to address this research gap by presenting a comprehensive solution
for tuning vision-language models without suffering from forgetting.
2
3 From Old Classes to New Classes
In this section, we introduce the background information about class-incremental learning and vision
language models. We also discuss the naïve solutions for tuning VLM in CIL.
3.1 Class-Incremental LearningGiven a data stream with emerging new classes, class-incremental learning aims to continually
incorporate the knowledge and build a unified classifier [81]. We denote the sequence of B training
sets without overlapping classes as D1, D2, · · · , DB , where Db = {(xi, yi)}n i=1 b is the b-th training
set with nb instances. A training instance xi ∈ RD belongs to class yi ∈ Yb. Yb is the label space of
task b, and Yb ∩ Yb′ = ∅ for b ̸= b′. Following the typical CIL setting [47, 22, 67], a fixed number of
exemplars from the former classes are selected as the exemplar set E. During the b-th incremental
stage, we can only access data from Db and E for model training. The target is to build a unified
classifier for all seen classes Yb = Y1 ∪ · · · Yb continually. In other words, we hope to find a model
f(x) : X → Yb that minimizes the expected risk:
f∗ = argmin
f∈H
E
(x,y)∼Dt1∪···DtbI (y ̸= f(x)) , (1)
where H denotes the hypothesis space and I(·) is the indicator function. Dtb denotes the data
distribution of task b. Following [63, 64, 61], we assume that a pre-trained vision-language model is
available as the initialization for f(x), which will be introduced in Section 3.2.
3.2 Vision-Language ModelThis paper focuses on contrastive language-image pre-training (CLIP) [46] as the VLM. During pretraining, CLIP jointly learns an image encoder gi(·) : RD → Rd and a text encoder gt(·) : RDt → Rd
in a contrastive manner, where D/Dt are input dimensions of image/text, and d is the embedding
dimension. CLIP projects a batch of image-text pairs into a shared embedding space. It maximizes
the cosine similarity of paired inputs and minimizes it for unmatched ones. Benefiting from the
massive training data, CLIP can synthesize a zero-shot classifier that generalizes to unseen classes.
The output of CLIP is formulated as:
p(yi | x) = P|Y jexp (cos ( =1 b| exp (cos ( z, wzi,)w/τ j))/τ) , (2)
where cos(·, ·) denotes cosine similarity, τ is learnable temperature parameter, z = gi(x) is the image
embedding. Correspondingly, wi is the text embedding of class yi obtained by feeding templated
texts, e.g., “a photo of a [CLASS]” into the text encoder. We denote the templated text of class i as ti.
Eq. 2 aims to find the most similar text ti that maximizes the cosine similarity to the query image.
3.3 Overcome Forgetting in Class-Incremental LearningCIL, as a long-standing problem, has garnered significant attention from the research community. In
this section, we introduce two typical solutions for adapting pre-trained models with new classes.
Vision-Based Learning: Traditional CIL methods primarily rely on the image encoder to capture
the patterns of new classes. One such method, L2P [64], leverages visual prompt tuning [26] to
enable incremental updates of a pre-trained Vision Transformer [13]. By keeping the image encoder
frozen, L2P trains a learnable prompt pool Pool and combines it with patch embeddings to obtain
instance-specific embeddings. The optimization target can be formulated as:
L = ℓ (h ( ¯ gi (xi, Pool)) , yi) + Lreg , (3)
where h(·) is the classification head, g¯i is the frozen image encoder, Lreg is the regularization loss
for prompt selection. By freezing the encoder, Eq. 3 grasps the new pattern with humble forgetting.
CLIP Tuning: The issue of tuning VLM without forgetting in CIL remains unaddressed, as previous
works have solely focused on transferring CLIP to downstream tasks without considering the performance of former tasks. For instance, CoOp [85] converts text inputs into a learnable prompt, i.e.,
3
ti = [V]1[V]2 · · · [V]M [CLASS]i. The posterior probability in Eq. 2 is transformed into:
p(yi | x) = | P|Y jexp (cos ( =1 b| exp (cos ( z, gzt(, g tit)) (t/τ j)))/τ) | . |
(4)
With the help of the learned prompt, Eq. 4 enables the model to be transferred to the downstream
task. However, since the prompt template is shared for all tasks, sequentially tuning CoOp will suffer
catastrophic forgetting of former concepts.
Discussions: Current methods focus on different aspects of CIL. Vision-based methods (e.g., Eq. 3)
address the issue of forgetting but neglect the valuable semantic information conveyed in texts.
Conversely, CLIP’s pre-trained text encoder captures class-wise relationships that can enhance model
learning. Meanwhile, transfer learning methods (e.g., Eq. 4) effectively leverage the cross-modal
information, while sequentially tuning them suffers the catastrophic forgetting of former concepts. Is
it possible to combine the cross-modal information and meanwhile resist catastrophic forgetting?
4 PROOF: Projection Fusion for VLM
Observing the limitations of typical vision-based methods in utilizing textual information and
forgetting in CLIP tuning, we aim to leverage cross-modality knowledge in CLIP while effectively
mitigating forgetting. To this end, we must make the model retentive and comprehensive. Retentive
represents the ability to adapt to downstream tasks without forgetting, and we propose projections
to map the pre-trained features in the projected feature space. Our unique training strategy ensures
the preservation of former knowledge by freezing old projections and expanding new ones for new
tasks. The comprehensive aspect involves co-adapting and utilizing cross-modal information to
enhance unified predictions. The query instance’s embedding is influenced by both visual and textual
information, allowing for instance-specific adaptation and enabling comprehensive predictions.
In the following sections, we introduce the learning paradigm and the co-adaptation process. Lastly,
we provide detailed guidelines for training and inference.
4.1 Expandable Feature ProjectionCLIP is known for its strong zero-shot performance [46], i.e., Eq. 2 obtains competitive results even
without explicit training on the specific tasks. However, given the domain gap between pre-trained
and downstream tasks, an adaptation process is still necessary to capture the characteristics of the
latter. Specifically, we introduce a linear layer (denoted as “projection”) which is appended after the
frozen image and text embeddings to facilitate the matching of pair-wise projected features. Denoting
the projection of image/text as Pi(·) : Rd → Rd and Pt(·) : Rd → Rd, Eq. 2 is transformed into:
p(yi | x) = | P|Y jexp (cos ( =1 b| exp (cos ( Pi (Pzi)(, P z)t, P (wt i()) w/τ j)))/τ) | . | (5) |
| {z }
Projected MatchingWe denote the classification based on Eq. 5 as fPM(x). By freezing the image and text encoders, it
aligns the downstream features in the projected space, allowing the model to encode the relevant
downstream information into projection layers. Since the pre-trained model outputs generalizable
features, the projection layer learns to recombine features in a data-driven manner. For instance, in a
task involving ‘birds,’ the projection would assign a higher weight to features like ‘beaks’ and ‘wings.’
This adaptation enables the projected features to better discern and recognize downstream tasks.
Expandable Projections: However, sequentially training a single projection layer still leads to
forgetting of former tasks, resulting in confusion when combining old and new concepts. To
this end, we expand task-specific projections for each new task. Specifically, we append a newly
initialized projection layer Pib, Ptb when a new task Db arrives. This results in a set of projections:
{Pi1, P | ib, }, {Pt1, P | tb, }, and we adopt the aggregation as the output, i.e., |
Pi(z) = Pb m=1 Pim (z) , Pt(w) = Pb n=1 Ptn (w) . | (6) |
i2, · · · Pt2, · · · PIn Eq. 6, projected features from different stages are mapped and aggregated to capture the different
emphases of former and latter tasks. For example, former tasks might emphasize ‘beak’ features
4A photo of
a panda
Image
EncoderVisual
Prototypes
标签:pre,methods,Language,Models,Forgetting,learning,performance,model,PROOF From: https://blog.51cto.com/u_15892225/6386157