论文来自:https://ku-cvlab.github.io/locotrack/
Local All-Pair Correspondence for Point Tracking
局部全对应对点跟踪
Seokju Cho \({}^{1}\) ,Jiahui Huang \({}^{2}\) ,Jisu Nam \({}^{1}\) ,Honggyu An \({}^{1}\) ,
Seungryong \({\mathrm{{Kim}}}^{1, \dagger }\) ,and Joon-Young Lee \({}^{2, \dagger }\)
崔石九 \({}^{1}\) ,黄家辉 \({}^{2}\) ,南智秀 \({}^{1}\) ,安弘圭 \({}^{1}\) ,
成龙秀 \({\mathrm{{Kim}}}^{1, \dagger }\) ,以及李俊荣 \({}^{2, \dagger }\)
Fig. 1: Evaluating LocoTrack against state-of-the-art methods. We compare our LocoTrack against other SOTA methods 1224 in terms of model size (circle size), accuracy (y-axis), and throughput (x-axis). LocoTrack shows exceptionally high precision and efficiency.
图1:评估LocoTrack与最先进方法的对比。我们比较了LocoTrack与其他SOTA方法1224在模型大小(圆圈大小)、准确度(y轴)和吞吐量(x轴)方面的表现。LocoTrack显示出极高的精确度和效率。
Abstract. We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost \(6 \times\) faster than the current state-of-the-art.
摘要。我们介绍了LocoTrack,这是一个为跨视频序列跟踪任意点(TAP)任务设计的高准确度和高效率模型。以往的方法通常依赖于局部2D相关图来建立从查询图像中的点到目标图像中局部区域的对应关系,这在同质区域或重复特征的情况下常常遇到匹配模糊的问题。LocoTrack通过一种新颖的方法克服了这一挑战,
- 该方法利用区域间的全对应对应关系,即局部4D相关性,来建立精确的对应关系,
- 并通过双向对应和匹配平滑性显著增强了对抗模糊性的鲁棒性。
- 我们还引入了一个轻量级的相关性编码器来提高计算效率,以及一个紧凑的Transformer架构来整合长期时间信息。
- LocoTrack在所有TAP-Vid基准测试中实现了无与伦比的准确性,并且运行速度几乎比当前最先进的方法快\(6 \times\)倍。
1 Introduction 引言
Finding corresponding points across different views of a scene, a process known as point correspondence \(\left\lbrack {1,{31},{68}}\right\rbrack\) ,is one of fundamental problems in computer vision,which has a variety of applications such as 3D reconstruction [34,49], autonomous driving [21,40], and pose estimation [47-49]. Recently, the emerging point tracking task 11, 15 addresses the point correspondence across a video. Given an input video and a query point on a physical surface, the task aims to find the corresponding position of the query point for every target frame along with its visibility status. This task demands a sophisticated understanding of motion over time and a robust capability for matching points accurately.
- 在场景的不同视图中找到对应点,这一过程被称为点对应 \(\left\lbrack {1,{31},{68}}\right\rbrack\),是计算机视觉中的基本问题之一,具有多种应用,如三维重建 [34,49]、自动驾驶 [21,40] 和姿态估计 [47-49]。
- 最近,新兴的点跟踪任务 11, 15 解决了视频中的点对应问题。给定一个输入视频和一个物理表面上的查询点,该任务旨在找到查询点在每个目标帧中的对应位置及其可见性状态。
- 这项任务要求对随时间变化的移动有深入的理解,并具备准确匹配点的强大能力。
\({}^{ \dagger }\) Co-corresponding authors.
\({}^{ \dagger }\) 共同对应作者。
Fig. 2: Illustration of our core component. Our local all-pair formulation, achieved with local 4D correlation, demonstrates robustness against matching ambiguity. This contrasts with previous works [12, 15, 24, 63] that rely on point-to-region correspondences, achieved with local 2D correlation, which are susceptible to the ambiguity.
图 2:展示我们的核心组件。我们的局部全对偶公式通过局部 4D 相关性实现,显示出对匹配模糊性的鲁棒性。这与之前依赖局部 2D 相关性实现的点对区域对应的工作 [12, 15, 24, 63] 形成对比,后者容易受到模糊性的影响。
Recent methods in this task often rely on constructing a 2D local correlation map 12, 15, 24, 63, comparing the deep features of a query point with a local region of the target frame to predict the corresponding positions. However, this approach encounters substantial difficulties in precisely identifying positions within homogeneous areas, regions with repetitive patterns, or differentiating among co-occurring objects [46, 57, 68]. To resolve matching ambiguities that arise in these challenging scenarios, establishing effective correspondence between frames is crucial. Existing works attempt to resolve these ambiguities by considering the temporal context 12, 15, 24, 69, however, in cases of severe occlusion or complex scenes, challenges often persist.
这项任务中的近期方法通常依赖于构建一个 2D 局部相关性图 12, 15, 24, 63,通过比较查询点的深度特征与目标帧的局部区域来预测对应位置。然而,这种方法在精确识别同质区域、具有重复图案的区域或区分同时出现的物体 [46, 57, 68] 中的位置时遇到很大困难。为了解决这些具有挑战性的场景中出现的匹配模糊性,建立帧之间的有效对应至关重要。现有工作试图通过考虑时间上下文 12, 15, 24, 69 来解决这些模糊性,但在严重遮挡或复杂场景的情况下,挑战仍然存在。
In this work, we aim to alleviate the problem with better spatial context which is lacking in local 2D correlations. We revisit dense correspondence methods 6,7,28,44,58,59, as they demonstrate robustness against matching ambiguity by leveraging rich spatial context. Dense correspondence establishes a corresponding point for every point in an image. To achieve this, these methods often calculate similarities for every pair of points across two images, resulting in a 4D correlation volume 6, 37, 44, 57, 58]. This high-dimensional tensor provides dense bidirectional correspondence, offering matching priors that 2D correlation does not, such as dense matching smoothness from one image to another and vice versa. For example, 4D correlation can provide the constraint that the correspondence of one point to another image is spatially coherent with the correspondences of its neighboring points [46]. However, incorporating the advantages of dense correspondence,which stem from the use of \(4\mathrm{D}\) correlation, into point tracking poses significant challenges. Not only does it introduce a substantial computational burden but the high-dimensionality of the correlation also necessitates a dedicated design for proper processing 6,35,46.
在这项工作中,我们的目标是通过更好地利用空间上下文来缓解局部二维相关性中缺乏空间上下文的问题。我们重新审视了密集对应方法 6,7,28,44,58,59,因为它们通过利用丰富的空间上下文展示了对抗匹配模糊的鲁棒性。密集对应为图像中的每个点建立一个对应的点。为了实现这一点,这些方法通常计算两幅图像中每对点的相似性,从而形成一个四维相关体积 6, 37, 44, 57, 58]。这种高维张量提供了密集的双向对应关系,提供了二维相关性所不具备的匹配先验,例如从一个图像到另一个图像以及反向的密集匹配平滑性。例如,四维相关性可以提供一个约束,即一个点到另一幅图像的对应关系在空间上与其邻近点的对应关系一致 [46]。然而,将密集对应的优点(源自使用 \(4\mathrm{D}\) 相关性)融入点跟踪中面临着重大挑战。这不仅引入了巨大的计算负担,而且相关性的高维度也需要专门的设计以进行适当的处理 6,35,46。
We solve the problem by formulating point tracking as a local all-pair correspondence problem, contrary to predominant point-to-region correspondence methods 12, 15, 24, 63, as illustrated in Fig. 2. We construct a local 4D correlation that finds all-pair matches between the local region around a query point and a corresponding local region on the target frame. With this formulation, our framework gains the ability to resolve matching ambiguities, provided by 4D correlation, while maintaining efficiency due to a constrained search range. The local 4D correlation is then processed by a lightweight correlation encoder carefully designed to handle high-dimensional correlation volume. This encoder decomposes the processing into two branches of 2D convolution layers and produces a compact correlation embedding. We then use a Transformer [24] to integrate temporal context into the embeddings. The Transformer's global receptive field facilitates effective modeling of long-range dependencies despite its compact architecture. Our experiments demonstrate that stack of 3 Transformer layers is sufficient to significantly outperform state-of-the-arts [12, 24]. Additionally, we found that using relative position bias [42,43,50] allows the Transformer to process sequences of variable length. This enables our model to handle long videos without the need for a hand-designed chaining process [15,24].
我们通过将点跟踪问题表述为一个局部全对应对应问题来解决该问题,这与主流的点对区域对应方法12, 15, 24, 63相反,如图2所示。我们构建了一个局部4D相关性,该相关性在查询点周围的局部区域与目标帧上的相应局部区域之间找到所有对的匹配。通过这种表述,我们的框架获得了由4D相关性提供的解决匹配歧义的能力,同时由于受限的搜索范围而保持了效率。然后,局部4D相关性由一个精心设计的轻量级相关性编码器处理,该编码器用于处理高维相关性体积。这个编码器将处理分解为两个2D卷积层分支,并产生一个紧凑的相关性嵌入。然后,我们使用Transformer [24]将时间上下文集成到嵌入中。Transformer的全局感受野尽管架构紧凑,但有助于有效建模长程依赖关系。我们的实验表明,3层Transformer堆栈足以显著超越现有技术[12, 24]。此外,我们发现使用相对位置偏差[42,43,50]允许Transformer处理可变长度的序列。这使得我们的模型能够处理长视频,而无需手动设计的链接过程[15,24]。
Our model, dubbed LocoTrack, outperforms the recent state-of-the-art model while maintaining an extremely lightweight architecture, as illustrated in Fig. 1 Specifically, our small model variant achieves a +2.5 AJ increase in the TAP-Vid-DAVIS dataset compared to Cotracker [24] and offers \(6 \times\) faster inference speed. Additionally,it surpasses TAPIR [12] by \(+ {5.6}\mathrm{{AJ}}\) with \({3.5} \times\) faster inference in the same dataset. Our larger variant, while still faster than competing state-of-the-art models 12, 24, demonstrates even further performance gains.
我们的模型,称为LocoTrack,在保持极其轻量级架构的同时,超越了最近的最新模型,如图1所示。具体来说,我们的小模型变体在TAP-Vid-DAVIS数据集上比Cotracker [24]提高了+2.5 AJ,并提供了\(6 \times\)更快的推理速度。此外,它在同一数据集上以\({3.5} \times\)更快的推理速度超越了TAPIR [12] \(+ {5.6}\mathrm{{AJ}}\)。我们的大型变体虽然仍然比竞争的最新模型12, 24更快,但展示了进一步的性能提升。
In summary, LocoTrack is a highly efficient and accurate model for point tracking. Its core components include a novel local all-pair correspondence formulation, leveraging dense correspondence to improve robustness against matching ambiguity, a lightweight correlation encoder that ensures computational efficiency, and a Transformer for incorporating temporal information over variable context lengths.
总之,LocoTrack 是一个高效且准确的点跟踪模型。其核心组件包括一种新颖的局部全对应对策,利用密集对应关系来提高对匹配模糊的鲁棒性,一个轻量级的相关编码器,确保计算效率,以及一个变压器,用于在可变上下文长度上整合时间信息。
2 Related Work
Point correspondence. The aim of point correspondence, which is also known as sparse feature matching [10,13,31,48], is to identify corresponding points across images within a set of detected points. This is often achieved by matching a hand-designed descriptors [1, 31] or, more recently, learnable deep features 10, 22, 32, 52]. They are also applicable to videos [40], as the task primarily targets image pairs with large baselines, which is similar to the case with video frames. These approaches filter out noisy correspondences using geometric constraints [16, 49, 56] or their learnable counterparts [22, 48, 68]. However, they often struggle with objects that exhibit deformation 67. Also, they primarily target the correspondence of geometrically salient points (i.e., detected points) rather than any arbitrary point.
点对应。点对应的目标,也称为稀疏特征匹配 [10,13,31,48],是在一组检测到的点中识别跨图像的对应点。这通常通过匹配手工设计的描述符 [1, 31] 或更近期的可学习深度特征 [10, 22, 32, 52] 来实现。它们也适用于视频 [40],因为该任务主要针对具有大基线的图像对,这与视频帧的情况类似。这些方法通过几何约束 [16, 49, 56] 或其可学习的对应方法 [22, 48, 68] 来过滤掉噪声对应关系。然而,它们通常难以处理表现出变形的物体 [67]。此外,它们主要针对几何上显著点的对应关系(即检测到的点),而不是任意点。
Long-range point correspondence in video. Recent methods 2, 11, 12, 15 24,63,69 finds point correspondence in a video, aiming to find a track for a query point over a long sequence of video. They capture a long-range temporal context with MLP-Mixer [2, 15], 1D convolution [12, 69], or Transformer [24]. However, they either leverage a constrained length of sequence within a local temporal window and use sliding window inference to process videos longer than the fixed window size \(\left| {2,{15},{24}}\right|\) ,or they necessitate a series of convolution layers to expand the temporal receptive field 12, 69. Recent Cotracker 24 leverage spatial context by aggregating supporting tracks with self-attention. However, this approach requires tracking additional query points, which introduces significant computational overhead. Notably, Context-PIPs [2] constructs a correlation map across sparse points around the query and the target region. However, this sparsity may limit the model's ability to fully leverage the matching prior that all-pair correlation can provide, such as matching smoothness.
视频中的长距离点对应关系。近期方法2, 11, 12, 15, 24, 63, 69旨在在视频中寻找点对应关系,目的是在长视频序列中为查询点找到一条轨迹。它们通过MLP-Mixer [2, 15]、1D卷积 [12, 69] 或Transformer [24] 捕捉长距离的时间上下文。然而,这些方法要么在局部时间窗口内利用受限的序列长度,并通过滑动窗口推理来处理超过固定窗口大小的视频 \(\left| {2,{15},{24}}\right|\),要么需要一系列卷积层来扩展时间感受野 12, 69。最近的Cotracker 24通过自注意力聚合支持轨迹来利用空间上下文。然而,这种方法需要追踪额外的查询点,这引入了显著的计算开销。值得注意的是,Context-PIPs [2] 在查询点和目标区域周围的稀疏点之间构建了相关性图。然而,这种稀疏性可能限制了模型充分利用全对相关性所能提供的匹配先验,例如匹配平滑性。
Dense correspondence. Dense correspondence [28] aims to establish pixel-wise correspondence between a pair of images. Conventional methods [6, 18, 33, 37, 45, 54, 58, 60 often leverage a 4-dimensional correlation volume, which computes pairwise cosine similarity between localized deep feature descriptors from two images, as the 4D correlation provides a mean for disambiguate the matching process. Traditionally, bidirectional matches from 4D correlation are filtered to remove spurious matches using techniques such as the second nearest neighbor ratio test [31] or the mutual nearest neighbor constraint. Recent methods instead learn patterns within the correlation map to disambiguate matches. DGC-Net [33] and GLU-Net [58] proposed a coarse-to-fine architecture leveraging global 4D correlation followed by local 2D correlation. CATs [6, 7] propose a transformer-based architecture to aggregate the global 4D correlation. GoCor [57], NCNet [45], and RAFT [54] developed an efficient framework using local 4D correlation to learn spatial priors in both image pairs, addressing matching ambiguities.
密集对应。密集对应 [28] 旨在建立一对图像之间的像素级对应关系。传统方法 [6, 18, 33, 37, 45, 54, 58, 60] 通常利用四维相关体积,该体积计算来自两幅图像的局部深度特征描述符之间的成对余弦相似度,因为四维相关为消除匹配过程中的歧义提供了一种手段。传统上,来自四维相关的双向匹配通过使用第二最近邻比率测试 [31] 或互最近邻约束等技术进行过滤,以去除虚假匹配。最近的方法则学习相关图中的模式以消除匹配歧义。DGC-Net [33] 和 GLU-Net [58] 提出了一种利用全局四维相关后跟局部二维相关的从粗到细的架构。CATs [6, 7] 提出了一种基于transformer的架构来聚合全局四维相关。GoCor [57]、NCNet [45] 和 RAFT [54] 开发了一个使用局部四维相关的高效框架,以学习两幅图像中的空间先验,解决匹配歧义。
The use of 4D correlation extends beyond dense correspondence. It has been widely applied in fields such as video object segmentation [5, 39], few-shot semantic segmentation [19,35], and few-shot classification [23]. However, its application in point tracking remains underexplored. Instead, several attempts have been made to integrate the strengths of off-the-shelf dense correspondence model [54] into point tracking. These include chaining dense correspondences [15, 54, which has limitations in recovering from occlusion, or directly finding correspondences with distant frames 36,38,64 , which is computationally expensive.
四维相关的应用不仅限于密集对应。它已被广泛应用于视频对象分割 [5, 39]、少样本语义分割 [19,35] 和少样本分类 [23] 等领域。然而,其在点跟踪中的应用仍未得到充分探索。相反,已有几次尝试将现成的密集对应模型的优势 [54] 整合到点跟踪中。这些尝试包括链接密集对应 [15, 54],其在从遮挡中恢复方面存在局限性,或直接寻找与远距离帧的对应关系 [36,38,64],这在计算上非常昂贵。
3 Method
In this work, we integrate the effectiveness of a 4D correlation volume into our point tracking pipeline. Compared to the widely used 2D correlation 11,12,15
在这项工作中,我们将4D相关体积的有效性整合到我们的点跟踪流程中。与广泛使用的2D相关性相比,11,12,15
Fig. 3: Overall architecture of LocoTrack. Our model comprises two stages: track initialization and track refinement. The track initialization stage determines a rough position by conducting feature matching with global correlation. The track refinement stage iteratively refines the track by processing the local 4D correlation.
图3:LocoTrack的整体架构。我们的模型包括两个阶段:轨迹初始化和轨迹细化。轨迹初始化阶段通过全局相关性进行特征匹配来确定一个粗略位置。轨迹细化阶段通过处理局部4D相关性来迭代细化轨迹。
24, 4D correlation offers two distinct characteristics that provide valuable information for filtering out noisy correspondences, leading to more robust tracking:
24, 4D相关性具有两个独特的特性,这些特性提供了有价值的信息,用于过滤噪声对应关系,从而实现更稳健的跟踪:
-
Bidirectional correspondence: 4D correlation provides bidirectional correspondences, which can be used to verify matches and reduce ambiguity [31]. This prior is often leveraged by checking for mutual consensus [46] or by employing a ratio test [31].
-
双向对应:4D相关性提供双向对应关系,可用于验证匹配并减少歧义[31]。这种先验通常通过检查相互共识[46]或采用比率测试[31]来利用。
-
Smooth matching: A 4D correlation volume is constructed using dense all-pair correlations, which can be leveraged to enforce matching smoothness and improve matching consistency across neighboring points [46,57,58].
-
平滑匹配:4D相关体积是通过密集的全对相关性构建的,可以利用这一点来强制匹配平滑性并提高相邻点之间的匹配一致性[46,57,58]。
We aim to leverage these benefits of the 4D correlation volume while maintaining efficient computation. We achieve this by restricting the search space to a local neighborhood when constructing the \(4\mathrm{D}\) correlation volume. Along with the use of local 4D correlation, we also propose a recipe to benefit from the global receptive field of Transformers for long-range temporal modeling. This enables our model to capture long-range context within a few (even only with 3) stacks of transformer layers, resulting in a compact architecture.
我们的目标是在保持高效计算的同时利用4D相关体积的这些优势。我们通过在构建\(4\mathrm{D}\)相关体积时将搜索空间限制在局部邻域来实现这一点。除了使用局部4D相关性外,我们还提出了一种方法,利用Transformer的全局感受野进行长距离时间建模。这使得我们的模型能够在少数(甚至仅用3层)Transformer层堆栈中捕获长距离上下文,从而形成一个紧凑的架构。
Our method,dubbed LocoTrack,takes as input a query point \(q = \left( {{x}_{q},{y}_{q},{t}_{q}}\right) \in\) \({\mathbb{R}}^{3}\) and a video \(\mathcal{V} = {\left\{ {\mathcal{I}}_{t}\right\} }_{t = 0}^{t = T - 1}\) ,where \(T\) indicates the number of frames and \({\mathcal{I}}_{t} \in {\mathbb{R}}^{H \times W \times 3}\) represents the \(t\) -th frame. We assume query point can be given in the arbitrary time step. Our goal is to produce a track \(\mathcal{T} = {\left\{ {\mathcal{T}}_{t}\right\} }_{t = 0}^{t = T - 1}\) , where \({\mathcal{T}}_{t} \in {\mathbb{R}}^{2}\) ,and associated occlusion probabilities \(\mathcal{O} = {\left\{ {\mathcal{O}}_{t}\right\} }_{t = 0}^{t = T - 1}\) ,where \({\mathcal{O}}_{t} \in \left\lbrack {0,1}\right\rbrack\) . Following previous works [12,24],our method predicts the track in two stage approach: an initialization stage followed by a refinement stage, each detailed in the follows, as illustrated in Fig. 3
我们的方法,称为 LocoTrack,以查询点 \(q = \left( {{x}_{q},{y}_{q},{t}_{q}}\right) \in\) \({\mathbb{R}}^{3}\) 和视频 \(\mathcal{V} = {\left\{ {\mathcal{I}}_{t}\right\} }_{t = 0}^{t = T - 1}\) 作为输入,其中 \(T\) 表示帧数,\({\mathcal{I}}_{t} \in {\mathbb{R}}^{H \times W \times 3}\) 代表第 \(t\) 帧。我们假设查询点可以在任意时间步给出。我们的目标是生成一条轨迹 \(\mathcal{T} = {\left\{ {\mathcal{T}}_{t}\right\} }_{t = 0}^{t = T - 1}\),其中 \({\mathcal{T}}_{t} \in {\mathbb{R}}^{2}\),以及相关的遮挡概率 \(\mathcal{O} = {\left\{ {\mathcal{O}}_{t}\right\} }_{t = 0}^{t = T - 1}\),其中 \({\mathcal{O}}_{t} \in \left\lbrack {0,1}\right\rbrack\)。遵循先前的工作 [12,24],我们的方法采用两阶段方法预测轨迹:初始化阶段和细化阶段,每个阶段详细如下,如图 3 所示。
3.1 Stage I: Track Initialization 阶段 I:轨迹初始化
To estimate the initial track of a given query point, we conduct feature matching that constructs a global similarity map between features derived from the query point and the target frame's feature map, and choose the positions with the
为了估计给定查询点的初始轨迹,我们进行特征匹配,构建查询点特征与目标帧特征图之间的全局相似度图,并选择具有最高分数的位置作为初始轨迹。这种相似度图,通常称为相关性图,为准确初始化轨迹位置提供了强信号。我们在初始化阶段使用全局相关性图,计算每帧中每个像素的相似度。
Fig. 4: Visualization of correspondence. We visualize the correspondences established between the query and target regions. Our refined 4D correlation (e) demonstrates a clear reduction in matching ambiguity and yields better correspondences compared to the noisy results produced by \(2\mathrm{D}\) correlation (d). This improvement aligns closely with the ground truth (c).
图 4:对应关系可视化。我们展示了查询区域与目标区域之间建立的对应关系。我们改进的 4D 相关性(e)显示了匹配模糊性的明显减少,并产生了比 \(2\mathrm{D}\) 相关性(d)产生的噪声结果更好的对应关系。这一改进与真实情况(c)非常吻合。
highest scores as the initial track. This similarity map, often referred to as a correlation map, provides a strong signal for accurately initializing the track's positions. We use a global correlation map for the initialization stage, which calculates the similarity for every pixel in each frame.
最高分数作为初始轨迹。这种相似度图,通常称为相关性图,为准确初始化轨迹位置提供了强信号。我们在初始化阶段使用全局相关性图,计算每帧中每个像素的相似度。
Specifically, we use hierarchical feature maps derived from the feature backbone 17. Given a set of pyramidal feature maps \({\left\{ {F}_{t}^{l}\right\} }_{t}^{T - 1} = \mathcal{E}\left( \mathcal{V}\right)\) ,where \(\mathcal{E}\left( \cdot \right)\) represents the feature extractor and \({F}_{t}^{l}\) indicates a level \(l \in \{ 0,\ldots ,L - 1\}\) feature map in frame \(t\) ,we sample a query feature vector \({F}^{l}\left( q\right)\) at position \(q\) from \({F}^{l}\) using linear interpolation for each level \(l\) . The global correlation map is calculated as \({\mathrm{C}}_{t}^{l} = \frac{{F}_{t}^{l} \cdot {F}^{l}\left( q\right) }{{\begin{Vmatrix}{F}_{t}^{l}\end{Vmatrix}}_{2}{\begin{Vmatrix}{F}^{l}\left( q\right) \end{Vmatrix}}_{2}} \in {\mathbb{R}}^{{H}^{l} \times {W}^{l}}\) ,where \({H}^{l}\) and \({W}^{l}\) denote the height and width of the feature map at the \(l\) -th level,respectively. The correlation maps obtained from multiple levels are resized to the largest feature map size and concatenated as \({\mathrm{C}}_{t} \in {\mathbb{R}}^{{H}^{0} \times {W}^{0} \times L}\) . The concatenated maps are processed as follows to generate the initial track and occlusion probabilities:
具体而言,我们使用从特征主干17派生的分层特征图。给定一组金字塔特征图 \({\left\{ {F}_{t}^{l}\right\} }_{t}^{T - 1} = \mathcal{E}\left( \mathcal{V}\right)\),其中 \(\mathcal{E}\left( \cdot \right)\) 表示特征提取器,\({F}_{t}^{l}\) 表示帧 \(t\) 中第 \(l \in \{ 0,\ldots ,L - 1\}\) 层的特征图,我们针对每个第 \(l\) 层,在位置 \(q\) 从 \({F}^{l}\) 使用线性插值采样一个查询特征向量 \({F}^{l}\left( q\right)\)。全局相关图计算为 \({\mathrm{C}}_{t}^{l} = \frac{{F}_{t}^{l} \cdot {F}^{l}\left( q\right) }{{\begin{Vmatrix}{F}_{t}^{l}\end{Vmatrix}}_{2}{\begin{Vmatrix}{F}^{l}\left( q\right) \end{Vmatrix}}_{2}} \in {\mathbb{R}}^{{H}^{l} \times {W}^{l}}\),其中 \({H}^{l}\) 和 \({W}^{l}\) 分别表示第 \(l\) 层特征图的高度和宽度。从多个层次获得的相关图被调整到最大的特征图大小并连接为 \({\mathrm{C}}_{t} \in {\mathbb{R}}^{{H}^{0} \times {W}^{0} \times L}\)。连接的图按如下方式处理以生成初始跟踪和遮挡概率:
\[{\mathcal{T}}_{t}^{0} = \operatorname{Softargmax}\left( {\operatorname{Conv}2\mathrm{D}\left( {\mathrm{C}}_{t}\right) ;\tau }\right) , \]\[{\mathcal{O}}_{t}^{0} = \operatorname{Linear}\left( \left\lbrack {\operatorname{Maxpool}\left( {\mathrm{C}}_{t}\right) ;\operatorname{Avgpool}\left( {\mathrm{C}}_{t}\right) }\right\rbrack \right) , \tag{1} \]where Conv2D : \({\mathbb{R}}^{H \times W \times L} \rightarrow {\mathbb{R}}^{H \times W}\) is a single-layered 2D convolution layer, Softargmax : \({\mathbb{R}}^{H \times W} \rightarrow {\mathbb{R}}^{2}\) is a differentiable argmax function with a Gaussian kernel [27] that provides the 2D position of the maximum value, \(\tau\) is a temperature parameter, \(\left\lbrack \cdot \right\rbrack\) indicates concatenation,and Linear : \({\mathbb{R}}^{2L} \rightarrow \mathbb{R}\) is a linear projection. Similar to CBAM 65, we apply global max and average pooling followed by a linear projection to calculate initial occlusion probabilities.
其中 Conv2D : \({\mathbb{R}}^{H \times W \times L} \rightarrow {\mathbb{R}}^{H \times W}\) 是一个单层的2D卷积层,Softargmax : \({\mathbb{R}}^{H \times W} \rightarrow {\mathbb{R}}^{2}\) 是一个带有高斯核 [27] 的可微argmax函数,提供最大值的2D位置,\(\tau\) 是一个温度参数,\(\left\lbrack \cdot \right\rbrack\) 表示连接,Linear : \({\mathbb{R}}^{2L} \rightarrow \mathbb{R}\) 是一个线性投影。类似于CBAM 65,我们应用全局最大和平均池化,随后进行线性投影来计算初始遮挡概率。
3.2 Stage II: Track Refinement 阶段 II:跟踪细化
We found that the initial track \({\mathcal{T}}^{0}\) and \({\mathcal{O}}^{0}\) often exhibit severe jittering,arising from the matching ambiguity from the noisy correlation map. We iteratively refine the noise in the initial tracks \({\mathcal{T}}^{0}\) and \({\mathcal{O}}^{0}\) . For each iteration,we estimate the residuals \(\Delta {\mathcal{T}}^{k}\) and \(\Delta {\mathcal{O}}^{k}\) ,which are then applied to the tracks as \({\mathcal{T}}^{k + 1} \mathrel{\text{:=}} {\mathcal{T}}^{k} +\) \(\Delta {\mathcal{T}}^{k}\) and \({\mathcal{O}}^{k + 1} \mathrel{\text{:=}} {\mathcal{O}}^{k} + \Delta {\mathcal{O}}^{k}\) . During the refining process,the matching noise can be rectified in two ways: 1) by establishing locally dense correspondences with local 4D correlation, and 2) through temporal modeling with a Transformer 62.
我们发现初始轨迹 \({\mathcal{T}}^{0}\) 和 \({\mathcal{O}}^{0}\) 常常表现出严重的抖动,这是由于噪声相关图中匹配的模糊性所致。我们迭代地优化初始轨迹 \({\mathcal{T}}^{0}\) 和 \({\mathcal{O}}^{0}\) 中的噪声。对于每次迭代,我们估计残差 \(\Delta {\mathcal{T}}^{k}\) 和 \(\Delta {\mathcal{O}}^{k}\),然后将它们应用于轨迹作为 \({\mathcal{T}}^{k + 1} \mathrel{\text{:=}} {\mathcal{T}}^{k} +\) \(\Delta {\mathcal{T}}^{k}\) 和 \({\mathcal{O}}^{k + 1} \mathrel{\text{:=}} {\mathcal{O}}^{k} + \Delta {\mathcal{O}}^{k}\)。在优化过程中,匹配噪声可以通过两种方式纠正:1) 通过建立局部密集对应关系与局部 4D 相关性,以及 2) 通过使用 Transformer 62 进行时间建模。
Local 4D correlation. The 2D correlation \({\mathcal{C}}_{t}\) often exhibits limitations when dealing with repetitive patterns or homogeneous regions as exemplified in Fig. 4 Inspired by dense correspondence literatures, we utilize 4D correlation to provide richer information for refining tracks compared to 2D correlation. The 4D correlation \({\mathrm{C}}^{4\mathrm{D}} \in {\mathbb{R}}^{H \times W \times H \times W}\) ,which computes every pairwise similarity,can be formally defined as follows:
局部 4D 相关性。2D 相关性 \({\mathcal{C}}_{t}\) 在处理重复模式或同质区域时常常表现出局限性,如图 4 所示。受密集对应文献的启发,我们利用 4D 相关性来提供比 2D 相关性更丰富的信息以优化轨迹。4D 相关性 \({\mathrm{C}}^{4\mathrm{D}} \in {\mathbb{R}}^{H \times W \times H \times W}\),它计算每对相似度,可以正式定义如下:
\[{\mathrm{C}}_{t}^{4\mathrm{D}}\left( {i,j}\right) = \frac{{F}_{t}\left( i\right) \cdot {F}_{{t}_{q}}\left( j\right) }{{\begin{Vmatrix}{F}_{t}\left( i\right) \end{Vmatrix}}_{2}{\begin{Vmatrix}{F}_{{t}_{q}}\left( j\right) \end{Vmatrix}}_{2}}, \tag{2} \]where \({F}_{{t}_{q}}\) is the feature map from the frame in which the query point is located,and \(i\) and \(j\) specify the locations within the feature map. However,since a global 4D correlation volume with the shape of \(H \times W \times H \times W\) becomes computationally intractable,we employ a local 4D correlation \(\mathrm{L} \in {\mathbb{R}}^{{h}_{p} \times {w}_{p} \times {h}_{q} \times {w}_{q}}\) , where \(\left( {{h}_{p},{w}_{p},{h}_{q},{w}_{q}}\right)\) denotes spatial resolution of local correlation. We define the correlation as follows:
其中 \({F}_{{t}_{q}}\) 是查询点所在帧的特征图,而 \(i\) 和 \(j\) 指定了特征图内的位置。然而,由于形状为 \(H \times W \times H \times W\) 的全局 4D 相关体积在计算上变得难以处理,我们采用局部 4D 相关性 \(\mathrm{L} \in {\mathbb{R}}^{{h}_{p} \times {w}_{p} \times {h}_{q} \times {w}_{q}}\),其中 \(\left( {{h}_{p},{w}_{p},{h}_{q},{w}_{q}}\right)\) 表示局部相关的空间分辨率。我们将相关性定义如下:
\[\mathcal{N}\left( {p,r}\right) = \left\{ {p + \delta \mid \delta \in {\mathbb{Z}}^{2},\parallel \delta {\parallel }_{\infty } \leq r}\right\} \]\[{\mathrm{L}}_{t}\left( {i,j;p}\right) = \frac{{F}_{t}\left( i\right) \cdot {F}_{{t}_{q}}\left( j\right) }{{\begin{Vmatrix}{F}_{t}\left( i\right) \end{Vmatrix}}_{2}{\begin{Vmatrix}{F}_{{t}_{q}}\left( j\right) \end{Vmatrix}}_{2}},\;i \in \mathcal{N}\left( {p;{r}_{p}}\right) ,\;j \in \mathcal{N}\left( {q;{r}_{q}}\right) , \tag{3} \]where \({r}_{p}\) and \({r}_{q}\) are the radii of the regions around points \(p\) and \(q\) ,respectively, resulting in \({h}_{p} = {w}_{p} = 2{r}_{p} + 1\) and \({h}_{q} = {w}_{q} = 2{r}_{q} + 1\) . The correlation then serves as a cue for refining the track \({\mathcal{T}}^{k}\) . To achieve this,we calculate the set of local correlations around the intermediate predicted position,denoted as \({\left\{ {\mathrm{L}}_{t}\left( {\mathcal{T}}_{t}^{k}\right) \right\} }_{t = 0}^{T - 1}\) with abuse of notation.
其中 \({r}_{p}\) 和 \({r}_{q}\) 分别是围绕点 \(p\) 和 \(q\) 区域的半径,分别得到 \({h}_{p} = {w}_{p} = 2{r}_{p} + 1\) 和 \({h}_{q} = {w}_{q} = 2{r}_{q} + 1\)。然后,相关性作为细化轨迹 \({\mathcal{T}}^{k}\) 的线索。为了实现这一点,我们计算围绕中间预测位置的局部相关性集合,记为 \({\left\{ {\mathrm{L}}_{t}\left( {\mathcal{T}}_{t}^{k}\right) \right\} }_{t = 0}^{T - 1}\),这里使用了符号的滥用。
Local 4D correlation encoder. We then process the local 4D correlation volume to disambiguate matching ambiguities, leveraging the smoothness of both the query and target dimensions of correlations. Note that the obtained 4D correlation is a high-dimensional tensor, posing an additional challenge for its correct processing. In this regard, we introduce an efficient encoding strategy that decomposes the processing of the correlation. We process the 4D correlation in two symmetrical branches as shown in Fig. 5. One branch spatially processes the dimensions of the query, treating the flattened target dimensions as a channel dimension. The other branch, on the other hand, considers the query dimensions as channel. Each branch compresses the correlation into a single vector, which are then concatenated to form a correlation embedding \({E}_{t}^{k}\) :
局部4D相关性编码器。我们随后处理局部4D相关性体积,以消除匹配的歧义,利用相关性的查询和目标维度的平滑性。需要注意的是,获得的4D相关性是一个高维张量,这为其正确处理带来了额外的挑战。在这方面,我们引入了一种高效的编码策略,将相关性的处理分解开来。我们如图5所示,在两个对称分支中处理4D相关性。一个分支在空间上处理查询的维度,将展平的目标维度视为通道维度。另一个分支则将查询维度视为通道。每个分支将相关性压缩成一个单一向量,然后将它们连接起来形成一个相关性嵌入 \({E}_{t}^{k}\):
\[{E}_{t}^{k} = \left\lbrack {{\mathcal{E}}_{\mathrm{L}}\left( {{\mathrm{L}}_{t}\left( {\mathcal{T}}_{t}^{k}\right) }\right) ;{\mathcal{E}}_{\mathrm{L}}\left( {\left( {\mathrm{L}}_{t}\left( {\mathcal{T}}_{t}^{k}\right) \right) }^{T}\right) }\right\rbrack , \tag{4} \]Fig. 5: Local 4D correlation encoder.
图5:局部4D相关性编码器。
where \(\mathrm{L}\left( {i,j}\right) = {\mathrm{L}}^{T}\left( {j,i}\right)\) . The convolutional encoder \({\mathcal{E}}_{\mathrm{L}} : {\mathbb{R}}^{{h}_{p} \times {w}_{p} \times {h}_{q} \times {w}_{q}} \rightarrow {\mathbb{R}}^{{C}_{E}}\) consists of stacks of strided 2D convolutions, group normalization [66], and ReLU activations. These operations progressively reduce the correlation's spatial dimensions, followed by a final average pooling layer for a compact representation. We obtain the correlation embedding for all feature levels \(l\) ,and concatenate them to form the final embedding. For more details on the local 4D correlation encoder, please refer to the supplementary material.
其中 \(\mathrm{L}\left( {i,j}\right) = {\mathrm{L}}^{T}\left( {j,i}\right)\)。卷积编码器 \({\mathcal{E}}_{\mathrm{L}} : {\mathbb{R}}^{{h}_{p} \times {w}_{p} \times {h}_{q} \times {w}_{q}} \rightarrow {\mathbb{R}}^{{C}_{E}}\) 由一系列步幅为2的2D卷积、组归一化 [66] 和ReLU激活函数组成。这些操作逐步减少相关性的空间维度,随后通过一个最终的平均池化层得到紧凑的表示。我们为所有特征层获得相关性嵌入 \(l\),并将它们连接起来形成最终的嵌入。有关局部4D相关性编码器的更多详细信息,请参阅补充材料。
Temporal modeling with length-generalizable transformer. The encoded correlation is then provided to the refinement model. The model refines the initial trajectory and predicts its error with respect to the ground truth, \(\Delta \mathcal{T}\) and \(\Delta \mathcal{O}\) ,which requires an ability to leverage temporal context. For the temporal modelling, we explore three candidates widely used in the literature: 1D Convolution 12,69, MLP-Mixer [15], and Transformer [24]. We consider two aspects to select the appropriate architecture: 1) Can the architecture handle arbitrary sequence lengths \(T\) at test time? 2) Can the temporal receptive field, crucial for capturing long-range context, be sufficiently large with just a few layers stacked? Based on these criteria, we choose the Transformer as our architecture because it can handle arbitrary sequence lengths, a capability the MLP-Mixer lacks. This lack would necessitate an additional test-time strategy (e.g., sliding window inference [15]) to accommodate sequences longer than those used during training. Additionally, the Transformer can form a global receptive field with a single layer, unlike convolution, which requires multiple layers to achieve an expanded receptive field.
时间建模与长度可泛化的Transformer。编码的相关性随后被提供给细化模型。该模型细化初始轨迹并预测其相对于真实轨迹的误差,\(\Delta \mathcal{T}\) 和 \(\Delta \mathcal{O}\),这需要利用时间上下文的能力。对于时间建模,我们探讨了文献中广泛使用的三种候选模型:1D卷积12,69、MLP-Mixer [15] 和 Transformer [24]。我们考虑两个方面来选择合适的架构:1) 架构能否在测试时处理任意序列长度 \(T\)?2) 时间感受野,对于捕捉长程上下文至关重要,能否仅通过堆叠少数层就足够大?基于这些标准,我们选择Transformer作为我们的架构,因为它能够处理任意序列长度,这是MLP-Mixer所缺乏的能力。这种缺乏将需要额外的测试时策略(例如,滑动窗口推理 [15])来适应比训练时使用的序列更长的序列。此外,Transformer可以在单层中形成全局感受野,而卷积则需要多层才能实现扩展的感受野。
Although the Transformer can process sequences of arbitrary length at test time, we found that sinusoidal position encoding 62 degrades performance for videos with sequence lengths that differ from those used during training. Instead, we use relative position bias \(\left\lbrack {{42},{43},{50}}\right\rbrack\) ,which disproportionately reduces the impact of distant tokens by adjusting the bias within the Transformer's attention map. However, relative position bias is based solely on the distance between tokens cannot distinguish their relative direction (e.g.,whether token \(\mathrm{A}\) is before or after token B), which makes it only suitable for causal attention. To address this, we divide the attention head into two groups: one group encodes relative position only for tokens on the left, and the other for tokens on the right:
尽管Transformer可以在测试时处理任意长度的序列,但我们发现,对于序列长度与训练时使用的不同的视频,正弦位置编码62会降低性能。相反,我们使用相对位置偏差 \(\left\lbrack {{42},{43},{50}}\right\rbrack\),通过调整Transformer注意力图中的偏差,不均衡地减少远距离标记的影响。然而,相对位置偏差仅基于标记之间的距离,无法区分它们的相对方向(例如,标记 \(\mathrm{A}\) 是在标记B之前还是之后),这使得它仅适用于因果注意力。为了解决这个问题,我们将注意力头分成两组:一组仅对左侧的标记编码相对位置,另一组对右侧的标记编码相对位置:
\[\operatorname{Softmax}\left( {\mathrm{q} \cdot {\mathrm{k}}^{T} + b\left( h\right) }\right) \text{,where} \]\[b\left( {{t}_{1},{t}_{2};h}\right) = \left\{ \begin{array}{ll} {b}_{\text{left }}\left( {{t}_{1},{t}_{2};h}\right) , & h < \left\lfloor \frac{{N}_{h}}{2}\right\rfloor , \\ {b}_{\text{right }}\left( {{t}_{1},{t}_{2};h - \left\lfloor {{N}_{h}/2}\right\rfloor }\right) , & h \geq \left\lfloor \frac{{N}_{h}}{2}\right\rfloor , \end{array}\right. \tag{5} \]where \(\mathrm{q}\) and \(\mathrm{k}\) denote the query and key,respectively, \({N}_{\mathrm{h}}\) is the number of heads, and \(h \in \left\{ {0,\ldots ,{N}_{\mathrm{h}} - 1}\right\}\) is the index of the attention head. The bias term \({b}_{\text{left }}\) adjusts the attention map to ensure that each query token attends only to key tokens located to its left or within the same position, as follows:
其中 \(\mathrm{q}\) 和 \(\mathrm{k}\) 分别表示查询和键,\({N}_{\mathrm{h}}\) 是头部的数量,\(h \in \left\{ {0,\ldots ,{N}_{\mathrm{h}} - 1}\right\}\) 是注意力头部的索引。偏置项 \({b}_{\text{left }}\) 调整注意力图,以确保每个查询标记仅关注位于其左侧或同一位置的键标记,如下所示:
\[{b}_{\text{left }}\left( {{t}_{1},{t}_{2};h}\right) = \left\{ \begin{array}{ll} - \infty , & \text{ if }{t}_{1} < {t}_{2}, \\ - {s}_{h}\left| {{t}_{1} - {t}_{2}}\right| , & \text{ if }{t}_{1} \geq {t}_{2}, \end{array}\right. \tag{6} \]where \({s}_{h} \in {\mathbb{R}}^{ + }\) is a scaling factor that controls the rate of bias decay as distance increases. We employ different scaling factors for each attention head, following Press et al. [42]. The function \({b}_{\text{right }}\left( \cdot \right)\) can be similarly defined. With this design choice, we found that the Transformer can generalize to videos of arbitrary length, eliminating the need for test-time hand-designed techniques such as sliding window inference [15,24].
其中 \({s}_{h} \in {\mathbb{R}}^{ + }\) 是一个缩放因子,控制偏置随距离增加的衰减率。我们根据 Press 等人 [42] 的建议,为每个注意力头部采用不同的缩放因子。函数 \({b}_{\text{right }}\left( \cdot \right)\) 可以类似地定义。通过这种设计选择,我们发现 Transformer 可以泛化到任意长度的视频,消除了测试时需要手动设计的技巧,如滑动窗口推理 [15,24]。
Iterative update. We stack \({N}_{S}\) Transformer layers with the modified self-attention and feed the correlation embedding \({\left\{ {E}_{t}^{k}\right\} }_{t = 0}^{t = T - 1}\) ,the encoded initialized track \({\mathcal{T}}^{k}\) ,and occlusion status \({\mathcal{O}}^{k}\) to the Transformer \({\mathcal{E}}_{S}\) to predict track updates. We found using position differences between adjacent frames improves training convergence compared to using the absolute positions. This is formally defined as:
迭代更新。我们将修改后的自注意力机制堆叠 \({N}_{S}\) 个 Transformer 层,并将相关性嵌入 \({\left\{ {E}_{t}^{k}\right\} }_{t = 0}^{t = T - 1}\)、编码的初始轨迹 \({\mathcal{T}}^{k}\) 和遮挡状态 \({\mathcal{O}}^{k}\) 输入到 Transformer \({\mathcal{E}}_{S}\) 中以预测轨迹更新。我们发现使用相邻帧之间的位置差异比使用绝对位置更能改善训练收敛。这正式定义为:
\[\Delta {\mathcal{T}}^{k},\Delta {\mathcal{O}}^{k} = {\mathcal{E}}_{S}\left( {\left\{ \left\lbrack \sigma \left( {\mathcal{T}}_{t}^{k} - {\mathcal{T}}_{t - 1}^{k}\right) ;\sigma \left( {\mathcal{T}}_{t + 1}^{k} - {\mathcal{T}}_{t}^{k}\right) ;{\mathcal{O}}_{t}^{k};{E}_{t}^{k}\right\rbrack \right\} }_{t = 0}^{t = T - 1}\right) , \]\[{\mathcal{T}}_{-1}^{k} \mathrel{\text{:=}} {\mathcal{T}}_{0}^{k},\;{\mathcal{T}}_{T}^{k} \mathrel{\text{:=}} {\mathcal{T}}_{T - 1}^{k}, \tag{7} \]where \(\sigma \left( \cdot \right)\) is a sinusoidal encoding [53],[-] denotes concatenation,and \(\Delta {\mathcal{T}}^{k}\) and \(\Delta {\mathcal{O}}^{k}\) are predicted updates. Sequentially,the predicted updates are applied to initial track as \({\mathcal{T}}^{k + 1} \mathrel{\text{:=}} {\mathcal{T}}^{k} + \Delta {\mathcal{T}}^{k}\) and \({\mathcal{O}}^{k + 1} \mathrel{\text{:=}} {\mathcal{O}}^{k} + \Delta {\mathcal{O}}^{k}\) . We perform \(K\) iterations,yielding the final refined track \({\mathcal{T}}^{K}\) and \({\mathcal{O}}^{K}\) .
其中 \(\sigma \left( \cdot \right)\) 是正弦编码 [53],[-] 表示连接,\(\Delta {\mathcal{T}}^{k}\) 和 \(\Delta {\mathcal{O}}^{k}\) 是预测的更新。随后,预测的更新被应用于初始轨迹作为 \({\mathcal{T}}^{k + 1} \mathrel{\text{:=}} {\mathcal{T}}^{k} + \Delta {\mathcal{T}}^{k}\) 和 \({\mathcal{O}}^{k + 1} \mathrel{\text{:=}} {\mathcal{O}}^{k} + \Delta {\mathcal{O}}^{k}\)。我们执行 \(K\) 次迭代,得到最终的细化轨迹 \({\mathcal{T}}^{K}\) 和 \({\mathcal{O}}^{K}\)。
4 Experiments
4.1 Implementation Details
We use JAX [3] for implementation. For training, we utilize the Panning MOVi-E dataset 12 generated with Kubric [14]. We employ the loss functions introduced in Doersch et al. [12], including the prediction of additional uncertainty estimation for both track initialization and a refinement model. We use the AdamW 30 optimizer and use \(1 \cdot {10}^{-3}\) for both learning rate and weight decay. We employ a cosine learning rate scheduler with a 1000-step warmup stage [29]. Following Sun et al. [51], we apply gradient clipping with a value of 1.0. The initialization stage is first trained for \({100}\mathrm{\;K}\) steps,followed by track refinement model training for an additional \({300}\mathrm{\;K}\) steps. This process takes approximately 4 days on 8 NVIDIA RTX 3090 GPUs with a batch size of 1 per GPU. For each batch, we randomly sample 256 tracks. We use a \({256} \times {256}\) training resolution,following the standard protocol of TAP-Vid benchmark.
我们使用 JAX [3] 进行实现。在训练方面,我们利用了使用 Kubric [14] 生成的 Panning MOVi-E 数据集 12。我们采用了 Doersch 等人 [12] 介绍的损失函数,包括对轨迹初始化和细化模型的额外不确定性估计的预测。我们使用 AdamW 30 优化器,并使用 \(1 \cdot {10}^{-3}\) 作为学习率和权重衰减。我们采用带有 1000 步预热阶段的余弦学习率调度器 [29]。根据 Sun 等人 [51] 的方法,我们应用了值为 1.0 的梯度裁剪。初始化阶段首先训练 \({100}\mathrm{\;K}\) 步,随后进行轨迹细化模型训练,再增加 \({300}\mathrm{\;K}\) 步。此过程在 8 块 NVIDIA RTX 3090 GPU 上大约需要 4 天时间,每块 GPU 的批量大小为 1。对于每个批次,我们随机抽样 256 条轨迹。我们遵循 TAP-Vid 基准的标准协议,使用 \({256} \times {256}\) 训练分辨率。
Our feature backbone is ResNet18 [17] with instance normalization [61] replacing batch normalization [20]. We use three pyramidal feature maps \(\left( {L = 3}\right)\) from ResNet, each with a stride of 2,4 and 8, respectively. The temperature value for softargmax is set to \(\sigma = {20.0}\) . The radii of the local correlation window are \({r}_{q} = {r}_{p} = 3\) . We stack \({N}_{S} = 3\) Transformer layers for \({\mathcal{E}}_{S}\) . The number of iterations \(\left( K\right)\) is set to 4 . For the track refinement model,we propose two variants: a small model and a base model. All ablations are conducted using the base model. The hidden dimension of the Transformer is set to 256 for the small model and 384 for the base model. The number of heads is set to 4 for the small model and 6 for the base model. For more details, please refer to supplementary materials.
我们的特征骨干是 ResNet18 [17],采用实例归一化 [61] 替代批量归一化 [20]。我们使用了来自 ResNet 的三个金字塔特征图 \(\left( {L = 3}\right)\),每个特征图的步长分别为 2、4 和 8。softargmax 的温度值设置为 \(\sigma = {20.0}\)。局部相关窗口的半径为 \({r}_{q} = {r}_{p} = 3\)。我们堆叠了 \({N}_{S} = 3\) 个 Transformer 层用于 \({\mathcal{E}}_{S}\)。迭代次数 \(\left( K\right)\) 设置为 4。对于轨迹细化模型,我们提出了两种变体:小型模型和基础模型。所有消融实验均使用基础模型进行。Transformer 的隐藏维度对于小型模型设置为 256,对于基础模型设置为 384。头数对于小型模型设置为 4,对于基础模型设置为 6。更多详情,请参阅补充材料。
Table 1: Quantitative comparison on the TAP-Vid datasets with the strided query mode. Throughput is measured on a single Nvidia RTX 3090 GPU.
表 1:在 TAP-Vid 数据集上采用跨步查询模式的定量比较。吞吐量在单个 Nvidia RTX 3090 GPU 上测量。
Method | Kinetics | DAVIS | RGB-Stacking | Throughput $\left( {\mathrm{{points}}/\mathrm{{sec}}}\right)$ | ||||||
---|---|---|---|---|---|---|---|---|---|---|
AJ $< {\delta }_{avg}^{x}$ OA | J $< {\delta }_{avg}^{x}$ O | AJ $< {\delta }_{avg}^{x}$ OA | ||||||||
Input Resolution 256×256 | ||||||||||
Kubric-VFS-Like[14] | 40.5 | 59.0 | 80.0 | 33.1 | 48.5 | 79.4 | 57.9 | 72.6 | 91.9 | - |
TAP-Net [11] | 46.6 | 60.9 | 85.0 | 38.4 | 53.1 | 82.3 | 59.9 | 72.8 | 90.4 | 29,535.98 |
RAFT [54] | 34.5 | 52.5 | 79.7 | 30.0 | 46.3 | 79.6 | 44.0 | 58.6 | 90.4 | 23,405.71 |
TAPIR [12] | 57.2 | 70.1 | 87.8 | 61.3 | 73.6 | 88.8 | 62.7 | 74.6 | 91.6 | 2,097.32 |
LocoTrack-S | 59.6 | 72.7 | 88.1 | 66.9 | 78.8 | 88.9 | 77.4 | 87.0 | 92.9 | 7,244.47 |
LocoTrack-B | 59.5 | 73.0 | 88.5 | 67.8 | 79.6 | 89.9 | 77.1 | 86.9 | 93.2 | 4,358.96 |
Input Resolution 384×512 | ||||||||||
PIPs [15] | 35.3 | 54.8 | 77.4 | 42.0 | 59.4 | 82.1 | 37.3 | 51.0 | 91.6 | 46.43 |
FlowTrack [8] | $-$ | $-$ | $-$ | 66.0 | 79.8 | 87.2 | $-$ | - | - | $-$ |
CoTracker [24] | - | - | $-$ | 65.9 | 79.4 | 89.9 | - | - | - | 1,146.79 |
LocoTrack-S | 58.7 | 72.2 | 84.5 | 68.4 | 80.4 | 87.5 | 71.0 | 84.4 | 83.3 | 6,820.57 |
LocoTrack-B | 59.1 | 72.5 | 85.7 | 69.4 | 81.3 | 88.6 | 70.8 | 83.2 | 84.1 | 4,196.36 |
Table 2: Quantitative comparison on the query first mode.
表 2:在查询优先模式下的定量比较。
Method | Kinetics-First | DAVIS-First | RoboTAP-First | ||||||
---|---|---|---|---|---|---|---|---|---|
OA | AJ $< {\delta }_{avg}^{x}$ OA | ||||||||
Input Resolution 256×256 | |||||||||
TAP-Net $\left| {11}\right|$ | 38.5 | 54.4 | 80.6 | 33.0 | 48.6 | 78.8 | 45.1 | 62.1 | 82.9 |
TAPIR $\left\lbrack {12}\right\rbrack$ | 49.6 | 64.2 | 85.0 | 56.2 | 70.0 | 86.5 | 59.6 | 73.4 | 87.0 |
LocoTrack-S | 52.8 | 66.5 | 84.9 | 62.0 | 74.3 | 86.1 | 62.5 | 76.0 | 87.0 |
LocoTrack-B | 52.9 | 66.8 | 85.3 | 63.0 | 75.3 | 87.2 | 62.3 | 76.2 | 87.1 |
Input Resolution 384×512 | |||||||||
CoTracker [24 | 48.7 | 64.3 | 86.5 | 60.6 | 75.4 | 89.3 | $-$ | $-$ | $-$ |
LocoTrack-S | 81.2 | 63.2 | 76.2 | 84.6 | - | $-$ | - | ||
LocoTrack-B | 82.1 | 64.8 | 77.4 | 86.2 | - | - | $-$ |
4.2 Evaluation Protocol 评估协议
We evaluate the precision of the predicted tracks using the TAP-Vid benchmark [11] and the RoboTAP dataset [63]. For evaluation metrics, we use position accuracy \(\left( { < {\delta }_{\text{avg }}^{x}}\right)\) ,occlusion accuracy (OA),and average Jaccard (AJ). \(< {\delta }_{\text{avg }}^{x}\) calculates position accuracy for the points visible in ground-truth. It calculates the percentage of correct points (PCK) [46], averaged over the error threshold values of \(1,2,4,8\) ,and 16 pixels. OA represents the average accuracy of the binary classification results for occlusion. AJ is a metric that evaluates both position accuracy and occlusion accuracy.
我们使用 TAP-Vid 基准 [11] 和 RoboTAP 数据集 [63] 评估预测轨迹的精度。对于评估指标,我们使用位置精度 \(\left( { < {\delta }_{\text{avg }}^{x}}\right)\)、遮挡精度(OA)和平均 Jaccard(AJ)。\(< {\delta }_{\text{avg }}^{x}\) 计算地面实况中可见点的位置精度。它计算在误差阈值 \(1,2,4,8\) 和 16 像素上的平均正确点百分比(PCK)[46]。OA 表示遮挡的二分类结果的平均精度。AJ 是一个同时评估位置精度和遮挡精度的指标。
Following Doersch et al. [11], we evaluate the datasets in two modes: strided query mode and first query mode. Strided query mode samples the query point along the ground-truth track at fixed intervals, sampling every 5 frames, whereas first query mode samples the query point solely from the first visible point.
遵循 Doersch 等人 [11] 的方法,我们以两种模式评估数据集:跨步查询模式和首查询模式。跨步查询模式沿真实轨迹以固定间隔采样查询点,每 5 帧采样一次,而首查询模式仅从第一个可见点采样查询点。
Fig. 6: Qualitative comparison of long-range tracking. We visualize dense tracking results generated by LocoTrack and state-of-the-art methods [12,24]. These visualizations use query points densely distributed within the initial reference frame. Our model can establish highly precise correspondences over long ranges, even in the presence of occlusions and matching challenges like homogeneous areas or deforming objects. Best viewed in color.
图 6:长距离跟踪的定性比较。我们展示了 LocoTrack 和最先进方法 [12,24] 生成的密集跟踪结果。这些可视化使用了初始参考帧内密集分布的查询点。我们的模型即使在存在遮挡和类似同质区域或变形物体等匹配挑战的情况下,也能在长距离范围内建立高度精确的对应关系。最佳观看效果为彩色。
4.3 Main Results
Quantitative comparison. We compare our method with recent state-of-the-art approaches 8,11,12,14,15,24,54 in both strided query mode, with scores shown in Table 1, and first query mode, with scores shown in Table 2. To ensure a fair comparison, we categorize models based on their input resolution sizes: \({256} \times {256}\) and \({384} \times {512}\) . Along with performance,we also present the throughput of each model, which indicates the number of points a model can process within a second. Higher throughput implies more efficient computation.
定量比较。我们将我们的方法与近期最先进的方法 8,11,12,14,15,24,54 在跨步查询模式(得分见表 1)和首查询模式(得分见表 2)中进行比较。为了确保公平比较,我们根据模型的输入分辨率大小对模型进行分类:\({256} \times {256}\) 和 \({384} \times {512}\)。除了性能之外,我们还展示了每个模型的吞吐量,这表示模型每秒可以处理的点数。吞吐量越高意味着计算效率越高。
Our small variant, LocoTrack-S, already achieves state-of-the-art performance on AJ and position accuracy across all benchmarks, surpassing both TAPIR and CoTracker by a large margin. In the DAVIS benchmark with strided query mode, we achieved a +5.6 AJ improvement compared to TAPIR and a +2.5 AJ improvement compared to CoTracker. This small variant model is not only powerful but also extremely efficient compared to recent state-of-the-art methods. Our model demonstrates \({3.5} \times\) higher throughput than TAPIR and \(6 \times\) higher than CoTracker. LocoTrack-B model shows even better performance, achieving a \(+ {0.9}\mathrm{{AJ}}\) improvement over our small variant in DAVIS strided query mode.
我们的小型变体模型 LocoTrack-S 已经在所有基准测试中实现了最先进的性能,包括 AJ 和位置精度,大幅超越了 TAPIR 和 CoTracker。在采用跨步查询模式的 DAVIS 基准测试中,我们相较于 TAPIR 实现了 +5.6 AJ 的提升,相较于 CoTracker 实现了 +2.5 AJ 的提升。这个小型变体模型不仅强大,而且与近期最先进的方法相比极为高效。我们的模型展示了 \({3.5} \times\) 比 TAPIR 更高的吞吐量,并且 \(6 \times\) 比 CoTracker 更高。LocoTrack-B 模型表现更佳,在 DAVIS 跨步查询模式下实现了 \(+ {0.9}\mathrm{{AJ}}\) 的提升,超过了我们的小型变体。
However,our model often shows degradation on some datasets in \({384} \times {512}\) . We attribute this degradation to the diminished effective receptive field of local correlation when resolution is increased.
然而,我们的模型在某些数据集上显示出 \({384} \times {512}\) 的性能下降。我们将这种下降归因于当分辨率增加时,局部相关性的有效感受野减小。
Table 3: Comparison of computation cost. We measure the inference time with a varying number of query points and calculate the FLOPs for the feature backbone and refinement stage, along with the number of parameters. All metrics are measured using a video consisting of 24 frames on a single Nvidia RTX 3090 GPU.
表3:计算成本比较。我们测量了不同查询点数量的推理时间,并计算了特征主干和细化阶段的 FLOPs 以及参数数量。所有指标均使用包含 24 帧的视频在单个 Nvidia RTX 3090 GPU 上进行测量。
Method | Inference Time (s) ${10}^{0}$ point ${10}^{1}$ points ${10}^{2}$ points ${10}^{3}$ points ${10}^{4}$ points ${10}^{5}$ poi | Throughput (points/sec) | Backbone FLOPs (G) | FLOPs per point (G) | # of Params. (M) | |||||
---|---|---|---|---|---|---|---|---|---|---|
RAFT [54] | - | - | - | - | 0.4 | - | 23,405.71 | 325.45 | - | 5.3 |
CoTracker [24] | 0.53 | 0.53 | 0.53 | 1.18 | 8.40 | 87.2 | 1,146.79 | 624.83 | 4.65 | 45.5 |
TAPIR [12] | 0.06 | 0.06 | 0.19 | 0.82 | 4.89 | 47.68 | 2,097.32 | 442.16 | 5.12 | 29.3 |
LocoTrack-S | 0.04 | 0.04 | 0.05 | 0.17 | 1.44 | 14.23 | 7,244.47 | 442.16 | 1.08 | 8.2 |
LocoTrack-B | 0.04 | 0.04 | 0.06 | 0.26 | 2.39 | 23.37 | 4.358.96 | 442.16 | 2.10 | 11.5 |
Table 4: Ablation on construction of correlation volume.
表4:相关性体积构建的消融研究。
Local Corr. Size | Query Neighbour | AJ | DAVIS $< {\delta }_{avg}^{x}$ | OA | |
---|---|---|---|---|---|
(I) | $7 \times 7$ | No neighbour (2D corr.) | 65.0 | 77.2 | 89.0 |
(II) | $9 \times 7 \times 7$ | Uniform random in local region | 65.7 | 77.8 | 88.9 |
(III) | $1 \times 9 \times 7 \times 7$ | Horizontal line | 66.5 | 78.4 | 89.4 |
(IV) | $3 \times 3 \times 7 \times 7$ | Regular grid $\left( {{r}_{q} = 1}\right)$ | 67.2 | 79.1 | 89.5 |
(V) | $7 \times 7 \times 7 \times 7$ | Regular grid $\left( {{r}_{q} = 3\text{,Ours}}\right)$ | 67.8 | 79.6 | 89.9 |
Qualitative comparison. The qualitative comparison is shown in Fig. 6. We visualize the results from the DAVIS [41] dataset, with the input resized to \({384} \times {512}\) resolution. Note that images at their original resolution are used for visualization. Overall, our method demonstrates superior smoothness compared to TAPIR. Our predictions are spatially coherent, even over long-range tracking sequences with occlusion.
定性比较。定性比较如图6所示。我们展示了来自 DAVIS [41] 数据集的结果,输入分辨率调整为 \({384} \times {512}\)。请注意,用于可视化的图像是其原始分辨率。总体而言,我们的方法相较于 TAPIR 展示了更优越的平滑性。即使在具有遮挡的长距离跟踪序列中,我们的预测在空间上也是一致的。
4.4 Analysis and Ablation Study 分析与消融研究
Efficiency comparison. We compare efficiency to recent state of the arts 12, 24,54 in Table 3. We measure inference time, throughput, FLOPs, and the number of parameters for a 24 frame video. We report inference time for a varying number of query points,increasing exponentially from \({10}^{0}\) to \({10}^{5}\) . To measure throughput, we calculate the average time required to add each query point. Also, we measure FLOPs for both the feature backbone and the refinement model, focusing on the incremental FLOPs per additional point.
效率比较。我们在表3中与最近的先进技术12、24、54进行效率比较。我们测量了24帧视频的推理时间、吞吐量、FLOPs和参数数量。我们报告了不同查询点数量的推理时间,从\({10}^{0}\)到\({10}^{5}\)呈指数增长。为了测量吞吐量,我们计算了每个查询点所需的平均时间。此外,我们还测量了特征主干和细化模型的FLOPs,重点关注每个额外点的增量FLOPs。
All variants of our model demonstrate superior efficiency across all metrics. Our small variant exhibits \({4.7} \times\) lower FLOPs per point compared to TAPIR and \({4.3} \times\) lower than CoTracker. Additionally,our model boasts a compact parameter count of only \({8.2}\mathrm{M}\) ,which is \({5.5} \times\) lower than CoTracker. Remarkably,our model can process \({10}^{4}\) points in approximately one second,implying real-time processing of \({64} \times {64}\) near-dense query points for a 24 frame rate video. This underscores the practicality of our model, paving the way for real-time applications.
我们模型的所有变体在所有指标上都展示了卓越的效率。我们的小型变体相对于TAPIR和CoTracker,每点的FLOPs分别降低了\({4.7} \times\)和\({4.3} \times\)。此外,我们的模型参数数量仅为\({8.2}\mathrm{M}\),比CoTracker低\({5.5} \times\)。值得注意的是,我们的模型可以在大约一秒钟内处理\({10}^{4}\)个点,这意味着对于24帧率的视频,可以实时处理\({64} \times {64}\)个近密集查询点。这突显了我们模型的实用性,为实时应用铺平了道路。
Analysis on local correlation. In Table 4, we analyze the construction of our local correlation method, focusing on how we sample neighboring points around the query points rather than target points. (I) represents the performance of local 2D correlation, a common approach in the literature [12,15,24]. The performance gap between (I) and (VI) demonstrates the superiority of our 4D correlation approach over 2D. (II) and (III) investigate the importance of calculating dense all-pair correlations within the local region. In (II), we use randomly sampled positions for the query point's neighbors, while (III) uses a horizontal line-shaped neighborhood. Their inferior performance compared to (IV), which samples the same number of points densely, emphasizes the value of our all-pair local 4D correlation. (IV) and (V) examine the effect of local region size. The gap between (IV) and (V), supports our choice of region size. (V) represents our final model.
局部相关性分析。在表4中,我们分析了我们的局部相关性方法的构建,重点研究了我们如何围绕查询点而不是目标点采样邻近点。(I) 表示局部2D相关性的性能,这是文献中常见的方法 [12,15,24]。(I) 和 (VI) 之间的性能差距展示了我们的4D相关性方法相对于2D方法的优势。(II) 和 (III) 研究了在局部区域内计算密集全对相关性的重要性。在 (II) 中,我们使用随机采样的位置作为查询点的邻居,而在 (III) 中使用水平线形状的邻域。它们相对于 (IV) 的较差性能,后者密集采样相同数量的点,强调了我们的全对局部4D相关性的价值。(IV) 和 (V) 考察了局部区域大小的影响。(IV) 和 (V) 之间的差距支持了我们选择的区域大小。(V) 表示我们的最终模型。
Table 5: Ablation on position encoding.
表5:位置编码的消融研究。
Method | AJ | DAVIS $< {\delta }_{avg}^{x}$ | OA |
---|---|---|---|
(I) Sinusoidal encoding (I) | 61.9 | 73.9 | 83.5 |
(II) Relative position bias (Ours) | 67.8 | 79.6 | 89.9 |
Table 6: Ablation on architecture of \({\mathcal{E}}_{S}\) . We found that our model outperforms its counterpart while using the same number of parameters.
表6:\({\mathcal{E}}_{S}\) 架构的消融研究。我们发现,尽管使用相同数量的参数,我们的模型性能优于其对应模型。
Method | # of | # of Params. | DAVIS $< {\delta }_{avg}^{x}$ | OA | ||
---|---|---|---|---|---|---|
(I) 1D Conv Mixer (TAPIR) | 12 | 3 | 11.5 | 66.1 | 78.0 | 87.5 |
(I) LocoTrack-B (Ours) | 3 | 11.5 | 67.8 | 79.6 | 89.9 |
Ablation on position encoding of Transformer. In Table 5, we ablate the effect of relative position bias. With sinusoidal encoding 62, we observe significant performance degradation during inference (I) with variable length. In contrast, relative position bias demonstrates generalization to unseen sequence lengths at inference time (II). This approach eliminates the need for hand-designed chaining processes (i.e., sliding window inference [15,24]) where window overlapping leads to computational inefficiency.
Transformer位置编码的消融研究。在表5中,我们消融了相对位置偏差的影响。使用正弦编码62,我们观察到在推理过程中(I)在可变长度下性能显著下降。相比之下,相对位置偏差展示了在推理时对未见序列长度的泛化能力(II)。这种方法消除了对手工设计链式过程(即滑动窗口推理 [15,24])的需求,其中窗口重叠导致计算效率低下。
Ablation on the architecture of refinement model. We verify the advantages of using a Transformer architecture over a Convolution-based architecture in Table 6. Our comparison includes the architecture proposed in Doersch et al. [12], which replaces the token mixing layer of MLP-Mixer [55] with depthwise 1D convolution. We ensure a fair comparison by matching the number of parameters and layers between the models. Our Transformer-based model achieves superior performance. We believe this difference stems from their receptive fields: Transformers can achieve a global receptive field within a single layer, while convolutions require multiple stacked layers. Although convolutions can also achieve large receptive fields with lightweight designs [4, 9], their exploration in long-range point tracking remains a promising area for future work.
精炼模型架构的消融研究。我们在表6中验证了使用Transformer架构相对于基于卷积的架构的优势。我们的比较包括Doersch等人[12]提出的架构,该架构用深度可分离的1D卷积替换了MLP-Mixer[55]的令牌混合层。我们通过匹配模型之间的参数数量和层数来确保公平比较。我们的基于Transformer的模型实现了更优的性能。我们相信这种差异源于它们的感受野:Transformer可以在单层内实现全局感受野,而卷积需要多层叠加。尽管卷积也可以通过轻量级设计[4, 9]实现大感受野,但它们在长距离点跟踪方面的探索仍然是未来工作的有前景的领域。
Analysis on the number of iterations. We show the performance and throughput of our model, varying the number of iterations, in Fig. 7. We compare our model with TAPIR and CoTracker at their respective resolutions. Surprisingly, our model surpasses TAPIR even with a single iteration for both the small and base variants. With a single iteration,our small variant is about \(9 \times\) faster than TAPIR. Compared to CoTracker,our model is about \(9 \times\) faster at the same performance level.
迭代次数的分析。我们在图7中展示了我们的模型在不同迭代次数下的性能和吞吐量。我们将我们的模型与TAPIR和CoTracker在各自的分辨率下进行比较。令人惊讶的是,我们的模型甚至在单次迭代的情况下也超过了TAPIR,无论是小型还是基础变体。在单次迭代中,我们的小型变体比TAPIR快约\(9 \times\)。与CoTracker相比,我们的模型在相同性能水平下快约\(9 \times\)。
Fig. 7: Results with a varying number of refinement iterations on TAP-Vid-DAVIS. The number in the circle denotes the number of iterations. (up) In a \({256} \times {256}\) resolution, compared to TAPIR [12], LocoTrack achieves better performance in a single iteration while being about \(9 \times\) faster. (below) In a \({384} \times {512}\) resolution,compared to CoTracker [24],LocoTrack achieves comparable performance while being about \(9 \times\) faster.
图7:在TAP-Vid-DAVIS上进行不同精炼迭代次数的结果。圆圈中的数字表示迭代次数。(上)在\({256} \times {256}\)分辨率下,与TAPIR[12]相比,LocoTrack在单次迭代中实现了更好的性能,同时快约\(9 \times\)。(下)在\({384} \times {512}\)分辨率下,与CoTracker[24]相比,LocoTrack实现了可比的性能,同时快约\(9 \times\)。
5 Conclusion
We introduce LocoTrack, an approach to the point tracking task, addressing the shortcomings of existing methods that rely solely on local 2D correlation. Our core innovation lies in a local all-pair correspondence formulation, combining the rich spatial context of \(4\mathrm{D}\) correlation with computational efficiency by limiting the search range. Further, a length-generalizable Transformer empowers the model to handle videos of varying lengths, eliminating the need for hand-designed processes. Our approach demonstrates superior performance and real-time inference while requiring significantly less computation compared to state-of-the-art methods.
我们介绍 LocoTrack,这是一种针对点跟踪任务的方法,解决了仅依赖局部 2D 相关性的现有方法的不足。我们的核心创新在于局部全对应对的公式化,结合了 \(4\mathrm{D}\) 相关性的丰富空间上下文和通过限制搜索范围实现的计算效率。此外,一个长度可泛化的 Transformer 使模型能够处理不同长度的视频,消除了对手工设计过程的需求。与最先进的方法相比,我们的方法在需要显著更少的计算量的同时,展示了优越的性能和实时推理能力。
Acknowledgements
This research was supported by the MSIT, Korea (IITP-2024-2020-0-01819, RS- 2023-00227592), Culture, Sports, and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism (Research on neural watermark technology for copyright protection of generative AI 3D content, RS-2024-00348469, RS-2024-00333068) and National Research Foundation of Korea (RS-2024-00346597).
这项研究得到了韩国 MSIT(IITP-2024-2020-0-01819,RS-2023-00227592)、文化、体育和旅游研发计划通过韩国创意内容机构资助(关于生成式 AI 3D 内容版权保护的神经水印技术研究,RS-2024-00348469,RS-2024-00333068)和国家研究基金会(RS-2024-00346597)的支持。
Local All-Pair Correspondence for Point Tracking -Supplementary Material- 点跟踪的局部全对应对 - 补充材料 -
A More Implementation Details
For generating the Panning-MOVi-E dataset [12], we randomly add 10-20 static objects and 5-10 dynamic objects to each scene. The dataset comprises 10,000 videos, including a validation set of 250 . For the sinusoidal position encoding function 53 \(\sigma \left( \cdot \right)\) ,we use a channel size of 20 along with the original unnormalized coordinate. This results in a total of 21 channels. For all qualitative comparisons,we use LocoTrack-B model with a resolution of \({384} \times {512}\) .
为了生成 Panning-MOVi-E 数据集 [12],我们随机向每个场景添加 10-20 个静态物体和 5-10 个动态物体。该数据集包含 10,000 个视频,其中包括 250 个验证集。对于正弦位置编码函数 53 \(\sigma \left( \cdot \right)\),我们使用 20 个通道大小以及原始未归一化的坐标。这总共产生了 21 个通道。对于所有定性比较,我们使用分辨率为 \({384} \times {512}\) 的 LocoTrack-B 模型。
Details of the evaluation benchmark. We evaluate the precision of the predicted tracks using the TAP-Vid benchmark [11]. This benchmark comprises both real-world video datasets and synthetic video datasets. TAP-Vid-Kinetics includes 1,189 real-world videos from the Kinetics [25] dataset. As the videos are collected from YouTube, they often contain edits such as scene cuts, text, fade-ins or -outs, or captions. TAP-Vid-DAVIS comprises real-world videos from the DAVIS [41] dataset. This dataset includes 30 videos featuring various concepts of objects with deformations. TAP-Vid-RGB-Stacking consists of 50 synthetic videos [26]. These videos feature a robot arm stacking geometric shapes against a monotonic background, with the camera remaining static. In addition to the TAP-Vid benchmark, we also evaluate our model on the RoboTAP dataset [63], which comprises 265 real-world videos of robot arm manipulation.
评估基准的详细信息。我们使用 TAP-Vid 基准 [11] 评估预测轨迹的精确度。该基准包括现实世界视频数据集和合成视频数据集。TAP-Vid-Kinetics 包含来自 Kinetics [25] 数据集的 1,189 个现实世界视频。由于这些视频是从 YouTube 收集的,它们通常包含剪辑,如场景切换、文字、淡入淡出或字幕。TAP-Vid-DAVIS 包含来自 DAVIS [41] 数据集的现实世界视频。该数据集包括 30 个视频,展示具有变形概念的各种物体。TAP-Vid-RGB-Stacking 包含 50 个合成视频 [26]。这些视频展示了一个机器人手臂在单调背景下堆叠几何形状,摄像头保持静止。除了 TAP-Vid 基准外,我们还在 RoboTAP 数据集 [63] 上评估我们的模型,该数据集包含 265 个机器人手臂操作的现实世界视频。
Table 7: Convolutional layer configurations for different model sizes.
表 7:不同模型大小的卷积层配置。
Model | Channel Sizes | Kernel Size | Strides |
---|---|---|---|
Small | (64, 128) | $\left( {5,2}\right)$ | $\left( {4,2}\right)$ |
Base | $\left( {{64},{128},{128}}\right)$ | $\left( {3,3,2}\right)$ | $\left( {2,2,2}\right)$ |
Detailed architecture of local 4D correlation encoder. We stack blocks of convolutional layers, where each block consists of a 2D convolution, group normalization [66], and ReLU activation. See Table 7 for details. For the small model,we use an intermediate channel size of \(\left( {{64},{128}}\right)\) for each block. For the base model,the intermediate channel sizes are \(\left( {{64},{128},{128}}\right)\) for each block. For every instance of group normalization, we set the group size to 16 .
局部 4D 相关编码器的详细架构。我们堆叠卷积层块,每个块由 2D 卷积、组归一化 [66] 和 ReLU 激活组成。详情参见表 7。对于小型模型,我们为每个块使用 \(\left( {{64},{128}}\right)\) 的中间通道大小。对于基础模型,每个块的中间通道大小为 \(\left( {{64},{128},{128}}\right)\)。对于每个组归一化的实例,我们将组大小设置为 16。
Details of correlation visualization. For the correlation visualization in Fig. 3 of the main text, we train a linear layer to project the correlation embedding \({E}_{t}^{k}\) into a local 2D correlation with a shape of \(7 \times 7\) . This local 2D correlation then undergoes a softargmax operation to predict the error relative to the ground truth. We begin with the pre-trained model and train the linear layer for 20,000 iterations. For clarity,we bilinearly upsample the \(7 \times 7\) correlation to \({256} \times {256}\) .
相关性可视化的细节。对于主文本图3中的相关性可视化,我们训练一个线性层将相关性嵌入 \({E}_{t}^{k}\) 投影到一个具有形状 \(7 \times 7\) 的局部2D相关性中。然后,这个局部2D相关性经过softargmax操作来预测与真实值的相对误差。我们从预训练模型开始,对线性层进行20,000次迭代训练。为了清晰起见,我们将 \(7 \times 7\) 相关性双线性上采样到 \({256} \times {256}\)。
B More Qualitative Comparison 更多定性比较
We provide more qualitative comparisons to recent state-of-the-art methods 12 24 in Fig. 8 and Fig. 9. Our model establishes accurate correspondences in homogeneous areas and on deforming objects, demonstrating robust occlusion handling even under severe occlusion conditions.
我们在图8和图9中提供了与近期最先进方法12、24的更多定性比较。我们的模型在同质区域和变形物体上建立了准确的对应关系,即使在严重遮挡条件下也展示了强大的遮挡处理能力。
Fig. 8: Additional qualitative comparison with state-of-the-art 12, 24.
图8:与最先进方法12、24的额外定性比较。
Fig. 9: Additional qualitative comparison with state-of-the-art [12, 24].
图9:与最先进方法[12, 24]的额外定性比较。
References
-
Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Computer Vision-ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. pp. 404-417. Springer (2006)
-
Bian, W., Huang, Z., Shi, X., Dong, Y., Li, Y., Li, H.: Context-tap: Tracking any point demands spatial context features. arXiv preprint arXiv:2306.02000 (2023)
-
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M.J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., Zhang, Q.: JAX: composable transformations of Python+NumPy programs (2018), http://github. com/google/jax
-
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
-
Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In: European Conference on Computer Vision. pp. 640-658. Springer (2022)
-
Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems 34, 9011-9023 (2021)
-
Cho, S., Hong, S., Kim, S.: Cats++: Boosting cost aggregation with convolutions and transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence \(\mathbf{{45}}\left( 6\right) ,{7174}–{7194}\left( {2022}\right)\)
-
Cho, S., Huang, J., Kim, S., Lee, J.Y.: Flowtrack: Revisiting optical flow for long-range dense tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19268-19277 (2024)
-
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764-773 (2017)
-
DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 224-236 (2018)
-
Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems 35, 13610-13626 (2022)
-
Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Carreira, J., Zisserman, A.: Tapir: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637 (2023)
-
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-net: A trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561 (2019)
-
Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3749-3761 (2022)
-
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: Tracking through occlusions using point trajectories. In: European Conference on Computer Vision. pp. 59-75. Springer (2022)
-
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
-
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770-778 (2016)
-
Hong, S., Cho, S., Kim, S., Lin, S.: Unifying feature and cost aggregation with transformers for semantic and visual correspondence. In: The Twelfth International Conference on Learning Representations (2024)
-
Hong, S., Cho, S., Nam, J., Lin, S., Kim, S.: Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In: European Conference on Computer Vision. pp. 108-126. Springer (2022)
-
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp. 448-456. pmlr (2015)
-
Janai, J., Güney, F., Behl, A., Geiger, A., et al.: Computer vision for autonomous vehicles: Problems, datasets and state of the art. Foundations and Trends@ in Computer Graphics and Vision 12(1-3), 1-308 (2020)
-
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: Cotr: Correspondence transformer for matching across images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6207-6217 (2021)
-
Kang, D., Kwon, H., Min, J., Cho, M.: Relational embedding for few-shot classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8822-8833 (2021)
-
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635 (2023)
-
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
-
Lee, A.X., Devin, C.M., Zhou, Y., Lampe, T., Bousmalis, K., Springenberg, J.T., Byravan, A., Abdolmaleki, A., Gileadi, N., Khosid, D., et al.: Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In: 5th Annual Conference on Robot Learning (2021)
-
Lee, J., Kim, D., Ponce, J., Ham, B.: Sfnet: Learning object-aware semantic correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2278-2287 (2019)
-
Liu, C., Yuen, J., Torralba, A.: Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence \(\mathbf{{33}}\left( 5\right) ,{978} - {994}\left( {2010}\right)\)
-
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
-
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
-
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 91-110 (2004)
-
Manuelli, L., Li, Y., Florence, P., Tedrake, R.: Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085 (2020)
-
Melekhov, I., Tiulpin, A., Sattler, T., Pollefeys, M., Rahtu, E., Kannala, J.: Dgc-net: Dense geometric correspondence network. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1034-1042. IEEE (2019)
-
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM \(\mathbf{{65}}\left( 1\right) ,{99} - {106}\left( {2021}\right)\)
-
Min, J., Kang, D., Cho, M.: Hypercorrelation squeeze for few-shot segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6941-6952 (2021)
-
Moing, G.L., Ponce, J., Schmid, C.: Dense optical tracking: Connecting the dots. arXiv preprint arXiv:2312.00786 (2023)
-
Nam, J., Lee, G., Kim, S., Kim, H., Cho, H., Kim, S., Kim, S.: Diffmatch: Diffusion model for dense matching. arXiv preprint arXiv:2305.19094 (2023)
-
Neoral, M., Šerých, J., Matas, J.: Mft: Long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6837-6847 (2024)
-
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9226-9235 (2019)
-
Pollefeys, M., Nistér, D., Frahm, J.M., Akbarzadeh, A., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S.J., Merrell, P., et al.: Detailed real-time urban 3d reconstruction from video. International Journal of Computer Vision 78, 143-167 (2008)
-
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
-
Press, O., Smith, N.A., Lewis, M.: Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409 (2021)
-
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 5485-5551 (2020)
-
Rocco, I., Arandjelovic, R., Sivic, J.: Convolutional neural network architecture for geometric matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6148-6157 (2017)
-
Rocco, I., Arandjelović, R., Sivic, J.: Efficient neighbourhood consensus networks via submanifold sparse convolutions. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IX 16. pp. 605-621. Springer (2020)
-
Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. Advances in neural information processing systems 31 (2018)
-
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: Robust hierarchical localization at large scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12716-12725 (2019)
-
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4938-4947 (2020)
-
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104-4113 (2016)
-
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)
-
Sun, D., Herrmann, C., Reda, F., Rubinstein, M., Fleet, D.J., Freeman, W.T.: Disentangling architecture and training for optical flow. In: European Conference on Computer Vision. pp. 165-182. Springer (2022)
-
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8922-8931 (2021)
-
Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, 7537-7547 (2020)
-
Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. pp. 402-419. Springer (2020)
-
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al.: Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems 34, 24261-24272 (2021)
-
Torr, P.H., Zisserman, A.: Feature based methods for structure and motion estimation. In: International workshop on vision algorithms. pp. 278-294. Springer (1999)
-
Truong, P., Danelljan, M., Gool, L.V., Timofte, R.: Gocor: Bringing globally optimized correspondence volumes into your neural network. Advances in Neural Information Processing Systems 33, 14278-14290 (2020)
-
Truong, P., Danelljan, M., Timofte, R.: Glu-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6258-6268 (2020)
-
Truong, P., Danelljan, M., Timofte, R., Van Gool, L.: Pdc-net+: Enhanced probabilistic dense correspondence network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
-
Truong, P., Danelljan, M., Van Gool, L., Timofte, R.: Learning accurate dense correspondences and when to trust them. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5714-5724 (2021)
-
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems \(\mathbf{{30}}\left( {2017}\right)\)
-
Vecerik, M., Doersch, C., Yang, Y., Davchev, T., Aytar, Y., Zhou, G., Hadsell, R., Agapito, L., Scholz, J.: Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv preprint arXiv:2308.15975 (2023)
-
Wang, Q., Chang, Y.Y., Cai, R., Li, Z., Hariharan, B., Holynski, A., Snavely, N.: Tracking everything everywhere all at once. arXiv preprint arXiv:2306.05422 (2023)
-
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cham: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). pp. 3-19 (2018)
-
Wu, Y., He, K.: Group normalization. In: Proceedings of the European conference on computer vision (ECCV). pp. 3-19 (2018)
-
Xiao, J., Chai, J.x., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part IV 8. pp. 573-587. Springer (2004)
-
Yi, K.M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2666-2674 (2018)
-
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19855-19865 (2023)