标签：right mathbf Sintel flow 2024 中英对照 RAFT left

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

RAFT：用于光流的循环全对域变换

Zachary Teed and Jia Deng

普林斯顿大学

{zteed,jiadeng}@cs.princeton.edu

Abstract. We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features,builds multi-scale $4\mathrm{D}$ correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance. On KITTI, RAFT achieves an F1-all error of ${5.10}\%$ ,a ${16}\%$ error reduction from the best published result $\left( {{6.10}\% }\right)$ . On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a ${30}\%$ error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code

摘要。我们介绍了 Recurrent All-Pairs Field Transforms（RAFT），这是一种用于光流的新型深度网络架构。RAFT 提取每个像素的特征，构建所有像素对的多尺度 $4\mathrm{D}$ 相关体积，并通过在相关体积上执行查找的循环单元迭代更新流场。RAFT 达到了最先进的性能。在 KITTI 数据集上，RAFT 实现了 ${5.10}\%$ 的 F1-all 误差，比最佳发表结果 $\left( {{6.10}\% }\right)$ 减少了 ${16}\%$ 的误差。在 Sintel（最终通过）上，RAFT 获得了 2.855 像素的终点误差，比最佳发表结果（4.098 像素）减少了 ${30}\%$ 的误差。此外，RAFT 具有强大的跨数据集泛化能力以及在推理时间、训练速度和参数数量方面的高效率。代码

is available at https://github.com/princeton-v1/RAFT

可在 https://github.com/princeton-v1/RAFT 获取

1 Introduction

1 引言

$>$ Optical flow is the task of estimating per-pixel motion between video frames. . It is a long-standing vision problem that remains unsolved. The best systems Care limited by difficulties including fast-moving objects, occlusions, motion blur, and textureless surfaces.

$>$ 光流任务是估计视频帧之间的每个像素运动。这是一个长期存在的视觉问题，至今仍未解决。最佳系统受到包括快速移动物体、遮挡、运动模糊和无纹理表面等困难的限制。

Optical flow has traditionally been approached as a hand-crafted optimization problem over the space of dense displacement fields between a pair of images 215113. Generally, the optimization objective defines a trade-off between a data term which encourages the alignment of visually similar image regions and a regularization term which imposes priors on the plausibility of motion. Such an approach has achieved considerable success, but further progress has appeared challenging, due to the difficulties in hand-designing an optimization objective that is robust to a variety of corner cases.

传统上，光流问题被视为一对图像之间密集位移场的空间上的手工优化问题 215113。通常，优化目标定义了数据项（鼓励视觉相似图像区域的对齐）和正则化项（对运动合理性施加先验）之间的权衡。这种方法取得了相当大的成功，但由于手工设计一个对各种边缘情况都鲁棒的优化目标的困难，进一步的进展似乎具有挑战性。

Recently, deep learning has been shown as a promising alternative to traditional methods. Deep learning can side-step formulating an optimization problem and train a network to directly predict flow. Current deep learning methods ${25}\left\lbrack {{42}\left| {22}\right| {49}|{20}}\right\rbrack$ have achieved performance comparable to the best traditional methods while being significantly faster at inference time. A key question for further research is designing effective architectures that perform better, train more easily and generalize well to novel scenes.

近年来，深度学习已被证明是传统方法的有前途的替代方案。深度学习可以绕过制定优化问题，训练网络直接预测流。当前的深度学习方法 ${25}\left\lbrack {{42}\left| {22}\right| {49}|{20}}\right\rbrack$ 在推理时间上显著更快的同时，达到了与最佳传统方法相媲美的性能。进一步研究的一个关键问题是设计有效的架构，这些架构表现更好，训练更容易，并且能够很好地泛化到新颖的场景。

We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT enjoys the following strengths:

我们介绍了循环全对场变换（RAFT），这是一种用于光流的新型深度网络架构。RAFT具有以下优势：

Fig. 1: RAFT consists of 3 main components: (1) A feature encoder that extracts per-pixel features from both input images, along with a context encoder that extracts features from only ${I}_{1}$ . (2) A correlation layer which constructs a 4D $W \times H \times W \times H$ correlation volume by taking the inner product of all pairs of feature vectors. The last 2-dimensions of the 4D volume are pooled at multiple scales to construct a set of multi-scale volumes. (3) An update operator which recurrently updates optical flow by using the current estimate to look up values from the set of correlation volumes.

图1：RAFT由3个主要组件组成：（1）一个特征编码器，从两个输入图像中提取每个像素的特征，以及一个上下文编码器，仅从 ${I}_{1}$ 提取特征。（2）一个相关层，通过取所有特征向量对的内积来构建一个4D $W \times H \times W \times H$ 相关体积。4D体积的最后两个维度在多个尺度上进行池化，以构建一组多尺度体积。（3）一个更新操作符，通过使用当前估计从相关体积集中查找值来循环更新光流。

State-of-the-art accuracy: On KITTI [18], RAFT achieves an F1-all error of ${5.10}\%$ ,a ${16}\%$ error reduction from the best published result $\left( {{6.10}\% }\right)$ . On Sintel [11] (final pass), RAFT obtains an end-point-error of 2.855 pixels, a ${30}\%$ error reduction from the best published result (4.098 pixels).
最先进的准确性：在KITTI [18]上，RAFT实现了 ${5.10}\%$ 的F1-all误差，比最佳已发表结果 $\left( {{6.10}\% }\right)$ 减少了 ${16}\%$ 误差。在Sintel [11]（最终通过）上，RAFT获得了2.855像素的终点误差，比最佳已发表结果（4.098像素）减少了 ${30}\%$ 误差。
Strong generalization: When trained only on synthetic data, RAFT achieves an end-point-error of 5.04 pixels on KITTI [18],a ${40}\%$ error reduction from the best prior deep network trained on the same data (8.36 pixels).
强大的泛化能力：仅在合成数据上训练时，RAFT在KITTI [18]上实现了5.04像素的终点误差，比在相同数据上训练的最佳先前深度网络（8.36像素）减少了 ${40}\%$ 误差。
High efficiency: RAFT processes 1088 $\times {436}$ videos at 10 frames per second on a 1080Ti GPU. It trains with 10X fewer iterations than other architectures. A smaller version of RAFT with $1/5$ of the parameters runs at 20 frames per second while still outperforming all prior methods on Sintel.
高效率：RAFT在1080Ti GPU上以每秒10帧的速度处理1088个$\times {436}$视频。它的训练迭代次数比其他架构少10倍。一个参数为$1/5$的小型RAFT版本以每秒20帧的速度运行，同时在Sintel上仍然优于所有先前的方法。

RAFT consists of three main components: (1) a feature encoder that extracts a feature vector for each pixel; (2) a correlation layer that produces a 4D correlation volume for all pairs of pixels, with subsequent pooling to produce lower resolution volumes; (3) a recurrent GRU-based update operator that retrieves values from the correlation volumes and iteratively updates a flow field initialized at zero. Fig. 1 illustrates the design of RAFT.

RAFT由三个主要组件组成：（1）一个特征编码器，为每个像素提取特征向量；（2）一个相关层，为所有像素对生成4D相关体积，随后进行池化以生成较低分辨率的体积；（3）一个基于GRU的循环更新操作符，从相关体积中检索值并迭代更新初始为零的流场。图1展示了RAFT的设计。

The RAFT architecture is motivated by traditional optimization-based approaches. The feature encoder extracts per-pixel features. The correlation layer computes visual similarity between pixels. The update operator mimics the steps of an iterative optimization algorithm. But unlike traditional approaches, features and motion priors are not handcrafted but learned-learned by the feature encoder and the update operator respectively.

RAFT架构的灵感来源于传统的基于优化的方法。特征编码器提取每个像素的特征。相关层计算像素之间的视觉相似性。更新操作符模仿迭代优化算法的步骤。但与传统方法不同，特征和运动先验不是手工制作的，而是分别由特征编码器和更新操作符学习的。

The design of RAFT draws inspiration from many existing works but is substantially novel. First, RAFT maintains and updates a single fixed flow field at high resolution. This is different from the prevailing coarse-to-fine design in prior work 4249222350, where flow is first estimated at low resolution and upsam-pled and refined at high resolution. By operating on a single high-resolution flow field, RAFT overcomes several limitations of a coarse-to-fine cascade: the difficulty of recovering from errors at coarse resolutions, the tendency to miss small fast-moving objects, and the many training iterations (often over 1M) typically required for training a multi-stage cascade.

RAFT的设计受到许多现有工作的启发，但具有实质性的创新。首先，RAFT维护并更新一个单一的高分辨率固定流场。这与先前工作中流行的从粗到细的设计不同，其中流首先在低分辨率下估计，然后在高分辨率下上采样和细化。通过在高分辨率流场上操作，RAFT克服了从粗到细级联的几个限制：难以从粗分辨率的错误中恢复，容易遗漏快速移动的小物体，以及通常需要的多阶段级联的多次训练迭代（通常超过100万次）。

Second, the update operator of RAFT is recurrent and lightweight. Many recent works 244249 2225 have included some form of iterative refinement, but do not tie the weights across iterations 424922 and are therefore limited to a fixed number of iterations. To our knowledge, IRR [24] is the only deep learning approach [24] that is recurrent. It uses FlowNetS [15] or PWC-Net [42] as its recurrent unit. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWC-Net, iterations are limited by the number of pyramid levels. In contrast, our update operator has only ${2.7}\mathrm{M}$ parameters and can be applied ${100} +$ times during inference without divergence.

其次，RAFT的更新操作符是循环且轻量级的。许多近期的工作244249 2225包含了某种形式的迭代细化，但没有在迭代之间绑定权重，因此仅限于固定次数的迭代。据我们所知，IRR [24]是唯一一种循环的深度学习方法[24]。它使用FlowNetS [15]或PWC-Net [42]作为其循环单元。当使用FlowNetS时，受限于网络的大小（38M参数），仅能应用至多5次迭代。当使用PWC-Net时，迭代次数受金字塔层数的限制。相比之下，我们的更新操作符仅有${2.7}\mathrm{M}$参数，并且在推理过程中可以应用${100} +$次而不会发散。

Third, the update operator has a novel design, which consists of a convolutional GRU that performs lookups on 4D multi-scale correlation volumes; in contrast, refinement modules in prior work typically use only plain convolution or correlation layers.

第三，更新操作符具有新颖的设计，它包含一个卷积GRU，该GRU在4D多尺度相关体积上执行查找；相比之下，先前工作中的细化模块通常仅使用普通的卷积或相关层。

We conduct experiments on Sintel 11 and KITTI 18. Results show that RAFT achieves state-of-the-art performance on both datasets. In addition, we validate various design choices of RAFT through extensive ablation studies.

我们在Sintel 11和KITTI 18上进行了实验。结果显示，RAFT在两个数据集上都达到了最先进的性能。此外，我们通过广泛的消融研究验证了RAFT的各种设计选择。

2 相关工作

Optical Flow as Energy Minimization Optical flow has traditionally been treated as an energy minimization problem which imposes a tradeoff between a data term and a regularization term. Horn and Schnuck [21] formulated optical flow as a continuous optimization problem using a variational framework, and were able to estimate a dense flow field by performing gradient steps. Black and Anandan 9 addressed problems with oversmoothing and noise sensitivity by introducing a robust estimation framework. TV-L1 [51] replaced the quadratic penalties with an L1 data term and total variation regularization, which allowed for motion discontinuities and was better equipped to handle outliers. Improvements have been made by defining better matching costs 4510 and regularization terms [38].

光流作为能量最小化传统上，光流问题被视为一个能量最小化问题，需要在数据项和正则化项之间进行权衡。Horn和Schnuck [21] 使用变分框架将光流问题表述为一个连续优化问题，并通过执行梯度步骤能够估计密集的流场。Black和Anandan 9 通过引入鲁棒估计框架解决了过度平滑和噪声敏感问题。TV-L1 [51] 用L1数据项和总变差正则化替代了二次惩罚，这使得能够处理运动不连续性并更好地应对异常值。通过定义更好的匹配成本 4510 和正则化项 [38]，已经取得了改进。

Such continuous formulations maintain a single estimate of optical flow which is refined at each iteration. To ensure a smooth objective function, a first order Taylor approximation is used to model the data term. As a result, they only work well for small displacements. To handle large displacements, the coarse-to-fine strategy is used, where an image pyramid is used to estimate large displacements at low resolution, then small displacements refined at high resolution. But this coarse-to-fine strategy may miss small fast-moving objects and have difficulty recovering from early mistakes. Like continuous methods, we maintain a single estimate of optical flow which is refined with each iteration. However, since we build correlation volumes for all pairs at both high resolution and low resolution, each local update uses information about both small and large displacements. In addition, instead of using a subpixel Taylor approximation of the data term, our update operator learns to propose the descent direction.

这种连续的公式保持了对光流的单一估计，该估计在每次迭代中得到改进。为了确保平滑的目标函数，使用一阶泰勒近似来建模数据项。因此，它们仅适用于小位移。为了处理大位移，采用了由粗到细的策略，其中使用图像金字塔在低分辨率下估计大位移，然后在高分辨率下细化小位移。但这种由粗到细的策略可能会错过快速移动的小物体，并且难以从早期错误中恢复。与连续方法类似，我们保持了对光流的单一估计，该估计在每次迭代中得到改进。然而，由于我们为所有对在高分辨率和低分辨率下构建相关体积，每个局部更新都使用关于小位移和大位移的信息。此外，我们的更新操作符学习提出下降方向，而不是使用数据项的亚像素泰勒近似。

More recently, optical flow has also been approached as a discrete optimization problem 351347 using a global objective. One challenge of this approach is the massive size of the search space, as each pixel can be reasonably paired with thousands of points in the other frame. Menez et al 35 pruned the search space using feature descriptors and approximated the global MAP estimate using message passing. Chen et al. 13 showed that by using the distance transform, solving the global optimization problem over the full space of flow fields is tractable. DCFlow [47] showed further improvements by using a neural network as a feature descriptor, and constructed a 4D cost volume over all pairs of features. The 4D cost volume was then processed using the Semi-Global Matching (SGM) algorithm [19]. Like DCFlow, we also constructed 4D cost volumes over learned features. However, instead of processing the cost volumes using SGM, we use a neural network to estimate flow. Our approach is end-to-end differentiable, meaning the feature encoder can be trained with the rest of the network to directly minimize the error of the final flow estimate. In contrast, DCFlow requires their network to be trained using an embedding loss between pixels; it cannot be trained directly on optical flow because their cost volume processing is not differentiable.

近年来，光流问题也被视为一个离散优化问题351347，使用全局目标函数来处理。这种方法的一个挑战是搜索空间的巨大规模，因为每个像素都可以合理地与另一帧中的数千个点配对。Menez等人35通过使用特征描述符来修剪搜索空间，并使用消息传递来近似全局MAP估计。Chen等人13表明，通过使用距离变换，在流场的整个空间上解决全局优化问题是可行的。DCFlow[47]通过使用神经网络作为特征描述符，并在所有特征对上构建了一个4D代价体，进一步改进了这一方法。然后使用半全局匹配（SGM）算法[19]处理4D代价体。与DCFlow类似，我们也在学习到的特征上构建了4D代价体。然而，我们不是使用SGM处理代价体，而是使用神经网络来估计流。我们的方法是端到端可微分的，这意味着特征编码器可以与网络的其他部分一起训练，以直接最小化最终流估计的误差。相比之下，DCFlow要求他们的网络使用像素间的嵌入损失进行训练；由于他们的代价体处理是不可微分的，因此不能直接在光流上进行训练。

Direct Flow Prediction Neural networks have been trained to directly predict optical flow between a pair of frames, side-stepping the optimization problem completely. Coarse-to-fine processing has emerged as a popular ingredient in many recent works [42] [22] [24] [2018][2]. In contrast, our method maintains and updates a single high-resolution flow field.

直接流预测神经网络已经被训练来直接预测一对帧之间的光流，完全绕过了优化问题。粗到细的处理已经成为许多近期工作[42][22][24][2018][2]中的一个流行要素。相比之下，我们的方法维护并更新一个单一的高分辨率流场。

Iterative Refinement for Optical Flow Many recent works have used iterative refinement to improve results on optical flow 25,39,42,22,49 and related tasks 29534428. Ilg et al. 25 applied iterative refinement to optical flow by stacking multiple FlowNetS and FlowNetC modules in series. SpyNet 39, PWC-Net [42, LiteFlowNet 22, and VCN [49] apply iterative refinement using coarse-to-fine pyramids. The main difference of these approaches from ours is that they do not share weights between iterations.

光流迭代细化许多近期的工作采用了迭代细化来改善光流25,39,42,22,49及相关任务29534428的结果。Ilg等人25通过串联多个FlowNetS和FlowNetC模块来应用迭代细化于光流。SpyNet 39、PWC-Net [42]、LiteFlowNet 22和VCN [49]使用由粗到细的金字塔结构进行迭代细化。这些方法与我们的主要区别在于它们在迭代之间不共享权重。

More closely related to our approach is IRR 24, which builds off of the FlownetS and PWC-Net architecture but shares weights between refinement networks. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWC-Net, iterations are limited by the number of pyramid levels. In contrast, we use a much simpler refinement module (2.7M parameters) which can be applied for 100+ iterations during inference without divergence. Our method also shares similarites with Devon [31], namely the construction of the cost volume without warping and fixed resolution updates. However, Devon does not have any recurrent unit. It also differs from ours regarding large displacements. Devon handles large displacements using a dilated cost volume while our approach pools the correlation volume at multiple resolutions.

与我们的方法更为接近的是IRR 24，它基于FlownetS和PWC-Net架构构建，但在细化网络之间共享权重。当使用FlowNetS时，它受限于网络的大小（38M参数），并且仅能应用至多5次迭代。当使用PWC-Net时，迭代次数受限于金字塔层级的数量。相比之下，我们使用了一个简单得多的细化模块（2.7M参数），在推理过程中可以应用100次以上的迭代而不会发散。我们的方法也与Devon [31]有相似之处，即在没有扭曲和固定分辨率更新的情况下构建成本体积。然而，Devon没有任何循环单元。在处理大位移方面，Devon使用膨胀的成本体积，而我们的方法则在多个分辨率上池化相关体积。

Our method also has ties to TrellisNet [5] and Deep Equilibrium Models (DEQ) 6. Trellis net uses depth tied weights over a large number of layers, DEQ simulates an infinite number of layers by solving for the fixed point directly. TrellisNet and DEQ were designed for sequence modeling tasks, but we adopt the core idea of using a large number of weight-tied units. Our update operator uses a modified GRU block 14, which is similar to the LSTM block used in TrellisNet. We found that this structure allows our update operator to more easily converge to a fixed flow field.

我们的方法也与TrellisNet [5]和深度均衡模型（DEQ）6有关。TrellisNet在大量层上使用深度绑定权重，而DEQ通过直接求解固定点来模拟无限数量的层。TrellisNet和DEQ是为序列建模任务设计的，但我们采用了使用大量权重绑定单元的核心思想。我们的更新操作符使用了一个修改过的GRU块14，类似于TrellisNet中使用的LSTM块。我们发现这种结构使得我们的更新操作符更容易收敛到一个固定的流场。

Learning to Optimize Many problems in vision can be formulated as an optimization problem. This has motivated several works to embed optimization problems into network architectures [43,43,32,44]. These works typically use a network to predict the inputs or parameters of the optimization problem, and then train the network weights by backpropogating the gradient through the solver, either implicitly 43 or unrolling each step 3243. However, this technique is limited to problems with an objective that can be easily defined.

学习优化视觉中的许多问题可以被表述为一个优化问题。这促使了多个工作将优化问题嵌入到网络架构中[43,43,32,44]。这些工作通常使用网络来预测优化问题的输入或参数，然后通过求解器反向传播梯度来训练网络权重，无论是隐式地[43]还是展开每一步[32,43]。然而，这种技术局限于目标容易定义的问题。

Another approach is to learn iterative updates directly from data [112]. These approaches are motivated by the fact that first order optimizers such as Primal Dual Hybrid Gradient (PDHG) 12 can be expressed as a sequence of iterative update steps. Instead of using an optimizer directly, Adler et al. 1 proposed building a network which mimics the updates of a first order algorithm. This approach has been applied to inverse problems such as image denoising [26], tomographic reconstruction [2], and novel view synthesis [17]. TVNet [16] implemented the TV-L1 algorithm as a computation graph, which enabled the training the TV-L1 parameters. However, TVNet operates directly based on intensity gradients instead of learned features, which limits the achievable accuracy on challenging datasets such as Sintel.

另一种方法是直接从数据中学习迭代更新[112]。这些方法的动机是，像原始对偶混合梯度（PDHG）12这样的第一阶优化器可以表示为一系列迭代更新步骤。Adler等人1提出构建一个模拟第一阶算法更新的网络，而不是直接使用优化器。这种方法已应用于图像去噪[26]、断层重建[2]和新视角合成[17]等逆问题。TVNet [16]将TV-L1算法实现为一个计算图，从而能够训练TV-L1参数。然而，TVNet直接基于强度梯度而不是学习到的特征进行操作，这限制了在如Sintel这样的挑战性数据集上可达到的准确性。

Our approach can be viewed as learning to optimize: our network uses a large number of update blocks to emulate the steps of a first-order optimization algorithm. However, unlike prior work, we never explicitly define a gradient with respect to some optimization objective. Instead, our network retrieves features from correlation volumes to propose the descent direction.

我们的方法可以看作是学习优化：我们的网络使用大量更新块来模拟一阶优化算法的步骤。然而，与先前的工作不同，我们从未明确地定义相对于某些优化目标的梯度。相反，我们的网络从相关体积中检索特征以提出下降方向。

3 Approach

3 方法

Given a pair of consecutive RGB images, ${I}_{1},{I}_{2}$ ,we estimate a dense displacement field $\left( {{\mathbf{f}}^{1},{\mathbf{f}}^{2}}\right)$ which maps each pixel $\left( {u,v}\right)$ in ${I}_{2}$ to its corresponding coordinates $\left( {{u}^{\prime },{v}^{\prime }}\right) = \left( {u + {f}^{1}\left( u\right) ,v + {f}^{2}\left( v\right) }\right)$ in ${I}_{2}$ . An overview of our approach is given in Figure 1 Our method can be distilled down to three stages: (1) feature extraction, (2) computing visual similarity, and (3) iterative updates, where all stages are differentiable and composed into an end-to-end trainable architecture.

给定一对连续的RGB图像，${I}_{1},{I}_{2}$，我们估计一个密集位移场 $\left( {{\mathbf{f}}^{1},{\mathbf{f}}^{2}}\right)$，该场将 ${I}_{2}$ 中的每个像素 $\left( {u,v}\right)$ 映射到 ${I}_{2}$ 中的相应坐标 $\left( {{u}^{\prime },{v}^{\prime }}\right) = \left( {u + {f}^{1}\left( u\right) ,v + {f}^{2}\left( v\right) }\right)$。我们的方法概述如图1所示，可以简化为三个阶段：（1）特征提取，（2）计算视觉相似度，和（3）迭代更新，所有阶段都是可微分的，并组合成一个端到端可训练的架构。

Fig. 2: Building correlation volumes. Here we depict 2D slices of a full 4D volume. For a feature vector in ${I}_{1}$ ,we take take the inner product with all pairs in ${I}_{2}$ , generating a $4\mathrm{D}W \times H \times W \times H$ volume (each pixel in ${I}_{2}$ produces a $2\mathrm{D}$ response map). The volume is pooled using average pooling with kernel sizes $\{ 1,2,4,8\}$ .

图2：构建相关体积。这里我们展示了一个完整4D体积的2D切片。对于 ${I}_{1}$ 中的一个特征向量，我们与 ${I}_{2}$ 中的所有对进行内积运算，生成一个 $4\mathrm{D}W \times H \times W \times H$ 体积（${I}_{2}$ 中的每个像素产生一个 $2\mathrm{D}$ 响应图）。该体积使用平均池化进行池化，核大小为 $\{ 1,2,4,8\}$。

3.1 Feature Extraction

3.1 特征提取

Features are extracted from the input images using a convolutional network. The feature encoder network is applied to both ${I}_{1}$ and ${I}_{2}$ and maps the input images to dense feature maps at a lower resolution. Our encoder, ${g}_{\theta }$ outputs features at $1/8$ resolution ${g}_{\theta } : {\mathbb{R}}^{H \times W \times 3} \mapsto {\mathbb{R}}^{H/8 \times W/8 \times D}$ where we set $D = {256}$ . The feature encoder consists of 6 residual blocks,2 at $1/2$ resolution,2 at $1/4$ resolution, and 2 at 1/8 resolution (more details in the supplemental material).

使用卷积网络从输入图像中提取特征。特征编码器网络应用于 ${I}_{1}$ 和 ${I}_{2}$，并将输入图像映射到较低分辨率的密集特征图。我们的编码器 ${g}_{\theta }$ 输出 $1/8$ 分辨率的特征 ${g}_{\theta } : {\mathbb{R}}^{H \times W \times 3} \mapsto {\mathbb{R}}^{H/8 \times W/8 \times D}$，我们设置 $D = {256}$。特征编码器由6个残差块组成，2个在 $1/2$ 分辨率，2个在 $1/4$ 分辨率，2个在1/8分辨率（更多细节见补充材料）。

We additionally use a context network. The context network extracts features only from the first input image ${I}_{1}$ . The architecture of the context network, ${h}_{\theta }$ is identical to the feature extraction network. Together,the feature network ${g}_{\theta }$ and the context network ${h}_{\theta }$ form the first stage of our approach,which only need to be performed once.

我们还使用了一个上下文网络。该上下文网络仅从第一个输入图像 ${I}_{1}$ 中提取特征。上下文网络的架构 ${h}_{\theta }$ 与特征提取网络相同。特征网络 ${g}_{\theta }$ 和上下文网络 ${h}_{\theta }$ 共同构成了我们方法的第一阶段，该阶段只需执行一次。

3.2 Computing Visual Similarity

3.2 计算视觉相似度

We compute visual similarity by constructing a full correlation volume between all pairs. Given image features ${g}_{\theta }\left( {I}_{1}\right) \in {\mathbb{R}}^{H \times W \times D}$ and ${g}_{\theta }\left( {I}_{2}\right) \in {\mathbb{R}}^{H \times W \times D}$ ,the correlation volume is formed by taking the dot product between all pairs of feature vectors. The correlation volume, $\mathbf{C}$ ,can be efficiently computed as a single matrix multiplication.

我们通过构建所有图像对之间的全相关体积来计算视觉相似度。给定图像特征 ${g}_{\theta }\left( {I}_{1}\right) \in {\mathbb{R}}^{H \times W \times D}$ 和 ${g}_{\theta }\left( {I}_{2}\right) \in {\mathbb{R}}^{H \times W \times D}$，相关体积是通过所有特征向量对的点积形成的。相关体积 $\mathbf{C}$ 可以高效地计算为单一矩阵乘法。

\[\mathbf{C}\left( {{g}_{\theta }\left( {I}_{1}\right) ,{g}_{\theta }\left( {I}_{2}\right) }\right) \in {\mathbb{R}}^{H \times W \times H \times W},\;{C}_{ijkl} = \mathop{\sum }\limits_{h}{g}_{\theta }{\left( {I}_{1}\right) }_{ijh} \cdot {g}_{\theta }{\left( {I}_{2}\right) }_{klh} \tag{1} \]

Correlation Pyramid: We construct a 4-layer pyramid $\left\{ {{\mathbf{C}}^{1},{\mathbf{C}}^{2},{\mathbf{C}}^{3},{\mathbf{C}}^{4}}\right\}$ by pooling the last two dimensions of the correlation volume with kernel sizes 1 , 2,4,and 8 and equivalent stride (Figure 2). Thus,volume ${\mathbf{C}}^{k}$ has dimensions $H \times W \times H/{2}^{k} \times W/{2}^{k}$ . The set of volumes gives information about both large and small displacements; however,by maintaining the first 2 dimensions (the ${I}_{1}$ dimensions) we maintain high resolution information, allowing our method to recover the motions of small fast-moving objects.

相关金字塔：我们通过使用核大小为 1、2、4 和 8 以及等效步幅对相关体积的最后两个维度进行池化，构建了一个 4 层金字塔 $\left\{ {{\mathbf{C}}^{1},{\mathbf{C}}^{2},{\mathbf{C}}^{3},{\mathbf{C}}^{4}}\right\}$（图 2）。因此，体积 ${\mathbf{C}}^{k}$ 的维度为 $H \times W \times H/{2}^{k} \times W/{2}^{k}$。这些体积集合提供了关于大位移和小位移的信息；然而，通过保持前两个维度（${I}_{1}$ 维度），我们保持了高分辨率信息，使得我们的方法能够恢复快速移动的小物体的运动。

Correlation Lookup: We define a lookup operator ${L}_{\mathbf{C}}$ which generates a feature map by indexing from the correlation pyramid. Given a current estimate of optical flow $\left( {{\mathbf{f}}^{1},{\mathbf{f}}^{2}}\right)$ ,we map each pixel $\mathbf{x} = \left( {u,v}\right)$ in ${I}_{1}$ to its estimated

相关查找：我们定义了一个查找操作符 ${L}_{\mathbf{C}}$，它通过从相关金字塔中索引生成特征图。给定光流的当前估计 $\left( {{\mathbf{f}}^{1},{\mathbf{f}}^{2}}\right)$，我们将 ${I}_{1}$ 中的每个像素 $\mathbf{x} = \left( {u,v}\right)$ 映射到其在 ${L}_{\mathbf{C}}$ 中的估计对应点。

correspondence in ${I}_{2} : {\mathbf{x}}^{\prime } = \left( {u + {f}^{1}\left( u\right) ,v + {f}^{2}\left( v\right) }\right)$ . We then define a local grid around ${\mathbf{x}}^{\prime }$

然后我们在 ${\mathbf{x}}^{\prime }$ 周围定义一个局部网格

\[\mathcal{N}{\left( {\mathbf{x}}^{\prime }\right) }_{r} = \left\{ {{\mathbf{x}}^{\prime } + \mathbf{{dx}}\left| {\mathbf{{dx}} \in {\mathbb{Z}}^{2},}\right| {\left| \mathbf{{dx}}\right| }_{1} \leq r}\right\} \tag{2} \]

as the set of integer offsets which are within a radius of $r$ units of ${\mathbf{x}}^{\prime }$ using the L1 distance. We use the local neighborhood $\mathcal{N}{\left( {\mathbf{x}}^{\prime }\right) }_{r}$ to index from the correlation volume. Since $\mathcal{N}{\left( {\mathbf{x}}^{\prime }\right) }_{r}$ is a grid of real numbers,we use bilinear sampling.

作为一组整数偏移量，这些偏移量在 ${\mathbf{x}}^{\prime }$ 的 $r$ 单位半径内，使用 L1 距离。我们使用局部邻域 $\mathcal{N}{\left( {\mathbf{x}}^{\prime }\right) }_{r}$ 从相关体积中索引。由于 $\mathcal{N}{\left( {\mathbf{x}}^{\prime }\right) }_{r}$ 是一个实数网格，我们使用双线性采样。

We perform lookups on all levels of the pyramid, such that the correlation volume at level $k,{\mathbf{C}}^{k}$ ,is indexed using the grid $\mathcal{N}{\left( {\mathbf{x}}^{\prime }/{2}^{k}\right) }_{r}$ . A constant radius across levels means larger context at lower levels: for the lowest level, $k = 4$ using a radius of 4 corresponds to a range of 256 pixels at the original resolution. The values from each level are then concatenated into a single feature map.

我们在金字塔的所有层级上进行查找，使得在层级 $k,{\mathbf{C}}^{k}$ 的相关性体积使用网格 $\mathcal{N}{\left( {\mathbf{x}}^{\prime }/{2}^{k}\right) }_{r}$ 进行索引。跨层级的恒定半径意味着较低层级具有更大的上下文：对于最低层级，使用半径为 4 对应于原始分辨率下的 256 像素范围。然后，将每个层级的值连接成一个单一的特征图。

Efficient Computation for High Resolution Images: The all pairs correlation scales $O\left( {N}^{2}\right)$ where $N$ is the number of pixels,but only needs to be computed once and is constant in the number of iterations $M$ . However,there exists an equivalent implementation of our approach which scales $O\left( {NM}\right)$ exploiting the linearity of the inner product and average pooling. Consider the cost volume at level $m,{\mathbf{C}}_{ijkl}^{m}$ ,and feature maps ${g}^{\left( 1\right) } = {g}_{\theta }\left( {I}_{1}\right) ,{g}^{\left( 2\right) } = {g}_{\theta }\left( {I}_{2}\right)$ :

高分辨率图像的高效计算：所有成对相关性的规模 $O\left( {N}^{2}\right)$ 其中 $N$ 是像素数量，但只需计算一次并且在迭代次数 $M$ 中保持恒定。然而，存在一个等效的实现方式，利用内积和平均池化的线性性质，规模为 $O\left( {NM}\right)$。考虑层级 $m,{\mathbf{C}}_{ijkl}^{m}$ 的成本体积和特征图 ${g}^{\left( 1\right) } = {g}_{\theta }\left( {I}_{1}\right) ,{g}^{\left( 2\right) } = {g}_{\theta }\left( {I}_{2}\right)$：

\[{\mathbf{C}}_{ijkl}^{m} = \frac{1}{{2}^{2m}}\mathop{\sum }\limits_{p}^{{2}^{m}}\mathop{\sum }\limits_{q}^{{2}^{m}}\left\langle {{g}_{i,j}^{\left( 1\right) },{g}_{{2}^{m}k + p,{2}^{m}l + q}^{\left( 2\right) }}\right\rangle = \left\langle {{g}_{i,j}^{\left( 1\right) },\frac{1}{{2}^{2m}}\left( {\mathop{\sum }\limits_{p}^{{2}^{m}}\mathop{\sum }\limits_{q}^{{2}^{m}}{g}_{{2}^{m}k + p,{2}^{m}l + q}^{\left( 2\right) }}\right) }\right\rangle \]

which is the average over the correlation response in the ${2}^{m} \times {2}^{m}$ grid. This means that the value at ${\mathbf{C}}_{ijkl}^{m}$ can be computed as the inner product between the feature vector ${g}_{\theta }{\left( {I}_{1}\right) }_{ij}$ and ${g}_{\theta }\left( {I}_{2}\right)$ pooled with kernel size ${2}^{m} \times {2}^{m}$ .

这是在 ${2}^{m} \times {2}^{m}$ 网格中相关性响应的平均值。这意味着在 ${\mathbf{C}}_{ijkl}^{m}$ 处的值可以计算为特征向量 ${g}_{\theta }{\left( {I}_{1}\right) }_{ij}$ 和 ${g}_{\theta }\left( {I}_{2}\right)$ 与核大小为 ${2}^{m} \times {2}^{m}$ 的池化结果的内积。

In this alternative implementation, we do not precompute the correlations, but instead precompute the pooled image feature maps. In each iteration, we compute each correlation value on demand-only when it is looked up. This gives a complexity of $O\left( {NM}\right)$ .

在这种替代实现中，我们不预先计算相关性，而是预先计算池化的图像特征图。在每次迭代中，我们按需计算每个相关性值——仅在查找时计算。这给出了复杂度为 $O\left( {NM}\right)$。

We found empirically that precomputing all pairs is easy to implement and not a bottleneck, due to highly optimized matrix routines on GPUs even for 1088x1920 videos it takes only ${17}\%$ of total inference time. Note that we can always switch to the alternative implementation should it become a bottleneck.

我们根据经验发现，预先计算所有成对相关性易于实现，并且由于 GPU 上高度优化的矩阵例程，即使对于 1088x1920 的视频，也仅占用总推理时间的 ${17}\%$。请注意，如果成为瓶颈，我们总是可以切换到替代实现。

3.3 Iterative Updates

3.3 迭代更新

Our update operator estimates a sequence of flow estimates $\left\{ {{\mathbf{f}}_{1},\ldots ,{\mathbf{f}}_{N}}\right\}$ from an initial starting point ${\mathbf{f}}_{0} = \mathbf{0}$ . With each iteration,it produces an update direction $\Delta \mathbf{f}$ which is applied to the current estimate: ${\mathbf{f}}_{k + 1} = \Delta \mathbf{f} + {\mathbf{f}}_{k + 1}$ .

我们的更新操作符从初始起点 ${\mathbf{f}}_{0} = \mathbf{0}$ 估计一系列流估计 $\left\{ {{\mathbf{f}}_{1},\ldots ,{\mathbf{f}}_{N}}\right\}$。每次迭代中，它生成一个更新方向 $\Delta \mathbf{f}$，该方向应用于当前估计：${\mathbf{f}}_{k + 1} = \Delta \mathbf{f} + {\mathbf{f}}_{k + 1}$。

The update operator takes flow, correlation, and a latent hidden state as input,and outputs the update $\Delta \mathbf{f}$ and an updated hidden state. The architecture of our update operator is designed to mimic the steps of an optimization algorithm. As such, we used tied weights across depth and use bounded activations to encourage convergence to a fixed point. The update operator is trained to perform updates such that the sequence converges to a fixed point ${\mathbf{f}}_{k} \rightarrow {\mathbf{f}}^{ * }$ .

更新操作符以流、相关性和一个潜在的隐藏状态作为输入，并输出更新 $\Delta \mathbf{f}$ 和一个更新的隐藏状态。我们的更新操作符的架构设计模仿了优化算法的步骤。因此，我们在深度上使用绑定权重，并使用有界激活来促进收敛到一个固定点。更新操作符被训练为执行更新，使得序列收敛到一个固定点 ${\mathbf{f}}_{k} \rightarrow {\mathbf{f}}^{ * }$。

Initialization: By default, we initialize the flow field to 0 everywhere, but our iterative approach gives us the flexibility to experiment with alternatives. When applied to video, we test warm-start initialization, where optical flow from the previous pair of frames is forward projected to the next pair of frames with occlusion gaps filled in using nearest neighbor interpolation.

初始化：默认情况下，我们将流场初始化为全零，但我们的迭代方法使我们能够灵活地尝试其他方法。当应用于视频时，我们测试热启动初始化，其中从前一对帧的光流通过最近邻插值填充遮挡间隙，向前投影到下一对帧。

Inputs: Given the current flow estimate ${\mathbf{f}}^{k}$ ,we use it to retrieve correlation features from the correlation pyramid as described in Sec. 3.2. The correlation features are then processed by 2 convolutional layers. Additionally, we apply 2 convolutional layers to the flow estimate itself to generate flow features. Finally, we directly inject the input from the context network. The input feature map is then taken as the concatenation of the correlation, flow, and context features.

输入：给定当前流估计 ${\mathbf{f}}^{k}$，我们使用它从相关性金字塔中检索相关性特征，如第3.2节所述。然后，相关性特征通过2个卷积层进行处理。此外，我们对流估计本身应用2个卷积层以生成流特征。最后，我们直接注入来自上下文网络的输入。输入特征图然后取相关性、流和上下文特征的串联。

Update: A core component of the update operator is a gated activation unit based on the GRU cell, with fully connected layers replaced with convolutions:

更新：更新操作符的核心组件是一个基于GRU单元的门控激活单元，其中全连接层被卷积替换：

\[{z}_{t} = \sigma \left( {{\operatorname{Conv}}_{3 \times 3}\left( {\left\lbrack {{h}_{t - 1},{x}_{t}}\right\rbrack ,{W}_{z}}\right) }\right) \tag{3} \]

\[{r}_{t} = \sigma \left( {{\operatorname{Conv}}_{3 \times 3}\left( {\left\lbrack {{h}_{t - 1},{x}_{t}}\right\rbrack ,{W}_{r}}\right) }\right) \tag{4} \]

\[\widetilde{{h}_{t}} = \tanh \left( {{\operatorname{Conv}}_{3 \times 3}\left( {\left\lbrack {{r}_{t} \odot {h}_{t - 1},{x}_{t}}\right\rbrack ,{W}_{h}}\right) }\right) \tag{5} \]

\[{h}_{t} = \left( {1 - {z}_{t}}\right) \odot {h}_{t - 1} + {z}_{t} \odot \widetilde{{h}_{t}} \tag{6} \]

where ${x}_{t}$ is the concatenation of flow,correlation,and context features previously defined. We also experiment with a separable ConvGRU unit, where we replace the $3 \times 3$ convolution with two GRUs: one with a $1 \times 5$ convolution and one with a 5 × 1 convolution to increase the receptive field without significantly increasing the size of the model.

其中 ${x}_{t}$ 是先前定义的流、相关性和上下文特征的串联。我们还尝试使用可分离的 ConvGRU 单元，其中我们用两个 GRU 替换 $3 \times 3$ 卷积：一个具有 $1 \times 5$ 卷积，另一个具有 5 × 1 卷积，以在不显著增加模型大小的情况下增加感受野。

Flow Prediction: The hidden state outputted by the GRU is passed through two convolutional layers to predict the flow update $\Delta \mathbf{f}$ . The output flow is at $1/8$ resolution of the input image. During training and evaluation,we upsample the predicted flow fields to match the resolution of the ground truth.

流预测：GRU 输出的隐藏状态通过两个卷积层来预测流更新 $\Delta \mathbf{f}$。输出流是输入图像的 $1/8$ 分辨率。在训练和评估期间，我们将预测的流场上采样以匹配真实值的分辨率。

Upsampling: The network outputs optical flow at $1/8$ resolution. We upsample the optical flow to full resolution by taking the full resolution flow at each pixel to be the convex combination of a $3 \times 3$ grid of its coarse resolution neighbors. We use two convolutional layers to predict a $H/8 \times W/8 \times \left( {8 \times 8 \times 9}\right)$ mask and perform softmax over the weights of the 9 neighbors. The final high resolution flow field is found by using the mask to take a weighted combination over the neighborhood, then permuting and reshaping to a $H \times W \times 2$ dimensional flow field. This layer can be directly implemented in PyTorch using the unfold function.

上采样：网络以 $1/8$ 分辨率输出光流。我们通过将每个像素的全分辨率流视为其粗分辨率邻居的 $3 \times 3$ 网格的凸组合，将光流上采样到全分辨率。我们使用两个卷积层来预测 $H/8 \times W/8 \times \left( {8 \times 8 \times 9}\right)$ 掩码，并对 9 个邻居的权重执行 softmax。最终的高分辨率流场是通过使用掩码对邻域进行加权组合，然后置换和重塑为 $H \times W \times 2$ 维流场来找到的。该层可以直接在 PyTorch 中使用 unfold 函数实现。

3.4 Supervision

3.4 监督

We supervised our network on the ${l}_{1}$ distance between the predicted and ground truth flow over the full sequence of predictions, $\left\{ {{\mathbf{f}}_{1},\ldots ,{\mathbf{f}}_{N}}\right\}$ ,with exponentially increasing weights. Given ground truth flow ${\mathbf{f}}_{gt}$ ,the loss is defined as

我们在预测的完整序列上，预测流与真实流之间的 ${l}_{1}$ 距离 $\left\{ {{\mathbf{f}}_{1},\ldots ,{\mathbf{f}}_{N}}\right\}$ 上监督我们的网络，权重呈指数级增加。给定真实流 ${\mathbf{f}}_{gt}$，损失定义为

\[\mathcal{L} = \mathop{\sum }\limits_{{i = 1}}^{N}{\gamma }^{N - i}{\begin{Vmatrix}{\mathbf{f}}_{gt} - {\mathbf{f}}_{i}\end{Vmatrix}}_{1} \tag{7} \]

Fig. 3: Flow predictions on the Sintel test set.

图 3：在 Sintel 测试集上的流预测。

where we set $\gamma = {0.8}$ in our experiments.

其中我们在实验中设置 $\gamma = {0.8}$。

4 Experiments

4 实验

We evaluate RAFT on Sintel 11 and KITTI 18. Following previous works, we pretrain our network on FlyingChairs 15 and FlyingThings 33, followed by dataset specific finetuning. Our method achieves state-of-the-art performance on both Sintel (both clean and final passes) and KITTI. Additionally, we test our method on 1080p video from the DAVIS dataset 37 to demonstrate that our method scales to videos of very high resolutions.

我们在Sintel 11和KITTI 18上评估RAFT。按照先前的工作，我们在FlyingChairs 15和FlyingThings 33上预训练我们的网络，随后进行特定数据集的微调。我们的方法在Sintel（清洁和最终通道）和KITTI上都达到了最先进的性能。此外，我们在DAVIS数据集37的1080p视频上测试我们的方法，以证明我们的方法可以扩展到非常高分辨率的视频。

Implementation Details: RAFT is implemented in PyTorch 36. All modules are initialized from scratch with random weights. During training, we use the AdamW[30] optimizer and clip gradients to the range $\left\lbrack {-1,1}\right\rbrack$ . Unless otherwise noted, we evaluate after 32 flow updates on Sintel and 24 on KITTI. For every update, $\Delta \mathbf{f} + {\mathbf{f}}_{k}$ ,we only backpropgate the gradient through the $\Delta \mathbf{f}$ branch,and zero the gradient through the ${\mathbf{f}}_{k}$ branch as suggested by [20].

实现细节：RAFT在PyTorch 36中实现。所有模块都从头开始初始化，使用随机权重。在训练期间，我们使用AdamW[30]优化器并将梯度裁剪到范围$\left\lbrack {-1,1}\right\rbrack$。除非另有说明，我们在Sintel上进行32次流更新后评估，在KITTI上进行24次更新后评估。对于每次更新，$\Delta \mathbf{f} + {\mathbf{f}}_{k}$，我们仅通过$\Delta \mathbf{f}$分支反向传播梯度，并按照[20]的建议通过${\mathbf{f}}_{k}$分支将梯度归零。

Training Schedule: We train RAFT using two 2080Ti GPUs. We pretrain on FlyingThings for 100k iterations with a batch size of 12, then train for 100k iterations on FlyingThings3D with a batch size of 6. We finetune on Sintel for another 100k by combining data from Sintel [11], KITTI-2015 [34], and HD1K [27] similar to MaskFlowNet [52] and PWC-Net + [41]. Finally, we finetune on KITTI-2015 for an additionally ${50}\mathrm{k}$ iterations using the weights from the model finetuned on Sintel. Details on training and data augmentation are provided in the supplemental material. For comparison with prior work, we also include results from our model when finetuning only on Sintel and only on KITTI.

训练计划：我们使用两块2080Ti GPU训练RAFT。我们在FlyingThings上预训练10万次迭代，批量大小为12，然后在FlyingThings3D上训练10万次迭代，批量大小为6。我们在Sintel上进行额外的10万次迭代微调，结合了来自Sintel[11]、KITTI-2015[34]和HD1K[27]的数据，类似于MaskFlowNet[52]和PWC-Net +[41]。最后，我们使用在Sintel上微调的模型的权重，在KITTI-2015上进行额外的${50}\mathrm{k}$次迭代微调。训练和数据增强的细节在补充材料中提供。为了与先前的工作进行比较，我们还包含了仅在Sintel和仅在KITTI上微调的模型的结果。

4.1 Sintel

We train our model using the FlyingChairs $\rightarrow$ FlyingThings schedule and then evaluate on the Sintel dataset using the train split for validation. Results are shown in Table 1 and Figure 3, and we split results based on the data used for training. C + T means that the models are trained on FlyingChairs $\left( C\right)$ and FlyingThings(T), while +ft indicates the model is finetuned on Sintel data. Like PWC-Net+[41] and MaskFlowNet [52] we include data from KITTI and HD1K when finetuning. We train 3 times with different seeds, and report results using the model with the median accuracy on the clean pass of Sintel (train).

我们使用 FlyingChairs $\rightarrow$ FlyingThings 计划训练模型，然后在 Sintel 数据集上使用训练集进行验证。结果显示在表 1 和图 3 中，我们根据用于训练的数据对结果进行划分。C + T 表示模型在 FlyingChairs $\left( C\right)$ 和 FlyingThings(T) 上进行训练，而 +ft 表示模型在 Sintel 数据上进行微调。与 PWC-Net+[41] 和 MaskFlowNet [52] 类似，我们在微调时包括了 KITTI 和 HD1K 的数据。我们使用不同的种子训练了 3 次，并报告了在 Sintel（训练）干净通道上具有中位准确率的模型的结果。

Fig. 4: Flow predictions on the KITTI test set.

图 4：KITTI 测试集上的光流预测。

When using $\mathrm{C} + \mathrm{T}$ for training,our method outperforms all existing approaches, despite using a significantly shorter training schedule. Our method achieves an average EPE (end-point-error) of 1.43 on the Sintel(train) clean pass,which is a ${29}\%$ lower error than FlowNet2. These results demonstrates good cross dataset generalization. One of the reasons for better generalization is the structure of our network. By constraining optical flow to be the product of a series of identical update steps, we force the network to learn an update operator which mimics the updates of a first-order descent algorithm. This constrains the search space, reduces the risk of over-fitting, and leads to faster training and better generalization.

在使用 $\mathrm{C} + \mathrm{T}$ 进行训练时，尽管训练计划显著缩短，我们的方法仍优于所有现有方法。我们的方法在 Sintel（训练）干净通道上的平均 EPE（终点误差）为 1.43，比 FlowNet2 的误差低 ${29}\%$。这些结果展示了良好的跨数据集泛化能力。更好的泛化能力的原因之一是我们网络的结构。通过将光流约束为一系列相同更新步骤的乘积，我们迫使网络学习一个模仿一阶下降算法更新的更新算子。这约束了搜索空间，降低了过拟合的风险，并导致更快的训练和更好的泛化。

When evaluating on the Sintel(test) set, we finetune on the combined clean and final passes of the training set along with KITTI and HD1K data. Our method ranks 1st on both the Sintel clean and final passes, and outperforms all prior work by 0.9 pixels (36%) on the clean pass and 1.2 pixels (30%) on the final pass. We evaluate two versions of our model, Ours (two-frame) uses zero initialization, while Ours (warp-start) initializes flow by forward projecting the flow estimate from the previous frame. Since our method operates at a single resolution, we can initialize the flow estimate to utilize motion smoothness from past frames, which cannot be easily done using the coarse-to-fine model.

在Sintel（测试）集上评估时，我们在训练集的干净和最终通道以及KITTI和HD1K数据上进行微调。我们的方法在Sintel干净和最终通道上均排名第一，并且在干净通道上比所有先前的工作高出0.9像素（36%），在最终通道上高出1.2像素（30%）。我们评估了两个版本的模型，Ours（两帧）使用零初始化，而Ours（warp-start）通过从前一帧正向投影光流估计来初始化光流。由于我们的方法在单一分辨率下运行，我们可以初始化光流估计以利用过去帧的运动平滑性，这是使用粗到细模型难以实现的。

4.2 KITTI

We also evaluate RAFT on KITTI and provide results in Table 1 and Figure 4 We first evaluate cross-dataset generalization by evaluating on the KITTI-15 $\frac{}{\left( {train}\right) {split}}{aftertrainingonChairs}\left( C\right) {andFlyingThings}\left( T\right) .{Ourmethodout}$ performs prior works by a large margin, improving EPE (end-point-error) from 8.36 to 5.04 , which shows that the underlying structure of our network facilitates generalization. Our method ranks 1st on the KITTI leaderboard among all optical flow methods.

我们还在KITTI上评估了RAFT，并在表1和图4中提供了结果。我们首先通过在KITTI-15上评估来评估跨数据集的泛化能力，$\frac{}{\left( {train}\right) {split}}{aftertrainingonChairs}\left( C\right) {andFlyingThings}\left( T\right) .{Ourmethodout}$以较大优势超越了先前的工作，将EPE（终点误差）从8.36提高到5.04，这表明我们网络的底层结构有助于泛化。我们的方法在KITTI排行榜上所有光流方法中排名第一。

4.3 Ablations

4.3 消融实验

We perform a set of ablation experiments to show the relative importance of each component. All ablated versions are trained on FlyingChairs(C) + Fly-ingThings(T). Results of the ablations are shown in Table 2 In each section of

我们进行了一系列消融实验，以展示每个组件的相对重要性。所有消融版本都在FlyingChairs（C）+ FlyingThings（T）上进行训练。消融实验的结果如表2所示，在每个部分中

Training Data	Method	Sintel (train)		KITTI-15 (train)		Sintel (test)		KITTI-15 (test)
Training Data	Method	Clean	Final	F1-epe	F1-all	Clean	Final	F1-all
$-$	FlowFields 7	-	$-$	-	$-$	3.75	5.81	15.31
$-$	FlowFields++ 40	$-$	$-$	$-$	-	2.94	5.49	14.82
S	DCFlow 47	$-$	$-$	$-$	$-$	3.54	5.12	14.86
S	MRFlow 46	-	$-$	$-$	-	2.53	5.38	12.19
$\mathrm{C} + \mathrm{T}$	HD3 50	3.84	8.77	13.17	24.0	$-$	$-$	$-$
	LiteFlowNet 22	2.48	4.04	10.39	28.5	$-$	$-$	$-$
	PWC-Net 42	2.55	3.93	10.35	33.7	$-$	$-$	$-$
	LiteFlowNet2 23	2.24	3.78	8.97	25.9	$-$	-	-
	VCN 49	2.21	3.68	8.36	25.1	$-$	$-$	$-$
	MaskFlowNet 52	2.25	3.61	$-$	23.1	-	_，	$-$
	FlowNet2 25	2.02	${3.54}^{1}$	10.08	30.0	3.96	6.02	-
	Ours (small)	2.21	3.35	7.51	26.9	$-$	7	$=$
	Ours (2-view)	1.43	2.71	5.04	17.4	$-$	-	-
$\mathrm{C} + \mathrm{T} + \mathrm{S}/\mathrm{K}$	FlowNet2 25	(1.45)	(2.01)	(2.30)	(6.8)	4.16	5.74	11.48
	HD3 50	(1.87)	(1.17)	(1.31)	(4.1)	4.79	4.67	6.55
	IRR-PWC [24]	(1.92)	(2.51)	(1.63)	(5.3)	3.84	4.58	7.65
	ScopeFlow 8	$-$	$-$	$-$	$-$	3.59	4.10	6.82
	Ours (2-view)	(0.77)	(1.20)	(0.64)	(1.5)	2.08	3.41	5.27
$\mathrm{C} + \mathrm{T} + \mathrm{S} + \mathrm{K} + \mathrm{H}$	${\mathrm{{LiteFlowNet2}}}^{2}$ [23]	(1.30)	(1.62)	(1.47)	(4.8)	3.48	4.69	7.74
	PWC-Net + 41	(1.71)	(2.34)	(1.50)	(5.3)	3.45	4.60	7.72
	VCN 49	(1.66)	(2.24)	(1.16)	(4.1)	2.81	4.40	6.30
	MaskFlowNet 52	$-$	$-$	$-$	$-$	2.52	4.17	6.10
	Ours (2-view)	(0.76)	(1.22)	(0.63)	(1.5)	1.94	3.18	5.10
	Ours (warm-start)	(0.77)	(1.27)	-	-	1.61	2.86	-

Table 1: Results on Sintel and KITTI datasets. We test the generalization performance on Sintel(train) after training on FlyingChairs(C) and FlyingThing(T), and outperform all existing methods on both the clean and final pass. The bottom two sections show the performance of our model on public leaderboards after dataset specific finetuning. S/K includes methods which use only Sintel data for finetuning on Sintel and only KITTI data when finetuning on KITTI. +S+K+H includes methods which combine KITTI, HD1K, and Sintel data when finetun-ing on Sintel. Ours (warm-start) ranks 1st on both the Sintel clean and final passes, and 1st among all flow approaches on KITTI. ('FlowNet2 originally reported results on the disparity split of Sintel, 3.54 is the EPE when their model is evaluated on the standard data [22. 23 finds that HD1K data does not help significantly during Sintel finetuning and reports results without it. )

表1：Sintel和KITTI数据集上的结果。我们在FlyingChairs(C)和FlyingThings(T)上训练后，测试Sintel(train)上的泛化性能，并在干净和最终通过上超越所有现有方法。底部两部分显示了我们的模型在数据集特定微调后在公共排行榜上的性能。S/K包括仅使用Sintel数据在Sintel上微调以及仅使用KITTI数据在KITTI上微调的方法。+S+K+H包括在Sintel上微调时结合KITTI、HD1K和Sintel数据的方法。我们的方法（热启动）在Sintel干净和最终通过上排名第一，并且在KITTI上所有光流方法中排名第一。（FlowNet2最初在Sintel的视差分割上报告结果，3.54是他们的模型在标准数据上评估时的EPE [22. 23发现HD1K数据在Sintel微调期间没有显著帮助，并报告了不使用它的结果。）

the table, we test a specific component of our approach in isolation, the settings which are used in our final model is underlined. Below we describe each of the experiments in more detail.

在表中，我们单独测试了我们方法的一个特定组件，最终模型中使用的设置被下划线标出。下面我们更详细地描述每个实验。

Architecture of Update Operator: We use a gated activation unit based on the GRU cell. We experiment with replacing the convolutional GRU with a set of 3 convolutional layers with ReLU activation. We achieve better performance by using the GRU block, likely because the gated activation makes it easier for the sequence of flow estimates to converge.

更新操作符的架构：我们使用基于GRU单元的门控激活单元。我们尝试用一组具有ReLU激活的3个卷积层替换卷积GRU。通过使用GRU块，我们实现了更好的性能，可能是因为门控激活使得光流估计序列更容易收敛。

Weight Tying: By default, we tied the weights across all instances of the update operator. Here, we test a version of our approach where each update operator learns a separate set of weights. Accuracy is better when weights are tied and the parameter count is significantly lower.

权重绑定：默认情况下，我们在所有更新操作符实例中绑定权重。在这里，我们测试了我们方法的一个版本，其中每个更新操作符学习一组单独的权重。当权重被绑定时，准确性更好，且参数数量显著减少。

Z. Teed and J. Deng

Z. Teed 和 J. Deng

Experiment	Method	Clean	Final	F1-epe	F1-all	Tarameters
Reference Model (bilinear upsampling),Training: ${100}\mathrm{k}\left( \mathrm{C}\right) \rightarrow {60}\mathrm{k}\left( \mathrm{T}\right)$
Update Op.	ConvGRU	1.63	2.83	5.54	19.8	4.8M
Update Op.	Conv	2.04	3.21	7.66	26.1	4.1M
Tying	Tied Weights	1.63	2.83	5.54	19.8	4.8M
Tying	Untied Weights	1.96	3.20	7.64	24.1	32.5M
Context	Context	1.63	2.83	5.54	19.8	4.8M
Context	No Context	1.93	3.06	6.25	23.1	3.3M
Feature Scale	Single-Scale	1.63	2.83	5.54	19.8	4.8M
Feature Scale	Multi-Scale	2.08	3.12	6.91	23.2	6.6M
Lookup Radius	0	3.41	4.53	23.6	44.8	4.7M
	1	1.80	2.99	6.27	21.5	4.7M
	2	1.78	2.82	5.84	21.1	4.8M
	4	1.63	2.83	5.54	19.8	4.8M
Correlation Pooling	No	1.95	3.02	6.07	23.2	4.7M
Correlation Pooling	Yes	1.63	2.83	5.54	19.8	4.8M
Correlation Range	32px	2.91	4.48	10.4	28.8	4.8M
	64px	2.06	3.16	6.24	20.9	4.8M
	128px	1.64	2.81	6.00	19.9	4.8M
	All-Pairs	1.63	2.83	5.54	19.8	4.8M
Features for Refinement	Correlation	1.63	2.83	5.54	19.8	4.8M
Features for Refinement	Warping	2.27	3.73	11.83	32.1	2.8M

Reference Model (convex upsampling),Training: ${100}\mathrm{k}\left( \mathrm{C}\right) \rightarrow {100}\mathrm{k}\left( \mathrm{T}\right)$

参考模型（凸上采样），训练：${100}\mathrm{k}\left( \mathrm{C}\right) \rightarrow {100}\mathrm{k}\left( \mathrm{T}\right)$

Upsampling	Convex	1.43	2.71	5.04	17.4	5.3M
Upsampling	Bilinear	1.60	2.79	5.17	19.2	4.8M
Inference Updates	1	4.04	5.45	15.30	44.5	5.3M
	3	2.14	3.52	8.98	29.9	5.3M
	8	1.61	2.88	5.99	19.6	5.3M
	32	1.43	2.71	5.00	17.4	5.3M
	100	1.41	2.72	4.95	17.4	5.3M
	200	1.40	2.73	4.94	17.4	5.3M

Table 2: Ablation experiments. Settings used in our final model are underlined. See Sec. 4.3 for details.

表2：消融实验。我们最终模型中使用的设置已加下划线。详情请参见第4.3节。

Context: We test the importance of context by training a model with the context network removed. Without context, we still achieve good results, outperforming all existing works on both Sintel and KITTI. But context is helpful. Directly injecting image features into the update operator likely allows spatial information to be better aggregated within motion boundaries.

上下文：我们通过移除上下文网络来训练模型，以测试上下文的重要性。没有上下文，我们仍然取得了良好的结果，在Sintel和KITTI上都优于所有现有作品。但上下文是有帮助的。直接将图像特征注入更新操作符可能使得空间信息在运动边界内更好地聚合。

Feature Scale: By default, we extract features at a single resolution. We also try extracting features at multiple resolutions by building a correlation volume at each scale separately. Single resolution features simplifies the network architecture and allows fine-grained matching even at large displacements.

特征尺度：默认情况下，我们提取单一分辨率的特征。我们还尝试通过在每个尺度上分别构建相关性体积来提取多分辨率的特征。单一分辨率的特征简化了网络架构，并允许在大位移情况下进行细粒度匹配。

Lookup Radius: The lookup radius specifies the dimensions of the grid used in the lookup operation. When a radius of 0 is used, the correlation volume is retrieved at a single point. Surprisingly, we can still get a rough estimate of flow when the radius is 0 , which means the network is learning to use 0 'th order information. However, we see better results as the radius is increased.

查找半径：查找半径指定了查找操作中使用的网格的尺寸。当使用0半径时，相关性体积在单个点上检索。令人惊讶的是，即使半径为0，我们仍然可以得到流的大致估计，这意味着网络正在学习使用0阶信息。然而，随着半径的增加，我们看到了更好的结果。

Correlation Pooling: We output features at a single resolution and then perform pooling to generate multiscale volumes. Here we test the impact when this pooling is removed. Results are better with pooling, because large and small displacements are both captured.

相关性池化：我们以单一分辨率输出特征，然后进行池化以生成多尺度体积。这里我们测试移除池化的影响。结果显示，有池化时效果更好，因为大位移和小位移都被捕获了。

Correlation Range: Instead of all-pairs correlation, we also try constructing the correlation volume only for a local neighborhood around each pixel. We try a range of 32 pixels, 64 pixels, and 128 pixels. Overall we get the best results when the all-pairs are used, although a 128px range is sufficient to perform well on Sintel because most displacements fall within this range. That said, all-pairs is still preferable because it eliminates the need to specify a range. It is also more convenient to implement: it can be computed using matrix multiplication allowing our approach to be implemented entirely in PyTorch.

相关范围：我们尝试构建仅针对每个像素周围局部邻域的相关体积，而不是所有成对的相关性。我们尝试了32像素、64像素和128像素的范围。总体而言，当使用所有成对相关时，我们获得了最佳结果，尽管128像素范围足以在Sintel上表现良好，因为大多数位移落在此范围内。尽管如此，所有成对相关仍然是首选，因为它消除了指定范围的需要。它也更便于实现：可以使用矩阵乘法计算，从而使我们的方法完全在PyTorch中实现。

Features for Refinement: We compute visual similarity by building a correlation volume between all pairs of pixels. In this experiment, we try replacing the correlation volume with a warping layer, which uses the current estimate of optical flow to warp features from ${I}_{2}$ onto ${I}_{1}$ and then estimates the residual displacement. While warping is still competitive with prior work on Sintel, correlation performs significantly better, especially on KITTI.

细化特征：我们通过构建所有像素对之间的相关体积来计算视觉相似度。在这个实验中，我们尝试用翘曲层替换相关体积，该层使用光流的当前估计将特征从${I}_{2}$翘曲到${I}_{1}$，然后估计残余位移。尽管翘曲在Sintel上的先前工作仍然具有竞争力，但相关性表现明显更好，尤其是在KITTI上。

Upsampling: RAFT outputs flow fields at $1/8$ resolution. We compare bilinear upsampling to our learned upsampling module. The upsampling module produces better results, particularly near motion boundaries.

上采样：RAFT输出的流场分辨率为$1/8$。我们将双线性上采样与我们的学习上采样模块进行了比较。上采样模块产生了更好的结果，特别是在运动边界附近。

Inference Updates: Although we unroll 12 updates during training, we can apply an arbitrary number of updates during inference. In Table 2 we provide numerical results for selected number of updates, and test an extreme case of 200 to show that our method doesn't diverge. Our method quickly converges, surpassing PWC-Net after 3 updates and FlowNet2 after 6 updates, but continues to improve with more updates.

推理更新：尽管我们在训练期间展开了12次更新，但在推理期间我们可以应用任意数量的更新。在表2中，我们为选定的更新次数提供了数值结果，并测试了200次更新的极端情况，以表明我们的方法不会发散。我们的方法迅速收敛，在3次更新后超过PWC-Net，在6次更新后超过FlowNet2，但随着更多更新继续改进。

4.4 Timing and Parameter Counts

4.4 时间和参数计数

Inference time and parameter counts are shown in Figure 5 Accuracy is determined by performance on the Sintel(train) final pass after training on Fly-ingChairs and FlyingThings (C+T). In these plots, we report accuracy and timing after 10 iterations, and we time our method using a GTX 1080Ti GPU. Parameters counts for other methods are taken as reported in their papers, and we report times when run on our hardware. RAFT is more efficient in terms of parameter count, inference time, and training iterations. Ours-S uses only 1M parameters, but outperforms PWC-Net and VCN which are more than 6x larger. We provide an additional table with numerical values for parameters, timing, and training iterations in the supplemental material.

推断时间和参数计数如图5所示。准确性是通过在Fly-ingChairs和FlyingThings（C+T）训练后在Sintel（train）最终通过上的表现来确定的。在这些图中，我们报告了10次迭代后的准确性和时间，并使用GTX 1080Ti GPU计时我们的方法。其他方法的参数计数取自其论文中的报告，并在我们的硬件上运行时报告时间。RAFT在参数计数、推断时间和训练迭代方面更高效。Ours-S仅使用100万个参数，但性能优于PWC-Net和VCN，后者大6倍以上。我们在补充材料中提供了参数、时间和训练迭代的数值表。

Fig. 5: Plots comparing parameter counts, inference time, and training iterations vs. accuracy. Accuracy is measured by the EPE on the Sintel(train) final pass after training on C+T. Left: Parameter count vs. accuracy compared to other methods. RAFT is more parameter efficient while achieving lower EPE. Middle: Inference time vs. accuracy timed using our hardware Right: Training iterations vs. accuracy (taken as product of iterations and GPUs used).

图5：参数计数、推断时间和训练迭代与准确性的比较图。准确性是通过在C+T训练后在Sintel（train）最终通过上的EPE来衡量的。左图：与其他方法相比的参数计数与准确性。RAFT在实现较低EPE的同时更参数高效。中图：使用我们的硬件计时的推断时间与准确性。右图：训练迭代与准确性（取迭代次数和使用的GPU数量的乘积）。

Fig. 6: Results on 1080p (1088x1920) video from DAVIS (550 ms per frame).

图6：来自DAVIS的1080p（1088x1920）视频的结果（每帧550毫秒）。

4.5 Video of Very High Resolution

4.5 极高分辨率视频

To demonstrate that our method scales well to videos of very high resolution we apply our network to HD video from the DAVIS 37 dataset. We use 1080p (1088x1920) resolution video and apply 12 iterations of our approach. Inference takes ${550}\mathrm{\;{ms}}$ for 12 iterations on 1080p video,with all-pairs correlation taking 95ms. Fig. 6 visualizes example results on DAVIS.

为了证明我们的方法能够很好地扩展到极高分辨率的视频，我们将我们的网络应用于DAVIS 37数据集中的HD视频。我们使用1080p（1088x1920）分辨率的视频，并应用12次迭代的方法。推断时间为1080p视频的12次迭代，所有对相关性耗时95毫秒。图6展示了DAVIS上的示例结果。

5 Conclusions

5 结论

We have proposed RAFT-Recurrent All-Pairs Field Transforms - a new end-to-end trainable model for optical flow. RAFT is unique in that it operates at a single resolution using a large number of lightweight, recurrent update operators. Our method achieves state-of-the-art accuracy across a diverse range of datasets, strong cross dataset generalization, and is efficient in terms of inference time, parameter count, and training iterations.

我们提出了 RAFT-循环全对场变换 - 一种新的端到端可训练的光流模型。RAFT 的独特之处在于它以单一分辨率运行，使用大量轻量级的循环更新操作符。我们的方法在多种数据集上达到了最先进的准确性，具有强大的跨数据集泛化能力，并且在推理时间、参数数量和训练迭代次数方面都很高效。

Acknowledgments: This work was partially funded by the National Science Foundation under Grant No. 1617767.

致谢：本工作部分由美国国家科学基金会根据第 1617767 号资助项目资助。

References

参考文献

Adler, J., Oktem, O.: Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems 33(12), 124007 (2017)
Adler, J., Öktem, O.: Learned primal-dual reconstruction. IEEE transactions on medical imaging $\mathbf{{37}}\left( 6\right) ,{1322} - {1332}\left( {2018}\right)$
Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, J.Z.: Differentiable convex optimization layers. In: Advances in Neural Information Processing Systems. pp. 9558-9570 (2019)
Amos, B., Kolter, J.Z.: Optnet: Differentiable optimization as a layer in neural networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 136-145. JMLR. org (2017)
Bai, S., Kolter, J.Z., Koltun, V.: Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682 (2018)
Bai, S., Kolter, J.Z., Koltun, V.: Deep equilibrium models. In: Advances in Neural Information Processing Systems. pp. 688-699 (2019)
Bailer, C., Taetz, B., Stricker, D.: Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 4015-4023 (2015)
Bar-Haim, A., Wolf, L.: Scopeflow: Dynamic scene scoping for optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7998-8007 (2020)
Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: 1993 (4th) International Conference on Computer Vision. pp. 231-236. IEEE (1993)
Brox, T., Bregler, C., Malik, J.: Large displacement optical flow. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 41-48. IEEE (2009)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611- 625. Springer (2012)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision $\mathbf{{40}}\left( 1\right)$ , 120–145 (2011)
Chen, Q., Koltun, V.: Full flow: Optical flow estimation by global optimization over regular grids. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4706-4714 (2016)
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2758-2766 (2015)
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6016-6025 (2018)
Flynn, J., Broxton, M., Debevec, P., DuVall, M., Fyffe, G., Overbeck, R.S., Snavely, N., Tucker, R.: Deepview: High-quality view synthesis by learned gradient descent (2019)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11), 1231-1237 (2013)
Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence $\mathbf{{30}}\left( 2\right) ,{328} -$ 341 (2007)
Hofinger, M., Bulò, S.R., Porzi, L., Knapitsch, A., Kontschieder, P.: Improving optical flow on a pyramidal level. In: ECCV (2020)
Horn, B.K., Schunck, B.G.: Determining optical flow. In: Techniques and Applications of Image Understanding. vol. 281, pp. 319-331. International Society for Optics and Photonics (1981)
Hui, T.W., Tang, X., Change Loy, C.: Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8981-8989 (2018)
Hui, T.W., Tang, X., Loy, C.C.: A lightweight optical flow cnn-revisiting data fidelity and regularization. fidelity and regularization. arXiv:1903.07414 (2019)
Hur, J., Roth, S.: Iterative residual refinement for joint optical flow and occlusion estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5754-5763 (2019)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2462-2470 (2017)
Kobler, E., Klatzer, T., Hammernik, K., Pock, T.: Variational networks: connecting variational methods and deep learning. In: German conference on pattern recognition. pp. 281-293. Springer (2017)
Kondermann, D., Nair, R., Honauer, K., Krispin, K., Andrulis, J., Brock, A., Gussefeld, B., Rahimimoghaddam, M., Hofmann, S., Brenner, C., et al.: The hei benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 19-28 (2016)
Li, X., Wu, J., Lin, Z., Liu, H., Zha, H.: Recurrent squeeze-and-excitation context aggregation net for single image deraining. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 254-269 (2018)
Liang, Z., Feng, Y., Guo, Y., Liu, H., Chen, W., Qiao, L., Zhou, L., Zhang, J.: Learning for disparity estimation through feature constancy. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2811- 2820 (2018)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, Y., Valmadre, J., Wang, H., Kannala, J., Harandi, M., Torr, P.: Devon: Deformable volume network for learning optical flow. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 2705-2713 (2020)
Lv, Z., Dellaert, F., Rehg, J.M., Geiger, A.: Taking a deeper look at the inverse compositional algorithm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4581-4590 (2019)
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4040-4048 (2016)
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the IEE conference on computer vision and pattern recognition. pp. 3061-3070 (2015)
Menze, M., Heipke, C., Geiger, A.: Discrete optimization for optical flow. In: German Conference on Pattern Recognition. pp. 16-28. Springer (2015)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Ranftl, R., Bredies, K., Pock, T.: Non-local total generalized variation for optical flow estimation. In: European Conference on Computer Vision. pp. 439-454. Springer (2014)
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4161-4170 (2017)
Schuster, R., Bailer, C., Wasenmüller, O., Stricker, D.: Flowfields++: Accurate optical flow correspondences meet robust interpolation. In: 2018 25th IEEE International Conference on Image Processing (ICIP). pp. 1463-1467. IEEE (2018)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does training: An empirical study of cnns for optical flow estimation. arXiv preprint arXiv:1809.05571 (2018)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8934-8943 (2018)
Tang, C., Tan, P.: Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807 (2018)
Teed, Z., Deng, J.: Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605 (2018)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: Proceedings of the IEEE international conference on computer vision. pp. 1385-1392 (2013)
Wulff, J., Sevilla-Lara, L., Black, M.J.: Optical flow in mostly rigid scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4671-4680 (2017)
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1289-1297 (2017)
Yang, G., Manela, J., Happold, M., Ramanan, D.: Hierarchical deep stereo matching on high-resolution images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5515-5524 (2019)
Yang, G., Ramanan, D.: Volumetric correspondence networks for optical flow. In: Advances in Neural Information Processing Systems. pp. 793-803 (2019)
Yin, Z., Darrell, T., Yu, F.: Hierarchical discrete distribution decomposition for match density estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6044-6053 (2019)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Joint pattern recognition symposium. pp. 214-223. Springer (2007)
Zhao, S., Sheng, Y., Dong, Y., Chang, E.I., Xu, Y., et al.: Maskflownet: Asymmetric feature matching with learnable occlusion mask. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6278-6287 (2020)
Zhou, H., Ummenhofer, B., Brox, T.: Deeptam: Deep tracking and mapping. In: Proceedings of the European conference on computer vision (ECCV). pp. 822-838 (2018)

A Network Architecture

网络架构

Fig. 7: Network architecture details for the full 4.8M parameter model (5.3M with upsampling module) and the small 1.0M parameter model. The context and feature encoders have the same architecture, the only difference is that the feature encoder uses instance normalization while the context encoder uses batch normalization. In RAFT-S, we replace the residual units with bottleneck residual units. The update block takes in context features, correlation features, and flow features to update the latent hidden state. The updated hidden status is used to predict the flow update. The full model uses two convolutional GRU update blocks with $1 \times 5$ filters and $5 \times 1$ filters respectively,while the small model uses a single GRU with 3x3 filters.

图 7：完整 4.8M 参数模型（包含上采样模块为 5.3M 参数）和小型 1.0M 参数模型的网络架构细节。上下文编码器和特征编码器具有相同的架构，唯一的区别是特征编码器使用实例归一化，而上下文编码器使用批归一化。在 RAFT-S 中，我们将残差单元替换为瓶颈残差单元。更新块接收上下文特征、相关特征和流特征来更新潜在隐藏状态。更新的隐藏状态用于预测流更新。完整模型使用两个卷积 GRU 更新块，分别具有 $1 \times 5$ 滤波器和 $5 \times 1$ 滤波器，而小型模型使用一个具有 3x3 滤波器的 GRU。

B Upsampling Module

上采样模块

Fig. 8: Illistration of the upsampling module. Each pixel of the high resolution flow field (small boxes) is taken to be the convex combination of its 9 coarse resolution neighbors using weights predicted by the network.

图 8：上采样模块的图示。高分辨率流场（小方框）的每个像素被视为其 9 个粗分辨率邻居的凸组合，权重由网络预测。

Fig. 9: Our upsampling module improves accuracy near motion boundaries, and also allows RAFT to recover the flow of small fast moving objects such as the birds shown in the figure.

图 9：我们的上采样模块提高了运动边界附近的准确性，并且还允许 RAFT 恢复小而快速移动物体的流，如图中所示的鸟类。

C Training Details

训练细节

Stage	Weights	Training Data	Learning Rate	Batch Size (per GPU)	Weight Decay	Crop Size
Chairs	$-$	C	$4\mathrm{e} - 4$	6	1e-4	$\left\lbrack {{368},{496}}\right\rbrack$
Things	Chairs	$\mathrm{T}$	${1.2}\mathrm{e} - 4$	3	1e-4	$\left\lbrack {{400},{720}}\right\rbrack$
Sintel	Things	$\mathrm{S} + \mathrm{T} + \mathrm{K} + \mathrm{H}$	${1.2}\mathrm{e} - 4$	3	1e-5	[368, 768]
KITTI	Sintel	K	1e-4	3	1e-5	$\left\lbrack {{288},{960}}\right\rbrack$

Table 3: Details of the training schedule. Dataset abbreviations: C: FlyingChairs, T: FlyingThings, S: Sintel, K: KITTI-2015, H: HD1K. During the Sintel Fine-tuning phase, the dataset distribution is S(.71), T(.135), K(.135), H(.02).

表3：训练计划详情。数据集缩写：C: FlyingChairs, T: FlyingThings, S: Sintel, K: KITTI-2015, H: HD1K。在Sintel微调阶段，数据集分布为S(.71), T(.135), K(.135), H(.02)。

Photometric Augmentation: We perform photometric augmentation by randomly perturbing brightness, contrast, saturation, and hue. We use the Torchvi-sion ColorJitter with brightness 0.4 , contrast 0.4 , saturation 0.4 , and hue

光度增强：我们通过随机扰动亮度、对比度、饱和度和色调来进行光度增强。我们使用Torchvi-sion ColorJitter，亮度为0.4，对比度为0.4，饱和度为0.4，色调为

${0.5}/\pi$ . On KITTI,we reduce the degree of augmentation to brightness 0.3,contrast 0.3,saturation 0.3,and hue ${0.3}/\pi$ . With probablity 0.2,color augmentation is performed to each of the images independently.

${0.5}/\pi$。在KITTI上，我们将增强程度降低到亮度0.3，对比度0.3，饱和度0.3，色调${0.3}/\pi$。以0.2的概率对每张图像独立进行颜色增强。

Spatial Augmentation: We perform spatial augmentation by randomly rescaling and stretching the images. The degree of random scaling depends on the dataset. For FlyingChairs,we perform spatial augmentation in the range ${2}^{\left\lbrack -{0.2},{1.0}\right\rbrack }$ , FlyingThings ${2}^{\left\lbrack -{0.4},{0.8}\right\rbrack }$ ,Sintel ${2}^{\left\lbrack -{0.2},{0.6}\right\rbrack }$ ,and KITTI ${2}^{\left\lbrack -{0.2},{0.4}\right\rbrack }$ . Spatial augmentation is performed with probability 0.8 .

空间增强：我们通过随机缩放和拉伸图像来进行空间增强。随机缩放的程度取决于数据集。对于FlyingChairs，我们在范围${2}^{\left\lbrack -{0.2},{1.0}\right\rbrack }$内进行空间增强，FlyingThings ${2}^{\left\lbrack -{0.4},{0.8}\right\rbrack }$，Sintel ${2}^{\left\lbrack -{0.2},{0.6}\right\rbrack }$，和KITTI ${2}^{\left\lbrack -{0.2},{0.4}\right\rbrack }$。空间增强以0.8的概率进行。

Occlusion Augmentation: Following HSM-Net [48], we also randomly erase rectangular regions in ${I}_{2}$ with probability 0.5 to simulate occlusions.

遮挡增强：遵循HSM-Net [48]，我们也以0.5的概率随机擦除${I}_{2}$中的矩形区域以模拟遮挡。

D Timing, Parameters, and Training Iterations

D 时间、参数和训练迭代次数

Fig. 10: (Left) EPE on the Sintel set as a function of the number of iterations at inference time. (Right) Magnitude of each update ${\begin{Vmatrix}\Delta {\mathbf{f}}_{k}\end{Vmatrix}}_{2}$ averaged over all pixels indicating convergence to a fixed point ${\mathbf{f}}_{k} \rightarrow {\mathbf{f}}^{ * }$ .

图10：（左）Sintel数据集上的EPE作为推理时迭代次数的函数。（右）每个更新${\begin{Vmatrix}\Delta {\mathbf{f}}_{k}\end{Vmatrix}}_{2}$的平均幅度，表明收敛到一个固定点${\mathbf{f}}_{k} \rightarrow {\mathbf{f}}^{ * }$。

Method	Parameters (M)	Time (Reported)	Time (1080Ti)	Training Iter. (#GPUs)	Accuracy
LiteFlowNetX 22	0.9M	0.03s	-	2000k	4.79
LiteFlowNet 22	${5.4}\mathrm{M}$	0.09s	0.09s	2000k	4.04
IRR-PWC 24	6.4M	$-$	0.20s	850k	3.95
PWCNet + 41	9.4M	0.03s	0.04s	1700k	3.93
VCN 49	6.2M	0.18s	0.26s	${220}\mathrm{k}\left( 4\right)$	3.63
FlowNet2 25	162M	0.12s	0.11s	7000k	3.54
Ours (small)	1.0M	$-$	0.05s	${160}\mathrm{k}\left( 2\right)$	3.37
Ours (mixed)	${5.3}\mathrm{M}$	-	0.10s	${240}\mathrm{k}\left( 1\right)$	2.85
Ours	${5.3}\mathrm{M}$	-	0.10s	${200}\mathrm{k}\left( 2\right)$	2.83

Table 4: Parameter counts, inference time, training iterations, and accuracy on the Sintel (train) final pass. We report the timing and accuracy of our method after 10 updates using a GTX 1080Ti GPU. If possible, we download the code from the other methods and re-time using our machine. If the model is trained using more than one GPU, we report the number of GPUs used to train in parenthesis. We can also train RAFT using mixed precision training Ours(mixed) and achieve similar results while training on only a single GPU. Overall, RAFT requires fewer training iterations and parameters when compared to prior work.

表 4：参数数量、推理时间、训练迭代次数以及在 Sintel（训练集）最终通过上的准确度。我们报告了在使用 GTX 1080Ti GPU 进行 10 次更新后的方法的计时和准确度。如果可能，我们从其他方法下载代码并在我们的机器上重新计时。如果模型是使用多个 GPU 训练的，我们会在括号中报告用于训练的 GPU 数量。我们还可以使用混合精度训练我们的方法（混合），并在仅使用单个 GPU 的情况下获得类似的结果。总的来说，与先前的工作相比，RAFT 需要的训练迭代次数和参数更少。

标签：right,mathbf,Sintel,flow,2024,中英对照,RAFT,left
From： https://www.cnblogs.com/odesey/p/18392574

2024-08-28-RAFT-中英对照

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

RAFT：用于光流的循环全对域变换

1 Introduction

1 引言

2 Related Work

2 相关工作

3 Approach

3 方法

3.1 Feature Extraction

3.1 特征提取

3.2 Computing Visual Similarity

3.2 计算视觉相似度

3.3 Iterative Updates

3.3 迭代更新

3.4 Supervision

3.4 监督

4 Experiments

4 实验

4.1 Sintel

4.1 Sintel

4.2 KITTI

4.2 KITTI

4.3 Ablations

4.3 消融实验

4.4 Timing and Parameter Counts

4.4 时间和参数计数

4.5 Video of Very High Resolution

4.5 极高分辨率视频

5 Conclusions

5 结论

References

参考文献

A Network Architecture

网络架构

B Upsampling Module

上采样模块

C Training Details

训练细节

D Timing, Parameters, and Training Iterations

D 时间、参数和训练迭代次数

相关文章

赞助商

阅读排行