介绍 torch.compile
原文:
pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
译者:飞龙
注意
点击这里下载完整的示例代码
作者: William Wen
torch.compile
是加速 PyTorch 代码的最新方法!torch.compile
通过将 PyTorch 代码 JIT 编译成优化的内核来使 PyTorch 代码运行更快,同时需要最少的代码更改。
在本教程中,我们涵盖了基本的torch.compile
用法,并展示了torch.compile
相对于之前的 PyTorch 编译器解决方案(如TorchScript和FX Tracing)的优势。
目录
-
基本用法
-
演示加速效果
-
与 TorchScript 和 FX Tracing 的比较
-
TorchDynamo 和 FX 图
-
结论
所需的 pip 依赖项
-
torch >= 2.0
-
torchvision
-
numpy
-
scipy
-
tabulate
注意:为了重现下面显示的速度提升数字以及其他地方记录的数字,建议使用现代的 NVIDIA GPU(H100、A100 或 V100)进行本教程。
import torch
import warnings
gpu_ok = False
if torch.cuda.is_available():
device_cap = torch.cuda.get_device_capability()
if device_cap in ((7, 0), (8, 0), (9, 0)):
gpu_ok = True
if not gpu_ok:
warnings.warn(
"GPU is not NVIDIA V100, A100, or H100\. Speedup numbers may be lower "
"than expected."
)
/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py:48: UserWarning:
GPU is not NVIDIA V100, A100, or H100\. Speedup numbers may be lower than expected.
基本用法
torch.compile
已包含在最新的 PyTorch 中。在 GPU 上运行 TorchInductor 需要 Triton,Triton 已包含在 PyTorch 2.0 nightly 二进制文件中。如果 Triton 仍然缺失,请尝试通过 pip 安装torchtriton
(pip install torchtriton --extra-index-url "https://download.pytorch.org/whl/nightly/cu117"
用于 CUDA 11.7)。
通过将可调用对象传递给torch.compile
,可以优化任意的 Python 函数。然后我们可以调用返回的优化函数来替代原始函数。
def foo(x, y):
a = torch.sin(x)
b = torch.cos(y)
return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
tensor([[ 1.6850, 1.9924, 1.7090, 0.0034, 1.1414, -0.1822, 0.4861, -0.0536,
-0.2252, 1.9398],
[ 0.3693, -0.0695, 0.1748, 0.3436, 0.1939, 1.5721, 1.9882, -0.2235,
0.3161, 1.2642],
[ 0.2480, 1.8793, 1.7152, 1.6772, 1.8881, 1.4748, 1.3466, 1.7763,
0.7469, 1.0407],
[-0.1121, 1.6015, -0.0188, 0.2128, 0.5218, 1.9838, 0.8185, 0.5093,
-0.3603, 0.1793],
[-1.7890, 1.7532, -0.4040, 0.1222, -0.0029, 1.7975, -0.3877, 0.5123,
0.1673, 0.1330],
[ 1.0627, 0.9609, 0.1019, 1.8814, 0.1142, -0.2338, -0.9621, 0.7631,
0.6506, 0.1853],
[ 0.4584, 1.7648, -0.0444, 1.9610, 1.5884, 0.7353, 1.2190, 1.3662,
1.0938, -0.1587],
[-0.7502, 1.6640, 0.3495, 1.3496, 0.8187, 1.1719, 0.5820, 0.1498,
0.0885, 0.1036],
[ 0.3961, 0.6043, -0.0861, -0.3371, 0.8622, 1.4341, 1.2988, 0.5023,
0.3074, 0.1277],
[ 0.9748, 0.4117, 1.2616, 1.6314, 0.4693, 0.4092, 0.0401, 1.1196,
1.2458, 1.3280]])
或者,我们可以装饰这个函数。
@torch.compile
def opt_foo2(x, y):
a = torch.sin(x)
b = torch.cos(y)
return a + b
print(opt_foo2(torch.randn(10, 10), torch.randn(10, 10)))
tensor([[ 0.5360, 0.1697, -0.0561, 0.1890, -0.1310, 1.2276, 1.1739, 0.1944,
-0.1561, 1.6990],
[ 1.0421, 1.9472, 0.2682, 0.2701, 1.3346, 0.7651, 1.0897, 1.1730,
0.6161, 0.9223],
[ 1.5756, 1.5294, 0.0112, -0.1522, -0.7674, 1.8515, -0.2443, 0.3696,
0.2693, 0.8735],
[-0.3701, 1.1190, 1.4164, 1.8648, 1.2080, 0.0732, 1.5274, 0.6868,
1.2440, 1.0715],
[-1.2454, -0.0159, 0.4315, 0.1317, 1.0530, -1.0603, -0.0532, 0.6661,
1.7101, -0.2076],
[-0.7091, 0.7824, 1.7161, 1.2750, 0.6368, 1.2488, 0.4897, 1.2429,
1.3409, 1.3735],
[ 0.8345, 0.0653, 0.3462, 1.2383, -0.4092, 1.6438, -0.0962, 0.4011,
0.2463, -0.5802],
[ 1.6349, 0.7297, 1.2547, -0.3113, 0.9310, 0.1162, 1.7618, 0.4882,
0.7640, 0.2930],
[ 1.1669, -0.7775, 1.2000, 0.6008, -0.2814, 0.5541, 0.5753, 1.4731,
1.6835, 0.7370],
[ 1.5087, 0.6195, 0.1153, 1.2966, 1.8815, 1.1678, 1.5686, 1.6018,
0.2193, 1.3500]])
我们也可以优化torch.nn.Module
实例。
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.lin = torch.nn.Linear(100, 10)
def forward(self, x):
return torch.nn.functional.relu(self.lin(x))
mod = MyModule()
opt_mod = torch.compile(mod)
print(opt_mod(torch.randn(10, 100)))
tensor([[-0.0000, -0.0000, 0.2419, 0.0446, 0.9011, 0.2674, 0.3633, 0.4984, -0.0000,
0.0988],
[0.6906, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 0.8490, -0.0000, -0.0000,
0.5475],
[0.0852, 0.2762, 0.7441, -0.0000, -0.0000, 0.1820, -0.0000, -0.0000, -0.0000,
0.0334],
[0.3024, 0.0077, 1.2572, -0.0000, -0.0000, 0.6520, -0.0000, -0.0000, -0.0000,
0.8976],
[0.1998, 0.3333, -0.0000, 0.7803, 0.4202, 0.0915, -0.0000, 1.2543, -0.0000,
0.4615],
[0.2487, 0.4187, -0.0000, -0.0000, 0.5124, -0.0000, 0.2512, -0.0000, 0.5850,
-0.0000],
[-0.0000, 0.0048, -0.0000, -0.0000, -0.0000, 0.2287, -0.0000, 0.4841, 0.3915,
-0.0000],
[0.2017, -0.0000, 0.0896, 1.4135, 0.0593, 0.3788, -0.0000, -0.0000, -0.0000,
0.4972],
[-0.0000, -0.0000, 1.6580, 0.6414, -0.0000, -0.0000, -0.0000, -0.0000, 0.6491,
0.7755],
[-0.0000, -0.0000, 0.6442, 0.0260, 0.7456, 0.1000, -0.0000, -0.0000, 0.5366,
0.1193]], grad_fn=<CompiledFunctionBackward>)
演示加速
让我们现在演示一下,使用torch.compile
可以加速真实模型。我们将通过在随机数据上评估和训练一个torchvision
模型来比较标准的急切模式和torch.compile
。
在开始之前,我们需要定义一些实用函数。
# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
return (
torch.randn(b, 3, 128, 128).to(torch.float32).cuda(),
torch.randint(1000, (b,)).cuda(),
)
N_ITERS = 10
from torchvision.models import densenet121
def init_model():
return densenet121().to(torch.float32).cuda()
首先,让我们比较推理。
请注意,在调用torch.compile
时,我们有额外的mode
参数,我们将在下面讨论。
model = init_model()
# Reset since we are using a different mode.
import torch._dynamo
torch._dynamo.reset()
model_opt = torch.compile(model, mode="reduce-overhead")
inp = generate_data(16)[0]
with torch.no_grad():
print("eager:", timed(lambda: model(inp))[1])
print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3166423034667969
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:140: UserWarning:
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
compile: 76.9008984375
请注意,与急切模式相比,torch.compile
需要更长的时间才能完成。这是因为torch.compile
在执行时将模型编译为优化的内核。在我们的示例中,模型的结构没有改变,因此不需要重新编译。因此,如果我们运行我们优化过的模型多次,我们应该会看到与急切模式相比的显著改进。
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, eager_time = timed(lambda: model(inp))
eager_times.append(eager_time)
print(f"eager eval time {i}: {eager_time}")
print("~" * 10)
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, compile_time = timed(lambda: model_opt(inp))
compile_times.append(compile_time)
print(f"compile eval time {i}: {compile_time}")
print("~" * 10)
import numpy as np
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert(speedup > 1)
print(f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x")
print("~" * 10)
eager eval time 0: 0.018123775482177733
eager eval time 1: 0.01638707160949707
eager eval time 2: 0.015945728302001954
eager eval time 3: 0.015856639862060547
eager eval time 4: 0.016062463760375977
eager eval time 5: 0.016149408340454103
eager eval time 6: 0.01600307273864746
eager eval time 7: 0.01600614356994629
eager eval time 8: 0.015964159965515135
eager eval time 9: 0.015935487747192383
~~~~~~~~~~
compile eval time 0: 0.708474853515625
compile eval time 1: 0.008540160179138183
compile eval time 2: 0.00828006362915039
compile eval time 3: 0.008294400215148925
compile eval time 4: 0.00828825569152832
compile eval time 5: 0.008264703750610352
compile eval time 6: 0.008274944305419921
compile eval time 7: 0.008263680458068847
compile eval time 8: 0.008263680458068847
compile eval time 9: 0.00827187156677246
~~~~~~~~~~
(eval) eager median: 0.016004608154296874, compile median: 0.008277503967285157, speedup: 1.9335065519208734x
~~~~~~~~~~
事实上,我们可以看到使用 torch.compile
运行我们的模型会显著加速。加速主要来自减少 Python 开销和 GPU 读写,因此观察到的加速可能会受到模型架构和批量大小等因素的影响。例如,如果模型的架构简单且数据量大,则瓶颈将是 GPU 计算,观察到的加速可能不那么显著。
您可能会看到不同的加速结果,这取决于所选择的 mode
参数。"reduce-overhead"
模式使用 CUDA 图来进一步减少 Python 的开销。对于您自己的模型,您可能需要尝试不同的模式以最大化加速。您可以在这里阅读更多关于模式的信息。
您可能还注意到,我们使用torch.compile
运行模型的第二次比其他运行要慢得多,尽管它比第一次运行要快得多。这是因为"reduce-overhead"
模式会为 CUDA 图运行几次预热迭代。
对于一般的 PyTorch 基准测试,您可以尝试使用torch.utils.benchmark
而不是我们上面定义的timed
函数。我们在本教程中编写了自己的计时函数,以展示torch.compile
的编译延迟。
现在,让我们考虑比较训练。
model = init_model()
opt = torch.optim.Adam(model.parameters())
def train(mod, data):
opt.zero_grad(True)
pred = mod(data[0])
loss = torch.nn.CrossEntropyLoss()(pred, data[1])
loss.backward()
opt.step()
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, eager_time = timed(lambda: train(model, inp))
eager_times.append(eager_time)
print(f"eager train time {i}: {eager_time}")
print("~" * 10)
model = init_model()
opt = torch.optim.Adam(model.parameters())
train_opt = torch.compile(train, mode="reduce-overhead")
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, compile_time = timed(lambda: train_opt(model, inp))
compile_times.append(compile_time)
print(f"compile train time {i}: {compile_time}")
print("~" * 10)
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert(speedup > 1)
print(f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x")
print("~" * 10)
eager train time 0: 0.3557437438964844
eager train time 1: 0.0508171501159668
eager train time 2: 0.04858163070678711
eager train time 3: 0.048674816131591796
eager train time 4: 0.04914883041381836
eager train time 5: 0.04877619171142578
eager train time 6: 0.048503807067871094
eager train time 7: 0.048318462371826174
eager train time 8: 0.04821299362182617
eager train time 9: 0.04865331268310547
~~~~~~~~~~
compile train time 0: 208.459546875
compile train time 1: 5.33654541015625
compile train time 2: 0.0332677116394043
compile train time 3: 0.023565311431884766
compile train time 4: 0.023459840774536132
compile train time 5: 0.02349772834777832
compile train time 6: 0.023554048538208007
compile train time 7: 0.02490163230895996
compile train time 8: 0.023513023376464843
compile train time 9: 0.02345062446594238
~~~~~~~~~~
(train) eager median: 0.048664064407348634, compile median: 0.023559679985046385, speedup: 2.065565595043579x
~~~~~~~~~~
同样,我们可以看到torch.compile
在第一次迭代中需要更长的时间,因为它必须编译模型,但在后续迭代中,与急切执行相比,我们看到了显著的加速。
我们注意到,本教程中呈现的加速比仅用于演示目的。官方加速数值可以在TorchInductor 性能仪表板上查看。
与 TorchScript 和 FX 跟踪的比较
我们已经看到 torch.compile
可以加速 PyTorch 代码。除此之外,为什么我们应该使用 torch.compile
而不是现有的 PyTorch 编译器解决方案,比如 TorchScript 或 FX 追踪呢?主要优势在于 torch.compile
能够处理任意 Python 代码,而对现有代码的更改很小。
torch.compile
可以处理其他编译器解决方案难以处理的一个案例,即数据相关的控制流(下面的 if x.sum() < 0:
行)。
def f1(x, y):
if x.sum() < 0:
return -y
return y
# Test that `fn1` and `fn2` return the same result, given
# the same arguments `args`. Typically, `fn1` will be an eager function
# while `fn2` will be a compiled function (torch.compile, TorchScript, or FX graph).
def test_fns(fn1, fn2, args):
out1 = fn1(*args)
out2 = fn2(*args)
return torch.allclose(out1, out2)
inp1 = torch.randn(5, 5)
inp2 = torch.randn(5, 5)
TorchScript 追踪 f1
会导致结果错误,因为只有实际的控制流路径被追踪。
traced_f1 = torch.jit.trace(f1, (inp1, inp2))
print("traced 1, 1:", test_fns(f1, traced_f1, (inp1, inp2)))
print("traced 1, 2:", test_fns(f1, traced_f1, (-inp1, inp2)))
/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py:274: TracerWarning:
Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
traced 1, 1: True
traced 1, 2: False
FX 追踪 f1
会因为存在数据相关的控制流而导致错误。
import traceback as tb
try:
torch.fx.symbolic_trace(f1)
except:
tb.print_exc()
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 304, in <module>
torch.fx.symbolic_trace(f1)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 1154, in symbolic_trace
graph = tracer.trace(root, concrete_args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 821, in trace
(self.create_arg(fn(*args)),),
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 274, in f1
if x.sum() < 0:
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 441, in __bool__
return self.tracer.to_bool(self)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 301, in to_bool
raise TraceError('symbolically traced variables cannot be used as inputs to control flow')
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow
如果我们为 x
提供一个值,然后尝试 FX 追踪 f1
,那么我们会遇到与 TorchScript 追踪相同的问题,因为追踪函数中的数据相关控制流被移除了。
fx_f1 = torch.fx.symbolic_trace(f1, concrete_args={"x": inp1})
print("fx 1, 1:", test_fns(f1, fx_f1, (inp1, inp2)))
print("fx 1, 2:", test_fns(f1, fx_f1, (-inp1, inp2)))
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py:638: UserWarning:
Was not able to add assertion to guarantee correct input x to specialized function. It is up to the user to make sure that your inputs match the inputs you specialized the function with.
fx 1, 1: True
fx 1, 2: False
现在我们可以看到 torch.compile
正确处理了数据相关的控制流。
# Reset since we are using a different mode.
torch._dynamo.reset()
compile_f1 = torch.compile(f1)
print("compile 1, 1:", test_fns(f1, compile_f1, (inp1, inp2)))
print("compile 1, 2:", test_fns(f1, compile_f1, (-inp1, inp2)))
print("~" * 10)
compile 1, 1: True
compile 1, 2: True
~~~~~~~~~~
TorchScript 脚本化可以处理数据相关的控制流,但这种解决方案也带来了一系列问题。换句话说,TorchScript 脚本化可能需要进行重大代码更改,并且在使用不受支持的 Python 时会引发错误。
在下面的示例中,我们忘记了 TorchScript 类型注释,因此收到了一个 TorchScript 错误,因为参数 y
的输入类型为 int
,与默认参数类型 torch.Tensor
不匹配。
def f2(x, y):
return x + y
inp1 = torch.randn(5, 5)
inp2 = 3
script_f2 = torch.jit.script(f2)
try:
script_f2(inp1, inp2)
except:
tb.print_exc()
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 347, in <module>
script_f2(inp1, inp2)
RuntimeError: f2() Expected a value of type 'Tensor (inferred)' for argument 'y' but instead found type 'int'.
Inferred 'y' to be of type 'Tensor' because it was not annotated with an explicit type.
Position: 1
Value: 3
Declaration: f2(Tensor x, Tensor y) -> Tensor
Cast error details: Unable to cast 3 to Tensor
然而,torch.compile
能够轻松处理 f2
。
compile_f2 = torch.compile(f2)
print("compile 2:", test_fns(f2, compile_f2, (inp1, inp2)))
print("~" * 10)
compile 2: True
~~~~~~~~~~
与以往编译器解决方案相比,torch.compile
在处理非 PyTorch 函数的使用方面表现良好。
import scipy
def f3(x):
x = x * 2
x = scipy.fft.dct(x.numpy())
x = torch.from_numpy(x)
x = x * 2
return x
TorchScript 跟踪将来自非 PyTorch 函数调用的结果视为常量,因此我们的结果可能会悄无声息地出错。
inp1 = torch.randn(5, 5)
inp2 = torch.randn(5, 5)
traced_f3 = torch.jit.trace(f3, (inp1,))
print("traced 3:", test_fns(f3, traced_f3, (inp2,)))
/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py:365: TracerWarning:
Converting a tensor to a NumPy array might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py:366: TracerWarning:
torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
traced 3: False
TorchScript 脚本化和 FX 跟踪不允许非 PyTorch 函数调用。
try:
torch.jit.script(f3)
except:
tb.print_exc()
try:
torch.fx.symbolic_trace(f3)
except:
tb.print_exc()
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 383, in <module>
torch.jit.script(f3)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/_script.py", line 1395, in script
fn = torch._C._jit_script_compile(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_jit_internal.py", line 1216, in _try_get_dispatched_fn
return boolean_dispatched.get(fn)
File "/opt/conda/envs/py_3.10/lib/python3.10/weakref.py", line 453, in get
return self.data.get(ref(key),default)
TypeError: cannot create weak reference to 'uarray._Function' object
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 388, in <module>
torch.fx.symbolic_trace(f3)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 1154, in symbolic_trace
graph = tracer.trace(root, concrete_args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 821, in trace
(self.create_arg(fn(*args)),),
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 365, in f3
x = scipy.fft.dct(x.numpy())
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/scipy/fft/_backend.py", line 25, in __ua_function__
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/scipy/fft/_pocketfft/realtransforms.py", line 19, in _r2r
tmp = _asfarray(x)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/scipy/fft/_pocketfft/helper.py", line 89, in _asfarray
if x.dtype == np.float16:
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 546, in impl
return tracer.create_proxy('call_function', target, args, kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 185, in create_proxy
args_ = self.create_arg(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 389, in create_arg
return super().create_arg(a)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 256, in create_arg
return type(a)(self.create_arg(elem) for elem in a)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 256, in <genexpr>
return type(a)(self.create_arg(elem) for elem in a)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py", line 389, in create_arg
return super().create_arg(a)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/proxy.py", line 292, in create_arg
raise NotImplementedError(f"argument of type: {type(a)}")
NotImplementedError: argument of type: <class 'type'>
相比之下,torch.compile
能够轻松处理非 PyTorch 函数调用。
compile_f3 = torch.compile(f3)
print("compile 3:", test_fns(f3, compile_f3, (inp2,)))
compile 3: True
TorchDynamo 和 FX 图
torch.compile
的一个重要组件是 TorchDynamo。TorchDynamo 负责将任意 Python 代码即时编译成FX 图,然后可以进一步优化。TorchDynamo 通过分析 Python 字节码并检测对 PyTorch 操作的调用来提取 FX 图。
通常,torch.compile
的另一个组件 TorchInductor 会进一步将 FX 图编译成优化的内核,但 TorchDynamo 允许使用不同的后端。为了检查 TorchDynamo 输出的 FX 图,让我们创建一个自定义后端,输出 FX 图并简单地返回图的未优化前向方法。
from typing import List
def custom_backend(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
print("custom backend called with FX graph:")
gm.graph.print_tabular()
return gm.forward
# Reset since we are using a different backend.
torch._dynamo.reset()
opt_model = torch.compile(init_model(), backend=custom_backend)
opt_model(generate_data(16)[0])
custom backend called with FX graph:
opcode name target args kwargs
------------- ------------------------------------------------- ---------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -----------------
placeholder l_x_ L_x_ () {}
call_module l__self___features_conv0 L__self___features_conv0 (l_x_,) {}
call_module l__self___features_norm0 L__self___features_norm0 (l__self___features_conv0,) {}
call_module l__self___features_relu0 L__self___features_relu0 (l__self___features_norm0,) {}
call_module l__self___features_pool0 L__self___features_pool0 (l__self___features_relu0,) {}
call_function concated_features <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0], 1) {}
call_module l__self___features_denseblock1_denselayer1_norm1 L__self___features_denseblock1_denselayer1_norm1 (concated_features,) {}
call_module l__self___features_denseblock1_denselayer1_relu1 L__self___features_denseblock1_denselayer1_relu1 (l__self___features_denseblock1_denselayer1_norm1,) {}
call_module bottleneck_output L__self___features_denseblock1_denselayer1_conv1 (l__self___features_denseblock1_denselayer1_relu1,) {}
call_module l__self___features_denseblock1_denselayer1_norm2 L__self___features_denseblock1_denselayer1_norm2 (bottleneck_output,) {}
call_module l__self___features_denseblock1_denselayer1_relu2 L__self___features_denseblock1_denselayer1_relu2 (l__self___features_denseblock1_denselayer1_norm2,) {}
call_module new_features L__self___features_denseblock1_denselayer1_conv2 (l__self___features_denseblock1_denselayer1_relu2,) {}
call_function concated_features_1 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0, new_features], 1) {}
call_module l__self___features_denseblock1_denselayer2_norm1 L__self___features_denseblock1_denselayer2_norm1 (concated_features_1,) {}
call_module l__self___features_denseblock1_denselayer2_relu1 L__self___features_denseblock1_denselayer2_relu1 (l__self___features_denseblock1_denselayer2_norm1,) {}
call_module bottleneck_output_2 L__self___features_denseblock1_denselayer2_conv1 (l__self___features_denseblock1_denselayer2_relu1,) {}
call_module l__self___features_denseblock1_denselayer2_norm2 L__self___features_denseblock1_denselayer2_norm2 (bottleneck_output_2,) {}
call_module l__self___features_denseblock1_denselayer2_relu2 L__self___features_denseblock1_denselayer2_relu2 (l__self___features_denseblock1_denselayer2_norm2,) {}
call_module new_features_2 L__self___features_denseblock1_denselayer2_conv2 (l__self___features_denseblock1_denselayer2_relu2,) {}
call_function concated_features_2 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0, new_features, new_features_2], 1) {}
call_module l__self___features_denseblock1_denselayer3_norm1 L__self___features_denseblock1_denselayer3_norm1 (concated_features_2,) {}
call_module l__self___features_denseblock1_denselayer3_relu1 L__self___features_denseblock1_denselayer3_relu1 (l__self___features_denseblock1_denselayer3_norm1,) {}
call_module bottleneck_output_4 L__self___features_denseblock1_denselayer3_conv1 (l__self___features_denseblock1_denselayer3_relu1,) {}
call_module l__self___features_denseblock1_denselayer3_norm2 L__self___features_denseblock1_denselayer3_norm2 (bottleneck_output_4,) {}
call_module l__self___features_denseblock1_denselayer3_relu2 L__self___features_denseblock1_denselayer3_relu2 (l__self___features_denseblock1_denselayer3_norm2,) {}
call_module new_features_4 L__self___features_denseblock1_denselayer3_conv2 (l__self___features_denseblock1_denselayer3_relu2,) {}
call_function concated_features_3 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0, new_features, new_features_2, new_features_4], 1) {}
call_module l__self___features_denseblock1_denselayer4_norm1 L__self___features_denseblock1_denselayer4_norm1 (concated_features_3,) {}
call_module l__self___features_denseblock1_denselayer4_relu1 L__self___features_denseblock1_denselayer4_relu1 (l__self___features_denseblock1_denselayer4_norm1,) {}
call_module bottleneck_output_6 L__self___features_denseblock1_denselayer4_conv1 (l__self___features_denseblock1_denselayer4_relu1,) {}
call_module l__self___features_denseblock1_denselayer4_norm2 L__self___features_denseblock1_denselayer4_norm2 (bottleneck_output_6,) {}
call_module l__self___features_denseblock1_denselayer4_relu2 L__self___features_denseblock1_denselayer4_relu2 (l__self___features_denseblock1_denselayer4_norm2,) {}
call_module new_features_6 L__self___features_denseblock1_denselayer4_conv2 (l__self___features_denseblock1_denselayer4_relu2,) {}
call_function concated_features_4 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0, new_features, new_features_2, new_features_4, new_features_6], 1) {}
call_module l__self___features_denseblock1_denselayer5_norm1 L__self___features_denseblock1_denselayer5_norm1 (concated_features_4,) {}
call_module l__self___features_denseblock1_denselayer5_relu1 L__self___features_denseblock1_denselayer5_relu1 (l__self___features_denseblock1_denselayer5_norm1,) {}
call_module bottleneck_output_8 L__self___features_denseblock1_denselayer5_conv1 (l__self___features_denseblock1_denselayer5_relu1,) {}
call_module l__self___features_denseblock1_denselayer5_norm2 L__self___features_denseblock1_denselayer5_norm2 (bottleneck_output_8,) {}
call_module l__self___features_denseblock1_denselayer5_relu2 L__self___features_denseblock1_denselayer5_relu2 (l__self___features_denseblock1_denselayer5_norm2,) {}
call_module new_features_8 L__self___features_denseblock1_denselayer5_conv2 (l__self___features_denseblock1_denselayer5_relu2,) {}
call_function concated_features_5 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0, new_features, new_features_2, new_features_4, new_features_6, new_features_8], 1) {}
call_module l__self___features_denseblock1_denselayer6_norm1 L__self___features_denseblock1_denselayer6_norm1 (concated_features_5,) {}
call_module l__self___features_denseblock1_denselayer6_relu1 L__self___features_denseblock1_denselayer6_relu1 (l__self___features_denseblock1_denselayer6_norm1,) {}
call_module bottleneck_output_10 L__self___features_denseblock1_denselayer6_conv1 (l__self___features_denseblock1_denselayer6_relu1,) {}
call_module l__self___features_denseblock1_denselayer6_norm2 L__self___features_denseblock1_denselayer6_norm2 (bottleneck_output_10,) {}
call_module l__self___features_denseblock1_denselayer6_relu2 L__self___features_denseblock1_denselayer6_relu2 (l__self___features_denseblock1_denselayer6_norm2,) {}
call_module new_features_10 L__self___features_denseblock1_denselayer6_conv2 (l__self___features_denseblock1_denselayer6_relu2,) {}
call_function cat_6 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_pool0, new_features, new_features_2, new_features_4, new_features_6, new_features_8, new_features_10], 1) {}
call_module l__self___features_transition1_norm L__self___features_transition1_norm (cat_6,) {}
call_module l__self___features_transition1_relu L__self___features_transition1_relu (l__self___features_transition1_norm,) {}
call_module l__self___features_transition1_conv L__self___features_transition1_conv (l__self___features_transition1_relu,) {}
call_module l__self___features_transition1_pool L__self___features_transition1_pool (l__self___features_transition1_conv,) {}
call_function concated_features_6 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool], 1) {}
call_module l__self___features_denseblock2_denselayer1_norm1 L__self___features_denseblock2_denselayer1_norm1 (concated_features_6,) {}
call_module l__self___features_denseblock2_denselayer1_relu1 L__self___features_denseblock2_denselayer1_relu1 (l__self___features_denseblock2_denselayer1_norm1,) {}
call_module bottleneck_output_12 L__self___features_denseblock2_denselayer1_conv1 (l__self___features_denseblock2_denselayer1_relu1,) {}
call_module l__self___features_denseblock2_denselayer1_norm2 L__self___features_denseblock2_denselayer1_norm2 (bottleneck_output_12,) {}
call_module l__self___features_denseblock2_denselayer1_relu2 L__self___features_denseblock2_denselayer1_relu2 (l__self___features_denseblock2_denselayer1_norm2,) {}
call_module new_features_12 L__self___features_denseblock2_denselayer1_conv2 (l__self___features_denseblock2_denselayer1_relu2,) {}
call_function concated_features_7 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12], 1) {}
call_module l__self___features_denseblock2_denselayer2_norm1 L__self___features_denseblock2_denselayer2_norm1 (concated_features_7,) {}
call_module l__self___features_denseblock2_denselayer2_relu1 L__self___features_denseblock2_denselayer2_relu1 (l__self___features_denseblock2_denselayer2_norm1,) {}
call_module bottleneck_output_14 L__self___features_denseblock2_denselayer2_conv1 (l__self___features_denseblock2_denselayer2_relu1,) {}
call_module l__self___features_denseblock2_denselayer2_norm2 L__self___features_denseblock2_denselayer2_norm2 (bottleneck_output_14,) {}
call_module l__self___features_denseblock2_denselayer2_relu2 L__self___features_denseblock2_denselayer2_relu2 (l__self___features_denseblock2_denselayer2_norm2,) {}
call_module new_features_14 L__self___features_denseblock2_denselayer2_conv2 (l__self___features_denseblock2_denselayer2_relu2,) {}
call_function concated_features_8 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14], 1) {}
call_module l__self___features_denseblock2_denselayer3_norm1 L__self___features_denseblock2_denselayer3_norm1 (concated_features_8,) {}
call_module l__self___features_denseblock2_denselayer3_relu1 L__self___features_denseblock2_denselayer3_relu1 (l__self___features_denseblock2_denselayer3_norm1,) {}
call_module bottleneck_output_16 L__self___features_denseblock2_denselayer3_conv1 (l__self___features_denseblock2_denselayer3_relu1,) {}
call_module l__self___features_denseblock2_denselayer3_norm2 L__self___features_denseblock2_denselayer3_norm2 (bottleneck_output_16,) {}
call_module l__self___features_denseblock2_denselayer3_relu2 L__self___features_denseblock2_denselayer3_relu2 (l__self___features_denseblock2_denselayer3_norm2,) {}
call_module new_features_16 L__self___features_denseblock2_denselayer3_conv2 (l__self___features_denseblock2_denselayer3_relu2,) {}
call_function concated_features_9 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16], 1) {}
call_module l__self___features_denseblock2_denselayer4_norm1 L__self___features_denseblock2_denselayer4_norm1 (concated_features_9,) {}
call_module l__self___features_denseblock2_denselayer4_relu1 L__self___features_denseblock2_denselayer4_relu1 (l__self___features_denseblock2_denselayer4_norm1,) {}
call_module bottleneck_output_18 L__self___features_denseblock2_denselayer4_conv1 (l__self___features_denseblock2_denselayer4_relu1,) {}
call_module l__self___features_denseblock2_denselayer4_norm2 L__self___features_denseblock2_denselayer4_norm2 (bottleneck_output_18,) {}
call_module l__self___features_denseblock2_denselayer4_relu2 L__self___features_denseblock2_denselayer4_relu2 (l__self___features_denseblock2_denselayer4_norm2,) {}
call_module new_features_18 L__self___features_denseblock2_denselayer4_conv2 (l__self___features_denseblock2_denselayer4_relu2,) {}
call_function concated_features_10 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18], 1) {}
call_module l__self___features_denseblock2_denselayer5_norm1 L__self___features_denseblock2_denselayer5_norm1 (concated_features_10,) {}
call_module l__self___features_denseblock2_denselayer5_relu1 L__self___features_denseblock2_denselayer5_relu1 (l__self___features_denseblock2_denselayer5_norm1,) {}
call_module bottleneck_output_20 L__self___features_denseblock2_denselayer5_conv1 (l__self___features_denseblock2_denselayer5_relu1,) {}
call_module l__self___features_denseblock2_denselayer5_norm2 L__self___features_denseblock2_denselayer5_norm2 (bottleneck_output_20,) {}
call_module l__self___features_denseblock2_denselayer5_relu2 L__self___features_denseblock2_denselayer5_relu2 (l__self___features_denseblock2_denselayer5_norm2,) {}
call_module new_features_20 L__self___features_denseblock2_denselayer5_conv2 (l__self___features_denseblock2_denselayer5_relu2,) {}
call_function concated_features_11 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20], 1) {}
call_module l__self___features_denseblock2_denselayer6_norm1 L__self___features_denseblock2_denselayer6_norm1 (concated_features_11,) {}
call_module l__self___features_denseblock2_denselayer6_relu1 L__self___features_denseblock2_denselayer6_relu1 (l__self___features_denseblock2_denselayer6_norm1,) {}
call_module bottleneck_output_22 L__self___features_denseblock2_denselayer6_conv1 (l__self___features_denseblock2_denselayer6_relu1,) {}
call_module l__self___features_denseblock2_denselayer6_norm2 L__self___features_denseblock2_denselayer6_norm2 (bottleneck_output_22,) {}
call_module l__self___features_denseblock2_denselayer6_relu2 L__self___features_denseblock2_denselayer6_relu2 (l__self___features_denseblock2_denselayer6_norm2,) {}
call_module new_features_22 L__self___features_denseblock2_denselayer6_conv2 (l__self___features_denseblock2_denselayer6_relu2,) {}
call_function concated_features_12 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22], 1) {}
call_module l__self___features_denseblock2_denselayer7_norm1 L__self___features_denseblock2_denselayer7_norm1 (concated_features_12,) {}
call_module l__self___features_denseblock2_denselayer7_relu1 L__self___features_denseblock2_denselayer7_relu1 (l__self___features_denseblock2_denselayer7_norm1,) {}
call_module bottleneck_output_24 L__self___features_denseblock2_denselayer7_conv1 (l__self___features_denseblock2_denselayer7_relu1,) {}
call_module l__self___features_denseblock2_denselayer7_norm2 L__self___features_denseblock2_denselayer7_norm2 (bottleneck_output_24,) {}
call_module l__self___features_denseblock2_denselayer7_relu2 L__self___features_denseblock2_denselayer7_relu2 (l__self___features_denseblock2_denselayer7_norm2,) {}
call_module new_features_24 L__self___features_denseblock2_denselayer7_conv2 (l__self___features_denseblock2_denselayer7_relu2,) {}
call_function concated_features_13 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22, new_features_24], 1) {}
call_module l__self___features_denseblock2_denselayer8_norm1 L__self___features_denseblock2_denselayer8_norm1 (concated_features_13,) {}
call_module l__self___features_denseblock2_denselayer8_relu1 L__self___features_denseblock2_denselayer8_relu1 (l__self___features_denseblock2_denselayer8_norm1,) {}
call_module bottleneck_output_26 L__self___features_denseblock2_denselayer8_conv1 (l__self___features_denseblock2_denselayer8_relu1,) {}
call_module l__self___features_denseblock2_denselayer8_norm2 L__self___features_denseblock2_denselayer8_norm2 (bottleneck_output_26,) {}
call_module l__self___features_denseblock2_denselayer8_relu2 L__self___features_denseblock2_denselayer8_relu2 (l__self___features_denseblock2_denselayer8_norm2,) {}
call_module new_features_26 L__self___features_denseblock2_denselayer8_conv2 (l__self___features_denseblock2_denselayer8_relu2,) {}
call_function concated_features_14 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22, new_features_24, new_features_26], 1) {}
call_module l__self___features_denseblock2_denselayer9_norm1 L__self___features_denseblock2_denselayer9_norm1 (concated_features_14,) {}
call_module l__self___features_denseblock2_denselayer9_relu1 L__self___features_denseblock2_denselayer9_relu1 (l__self___features_denseblock2_denselayer9_norm1,) {}
call_module bottleneck_output_28 L__self___features_denseblock2_denselayer9_conv1 (l__self___features_denseblock2_denselayer9_relu1,) {}
call_module l__self___features_denseblock2_denselayer9_norm2 L__self___features_denseblock2_denselayer9_norm2 (bottleneck_output_28,) {}
call_module l__self___features_denseblock2_denselayer9_relu2 L__self___features_denseblock2_denselayer9_relu2 (l__self___features_denseblock2_denselayer9_norm2,) {}
call_module new_features_28 L__self___features_denseblock2_denselayer9_conv2 (l__self___features_denseblock2_denselayer9_relu2,) {}
call_function concated_features_15 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22, new_features_24, new_features_26, new_features_28], 1) {}
call_module l__self___features_denseblock2_denselayer10_norm1 L__self___features_denseblock2_denselayer10_norm1 (concated_features_15,) {}
call_module l__self___features_denseblock2_denselayer10_relu1 L__self___features_denseblock2_denselayer10_relu1 (l__self___features_denseblock2_denselayer10_norm1,) {}
call_module bottleneck_output_30 L__self___features_denseblock2_denselayer10_conv1 (l__self___features_denseblock2_denselayer10_relu1,) {}
call_module l__self___features_denseblock2_denselayer10_norm2 L__self___features_denseblock2_denselayer10_norm2 (bottleneck_output_30,) {}
call_module l__self___features_denseblock2_denselayer10_relu2 L__self___features_denseblock2_denselayer10_relu2 (l__self___features_denseblock2_denselayer10_norm2,) {}
call_module new_features_30 L__self___features_denseblock2_denselayer10_conv2 (l__self___features_denseblock2_denselayer10_relu2,) {}
call_function concated_features_16 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22, new_features_24, new_features_26, new_features_28, new_features_30], 1) {}
call_module l__self___features_denseblock2_denselayer11_norm1 L__self___features_denseblock2_denselayer11_norm1 (concated_features_16,) {}
call_module l__self___features_denseblock2_denselayer11_relu1 L__self___features_denseblock2_denselayer11_relu1 (l__self___features_denseblock2_denselayer11_norm1,) {}
call_module bottleneck_output_32 L__self___features_denseblock2_denselayer11_conv1 (l__self___features_denseblock2_denselayer11_relu1,) {}
call_module l__self___features_denseblock2_denselayer11_norm2 L__self___features_denseblock2_denselayer11_norm2 (bottleneck_output_32,) {}
call_module l__self___features_denseblock2_denselayer11_relu2 L__self___features_denseblock2_denselayer11_relu2 (l__self___features_denseblock2_denselayer11_norm2,) {}
call_module new_features_32 L__self___features_denseblock2_denselayer11_conv2 (l__self___features_denseblock2_denselayer11_relu2,) {}
call_function concated_features_17 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22, new_features_24, new_features_26, new_features_28, new_features_30, new_features_32], 1) {}
call_module l__self___features_denseblock2_denselayer12_norm1 L__self___features_denseblock2_denselayer12_norm1 (concated_features_17,) {}
call_module l__self___features_denseblock2_denselayer12_relu1 L__self___features_denseblock2_denselayer12_relu1 (l__self___features_denseblock2_denselayer12_norm1,) {}
call_module bottleneck_output_34 L__self___features_denseblock2_denselayer12_conv1 (l__self___features_denseblock2_denselayer12_relu1,) {}
call_module l__self___features_denseblock2_denselayer12_norm2 L__self___features_denseblock2_denselayer12_norm2 (bottleneck_output_34,) {}
call_module l__self___features_denseblock2_denselayer12_relu2 L__self___features_denseblock2_denselayer12_relu2 (l__self___features_denseblock2_denselayer12_norm2,) {}
call_module new_features_34 L__self___features_denseblock2_denselayer12_conv2 (l__self___features_denseblock2_denselayer12_relu2,) {}
call_function cat_19 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition1_pool, new_features_12, new_features_14, new_features_16, new_features_18, new_features_20, new_features_22, new_features_24, new_features_26, new_features_28, new_features_30, new_features_32, new_features_34], 1) {}
call_module l__self___features_transition2_norm L__self___features_transition2_norm (cat_19,) {}
call_module l__self___features_transition2_relu L__self___features_transition2_relu (l__self___features_transition2_norm,) {}
call_module l__self___features_transition2_conv L__self___features_transition2_conv (l__self___features_transition2_relu,) {}
call_module l__self___features_transition2_pool L__self___features_transition2_pool (l__self___features_transition2_conv,) {}
call_function concated_features_18 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool], 1) {}
call_module l__self___features_denseblock3_denselayer1_norm1 L__self___features_denseblock3_denselayer1_norm1 (concated_features_18,) {}
call_module l__self___features_denseblock3_denselayer1_relu1 L__self___features_denseblock3_denselayer1_relu1 (l__self___features_denseblock3_denselayer1_norm1,) {}
call_module bottleneck_output_36 L__self___features_denseblock3_denselayer1_conv1 (l__self___features_denseblock3_denselayer1_relu1,) {}
call_module l__self___features_denseblock3_denselayer1_norm2 L__self___features_denseblock3_denselayer1_norm2 (bottleneck_output_36,) {}
call_module l__self___features_denseblock3_denselayer1_relu2 L__self___features_denseblock3_denselayer1_relu2 (l__self___features_denseblock3_denselayer1_norm2,) {}
call_module new_features_36 L__self___features_denseblock3_denselayer1_conv2 (l__self___features_denseblock3_denselayer1_relu2,) {}
call_function concated_features_19 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36], 1) {}
call_module l__self___features_denseblock3_denselayer2_norm1 L__self___features_denseblock3_denselayer2_norm1 (concated_features_19,) {}
call_module l__self___features_denseblock3_denselayer2_relu1 L__self___features_denseblock3_denselayer2_relu1 (l__self___features_denseblock3_denselayer2_norm1,) {}
call_module bottleneck_output_38 L__self___features_denseblock3_denselayer2_conv1 (l__self___features_denseblock3_denselayer2_relu1,) {}
call_module l__self___features_denseblock3_denselayer2_norm2 L__self___features_denseblock3_denselayer2_norm2 (bottleneck_output_38,) {}
call_module l__self___features_denseblock3_denselayer2_relu2 L__self___features_denseblock3_denselayer2_relu2 (l__self___features_denseblock3_denselayer2_norm2,) {}
call_module new_features_38 L__self___features_denseblock3_denselayer2_conv2 (l__self___features_denseblock3_denselayer2_relu2,) {}
call_function concated_features_20 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38], 1) {}
call_module l__self___features_denseblock3_denselayer3_norm1 L__self___features_denseblock3_denselayer3_norm1 (concated_features_20,) {}
call_module l__self___features_denseblock3_denselayer3_relu1 L__self___features_denseblock3_denselayer3_relu1 (l__self___features_denseblock3_denselayer3_norm1,) {}
call_module bottleneck_output_40 L__self___features_denseblock3_denselayer3_conv1 (l__self___features_denseblock3_denselayer3_relu1,) {}
call_module l__self___features_denseblock3_denselayer3_norm2 L__self___features_denseblock3_denselayer3_norm2 (bottleneck_output_40,) {}
call_module l__self___features_denseblock3_denselayer3_relu2 L__self___features_denseblock3_denselayer3_relu2 (l__self___features_denseblock3_denselayer3_norm2,) {}
call_module new_features_40 L__self___features_denseblock3_denselayer3_conv2 (l__self___features_denseblock3_denselayer3_relu2,) {}
call_function concated_features_21 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40], 1) {}
call_module l__self___features_denseblock3_denselayer4_norm1 L__self___features_denseblock3_denselayer4_norm1 (concated_features_21,) {}
call_module l__self___features_denseblock3_denselayer4_relu1 L__self___features_denseblock3_denselayer4_relu1 (l__self___features_denseblock3_denselayer4_norm1,) {}
call_module bottleneck_output_42 L__self___features_denseblock3_denselayer4_conv1 (l__self___features_denseblock3_denselayer4_relu1,) {}
call_module l__self___features_denseblock3_denselayer4_norm2 L__self___features_denseblock3_denselayer4_norm2 (bottleneck_output_42,) {}
call_module l__self___features_denseblock3_denselayer4_relu2 L__self___features_denseblock3_denselayer4_relu2 (l__self___features_denseblock3_denselayer4_norm2,) {}
call_module new_features_42 L__self___features_denseblock3_denselayer4_conv2 (l__self___features_denseblock3_denselayer4_relu2,) {}
call_function concated_features_22 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42], 1) {}
call_module l__self___features_denseblock3_denselayer5_norm1 L__self___features_denseblock3_denselayer5_norm1 (concated_features_22,) {}
call_module l__self___features_denseblock3_denselayer5_relu1 L__self___features_denseblock3_denselayer5_relu1 (l__self___features_denseblock3_denselayer5_norm1,) {}
call_module bottleneck_output_44 L__self___features_denseblock3_denselayer5_conv1 (l__self___features_denseblock3_denselayer5_relu1,) {}
call_module l__self___features_denseblock3_denselayer5_norm2 L__self___features_denseblock3_denselayer5_norm2 (bottleneck_output_44,) {}
call_module l__self___features_denseblock3_denselayer5_relu2 L__self___features_denseblock3_denselayer5_relu2 (l__self___features_denseblock3_denselayer5_norm2,) {}
call_module new_features_44 L__self___features_denseblock3_denselayer5_conv2 (l__self___features_denseblock3_denselayer5_relu2,) {}
call_function concated_features_23 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44], 1) {}
call_module l__self___features_denseblock3_denselayer6_norm1 L__self___features_denseblock3_denselayer6_norm1 (concated_features_23,) {}
call_module l__self___features_denseblock3_denselayer6_relu1 L__self___features_denseblock3_denselayer6_relu1 (l__self___features_denseblock3_denselayer6_norm1,) {}
call_module bottleneck_output_46 L__self___features_denseblock3_denselayer6_conv1 (l__self___features_denseblock3_denselayer6_relu1,) {}
call_module l__self___features_denseblock3_denselayer6_norm2 L__self___features_denseblock3_denselayer6_norm2 (bottleneck_output_46,) {}
call_module l__self___features_denseblock3_denselayer6_relu2 L__self___features_denseblock3_denselayer6_relu2 (l__self___features_denseblock3_denselayer6_norm2,) {}
call_module new_features_46 L__self___features_denseblock3_denselayer6_conv2 (l__self___features_denseblock3_denselayer6_relu2,) {}
call_function concated_features_24 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46], 1) {}
call_module l__self___features_denseblock3_denselayer7_norm1 L__self___features_denseblock3_denselayer7_norm1 (concated_features_24,) {}
call_module l__self___features_denseblock3_denselayer7_relu1 L__self___features_denseblock3_denselayer7_relu1 (l__self___features_denseblock3_denselayer7_norm1,) {}
call_module bottleneck_output_48 L__self___features_denseblock3_denselayer7_conv1 (l__self___features_denseblock3_denselayer7_relu1,) {}
call_module l__self___features_denseblock3_denselayer7_norm2 L__self___features_denseblock3_denselayer7_norm2 (bottleneck_output_48,) {}
call_module l__self___features_denseblock3_denselayer7_relu2 L__self___features_denseblock3_denselayer7_relu2 (l__self___features_denseblock3_denselayer7_norm2,) {}
call_module new_features_48 L__self___features_denseblock3_denselayer7_conv2 (l__self___features_denseblock3_denselayer7_relu2,) {}
call_function concated_features_25 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48], 1) {}
call_module l__self___features_denseblock3_denselayer8_norm1 L__self___features_denseblock3_denselayer8_norm1 (concated_features_25,) {}
call_module l__self___features_denseblock3_denselayer8_relu1 L__self___features_denseblock3_denselayer8_relu1 (l__self___features_denseblock3_denselayer8_norm1,) {}
call_module bottleneck_output_50 L__self___features_denseblock3_denselayer8_conv1 (l__self___features_denseblock3_denselayer8_relu1,) {}
call_module l__self___features_denseblock3_denselayer8_norm2 L__self___features_denseblock3_denselayer8_norm2 (bottleneck_output_50,) {}
call_module l__self___features_denseblock3_denselayer8_relu2 L__self___features_denseblock3_denselayer8_relu2 (l__self___features_denseblock3_denselayer8_norm2,) {}
call_module new_features_50 L__self___features_denseblock3_denselayer8_conv2 (l__self___features_denseblock3_denselayer8_relu2,) {}
call_function concated_features_26 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50], 1) {}
call_module l__self___features_denseblock3_denselayer9_norm1 L__self___features_denseblock3_denselayer9_norm1 (concated_features_26,) {}
call_module l__self___features_denseblock3_denselayer9_relu1 L__self___features_denseblock3_denselayer9_relu1 (l__self___features_denseblock3_denselayer9_norm1,) {}
call_module bottleneck_output_52 L__self___features_denseblock3_denselayer9_conv1 (l__self___features_denseblock3_denselayer9_relu1,) {}
call_module l__self___features_denseblock3_denselayer9_norm2 L__self___features_denseblock3_denselayer9_norm2 (bottleneck_output_52,) {}
call_module l__self___features_denseblock3_denselayer9_relu2 L__self___features_denseblock3_denselayer9_relu2 (l__self___features_denseblock3_denselayer9_norm2,) {}
call_module new_features_52 L__self___features_denseblock3_denselayer9_conv2 (l__self___features_denseblock3_denselayer9_relu2,) {}
call_function concated_features_27 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52], 1) {}
call_module l__self___features_denseblock3_denselayer10_norm1 L__self___features_denseblock3_denselayer10_norm1 (concated_features_27,) {}
call_module l__self___features_denseblock3_denselayer10_relu1 L__self___features_denseblock3_denselayer10_relu1 (l__self___features_denseblock3_denselayer10_norm1,) {}
call_module bottleneck_output_54 L__self___features_denseblock3_denselayer10_conv1 (l__self___features_denseblock3_denselayer10_relu1,) {}
call_module l__self___features_denseblock3_denselayer10_norm2 L__self___features_denseblock3_denselayer10_norm2 (bottleneck_output_54,) {}
call_module l__self___features_denseblock3_denselayer10_relu2 L__self___features_denseblock3_denselayer10_relu2 (l__self___features_denseblock3_denselayer10_norm2,) {}
call_module new_features_54 L__self___features_denseblock3_denselayer10_conv2 (l__self___features_denseblock3_denselayer10_relu2,) {}
call_function concated_features_28 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54], 1) {}
call_module l__self___features_denseblock3_denselayer11_norm1 L__self___features_denseblock3_denselayer11_norm1 (concated_features_28,) {}
call_module l__self___features_denseblock3_denselayer11_relu1 L__self___features_denseblock3_denselayer11_relu1 (l__self___features_denseblock3_denselayer11_norm1,) {}
call_module bottleneck_output_56 L__self___features_denseblock3_denselayer11_conv1 (l__self___features_denseblock3_denselayer11_relu1,) {}
call_module l__self___features_denseblock3_denselayer11_norm2 L__self___features_denseblock3_denselayer11_norm2 (bottleneck_output_56,) {}
call_module l__self___features_denseblock3_denselayer11_relu2 L__self___features_denseblock3_denselayer11_relu2 (l__self___features_denseblock3_denselayer11_norm2,) {}
call_module new_features_56 L__self___features_denseblock3_denselayer11_conv2 (l__self___features_denseblock3_denselayer11_relu2,) {}
call_function concated_features_29 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56], 1) {}
call_module l__self___features_denseblock3_denselayer12_norm1 L__self___features_denseblock3_denselayer12_norm1 (concated_features_29,) {}
call_module l__self___features_denseblock3_denselayer12_relu1 L__self___features_denseblock3_denselayer12_relu1 (l__self___features_denseblock3_denselayer12_norm1,) {}
call_module bottleneck_output_58 L__self___features_denseblock3_denselayer12_conv1 (l__self___features_denseblock3_denselayer12_relu1,) {}
call_module l__self___features_denseblock3_denselayer12_norm2 L__self___features_denseblock3_denselayer12_norm2 (bottleneck_output_58,) {}
call_module l__self___features_denseblock3_denselayer12_relu2 L__self___features_denseblock3_denselayer12_relu2 (l__self___features_denseblock3_denselayer12_norm2,) {}
call_module new_features_58 L__self___features_denseblock3_denselayer12_conv2 (l__self___features_denseblock3_denselayer12_relu2,) {}
call_function concated_features_30 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58], 1) {}
call_module l__self___features_denseblock3_denselayer13_norm1 L__self___features_denseblock3_denselayer13_norm1 (concated_features_30,) {}
call_module l__self___features_denseblock3_denselayer13_relu1 L__self___features_denseblock3_denselayer13_relu1 (l__self___features_denseblock3_denselayer13_norm1,) {}
call_module bottleneck_output_60 L__self___features_denseblock3_denselayer13_conv1 (l__self___features_denseblock3_denselayer13_relu1,) {}
call_module l__self___features_denseblock3_denselayer13_norm2 L__self___features_denseblock3_denselayer13_norm2 (bottleneck_output_60,) {}
call_module l__self___features_denseblock3_denselayer13_relu2 L__self___features_denseblock3_denselayer13_relu2 (l__self___features_denseblock3_denselayer13_norm2,) {}
call_module new_features_60 L__self___features_denseblock3_denselayer13_conv2 (l__self___features_denseblock3_denselayer13_relu2,) {}
call_function concated_features_31 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60], 1) {}
call_module l__self___features_denseblock3_denselayer14_norm1 L__self___features_denseblock3_denselayer14_norm1 (concated_features_31,) {}
call_module l__self___features_denseblock3_denselayer14_relu1 L__self___features_denseblock3_denselayer14_relu1 (l__self___features_denseblock3_denselayer14_norm1,) {}
call_module bottleneck_output_62 L__self___features_denseblock3_denselayer14_conv1 (l__self___features_denseblock3_denselayer14_relu1,) {}
call_module l__self___features_denseblock3_denselayer14_norm2 L__self___features_denseblock3_denselayer14_norm2 (bottleneck_output_62,) {}
call_module l__self___features_denseblock3_denselayer14_relu2 L__self___features_denseblock3_denselayer14_relu2 (l__self___features_denseblock3_denselayer14_norm2,) {}
call_module new_features_62 L__self___features_denseblock3_denselayer14_conv2 (l__self___features_denseblock3_denselayer14_relu2,) {}
call_function concated_features_32 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62], 1) {}
call_module l__self___features_denseblock3_denselayer15_norm1 L__self___features_denseblock3_denselayer15_norm1 (concated_features_32,) {}
call_module l__self___features_denseblock3_denselayer15_relu1 L__self___features_denseblock3_denselayer15_relu1 (l__self___features_denseblock3_denselayer15_norm1,) {}
call_module bottleneck_output_64 L__self___features_denseblock3_denselayer15_conv1 (l__self___features_denseblock3_denselayer15_relu1,) {}
call_module l__self___features_denseblock3_denselayer15_norm2 L__self___features_denseblock3_denselayer15_norm2 (bottleneck_output_64,) {}
call_module l__self___features_denseblock3_denselayer15_relu2 L__self___features_denseblock3_denselayer15_relu2 (l__self___features_denseblock3_denselayer15_norm2,) {}
call_module new_features_64 L__self___features_denseblock3_denselayer15_conv2 (l__self___features_denseblock3_denselayer15_relu2,) {}
call_function concated_features_33 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64], 1) {}
call_module l__self___features_denseblock3_denselayer16_norm1 L__self___features_denseblock3_denselayer16_norm1 (concated_features_33,) {}
call_module l__self___features_denseblock3_denselayer16_relu1 L__self___features_denseblock3_denselayer16_relu1 (l__self___features_denseblock3_denselayer16_norm1,) {}
call_module bottleneck_output_66 L__self___features_denseblock3_denselayer16_conv1 (l__self___features_denseblock3_denselayer16_relu1,) {}
call_module l__self___features_denseblock3_denselayer16_norm2 L__self___features_denseblock3_denselayer16_norm2 (bottleneck_output_66,) {}
call_module l__self___features_denseblock3_denselayer16_relu2 L__self___features_denseblock3_denselayer16_relu2 (l__self___features_denseblock3_denselayer16_norm2,) {}
call_module new_features_66 L__self___features_denseblock3_denselayer16_conv2 (l__self___features_denseblock3_denselayer16_relu2,) {}
call_function concated_features_34 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66], 1) {}
call_module l__self___features_denseblock3_denselayer17_norm1 L__self___features_denseblock3_denselayer17_norm1 (concated_features_34,) {}
call_module l__self___features_denseblock3_denselayer17_relu1 L__self___features_denseblock3_denselayer17_relu1 (l__self___features_denseblock3_denselayer17_norm1,) {}
call_module bottleneck_output_68 L__self___features_denseblock3_denselayer17_conv1 (l__self___features_denseblock3_denselayer17_relu1,) {}
call_module l__self___features_denseblock3_denselayer17_norm2 L__self___features_denseblock3_denselayer17_norm2 (bottleneck_output_68,) {}
call_module l__self___features_denseblock3_denselayer17_relu2 L__self___features_denseblock3_denselayer17_relu2 (l__self___features_denseblock3_denselayer17_norm2,) {}
call_module new_features_68 L__self___features_denseblock3_denselayer17_conv2 (l__self___features_denseblock3_denselayer17_relu2,) {}
call_function concated_features_35 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68], 1) {}
call_module l__self___features_denseblock3_denselayer18_norm1 L__self___features_denseblock3_denselayer18_norm1 (concated_features_35,) {}
call_module l__self___features_denseblock3_denselayer18_relu1 L__self___features_denseblock3_denselayer18_relu1 (l__self___features_denseblock3_denselayer18_norm1,) {}
call_module bottleneck_output_70 L__self___features_denseblock3_denselayer18_conv1 (l__self___features_denseblock3_denselayer18_relu1,) {}
call_module l__self___features_denseblock3_denselayer18_norm2 L__self___features_denseblock3_denselayer18_norm2 (bottleneck_output_70,) {}
call_module l__self___features_denseblock3_denselayer18_relu2 L__self___features_denseblock3_denselayer18_relu2 (l__self___features_denseblock3_denselayer18_norm2,) {}
call_module new_features_70 L__self___features_denseblock3_denselayer18_conv2 (l__self___features_denseblock3_denselayer18_relu2,) {}
call_function concated_features_36 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70], 1) {}
call_module l__self___features_denseblock3_denselayer19_norm1 L__self___features_denseblock3_denselayer19_norm1 (concated_features_36,) {}
call_module l__self___features_denseblock3_denselayer19_relu1 L__self___features_denseblock3_denselayer19_relu1 (l__self___features_denseblock3_denselayer19_norm1,) {}
call_module bottleneck_output_72 L__self___features_denseblock3_denselayer19_conv1 (l__self___features_denseblock3_denselayer19_relu1,) {}
call_module l__self___features_denseblock3_denselayer19_norm2 L__self___features_denseblock3_denselayer19_norm2 (bottleneck_output_72,) {}
call_module l__self___features_denseblock3_denselayer19_relu2 L__self___features_denseblock3_denselayer19_relu2 (l__self___features_denseblock3_denselayer19_norm2,) {}
call_module new_features_72 L__self___features_denseblock3_denselayer19_conv2 (l__self___features_denseblock3_denselayer19_relu2,) {}
call_function concated_features_37 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70, new_features_72], 1) {}
call_module l__self___features_denseblock3_denselayer20_norm1 L__self___features_denseblock3_denselayer20_norm1 (concated_features_37,) {}
call_module l__self___features_denseblock3_denselayer20_relu1 L__self___features_denseblock3_denselayer20_relu1 (l__self___features_denseblock3_denselayer20_norm1,) {}
call_module bottleneck_output_74 L__self___features_denseblock3_denselayer20_conv1 (l__self___features_denseblock3_denselayer20_relu1,) {}
call_module l__self___features_denseblock3_denselayer20_norm2 L__self___features_denseblock3_denselayer20_norm2 (bottleneck_output_74,) {}
call_module l__self___features_denseblock3_denselayer20_relu2 L__self___features_denseblock3_denselayer20_relu2 (l__self___features_denseblock3_denselayer20_norm2,) {}
call_module new_features_74 L__self___features_denseblock3_denselayer20_conv2 (l__self___features_denseblock3_denselayer20_relu2,) {}
call_function concated_features_38 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70, new_features_72, new_features_74], 1) {}
call_module l__self___features_denseblock3_denselayer21_norm1 L__self___features_denseblock3_denselayer21_norm1 (concated_features_38,) {}
call_module l__self___features_denseblock3_denselayer21_relu1 L__self___features_denseblock3_denselayer21_relu1 (l__self___features_denseblock3_denselayer21_norm1,) {}
call_module bottleneck_output_76 L__self___features_denseblock3_denselayer21_conv1 (l__self___features_denseblock3_denselayer21_relu1,) {}
call_module l__self___features_denseblock3_denselayer21_norm2 L__self___features_denseblock3_denselayer21_norm2 (bottleneck_output_76,) {}
call_module l__self___features_denseblock3_denselayer21_relu2 L__self___features_denseblock3_denselayer21_relu2 (l__self___features_denseblock3_denselayer21_norm2,) {}
call_module new_features_76 L__self___features_denseblock3_denselayer21_conv2 (l__self___features_denseblock3_denselayer21_relu2,) {}
call_function concated_features_39 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70, new_features_72, new_features_74, new_features_76], 1) {}
call_module l__self___features_denseblock3_denselayer22_norm1 L__self___features_denseblock3_denselayer22_norm1 (concated_features_39,) {}
call_module l__self___features_denseblock3_denselayer22_relu1 L__self___features_denseblock3_denselayer22_relu1 (l__self___features_denseblock3_denselayer22_norm1,) {}
call_module bottleneck_output_78 L__self___features_denseblock3_denselayer22_conv1 (l__self___features_denseblock3_denselayer22_relu1,) {}
call_module l__self___features_denseblock3_denselayer22_norm2 L__self___features_denseblock3_denselayer22_norm2 (bottleneck_output_78,) {}
call_module l__self___features_denseblock3_denselayer22_relu2 L__self___features_denseblock3_denselayer22_relu2 (l__self___features_denseblock3_denselayer22_norm2,) {}
call_module new_features_78 L__self___features_denseblock3_denselayer22_conv2 (l__self___features_denseblock3_denselayer22_relu2,) {}
call_function concated_features_40 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70, new_features_72, new_features_74, new_features_76, new_features_78], 1) {}
call_module l__self___features_denseblock3_denselayer23_norm1 L__self___features_denseblock3_denselayer23_norm1 (concated_features_40,) {}
call_module l__self___features_denseblock3_denselayer23_relu1 L__self___features_denseblock3_denselayer23_relu1 (l__self___features_denseblock3_denselayer23_norm1,) {}
call_module bottleneck_output_80 L__self___features_denseblock3_denselayer23_conv1 (l__self___features_denseblock3_denselayer23_relu1,) {}
call_module l__self___features_denseblock3_denselayer23_norm2 L__self___features_denseblock3_denselayer23_norm2 (bottleneck_output_80,) {}
call_module l__self___features_denseblock3_denselayer23_relu2 L__self___features_denseblock3_denselayer23_relu2 (l__self___features_denseblock3_denselayer23_norm2,) {}
call_module new_features_80 L__self___features_denseblock3_denselayer23_conv2 (l__self___features_denseblock3_denselayer23_relu2,) {}
call_function concated_features_41 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70, new_features_72, new_features_74, new_features_76, new_features_78, new_features_80], 1) {}
call_module l__self___features_denseblock3_denselayer24_norm1 L__self___features_denseblock3_denselayer24_norm1 (concated_features_41,) {}
call_module l__self___features_denseblock3_denselayer24_relu1 L__self___features_denseblock3_denselayer24_relu1 (l__self___features_denseblock3_denselayer24_norm1,) {}
call_module bottleneck_output_82 L__self___features_denseblock3_denselayer24_conv1 (l__self___features_denseblock3_denselayer24_relu1,) {}
call_module l__self___features_denseblock3_denselayer24_norm2 L__self___features_denseblock3_denselayer24_norm2 (bottleneck_output_82,) {}
call_module l__self___features_denseblock3_denselayer24_relu2 L__self___features_denseblock3_denselayer24_relu2 (l__self___features_denseblock3_denselayer24_norm2,) {}
call_module new_features_82 L__self___features_denseblock3_denselayer24_conv2 (l__self___features_denseblock3_denselayer24_relu2,) {}
call_function cat_44 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition2_pool, new_features_36, new_features_38, new_features_40, new_features_42, new_features_44, new_features_46, new_features_48, new_features_50, new_features_52, new_features_54, new_features_56, new_features_58, new_features_60, new_features_62, new_features_64, new_features_66, new_features_68, new_features_70, new_features_72, new_features_74, new_features_76, new_features_78, new_features_80, new_features_82], 1) {}
call_module l__self___features_transition3_norm L__self___features_transition3_norm (cat_44,) {}
call_module l__self___features_transition3_relu L__self___features_transition3_relu (l__self___features_transition3_norm,) {}
call_module l__self___features_transition3_conv L__self___features_transition3_conv (l__self___features_transition3_relu,) {}
call_module l__self___features_transition3_pool L__self___features_transition3_pool (l__self___features_transition3_conv,) {}
call_function concated_features_42 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool], 1) {}
call_module l__self___features_denseblock4_denselayer1_norm1 L__self___features_denseblock4_denselayer1_norm1 (concated_features_42,) {}
call_module l__self___features_denseblock4_denselayer1_relu1 L__self___features_denseblock4_denselayer1_relu1 (l__self___features_denseblock4_denselayer1_norm1,) {}
call_module bottleneck_output_84 L__self___features_denseblock4_denselayer1_conv1 (l__self___features_denseblock4_denselayer1_relu1,) {}
call_module l__self___features_denseblock4_denselayer1_norm2 L__self___features_denseblock4_denselayer1_norm2 (bottleneck_output_84,) {}
call_module l__self___features_denseblock4_denselayer1_relu2 L__self___features_denseblock4_denselayer1_relu2 (l__self___features_denseblock4_denselayer1_norm2,) {}
call_module new_features_84 L__self___features_denseblock4_denselayer1_conv2 (l__self___features_denseblock4_denselayer1_relu2,) {}
call_function concated_features_43 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84], 1) {}
call_module l__self___features_denseblock4_denselayer2_norm1 L__self___features_denseblock4_denselayer2_norm1 (concated_features_43,) {}
call_module l__self___features_denseblock4_denselayer2_relu1 L__self___features_denseblock4_denselayer2_relu1 (l__self___features_denseblock4_denselayer2_norm1,) {}
call_module bottleneck_output_86 L__self___features_denseblock4_denselayer2_conv1 (l__self___features_denseblock4_denselayer2_relu1,) {}
call_module l__self___features_denseblock4_denselayer2_norm2 L__self___features_denseblock4_denselayer2_norm2 (bottleneck_output_86,) {}
call_module l__self___features_denseblock4_denselayer2_relu2 L__self___features_denseblock4_denselayer2_relu2 (l__self___features_denseblock4_denselayer2_norm2,) {}
call_module new_features_86 L__self___features_denseblock4_denselayer2_conv2 (l__self___features_denseblock4_denselayer2_relu2,) {}
call_function concated_features_44 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86], 1) {}
call_module l__self___features_denseblock4_denselayer3_norm1 L__self___features_denseblock4_denselayer3_norm1 (concated_features_44,) {}
call_module l__self___features_denseblock4_denselayer3_relu1 L__self___features_denseblock4_denselayer3_relu1 (l__self___features_denseblock4_denselayer3_norm1,) {}
call_module bottleneck_output_88 L__self___features_denseblock4_denselayer3_conv1 (l__self___features_denseblock4_denselayer3_relu1,) {}
call_module l__self___features_denseblock4_denselayer3_norm2 L__self___features_denseblock4_denselayer3_norm2 (bottleneck_output_88,) {}
call_module l__self___features_denseblock4_denselayer3_relu2 L__self___features_denseblock4_denselayer3_relu2 (l__self___features_denseblock4_denselayer3_norm2,) {}
call_module new_features_88 L__self___features_denseblock4_denselayer3_conv2 (l__self___features_denseblock4_denselayer3_relu2,) {}
call_function concated_features_45 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88], 1) {}
call_module l__self___features_denseblock4_denselayer4_norm1 L__self___features_denseblock4_denselayer4_norm1 (concated_features_45,) {}
call_module l__self___features_denseblock4_denselayer4_relu1 L__self___features_denseblock4_denselayer4_relu1 (l__self___features_denseblock4_denselayer4_norm1,) {}
call_module bottleneck_output_90 L__self___features_denseblock4_denselayer4_conv1 (l__self___features_denseblock4_denselayer4_relu1,) {}
call_module l__self___features_denseblock4_denselayer4_norm2 L__self___features_denseblock4_denselayer4_norm2 (bottleneck_output_90,) {}
call_module l__self___features_denseblock4_denselayer4_relu2 L__self___features_denseblock4_denselayer4_relu2 (l__self___features_denseblock4_denselayer4_norm2,) {}
call_module new_features_90 L__self___features_denseblock4_denselayer4_conv2 (l__self___features_denseblock4_denselayer4_relu2,) {}
call_function concated_features_46 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90], 1) {}
call_module l__self___features_denseblock4_denselayer5_norm1 L__self___features_denseblock4_denselayer5_norm1 (concated_features_46,) {}
call_module l__self___features_denseblock4_denselayer5_relu1 L__self___features_denseblock4_denselayer5_relu1 (l__self___features_denseblock4_denselayer5_norm1,) {}
call_module bottleneck_output_92 L__self___features_denseblock4_denselayer5_conv1 (l__self___features_denseblock4_denselayer5_relu1,) {}
call_module l__self___features_denseblock4_denselayer5_norm2 L__self___features_denseblock4_denselayer5_norm2 (bottleneck_output_92,) {}
call_module l__self___features_denseblock4_denselayer5_relu2 L__self___features_denseblock4_denselayer5_relu2 (l__self___features_denseblock4_denselayer5_norm2,) {}
call_module new_features_92 L__self___features_denseblock4_denselayer5_conv2 (l__self___features_denseblock4_denselayer5_relu2,) {}
call_function concated_features_47 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92], 1) {}
call_module l__self___features_denseblock4_denselayer6_norm1 L__self___features_denseblock4_denselayer6_norm1 (concated_features_47,) {}
call_module l__self___features_denseblock4_denselayer6_relu1 L__self___features_denseblock4_denselayer6_relu1 (l__self___features_denseblock4_denselayer6_norm1,) {}
call_module bottleneck_output_94 L__self___features_denseblock4_denselayer6_conv1 (l__self___features_denseblock4_denselayer6_relu1,) {}
call_module l__self___features_denseblock4_denselayer6_norm2 L__self___features_denseblock4_denselayer6_norm2 (bottleneck_output_94,) {}
call_module l__self___features_denseblock4_denselayer6_relu2 L__self___features_denseblock4_denselayer6_relu2 (l__self___features_denseblock4_denselayer6_norm2,) {}
call_module new_features_94 L__self___features_denseblock4_denselayer6_conv2 (l__self___features_denseblock4_denselayer6_relu2,) {}
call_function concated_features_48 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94], 1) {}
call_module l__self___features_denseblock4_denselayer7_norm1 L__self___features_denseblock4_denselayer7_norm1 (concated_features_48,) {}
call_module l__self___features_denseblock4_denselayer7_relu1 L__self___features_denseblock4_denselayer7_relu1 (l__self___features_denseblock4_denselayer7_norm1,) {}
call_module bottleneck_output_96 L__self___features_denseblock4_denselayer7_conv1 (l__self___features_denseblock4_denselayer7_relu1,) {}
call_module l__self___features_denseblock4_denselayer7_norm2 L__self___features_denseblock4_denselayer7_norm2 (bottleneck_output_96,) {}
call_module l__self___features_denseblock4_denselayer7_relu2 L__self___features_denseblock4_denselayer7_relu2 (l__self___features_denseblock4_denselayer7_norm2,) {}
call_module new_features_96 L__self___features_denseblock4_denselayer7_conv2 (l__self___features_denseblock4_denselayer7_relu2,) {}
call_function concated_features_49 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96], 1) {}
call_module l__self___features_denseblock4_denselayer8_norm1 L__self___features_denseblock4_denselayer8_norm1 (concated_features_49,) {}
call_module l__self___features_denseblock4_denselayer8_relu1 L__self___features_denseblock4_denselayer8_relu1 (l__self___features_denseblock4_denselayer8_norm1,) {}
call_module bottleneck_output_98 L__self___features_denseblock4_denselayer8_conv1 (l__self___features_denseblock4_denselayer8_relu1,) {}
call_module l__self___features_denseblock4_denselayer8_norm2 L__self___features_denseblock4_denselayer8_norm2 (bottleneck_output_98,) {}
call_module l__self___features_denseblock4_denselayer8_relu2 L__self___features_denseblock4_denselayer8_relu2 (l__self___features_denseblock4_denselayer8_norm2,) {}
call_module new_features_98 L__self___features_denseblock4_denselayer8_conv2 (l__self___features_denseblock4_denselayer8_relu2,) {}
call_function concated_features_50 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98], 1) {}
call_module l__self___features_denseblock4_denselayer9_norm1 L__self___features_denseblock4_denselayer9_norm1 (concated_features_50,) {}
call_module l__self___features_denseblock4_denselayer9_relu1 L__self___features_denseblock4_denselayer9_relu1 (l__self___features_denseblock4_denselayer9_norm1,) {}
call_module bottleneck_output_100 L__self___features_denseblock4_denselayer9_conv1 (l__self___features_denseblock4_denselayer9_relu1,) {}
call_module l__self___features_denseblock4_denselayer9_norm2 L__self___features_denseblock4_denselayer9_norm2 (bottleneck_output_100,) {}
call_module l__self___features_denseblock4_denselayer9_relu2 L__self___features_denseblock4_denselayer9_relu2 (l__self___features_denseblock4_denselayer9_norm2,) {}
call_module new_features_100 L__self___features_denseblock4_denselayer9_conv2 (l__self___features_denseblock4_denselayer9_relu2,) {}
call_function concated_features_51 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100], 1) {}
call_module l__self___features_denseblock4_denselayer10_norm1 L__self___features_denseblock4_denselayer10_norm1 (concated_features_51,) {}
call_module l__self___features_denseblock4_denselayer10_relu1 L__self___features_denseblock4_denselayer10_relu1 (l__self___features_denseblock4_denselayer10_norm1,) {}
call_module bottleneck_output_102 L__self___features_denseblock4_denselayer10_conv1 (l__self___features_denseblock4_denselayer10_relu1,) {}
call_module l__self___features_denseblock4_denselayer10_norm2 L__self___features_denseblock4_denselayer10_norm2 (bottleneck_output_102,) {}
call_module l__self___features_denseblock4_denselayer10_relu2 L__self___features_denseblock4_denselayer10_relu2 (l__self___features_denseblock4_denselayer10_norm2,) {}
call_module new_features_102 L__self___features_denseblock4_denselayer10_conv2 (l__self___features_denseblock4_denselayer10_relu2,) {}
call_function concated_features_52 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102], 1) {}
call_module l__self___features_denseblock4_denselayer11_norm1 L__self___features_denseblock4_denselayer11_norm1 (concated_features_52,) {}
call_module l__self___features_denseblock4_denselayer11_relu1 L__self___features_denseblock4_denselayer11_relu1 (l__self___features_denseblock4_denselayer11_norm1,) {}
call_module bottleneck_output_104 L__self___features_denseblock4_denselayer11_conv1 (l__self___features_denseblock4_denselayer11_relu1,) {}
call_module l__self___features_denseblock4_denselayer11_norm2 L__self___features_denseblock4_denselayer11_norm2 (bottleneck_output_104,) {}
call_module l__self___features_denseblock4_denselayer11_relu2 L__self___features_denseblock4_denselayer11_relu2 (l__self___features_denseblock4_denselayer11_norm2,) {}
call_module new_features_104 L__self___features_denseblock4_denselayer11_conv2 (l__self___features_denseblock4_denselayer11_relu2,) {}
call_function concated_features_53 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102, new_features_104], 1) {}
call_module l__self___features_denseblock4_denselayer12_norm1 L__self___features_denseblock4_denselayer12_norm1 (concated_features_53,) {}
call_module l__self___features_denseblock4_denselayer12_relu1 L__self___features_denseblock4_denselayer12_relu1 (l__self___features_denseblock4_denselayer12_norm1,) {}
call_module bottleneck_output_106 L__self___features_denseblock4_denselayer12_conv1 (l__self___features_denseblock4_denselayer12_relu1,) {}
call_module l__self___features_denseblock4_denselayer12_norm2 L__self___features_denseblock4_denselayer12_norm2 (bottleneck_output_106,) {}
call_module l__self___features_denseblock4_denselayer12_relu2 L__self___features_denseblock4_denselayer12_relu2 (l__self___features_denseblock4_denselayer12_norm2,) {}
call_module new_features_106 L__self___features_denseblock4_denselayer12_conv2 (l__self___features_denseblock4_denselayer12_relu2,) {}
call_function concated_features_54 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102, new_features_104, new_features_106], 1) {}
call_module l__self___features_denseblock4_denselayer13_norm1 L__self___features_denseblock4_denselayer13_norm1 (concated_features_54,) {}
call_module l__self___features_denseblock4_denselayer13_relu1 L__self___features_denseblock4_denselayer13_relu1 (l__self___features_denseblock4_denselayer13_norm1,) {}
call_module bottleneck_output_108 L__self___features_denseblock4_denselayer13_conv1 (l__self___features_denseblock4_denselayer13_relu1,) {}
call_module l__self___features_denseblock4_denselayer13_norm2 L__self___features_denseblock4_denselayer13_norm2 (bottleneck_output_108,) {}
call_module l__self___features_denseblock4_denselayer13_relu2 L__self___features_denseblock4_denselayer13_relu2 (l__self___features_denseblock4_denselayer13_norm2,) {}
call_module new_features_108 L__self___features_denseblock4_denselayer13_conv2 (l__self___features_denseblock4_denselayer13_relu2,) {}
call_function concated_features_55 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102, new_features_104, new_features_106, new_features_108], 1) {}
call_module l__self___features_denseblock4_denselayer14_norm1 L__self___features_denseblock4_denselayer14_norm1 (concated_features_55,) {}
call_module l__self___features_denseblock4_denselayer14_relu1 L__self___features_denseblock4_denselayer14_relu1 (l__self___features_denseblock4_denselayer14_norm1,) {}
call_module bottleneck_output_110 L__self___features_denseblock4_denselayer14_conv1 (l__self___features_denseblock4_denselayer14_relu1,) {}
call_module l__self___features_denseblock4_denselayer14_norm2 L__self___features_denseblock4_denselayer14_norm2 (bottleneck_output_110,) {}
call_module l__self___features_denseblock4_denselayer14_relu2 L__self___features_denseblock4_denselayer14_relu2 (l__self___features_denseblock4_denselayer14_norm2,) {}
call_module new_features_110 L__self___features_denseblock4_denselayer14_conv2 (l__self___features_denseblock4_denselayer14_relu2,) {}
call_function concated_features_56 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102, new_features_104, new_features_106, new_features_108, new_features_110], 1) {}
call_module l__self___features_denseblock4_denselayer15_norm1 L__self___features_denseblock4_denselayer15_norm1 (concated_features_56,) {}
call_module l__self___features_denseblock4_denselayer15_relu1 L__self___features_denseblock4_denselayer15_relu1 (l__self___features_denseblock4_denselayer15_norm1,) {}
call_module bottleneck_output_112 L__self___features_denseblock4_denselayer15_conv1 (l__self___features_denseblock4_denselayer15_relu1,) {}
call_module l__self___features_denseblock4_denselayer15_norm2 L__self___features_denseblock4_denselayer15_norm2 (bottleneck_output_112,) {}
call_module l__self___features_denseblock4_denselayer15_relu2 L__self___features_denseblock4_denselayer15_relu2 (l__self___features_denseblock4_denselayer15_norm2,) {}
call_module new_features_112 L__self___features_denseblock4_denselayer15_conv2 (l__self___features_denseblock4_denselayer15_relu2,) {}
call_function concated_features_57 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102, new_features_104, new_features_106, new_features_108, new_features_110, new_features_112], 1) {}
call_module l__self___features_denseblock4_denselayer16_norm1 L__self___features_denseblock4_denselayer16_norm1 (concated_features_57,) {}
call_module l__self___features_denseblock4_denselayer16_relu1 L__self___features_denseblock4_denselayer16_relu1 (l__self___features_denseblock4_denselayer16_norm1,) {}
call_module bottleneck_output_114 L__self___features_denseblock4_denselayer16_conv1 (l__self___features_denseblock4_denselayer16_relu1,) {}
call_module l__self___features_denseblock4_denselayer16_norm2 L__self___features_denseblock4_denselayer16_norm2 (bottleneck_output_114,) {}
call_module l__self___features_denseblock4_denselayer16_relu2 L__self___features_denseblock4_denselayer16_relu2 (l__self___features_denseblock4_denselayer16_norm2,) {}
call_module new_features_114 L__self___features_denseblock4_denselayer16_conv2 (l__self___features_denseblock4_denselayer16_relu2,) {}
call_function cat_61 <built-in method cat of type object at 0x7fcfe8e34840> ([l__self___features_transition3_pool, new_features_84, new_features_86, new_features_88, new_features_90, new_features_92, new_features_94, new_features_96, new_features_98, new_features_100, new_features_102, new_features_104, new_features_106, new_features_108, new_features_110, new_features_112, new_features_114], 1) {}
call_module features L__self___features_norm5 (cat_61,) {}
call_function out <function relu at 0x7fcf10a65510> (features,) {'inplace': True}
call_function out_1 <function adaptive_avg_pool2d at 0x7fcf10a65000> (out, (1, 1)) {}
call_function out_2 <built-in method flatten of type object at 0x7fcfe8e34840> (out_1, 1) {}
call_module out_3 L__self___classifier (out_2,) {}
output output output ((out_3,),) {}
tensor([[ 0.0614, -0.4023, -0.2792, ..., -0.5549, 0.0976, -0.0634],
[-0.2032, -0.2706, -0.0935, ..., -0.4815, 0.0758, -0.1038],
[ 0.0637, -0.3492, -0.1492, ..., -0.4841, 0.1776, -0.0723],
...,
[-0.1050, -0.3393, 0.0092, ..., -0.4862, 0.0555, -0.1058],
[ 0.0018, -0.2431, -0.1656, ..., -0.5072, 0.0977, -0.1387],
[ 0.1192, -0.3563, -0.1147, ..., -0.4839, 0.1770, -0.0659]],
device='cuda:0', grad_fn=<AddmmBackward0>)
使用我们的自定义后端,我们现在可以看到 TorchDynamo 如何处理数据相关的控制流。考虑下面的函数,其中if b.sum() < 0
这一行是数据相关控制流的源头。
def bar(a, b):
x = a / (torch.abs(a) + 1)
if b.sum() < 0:
b = b * -1
return x * b
opt_bar = torch.compile(bar, backend=custom_backend)
inp1 = torch.randn(10)
inp2 = torch.randn(10)
opt_bar(inp1, inp2)
opt_bar(inp1, -inp2)
custom backend called with FX graph:
opcode name target args kwargs
------------- ------ ------------------------------------------------------ ----------- --------
placeholder l_a_ L_a_ () {}
placeholder l_b_ L_b_ () {}
call_function abs_1 <built-in method abs of type object at 0x7fcfe8e34840> (l_a_,) {}
call_function add <built-in function add> (abs_1, 1) {}
call_function x <built-in function truediv> (l_a_, add) {}
call_method sum_1 sum (l_b_,) {}
call_function lt <built-in function lt> (sum_1, 0) {}
output output output ((x, lt),) {}
custom backend called with FX graph:
opcode name target args kwargs
------------- ------ ----------------------- ------------ --------
placeholder l_x_ L_x_ () {}
placeholder l_b_ L_b_ () {}
call_function mul <built-in function mul> (l_x_, l_b_) {}
output output output ((mul,),) {}
custom backend called with FX graph:
opcode name target args kwargs
------------- ------ ----------------------- ----------- --------
placeholder l_b_ L_b_ () {}
placeholder l_x_ L_x_ () {}
call_function b <built-in function mul> (l_b_, -1) {}
call_function mul_1 <built-in function mul> (l_x_, b) {}
output output output ((mul_1,),) {}
tensor([-0.0176, 1.0753, 0.0282, 0.0756, -0.0176, 0.0633, -0.9161, 0.1333,
-0.1971, -0.3406])
输出显示 TorchDynamo 提取了 3 个不同的 FX 图,对应以下代码(顺序可能与上面的输出不同):
-
x = a / (torch.abs(a) + 1)
-
b = b * -1; return x * b
-
return x * b
当 TorchDynamo 遇到不支持的 Python 特性,比如数据相关的控制流,它会中断计算图,让默认的 Python 解释器处理不支持的代码,然后继续捕获图。
让我们通过示例来探究 TorchDynamo 如何逐步执行bar
。如果b.sum() < 0
,那么 TorchDynamo 将运行图 1,让 Python 确定条件的结果,然后运行图 2。另一方面,如果not b.sum() < 0
,那么 TorchDynamo 将运行图 1,让 Python 确定条件的结果,然后运行图 3。
这突显了 TorchDynamo 与以前的 PyTorch 编译器解决方案之间的主要区别。当遇到不支持的 Python 特性时,以前的解决方案要么引发错误,要么悄悄失败。另一方面,TorchDynamo 会中断计算图。
我们可以通过使用torch._dynamo.explain
来查看 TorchDynamo 中断图的位置:
# Reset since we are using a different backend.
torch._dynamo.reset()
explain_output = torch._dynamo.explain(bar)(torch.randn(10), torch.randn(10))
print(explain_output)
Graph Count: 2
Graph Break Count: 1
Op Count: 6
Break Reasons:
Break Reason 1:
Reason: generic_jump TensorVariable()
User Stack:
<FrameSummary file /var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py, line 434 in bar>
Ops per Graph:
Ops 1:
<built-in method abs of type object at 0x7fcfe8e34840>
<built-in function add>
<built-in function truediv>
<built-in function lt>
Ops 2:
<built-in function mul>
<built-in function mul>
Out Guards:
Guard 1:
Name: "L['a']"
Source: local
Create Function: TENSOR_MATCH
Guard Types: ['TENSOR_MATCH']
Code List: ["hasattr(L['a'], '_dynamo_dynamic_indices') == False"]
Object Weakref: <weakref at 0x7fce93c22110; dead>
Guarded Class Weakref: <weakref at 0x7fcf10f646d0; to 'torch._C._TensorMeta' at 0x60eec90 (Tensor)>
Guard 2:
Name: ''
Source: shape_env
Create Function: SHAPE_ENV
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 3:
Name: ''
Source: global
Create Function: GRAD_MODE
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 4:
Name: ''
Source: global
Create Function: CONFIG_HASH_MATCH
Guard Types: ['CONFIG_HASH_MATCH']
Code List: ["___compile_config_hash() == 'a9446d0645a24f8e5db15f38d621b2a5'"]
Object Weakref: None
Guarded Class Weakref: None
Guard 5:
Name: ''
Source: global
Create Function: DETERMINISTIC_ALGORITHMS
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 6:
Name: ''
Source: global
Create Function: HAS_GRAPH_BREAK
Guard Types: ['HAS_GRAPH_BREAK']
Code List: ['not ___needs_nopython()']
Object Weakref: None
Guarded Class Weakref: None
Guard 7:
Name: ''
Source: global
Create Function: TORCH_FUNCTION_STATE
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 8:
Name: "L['b']"
Source: local
Create Function: TENSOR_MATCH
Guard Types: ['TENSOR_MATCH']
Code List: ["hasattr(L['b'], '_dynamo_dynamic_indices') == False"]
Object Weakref: <weakref at 0x7fce89d67c90; dead>
Guarded Class Weakref: <weakref at 0x7fcf10f646d0; to 'torch._C._TensorMeta' at 0x60eec90 (Tensor)>
Guard 9:
Name: ''
Source: global
Create Function: DEFAULT_DEVICE
Guard Types: ['DEFAULT_DEVICE']
Code List: ['utils_device.CURRENT_DEVICE == None']
Object Weakref: None
Guarded Class Weakref: None
Guard 10:
Name: "G['torch']"
Source: global
Create Function: FUNCTION_MATCH
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 11:
Name: ''
Source: global
Create Function: BACKEND_MATCH
Guard Types: ['BACKEND_MATCH']
Code List: ['(___skip_backend_check() or ___current_backend() == ___lookup_backend(140524583700368))']
Object Weakref: None
Guarded Class Weakref: None
Guard 12:
Name: ''
Source: shape_env
Create Function: SHAPE_ENV
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 13:
Name: ''
Source: global
Create Function: GRAD_MODE
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 14:
Name: ''
Source: global
Create Function: CONFIG_HASH_MATCH
Guard Types: ['CONFIG_HASH_MATCH']
Code List: ["___compile_config_hash() == 'a9446d0645a24f8e5db15f38d621b2a5'"]
Object Weakref: None
Guarded Class Weakref: None
Guard 15:
Name: "L['b']"
Source: local
Create Function: TENSOR_MATCH
Guard Types: ['TENSOR_MATCH']
Code List: ["hasattr(L['b'], '_dynamo_dynamic_indices') == False"]
Object Weakref: <weakref at 0x7fce89d67c90; dead>
Guarded Class Weakref: <weakref at 0x7fcf10f646d0; to 'torch._C._TensorMeta' at 0x60eec90 (Tensor)>
Guard 16:
Name: ''
Source: global
Create Function: DETERMINISTIC_ALGORITHMS
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 17:
Name: ''
Source: global
Create Function: TORCH_FUNCTION_STATE
Guard Types: None
Code List: None
Object Weakref: None
Guarded Class Weakref: None
Guard 18:
Name: ''
Source: global
Create Function: DEFAULT_DEVICE
Guard Types: ['DEFAULT_DEVICE']
Code List: ['utils_device.CURRENT_DEVICE == None']
Object Weakref: None
Guarded Class Weakref: None
Guard 19:
Name: ''
Source: global
Create Function: BACKEND_MATCH
Guard Types: ['BACKEND_MATCH']
Code List: ['(___skip_backend_check() or ___current_backend() == ___lookup_backend(140524583700368))']
Object Weakref: None
Guarded Class Weakref: None
Guard 20:
Name: "L['x']"
Source: local
Create Function: TENSOR_MATCH
Guard Types: ['TENSOR_MATCH']
Code List: ["hasattr(L['x'], '_dynamo_dynamic_indices') == False"]
Object Weakref: <weakref at 0x7fce8830e1b0; dead>
Guarded Class Weakref: <weakref at 0x7fcf10f646d0; to 'torch._C._TensorMeta' at 0x60eec90 (Tensor)>
Compile Times: TorchDynamo compilation metrics:
Function Runtimes (s)
------------------------------- --------------
_compile.<locals>.compile_inner 0.0143, 0.0078
OutputGraph.call_user_compiler 0.0010, 0.0000
为了最大化加速,图中断应该受到限制。我们可以通过使用fullgraph=True
来强制 TorchDynamo 在遇到第一个图中断时引发错误:
opt_bar = torch.compile(bar, fullgraph=True)
try:
opt_bar(torch.randn(10), torch.randn(10))
except:
tb.print_exc()
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 482, in <module>
opt_bar(torch.randn(10), torch.randn(10))
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
compiled_product = _compile(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 646, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 562, in compile_inner
out_code = transform_code_object(code, transform)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 151, in _fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 527, in transform
tracer.run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2128, in run
super().run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 818, in run
and self.step()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 781, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 445, in inner
raise exc.UserError(
torch._dynamo.exc.UserError: Dynamic control flow is not supported at the moment. Please use functorch.experimental.control_flow.cond to explicitly capture the control flow. For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#cond-operands
from user code:
File "/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py", line 434, in bar
if b.sum() < 0:
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
在下面,我们演示了 TorchDynamo 在我们用于演示加速的模型上不会中断图。
opt_model = torch.compile(init_model(), fullgraph=True)
print(opt_model(generate_data(16)[0]))
tensor([[ 0.1346, 0.1981, -0.2058, ..., 0.1868, -0.1626, -0.1644],
[ 0.2679, 0.2347, -0.1904, ..., 0.2167, -0.0060, 0.0307],
[ 0.0166, 0.2182, -0.1113, ..., 0.1708, -0.1683, -0.0637],
...,
[ 0.0808, 0.2680, -0.1887, ..., 0.0378, -0.2078, -0.1444],
[-0.0211, 0.0857, -0.2459, ..., 0.1863, -0.1282, -0.0283],
[-0.0388, 0.0728, -0.1961, ..., 0.0860, -0.2200, -0.1485]],
device='cuda:0', grad_fn=<CompiledFunctionBackward>)
我们可以使用torch.export
(从 PyTorch 2.1+开始)从输入的 PyTorch 程序中提取一个可导出的 FX 图。导出的图旨在在不同(即无 Python)环境上运行。一个重要的限制是torch.export
不支持图断点。请查看此教程了解更多关于torch.export
的详细信息。
结论
在本教程中,我们介绍了torch.compile
,涵盖了基本用法,演示了相对于急切模式的加速效果,与之前的 PyTorch 编译器解决方案进行了比较,并简要调查了 TorchDynamo 及其与 FX 图的交互。我们希望您会尝试使用torch.compile
!
脚本的总运行时间:(6 分钟 7.888 秒)
下载 Python 源代码:torch_compile_tutorial.py
下载 Jupyter 笔记本:torch_compile_tutorial.ipynb
电感器 CPU 后端调试和性能分析
原文:
pytorch.org/tutorials/intermediate/inductor_debug_cpu.html
译者:飞龙
注意
点击这里下载完整示例代码
作者: Xuan Liao, Haozhe Zhu, Jiong Gong, Weihan Wang
概述
PyTorch 2.0 引入了名为torch.compile
的编译 API。这一新功能通过默认的电感器后端提供的图级优化,显著加快了急切模式执行的速度。
本教程旨在通过深入研究torch.compile
的复杂性,提供有关电感器 CPU 后端调试和性能分析的深入介绍。
同时,您还可以在基本用法、全面的故障排除和 GPU 特定知识(如GPU 性能分析)周围找到与torch.compile
相关的教程。
我们将从一个激励示例开始调试,通过演示调试过程来准确定位问题,触发编译问题和准确性问题。
通过启用日志记录并探索生成的底层代码,您可以逐步学习如何缩小失败范围,最终找出根本原因。
接下来,我们将讨论如何对编译后的代码进行性能分析,并通过与急切模式的性能比较详细说明为什么torch.compile
可以提供额外的性能提升,与其急切模式相比。
调试
这里有一个简单的示例,使用 Inductor 运行torch.compile
并将其结果与急切模式进行比较:
import torch
def foo1(x1, x2):
a = torch.neg(x1)
b = torch.maximum(x2, a)
y = torch.cat([b], dim=0)
return y
x1 = torch.randint(256, (1, 8), dtype=torch.uint8)
x2 = torch.randint(256, (8390, 8), dtype=torch.uint8)
compiled_foo1 = torch.compile(foo1)
result = compiled_foo1(x1, x2)
在cpp
代码生成中,neg
的正确实现如下:
def neg1(x):
return f"decltype({x})(-{x})"
为了演示调试,我们将稍后将函数修改为错误的。
获取更多日志信息
如果您默认运行这个简单示例,将不会提供调试信息。为了获得更多有用的调试和日志信息,通常我们会添加一个TORCH_COMPILE_DEBUG
环境变量,如下所示:
TORCH_COMPILE_DEBUG=1 python xx.py
这将在输出日志中打印更多的调试信息,并且在代码生成过程中生成的中间 IR 也会被转储。您可以在日志中找到转储文件路径,如下所示:
torch._inductor.debug: [WARNING] model___20 debug trace: /tmp/torchinductor_root/rx/crxfi2ybd7yp5sbj2pnhw33wfhtdw7wumvrobyp5sjvdui5ktjc2.debug
在这个目录中,以下文件被保存用于调试目的:
文件 | 描述 |
---|---|
fx_graph_runnable.py |
可执行的 FX 图,在分解之后,在模式匹配之前 |
fx_graph_transformed.py |
经过模式匹配后的转换后的 FX 图 |
ir_post_fusion.txt |
融合前的电感 IR |
ir_pre_fusion.txt |
融合后的电感 IR |
output_code.py |
生成的用于图形的 Python 代码,带有 C++/Triton 内核 |
请注意,为了更容易调试,fx_graph_runnable.py
和output_code.py
都是可运行和可编辑的。以下是从文件中提取的代码的主要部分,我们将 C++生成的行与 FX 代码行进行了对应。
fx_graph_runnable
:
def forward1(self, arg0_1, arg1_1):
neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None
maximum = torch.ops.aten.maximum.default(arg1_1, neg); arg1_1 = neg = None
clone = torch.ops.aten.clone.default(maximum); maximum = None
return (clone,)
output_code
中的 C++内核:
from torch._inductor.codecache import AsyncCompile
async_compile = AsyncCompile()
cpp_fused_cat_maximum_neg_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const unsigned char* in_ptr0,
const unsigned char* in_ptr1,
unsigned char* out_ptr0)
{
{
#pragma GCC ivdep
for(long i0=static_cast<long>(0L); i0<static_cast<long>(8390L); i0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(8L); i1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(i1 + (8L*i0))];
auto tmp1 = in_ptr1[static_cast<long>(i1)];
// Corresponding FX code line: neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None
auto tmp2 = decltype(tmp1)(-tmp1);
// Corresponding FX code line: maximum = torch.ops.aten.maximum.default(arg1_1, neg); arg1_1 = neg = None
auto tmp3 = max_propagate_nan(tmp0, tmp2);
// Corresponding FX code line: clone = torch.ops.aten.clone.default(maximum); maximum = None
out_ptr0[static_cast<long>(i1 + (8L*i0))] = tmp3;
}
}
}
}''')
确定错误的组件
在遇到错误或准确性问题时,找到错误的一个直接解决方案是缩小问题范围。首先要做的是确定错误发生的组件。幸运的是,通过更改torch.compile
的后端就可以简单实现。
代码 | 描述 |
---|---|
torch.compile(fn, backend="eager") |
启用 Dynamo |
torch.compile(fn, backend="aot_eager") |
启用 Dynamo + AOT Autograd |
torch.compile(fn, backend="inductor") |
启用 Dynamo + AOT Autograd + Inductor |
如果模型在将后端设置为eager
或aot_eager
时可以成功运行,而在inductor
时失败,我们可以将失败缩小到 Inductor。
编译错误
正如我们所知,图级优化的演变链是这样的:
torch.neg (Python) -> torch.ops.aten.neg.default (within FX graph) -> ops.neg (within IR node) -> tmp2 = -tmp1 (within C++ kernel)
如果遇到编译错误,说明在输出代码中编译 C++内核时出现了问题。这种类型的错误表明在将 IR 节点降级为输出代码时引入了错误。编译错误的根本原因通常在回溯日志中显示。
例如,neg
函数被修改如下:
def neg2(x):
return f"-{x}"
日志记录显示了以下编译错误,原因相当明确。
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
CppCompileError: C++ compile error
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp: In function ‘void kernel(const unsigned char*, const unsigned char*, unsigned char*)’:
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: error: no matching function for call to ‘max_propagate_nan(unsigned char&, int&)’
17 | auto tmp3 = max_propagate_nan(tmp0, tmp2);
| ^
In file included from /tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:2:
/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: candidate: ‘template<class scalar_t> scalar_t max_propagate_nan(scalar_t, scalar_t)’
27 | inline scalar_t max_propagate_nan(scalar_t a, scalar_t b) {
| ^~~~~~~~~~~~~~~~~
/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: template argument deduction/substitution failed:
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: note: deduced conflicting types for parameter ‘scalar_t’ (‘unsigned char’ and ‘int’)
17 | auto tmp3 = max_propagate_nan(tmp0, tmp2);
| ^
让我们也看看输出代码和 IR 节点中对应的 C++内核。
C++内核:
include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const unsigned char* in_ptr0,
const unsigned char* in_ptr1,
unsigned char* out_ptr0)
{
{
#pragma GCC ivdep
for(long i0=static_cast<long>(0L); i0<static_cast<long>(8390L); i0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(8L); i1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(i1 + (8L*i0))];
auto tmp1 = in_ptr1[static_cast<long>(i1)];
auto tmp2 = -tmp1;
auto tmp3 = max_propagate_nan(tmp0, tmp2);
out_ptr0[static_cast<long>(i1 + (8L*i0))] = tmp3;
}
}
}
}
IR 节点:
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep('buf0', c0, {c0: 67120})]
buf0.unmet_dependencies = []
buf0.met_dependencies =
[ MemoryDep('arg0_1', c1, {c0: 8390, c1: 8}),
MemoryDep('arg1_1', c0, {c0: 67120})]
buf0.users = [NodeUser(node=OUTPUT, can_inplace=False)]
buf0.group.device = cpu
buf0.group.iteration = ((8390, 8), ())
buf0.sizes = ([8390, 8], [])
class buf0_loop_body:
var_ranges = {z0: 8390, z1: 8}
index0 = 8*z0 + z1
index1 = z1
def body(self, ops):
get_index = self.get_index('index0')
load = ops.load('arg1_1', get_index)
get_index_1 = self.get_index('index1')
load_1 = ops.load('arg0_1', get_index_1)
neg = ops.neg(load_1)
maximum = ops.maximum(load, neg)
get_index_2 = self.get_index('index0')
store = ops.store('buf0', get_index_2, maximum, None)
return store
根据回溯日志,编译错误是由于max_propagate_nan
的输入数据类型不一致造成的。通过检查 C++内核,我们知道在执行-
后,tmp2
不再是long
,因为tmp0
是long
。我们可以在 C++内核中使用ops.neg
和ops.maximum
分别匹配-
和max_propagate_nan
。
现在我们成功找到了根本原因,即在cpp
代码生成中ops.neg
的实现,当执行neg
时会悄悄地更改数据类型。
准确性调试
否则,如果模型运行时出现其他错误或准确性问题,可以使用名为Minifier的 PyTorch 调试工具。
Minifier
的核心思想是不断删除图的节点和输入,直到找到具有问题的最小图。它通过 4 种策略自动生成一个经过缩小的有问题的图:截断后缀、增量调试、消除死代码和删除未使用的输入。
现在我们将展示如何通过Minifer
来调试准确性问题。准确性问题指的是后端 eager 和 inductor 的输出不同的情况。
例如,我们将示例修改如下:
from torch._dynamo.utils import same
def foo2(x1, x2):
a = torch.neg(x1)
b = torch.maximum(x2, a)
y = torch.cat([b], dim=0)
return y
x1 = torch.randn((1, 8), dtype=torch.float32)
x2 = torch.randn((8390, 8), dtype=torch.float32)
expected_result = foo2(x1, x2)
compiled_foo2 = torch.compile(foo2)
actual_result = compiled_foo2(x1, x2)
assert same(expected_result, actual_result) == True
还要修改neg
函数:
def neg3(x):
return f"decltype({x})(2 * {x})"
准确性问题将如下提出:
torch._dynamo.utils: [ERROR] Accuracy failed: allclose not within tol=0.0001
Traceback (most recent call last):
File "test_script.py", line 18, in <module>
assert same(expected_result, actual_result) == True
AssertionError
要调试 Minifier 的准确性问题,需要两个环境变量:
TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4 python xx.py
这给我们提供了记录信息,展示了缩小步骤的过程:
Started off with 6 nodes
Trying granularity 2
Strategy: Truncate suffix (G: 2) (6 nodes, 2 inputs)
SUCCESS: Went from 6 to 4 nodes
Trying granularity 4
Strategy: Remove unused inputs (G: 4) (4 nodes, 2 inputs)
SUCCESS: Went from 4 to 3 nodes
运行后,我们得到了目标节点neg
的最终缩小图:
def forward2(self, arg0_1):
neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None
return (neg,)
关于 Minifier 的更多使用细节,请参考Troubleshooting。
性能分析
在本节中,我们将演示使用 Inductor CPU 后端编译的模型进行性能分析的过程。在下面的示例中,我们使用急切模式和 Inductor 图模式对 Hugging Face Transformer 模型MobileBertForQuestionAnswering
进行基准测试。基准测试后打印出 Inductor 的执行时间和加速比。我们使用 Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz,并在第一个插槽上运行基准测试,以展示本节中的优化。我们设置以下环境变量作为在 Intel(R) CPU 上进行基准测试的最佳实践。
export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
numactl -C 0-31 -m 0 python bench.py
# bench.py
from transformers import MobileBertForQuestionAnswering
# Initialize an eager model
model = MobileBertForQuestionAnswering.from_pretrained("csarron/mobilebert-uncased-squad-v2")
seq_length = 128
bs = 128
vocab_size = model.config.vocab_size
input = torch.randint(0, vocab_size, (bs, seq_length), dtype=torch.int64)
input_dict = {"input_ids": input}
# Initialize the inductor model
compiled_model = torch.compile(model)
with torch.no_grad():
compiled_model(**input_dict)
NUM_ITERS=50
import timeit
with torch.no_grad():
# warmup
for _ in range(10):
model(**input_dict)
eager_t = timeit.timeit("model(**input_dict)", number=NUM_ITERS, globals=globals())
with torch.no_grad():
# warmup
for _ in range(10):
compiled_model(**input_dict)
inductor_t = timeit.timeit("compiled_model(**input_dict)", number=NUM_ITERS, globals=globals())
# print(f"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter")
# print(f"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter")
# print(f"speed up ratio: {eager_t / inductor_t}")
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning:
torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
Downloading config.json: 0%| | 0.00/765 [00:00<?, ?B/s]
Downloading config.json: 100%|##########| 765/765 [00:00<00:00, 4.90MB/s]
Downloading model.safetensors: 0%| | 0.00/98.5M [00:00<?, ?B/s]
Downloading model.safetensors: 32%|###1 | 31.5M/98.5M [00:00<00:00, 293MB/s]
Downloading model.safetensors: 75%|#######4 | 73.4M/98.5M [00:00<00:00, 342MB/s]
Downloading model.safetensors: 100%|##########| 98.5M/98.5M [00:00<00:00, 346MB/s]
输出:
eager use: 802.1023553796113 ms/iter
inductor use: 339.95180135127157 ms/iter
speed up ratio: 2.359459053287382
在我们自己的测试中,我们发现 Inductor CPU 后端可以将模型的速度提高约 2.355 倍。
接下来,让我们深入了解操作级别的性能,以了解速度提升来自哪里。Pytorch Profiler是一个帮助我们的好工具。Inductor CPU 后端支持使用enable_kernel_profile
配置选项将融合内核的时间报告给性能分析器:
from torch._inductor import config
config.cpp.enable_kernel_profile = True
按照Pytorch Profiler中的步骤,我们能够获得性能分析表和跟踪文件。
# bench.py
from torch.profiler import profile, schedule, ProfilerActivity
RESULT_DIR = "./prof_trace"
my_schedule = schedule(
skip_first=10,
wait=5,
warmup=5,
active=1,
repeat=5)
def trace_handler(p):
output = p.key_averages().table(sort_by="self_cpu_time_total", row_limit=20)
# print(output)
p.export_chrome_trace(f"{RESULT_DIR}/{p.step_num}.json")
for _ in range(10):
model(**input_dict) # compiled_model(**input_dict) to get inductor model profiling
total = 0
with profile(
activities=[ProfilerActivity.CPU],
schedule=my_schedule,
on_trace_ready=trace_handler
) as p:
for _ in range(50):
model(**input_dict) # compiled_model(**input_dict) to get inductor model profiling
p.step()
我们得到了急切模式模型的以下性能分析表(省略了一些列):
------------------------- ------------ ------------ ------------
Name CPU total % CPU total # of Calls
------------------------- ------------ ------------ ------------
aten::addmm 45.73% 370.814ms 362
aten::add 19.89% 161.276ms 363
aten::copy_ 14.97% 121.416ms 488
aten::mul 9.02% 73.154ms 194
aten::clamp_min 8.81% 71.444ms 96
aten::bmm 5.46% 44.258ms 48
ProfilerStep* 100.00% 810.920ms 1
aten::div 2.89% 23.447ms 24
aten::_softmax 1.00% 8.087ms 24
aten::linear 46.48% 376.888ms 362
aten::clone 2.77% 22.430ms 98
aten::t 0.31% 2.502ms 362
aten::view 0.14% 1.161ms 850
aten::transpose 0.17% 1.377ms 386
aten::index_select 0.12% 952.000us 3
aten::expand 0.12% 986.000us 458
aten::matmul 8.31% 67.420ms 48
aten::cat 0.09% 703.000us 1
aten::as_strided 0.08% 656.000us 963
aten::relu 8.86% 71.864ms 96
------------------------- ------------ ------------ ------------
Self CPU time total: 810.920ms
同样,我们还得到了使用 Inductor 编译模型的表格(省略了一些列):
----------------------------------------------- ------------ ------------ ------------
Name CPU total % CPU total # of Calls
----------------------------------------------- ------------ ------------ ------------
mkl::_mkl_linear 68.79% 231.573ms 362
aten::bmm 8.02% 26.992ms 48
ProfilerStep* 100.00% 336.642ms 1
graph_0_cpp_fused_constant_pad_nd_embedding_0 0.27% 915.000us 1
aten::empty 0.27% 911.000us 362
graph_0_cpp_fused__mkl_linear_add_mul_relu_151 0.27% 901.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_226 0.27% 899.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_361 0.27% 898.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_121 0.27% 895.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_31 0.27% 893.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_76 0.26% 892.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_256 0.26% 892.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_346 0.26% 892.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_241 0.26% 891.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_316 0.26% 891.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_91 0.26% 890.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_106 0.26% 890.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_211 0.26% 890.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_61 0.26% 889.000us 1
graph_0_cpp_fused__mkl_linear_add_mul_relu_286 0.26% 889.000us 1
----------------------------------------------- ------------ ------------ ------------
Self CPU time total: 336.642ms
从急切模型的分析表中,我们可以看到最耗时的操作是[aten::addmm
, aten::add
, aten::copy_
, aten::mul
, aten::clamp_min
, aten::bmm
]。与感应器模型的分析表相比,我们注意到一个mkl::_mkl_linear
条目和多个形式为graph_0_cpp_fused_*
的融合内核。它们是感应器模型正在进行的主要优化。让我们分别讨论它们。
关于mkl::_mkl_linear
:您可能会注意到对该内核的调用次数为 362,这恰好与急切模型分析表中的aten::linear
相同。aten::linear
的 CPU 总时间为 376.888 毫秒,而mkl::_mkl_linear
为 231.573 毫秒。这表明“linear”部分的速度提升约为 1.63 倍。速度提升主要来自将权重张量打包到块内存格式,并在 Inductor CPU 后端中调用cblas_sgemm_compute,以在 GEMM 计算过程中获得更好的缓存行为。
(2) 关于其他内存密集型操作:在我们的测试中,急切/电感器模型的端到端延迟为 802/339 毫秒。因此,我们可以粗略推断,其他内存密集型操作的加速大约是 3.94 倍。让我们阅读生成的代码,以了解电感器是如何实现这一令人印象深刻的优化的。您可以通过在 output_code.py 中搜索cpp_fused__mkl_linear_add_mul_relu_151
来找到生成的代码。
cpp_fused__mkl_linear_add_mul_relu_151 = async_compile.cpp('''
#include <ATen/record_function.h>
#include "/tmp/torchinductor_root/lr/clrlgu27q4ggd472umdzwsu6qcpqxcuusjxqvx2hwitjbujiiz7z.h"
extern "C" void kernel(float* in_out_ptr0,
const float* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
const float* in_ptr3)
{
RECORD_FUNCTION("graph_0_cpp_fused__mkl_linear_add_mul_relu_151", c10::ArrayRef<c10::IValue>({}));
#pragma omp parallel num_threads(32)
{
{
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(16384L); i0+=static_cast<long>(1L))
{
for(long i1=static_cast<long>(0L); i1<static_cast<long>(512L); i1+=static_cast<long>(8L))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (512L*i0)));
auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(i1));
auto tmp3 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i1 + (512L*i0)));
auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(i1));
auto tmp7 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(i1));
auto tmp2 = tmp0 + tmp1;
auto tmp4 = tmp2 + tmp3;
auto tmp6 = tmp4 * tmp5;
auto tmp8 = tmp6 + tmp7;
tmp8.store(in_out_ptr0 + static_cast<long>(i1 + (512L*i0)));
}
}
}
}
}''')
从上面生成的代码中,我们可以看到这个内核在[add, add, mul, add]
上进行了典型的循环融合。这是一个限制性能的内存瓶颈。为了更直观地了解这种优化,我们可以推断输入的大小和步幅,并进一步对这种[add, add, mul, add]
模式进行基准测试。
# bench.py
def func(arg_0, arg_1, arg_2, arg_3, arg_4):
add_0 = arg_0 + arg_1
add_1 = add_0 + arg_2
mul_1 = add_1 * arg_3
add_2 = mul_1 + arg_4
arg_2 = add_2
return arg_2
arg_0 = torch.rand(16384, 512)
arg_1 = torch.rand(1, 512)
arg_2 = torch.zeros(16384, 512)
arg_3 = torch.rand(1, 512)
arg_4 = torch.rand(1, 512)
input = (arg_0, arg_1, arg_2, arg_3, arg_4)
inductor_func = torch.compile(func)
with torch.no_grad():
inductor_func(*input)
import timeit
NUM_ITERS=100
with torch.no_grad():
# warmup
for _ in range(10):
func(*input)
eager_t = timeit.timeit("func(*input)", number=NUM_ITERS, globals=globals())
with torch.no_grad():
# warmup
for _ in range(10):
inductor_func(*input)
inductor_t = timeit.timeit("inductor_func(*input)", number=NUM_ITERS, globals=globals())
# print(f"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter")
# print(f"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter")
# print(f"speed up ratio: {eager_t / inductor_t}")
输出:
eager use: 5.780875144992024 ms/iter
inductor use: 0.9588955780491233 ms/iter
speed up ratio: 6.0286805751604735
这只是一个例子。在这个模型中,分析表显示所有的逐元素操作都会自动在电感器内部融合。您可以在 output_code.py 中阅读更多内核。
结论
本文档为电感器 CPU 后端提供了深入的教程。
通过激励性示例,我们演示了调试和性能分析的过程。主要思想是缩小问题范围。
我们逐步演示了深入研究问题并找到失败根本原因的方法,借助调试日志和 Minifier 工具的帮助。首先确定故障发生在哪个组件,然后尝试生成能够重现故障的最小代码片段。
当电感器的性能优于急切模式时,我们为性能分析提供了一种可靠的分析方法。我们展示了如何使用 PyTorch Profiler 找到耗时的热点,并找出解释现象的操作级或内核级原因。
脚本的总运行时间:(9 分钟 21.695 秒)
下载 Python 源代码:inductor_debug_cpu.py
下载 Jupyter 笔记本:inductor_debug_cpu.ipynb