Informer 时间序列模型
1 Introduction
3 significant limitations in LSTF
LSTF(Long sequence time-series forecasting)
- The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).
- The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layers makes total memory usage to be O(J · L2), which limits the model scalability in receiving long sequence inputs.
- The speed plunge in predicting long outputs. Dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model (Fig.(1b)).
prior works
- Vanilla Transformer(2017)
- The Sparse Transformer(2019)
- LogSparse Transformer(2019)
- Longformer(2020)
- Reformer(2019)
- Linformer(2020)
- Transformer-XL(2019)
- Compressive Transformer(2019)
2 Preliminary
3 Methodology
Efficient Self-attention Mechanism
query’s attention is defined as a kernel smoother in a probability form
\(\mathcal{A}(q, K, V) = \mathbb{E}_{p(k|q)[v]}\)
-
The Sparse Transformer
“self-attention probability has potential sparsity” 自注意力概率具有潜在的稀疏性
Query Sparsity Measurement
-
a few dot-product pairs contribute to the major attention,
-
others generate trivial attention.
distinguish the “important” queries
- Kullback-Leibler divergence
- Dropping the constant,
- query’s sparsity measurement
- Log-Sum-Exp (LSE)
- arithmetic mean
ProbSparse Self-attention
-
ProbSparse self-attention
only attend to the u dominant queries
Encoder
- extract dependency
Self-attention Distilling
-
distilling (inspired by dilated convolution)
- Attention Block
- Conv1d( )
- ELU( ) : activation function
- MaxPool
reduce memory usage
Decoder
- two identical multihead attention layers
Generative Inference
- sample a L_token long sequence
- take the known 5 days before the target sequence as “starttoken”
- feed the generative-style inference decoder
- one forward procedure predicts outputs
Loss function
- MSE loss function
4 Experiment
Datasets
2 collected real-world datasets for LSTF and 2 public benchmark datasets.
ETT (Electricity Transformer Temperature)
ECL (Electricity Consuming Load)
Weather
Experimental Details
Baselines:
- ARIMA(2014)
- Prophet(2018)
- LSTMa(2015)
- LSTnet(2018)
- DeepAR(2017)
self-attention:
- the canonical self-attention variant
- Reformer(2019)
- LogSparse self-attention(2019)
Metrics
- MSE
- MAE
Platform:
- a single Nvidia V100 32GB GPU