Fc2 3292343 _hot_ -

The pooled vectors p₁,…,p_K are concatenated and fed to the classification head. By allowing multiple “pools,” ATP can capture both short‑term actions and long‑range context.

For a video of length clips we obtain a sequence F = f₁,…,f_T . ATP learns a set of K pooling weights wₖ ∈ ℝᵀ via a small 1‑D convolution followed by softmax: fc2 3292343

[ \beginaligned g_v &= \sigma(W_g v + b_g), \quad g_a = \sigma(W_g a + b_g), \ \tildev &= \textLN(v + \textSoftmax\big((g_v \odot v)(g_a \odot a)^\top / \sqrtn\big) (g_a \odot a)),\ \tildea &= \textLN(a + \textSoftmax\big((g_a \odot a)(g_v \odot v)^\top / \sqrtn\big) (g_v \odot v)), \endaligned ] The pooled vectors p₁,…,p_K are concatenated and fed

| Category | Representative Works | Key Idea | Limitations | |---|---|---|---| | | SlowFast [1], ViViT [2] | Spatiotemporal convolutions / transformers | No audio information | | Audio‑only | WaveNet [3], PANNs [4] | Raw waveform / spectrogram modeling | No visual context | | Early Fusion | AVFusion [5] | Concatenate raw frames + spectrograms | Temporal misalignment | | Late Fusion | Two‑Stream LSTM [6] | Separate predictions + averaging | Ignores cross‑modal dynamics | | Intermediate Fusion | Cross‑modal Transformers [7] | Shared self‑attention | High memory/computation | | Hybrid | MMT [8] | Modality‑specific backbones + cross‑attention | Still computationally heavy | ATP learns a set of K pooling weights

The results confirm that both and ATP are crucial for performance, while the fully‑connected design preserves speed.