Our approach is generic and builds on top of any given 2D spatial network. In terms of wall runtime, it trains 16.1× faster and runs 5.1× faster during inference while maintaining competitive accuracy compared to other state-of-the-art methods. It enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs. We report competitive results on Kinetics-400 and Moments in Time benchmarks and present an ablation study of VTN properties and the trade-off between accuracy and inference speed.
We hope our approach will serve as a new baseline and start a fresh line of research in the video recognition domain.