Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation.
Interestingly, we find that adding position encoding, contrary to common observations, does not improve-and even degrade performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency.
First, two input frames are encoded as low-resolution feature maps. Based on these features, Fourier Mapper predicts the dominant frequency information, while B-spline Mapper predicts smoothly interpolated motion, which is then processed into optical flows at an arbitrary time t. The frequency information is temporally propagated by being warped with the optical flows. Finally, the warped frequency information is decoded to generate high-resolution interpolated RGB frame.
(a) Fourier Mapper estimates the dominant frequency and its amplitude to capture fine-detail information from the given frames. (b) B-spline Mapper estimates B-spline coefficients to model inherent motion, which smoothly interpolates motion features temporally.
Performance comparison on the Fixed-scale STVSR baselines on Vid4, Gopro, and Adobe240 datasets. Results are evaluated using PSNR (dB) and SSIM metrics. All frames are interpolated by a factor of ×4 in the spatial axis and ×8 in the temporal axis. “Average” refers to metrics calculated across all 8 interpolated frames, while “Center” refers to metrics measured using 1st, 4th and 9th (that is single-frame interpolation) frames of the interpolated sequence. Red and blue indicate the best and the second best performance, respectively.
Performance comparison on the C-STVSR baselines for out-of-distribution scale on Gopro dataset. Results are evaluated using PSNR (dB) and SSIM metrics. All frames are interpolated by a scaling factor specified on the table and metrics calculated across all interpolated frames. Bold indicates the best performance.
Qualitative comparison on in-distribution scale with ×4 in spatial scale and ×8 in temporal scale, using GoPro dataset.
Qualitative comparison on in-distribution scale with ×4 in spatial scale and ×12 in temporal scale, using GoPro dataset.
@article{kim2025bf,
author = {Kim, Eunjin and Kim, Hyeonjin and Jin, Kyong Hwan and Yoo, Jaejun},
title = {BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution},
journal = {arXiv},
year = {2025}
}