BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

Abstract

While prior methods in Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow networks for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve—and even degrades performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model’s flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency.

Method

BF-STVSR Overview

First, two input frames are encoded as low-resolution feature maps. Based on these features, Fourier Mapper predicts the dominant frequency information, while B-spline Mapper predicts smoothly interpolated motion, which is then processed into optical flows at an arbitrary time t. The frequency information is temporally propagated by being warped with the optical flows. Finally, the warped frequency information is decoded to generate high-resolution interpolated RGB frame.

B-spline Mapper and Fourier Mapper

(a) B-spline Mapper estimates B-spline coefficients to model inherent motion, which smoothly interpolates motion features temporally. (b) Fourier Mapper estimates the dominant frequency and its amplitude to capture fine-detail information from the given frames.

Experiments

Quantitative results

Performance comparison on the Fixed-scale STVSR baselines on Vid4, Gopro, and Adobe240 datasets. L_RAFT refers the optical flow supervision. Results are evaluated using PSNR (dB) and SSIM metrics. All frames are interpolated by a factor of ×4 in the spatial axis and ×8 in the temporal axis. “Average” refers to metrics calculated across all 8 interpolated frames, while “Center” refers to metrics measured using 1st, 4th and 9th (that is single-frame interpolation) frames of the interpolated sequence. Red and blue indicate the best and the second best performance, respectively.

Performance comparison on the C-STVSR baselines for out-of-distribution scale on Gopro dataset. L_RAFT refers the optical flow supervision. Results are evaluated using PSNR (dB) and SSIM metrics. All frames are interpolated by a scaling factor specified on the table and metrics calculated across all interpolated frames. Red and blue indicate the best and the second best performance, respectively.

Qualitative results.

Qualitative comparison on in-distribution scale with ×4 in spatial scale and ×8 in temporal scale, using GoPro dataset.

Qualitative comparison on in-distribution scale with ×4 in spatial scale and ×12 in temporal scale, using GoPro dataset.

BibTeX

@article{kim2025bf,
  author    = {Kim, Eunjin and Kim, Hyeonjin and Jin, Kyong Hwan and Yoo, Jaejun},
  title     = {BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution},
  journal   = {arXiv},
  year      = {2025}
}

[CVPR 2025] BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

BF-STVSR includes two positional embedding for each axis. It captures the high-frequency spatial feature by Fourier Mapper and interpolates temporal information smoothly via B-spline Mapper.