Jump Cut Smoothing for Talking Heads

1University of Washington, 2Adobe Research

TL;DR: Given a talking head video, we remove the filler and repetitive words, smooth the resulting jump cut, and output a seamless video.

Abstract

A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.

Method

In the training stage, we randomly sample source (denoted in green rectangle) and target (denoted in red rectangle) frames, and extract their corresponding DensePose keypoints augmented with facial landmarks (not shown here for simplicity). Our method extracts source dense keypoint features as key, target dense keypoint feature as query, and source image features as valye, then a cross attention is applied to get the values for the query, i.e., warped feature. This warped feature is fed into the generator inspired from Co-Mod GAN to synthesize a realistic target image compared with the ground truth target frame. For applying jump cut smoothing in the inference stage, we interpolate dense keypoints between jump cut end frames, and synthesize the transition frame with the interpolated keypoints (in yellow rectangle) sequence.

Jump Cut Smoothing for Filler Words Removal

Given an input talking video, we apply filler words detection algorithm to remove the filler words, and also manually cut unnecessary pauses, and other repetitive words. This results in unnatual jump cuts in the video. Then we apply our method to smooth the jump cuts to output a fluent talking video.

Input video with filler words Filler words removal creates abrupt jump cuts Our smoothed video
 
Input video with filler words Filler words removal creates abrupt jump cuts Our smoothed video
 
Input video with filler words Filler words removal creates abrupt jump cuts Our smoothed video
 
Input video with filler words Filler words removal creates abrupt jump cuts Our smoothed video
 
Input video with filler words Filler words removal creates abrupt jump cuts Our smoothed video
 

Baseline Comparisons

We show the video transition created by our method in comparison with the state-of-the-art frame interpolation method FILM in slow motion. We highlight the facial details in the rightmost column. Note that FILM has severe facial distortion especially when the head rotates from one side to another one.

example1_comp
FILM Ours. + FILM kpts Ours.
 
example2_comp
FILM Ours. + FILM kpts Ours.
 
example3_comp
FILM Ours. + FILM kpts Ours.
 
example4_comp
FILM Ours. + FILM kpts Ours.

Transition Control with Facial Landmark Manipulation

Given the same jump cut end frames, we show our synthesized transition sequences with different facial landmarks trajectories: in the left, we use linearly interpolated facial landmarks; in the middle, we make the mouth closed while keep other facial regions unchanged compared to the left; in the right, we provide a facial landmarks trajectory that simulates a normal speaking sequence.

Linearly interpolate the face landmarks Make the mouth closed in the middle Simulate a normal speaking sequence

Attention Visualizations

We highlight the correspondences learned by our attention mechanism with peak attention score >= 0.5.

BibTeX


    @article{wang2023jumpcut,
      title={Jump Cut Smoothing for Talking Heads},
      author={Xiaojuan Wang and Taesung Park and  Yang Zhou and Eli Shechtman and Richard Zhang},
      journal={arXiv},
      year={2023}
    }