Learning Rolling Shutter Correction from Real Data without Camera Motion Assumption


An overview of the proposed system: given an RS image, the pixel-wise depth is generated by DepthNet, whereas the AnchorNet predicts a set of pose anchors that are interpolated for a row-wise camera pose estimation. These pose estimates and depth maps are subsequently used for geometric projection to recover the corresponding GS image. [Red: input RS image; Yellow: deep neural networks; Green: network outputs; Blue: recovered GS image]




The rolling shutter mechanism in modern cameras generates distortions as the images are formed on the sensor through a row-by-row readout process; this is highly undesirable for photography and vision-based algorithms (e.g., structure-from-motion and visual SLAM). In this paper, we propose a deep neural network to predict depth and camera poses for single-frame rolling shutter correction. Compared to the state-of-the-art, the proposed method has no assumptions on camera motion. It is enabled by training on real images captured by rolling shutter cameras instead of synthetic ones generated with certain motion assumption. Consequently, the proposed method performs better for real rolling shutter images. This makes it possible for numerous vision-based algorithms to use imagery captured using rolling shutter cameras and produce highly accurate results. Our evaluations on the TUM rolling shutter dataset using DSO and COLMAP validate the accuracy and robustness of the proposed method.




Data Generation


The data capture device and an overview of the image processing pipeline. [Red: input RS image; Green: ground truth; Blue: recovered GS image]




The network architecture of AnchorNet; a backbone feature extraction network (ResNet-34 or VGG-16) is followed by five convolutional blocks that learn 6N parameters where N denotes the number of anchors. Notice that it reduces to VelocityNet with the default choice of a ResNet-34 backbone and N=1 anchor.




Data Generation Verification


Top to bottom: scenes reconstructed by DSO on RS images, on GS1 images, with 1 anchor, and with 8 anchors.


Network Evaluation


EPEs against acceleration with different anchors. The EPEs are sorted by acceleration and averaged with nearby 200 samples.



Samples of predicted GS images by the network with 256 anchors. The first column is the input RS images; the second column is the predicted GS images; the third column is the ground truth GS1 images.