Abstract:
This paper proposes a consistency-guided probabilistic model based on spatiotemporal deformable Transformer for gaze target detection. The model aims to address the shortcomings of existing methods in handling problems such as dynamic gaze in videos and improve performance through innovative structures and algorithms. The model mainly consists of a frame gaze model, a spatiotemporal gaze relationship model, and a future semantic gaze estimation module. Experiments on multiple datasets show that this model performs excellently in both video gaze target detection and video co-gaze target detection tasks, outperforming all previous methods. Ablation experiments demonstrate the positive impact of key modules and different loss functions in the model on the overall performance, and the visualization results show the effectiveness of the model in dynamic gaze scenarios. This research provides new methods and ideas in the field of gaze target detection and helps promote the application of related technologies in fields such as human-computer interaction.