Abstract:
Video scene graph generation is a structured description that takes objects in the video as nodes and the relationships between objects as edges, which is conducive to the deep understanding of images. Existing methods are unable to distinguish similar spatial positional relationships and changes in contact relationships caused by temporal variations. To address these issues, a video scene graph generation model based on spatial-temporal feature prototypes is proposed. In this model, by conducting cluster prototype learning on the spatial-temporal features extracted by the spatial-temporal Transformer, the spatial feature prototypes and temporal feature prototypes of the relationships are obtained. The spatial feature prototypes are used to calculate the similarity with the current features and fused with the spatial feature prototype with the minimum cosine distance. The spatial feature prototypes perform global modeling of the relationship features, effectively distinguishing similar spatial positional relationships. Meanwhile, the temporal feature prototypes model the relationship context features based on the global temporal features. Therefore, fusing the temporal feature prototypes with the current features effectively combines local and global temporal information, enabling the effective distinction of dynamic relationship changes. The spatial feature prototype generator and the temporal feature prototype generator described above correspond to the feature prototypes of the interaction relationships in space and time, respectively, and they have a strong complementary effect in the learning of spatial and temporal relationship features. Experiments were conducted on the Action Genome dataset for verification. The experimental data show that the video scene graph generation model based on spatial-temporal feature prototypes outperforms existing video scene graph generation methods.