• 基于时空特征原型的视频场景图生成

    Video scene graph generation based on spatial-temporal feature prototypes

    • 视频场景图生成是将视频中的对象作为节点,对象之间的关系作为边的一种结构化描述,有利于对图片的深层理解。现有方法无法区分相似的空间位置关系、无法区分由于时间变化导致的接触关系变化。为了解决上述问题,提出了一种基于时空特征原型的视频场景图生成模型。在该模型中,通过对时空Transformer提取出的时空特征进行聚类原型学习,得到关系的空间特征原型和时间特征原型。利用关系的空间特征原型与当前特征进行相似度计算,并与余弦距离最小的空间特征原型进行融合,空间特征原型对关系特征进行全局建模,能够有效区分相似的空间位置关系。同时,时间特征原型是基于全局时间特征对关系上下文特征进行建模。因此,利用时间特征原型与当前特征进行融合,有效地结合了局部时间信息和全局时间信息,能够有效地区分动态关系变化。上述的空间特征原型生成器、时间特征原型生成器,描述的是交互关系在空间、时间上对应的特征原型,二者在关系的空间特征学习、时间特征学习上具有较强的互补作用。实验在Action Genome数据集上进行验证,实验数据表明:基于时空特征原型的视频场景图生成模型,优于现有的视频场景图生成方法。

       

      Abstract: Video scene graph generation is a structured description that takes objects in the video as nodes and the relationships between objects as edges, which is conducive to the deep understanding of images. Existing methods are unable to distinguish similar spatial positional relationships and changes in contact relationships caused by temporal variations. To address these issues, a video scene graph generation model based on spatial-temporal feature prototypes is proposed. In this model, by conducting cluster prototype learning on the spatial-temporal features extracted by the spatial-temporal Transformer, the spatial feature prototypes and temporal feature prototypes of the relationships are obtained. The spatial feature prototypes are used to calculate the similarity with the current features and fused with the spatial feature prototype with the minimum cosine distance. The spatial feature prototypes perform global modeling of the relationship features, effectively distinguishing similar spatial positional relationships. Meanwhile, the temporal feature prototypes model the relationship context features based on the global temporal features. Therefore, fusing the temporal feature prototypes with the current features effectively combines local and global temporal information, enabling the effective distinction of dynamic relationship changes. The spatial feature prototype generator and the temporal feature prototype generator described above correspond to the feature prototypes of the interaction relationships in space and time, respectively, and they have a strong complementary effect in the learning of spatial and temporal relationship features. Experiments were conducted on the Action Genome dataset for verification. The experimental data show that the video scene graph generation model based on spatial-temporal feature prototypes outperforms existing video scene graph generation methods.

       

    /

    返回文章
    返回