Abstract:
Aiming at the limitations of existing video summarization algorithms and evaluation methods in fully considering the characteristics of video data perceived by industrial intelligent terminals and the application requirements of industrial intelligent perception, this paper revises representativeness and diversity evaluation constraint. Building upon these improvements, this paper propose a hybrid bidirectional multi-layer industrial video summarization framework by integrating Depthwise Convolution (DWConv) and Convolutional Long Short-Term Memory (ConvLSTM). This framework comprises three primary components: global coarse-grained feature extraction, local fine-grained feature extraction, and query-driven feedback-based feature fusion. To address the significant redundancy inherent in industrial data, a global feature extraction module has been developed utilizing ConvLSTM in conjunction with the attention mechanism. To comprehensively capture the spatiotemporal characteristics of video data, a local feature extraction module has been established by integrating the attention mechanism with DB-DWConvLSTM. Considering the periodicity and local stability of industrial data, a fusion DWConv feedback module has been designed, inspired by the principles of residual networks. Furthermore, to emphasize the salient features of key frames and enhance the selection process for these frames, a feature fusion module centered on a query-driven approach and a secondary screening method for summary evaluation has been investigated. To assess the efficacy and practicality of the proposed scheme, an analysis and verification were conducted utilizing the TVSum and SumMe datasets. The experimental findings indicate that the methodology presented in this paper demonstrates commendable performance in cross-validation, ablation studies, and comparative analyses.