• 内容主题约束的无监督关键短语提取方法

    Unsupervised keyphrase extraction method with content and topic constraints

    • 无监督关键短语提取能够自动识别概括文档核心内容与主题的关键短语,在信息检索、文本摘要和主题建模等任务中具有广泛应用。现有无监督方法通常依赖候选短语与文档在高维语义空间中的相似度计算来评估重要性,虽关注短语与文档整体语义的相关性,却未能充分建模短语与文档主题之间的一致性,导致提取结果的准确性和语义一致性受限。为此,提出了一种结合内容和主题约束的无监督关键短语提取方法。该方法基于T5模型,通过编码器生成的自注意力得分捕捉候选短语与文档内容的关联性,并利用解码器提示模板计算生成概率,以衡量候选短语的语义相关性和主题一致性。通过自注意力机制和提示生成机制的协同作用,模型能够在无监督条件下提取出与文本语义和主题高度契合的关键短语。在SemEval2017、Inspec和SemEval2010公开数据集上的实验结果表明:所提方法在F1分数上显著优于当前主流无监督方法。

       

      Abstract: Unsupervised keyphrase extraction can automatically identify key phrases that summarize the core content and themes of a document, and it has wide applications in tasks such as information retrieval, text summarization, and topic modeling. Existing unsupervised methods usually rely on the similarity calculation of candidate phrases and the document in the high-dimensional semantic space to evaluate their importance. Although they focus on the correlation between the phrases and the overall semantics of the document, they fail to fully model the consistency between the phrases and the document's theme, resulting in limited accuracy and semantic consistency of the extraction results. Therefore, an unsupervised key phrase extraction method combining content and topic constraints is proposed. This method is based on the T5 model, using the self-attention scores generated by the encoder to capture the association between the candidate phrases and the document content, and using the prompt template of the decoder to calculate the generation probability to measure the semantic relevance and topic consistency of the candidate phrases. Through the collaborative effect of the self-attention mechanism and the prompt generation mechanism, the model can extract key phrases that are highly consistent with the semantic and theme of the text under unsupervised conditions. Experimental results on public datasets SemEval2017, Inspec, and SemEval2010 show that the proposed method significantly outperforms current mainstream unsupervised methods in terms of F1 score.

       

    /

    返回文章
    返回