Abstract:
Unsupervised keyphrase extraction can automatically identify key phrases that summarize the core content and themes of a document, and it has wide applications in tasks such as information retrieval, text summarization, and topic modeling. Existing unsupervised methods usually rely on the similarity calculation of candidate phrases and the document in the high-dimensional semantic space to evaluate their importance. Although they focus on the correlation between the phrases and the overall semantics of the document, they fail to fully model the consistency between the phrases and the document's theme, resulting in limited accuracy and semantic consistency of the extraction results. Therefore, an unsupervised key phrase extraction method combining content and topic constraints is proposed. This method is based on the T5 model, using the self-attention scores generated by the encoder to capture the association between the candidate phrases and the document content, and using the prompt template of the decoder to calculate the generation probability to measure the semantic relevance and topic consistency of the candidate phrases. Through the collaborative effect of the self-attention mechanism and the prompt generation mechanism, the model can extract key phrases that are highly consistent with the semantic and theme of the text under unsupervised conditions. Experimental results on public datasets SemEval2017, Inspec, and SemEval2010 show that the proposed method significantly outperforms current mainstream unsupervised methods in terms of F1 score.