Accessibility navigation

Automatic keyphrase extraction: a comparison of methods

Hussey, R., Williams, S. and Mitchell, R. (2012) Automatic keyphrase extraction: a comparison of methods. In: eKNOW, Proceedings of The Fourth International Conference on Information Process, and Knowledge Management , Valencia, Spain, pp. 18-23.

Full text not archived in this repository.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Official URL:


There are many published methods available for creating keyphrases for documents. Previous work in the field has shown that in a significant proportion of cases author selected keyphrases are not appropriate for the document they accompany. This requires the use of such automated methods to improve the use of keyphrases. Often the keyphrases are not updated when the focus of a paper changes or include keyphrases that are more classificatory than explanatory. The published methods are all evaluated using different corpora, typically one relevant to their field of study. This not only makes it difficult to incorporate the useful elements of algorithms in future work but also makes comparing the results of each method inefficient and ineffective. This paper describes the work undertaken to compare five methods across a common baseline of six corpora. The methods chosen were term frequency, inverse document frequency, the C-Value, the NC-Value, and a synonym based approach. These methods were compared to evaluate performance and quality of results, and to provide a future benchmark. It is shown that, with the comparison metric used for this study Term Frequency and Inverse Document Frequency were the best algorithms, with the synonym based approach following them. Further work in the area is required to determine an appropriate (or more appropriate) comparison metric.

Item Type:Conference or Workshop Item (Paper)
Divisions:Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
ID Code:27770
Uncontrolled Keywords:Term Frequency, Inverse Document Frequency, C-Value, NC-Value, Synonyms, Comparisons, Automated Keyphrase Extraction, Document Classification
Additional Information:ISBN 9781612081816
Publisher Statement:Publisher makes available at as stated at

University Staff: Request a correction | Centaur Editors: Update this record

Page navigation