Accessibility navigation

A comparison of automated keyphrase extraction techniques and of automatic evaluation vs. human evaluation

Hussey, R., Williams, S., Mitchell, R. and Field, I. (2012) A comparison of automated keyphrase extraction techniques and of automatic evaluation vs. human evaluation. International Journal on Advances in Life Sciences, 4 (3 and 4). pp. 136-153. ISSN 1942-2660

Full text not archived in this repository.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Official URL:


Keyphrases are added to documents to help identify the areas of interest they contain. However, in a significant proportion of papers author selected keyphrases are not appropriate for the document they accompany: for instance, they can be classificatory rather than explanatory, or they are not updated when the focus of the paper changes. As such, automated methods for improving the use of keyphrases are needed, and various methods have been published. However, each method was evaluated using a different corpus, typically one relevant to the field of study of the method’s authors. This not only makes it difficult to incorporate the useful elements of algorithms in future work, but also makes comparing the results of each method inefficient and ineffective. This paper describes the work undertaken to compare five methods across a common baseline of corpora. The methods chosen were Term Frequency, Inverse Document Frequency, the C-Value, the NC-Value, and a Synonym based approach. These methods were analysed to evaluate performance and quality of results, and to provide a future benchmark. It is shown that Term Frequency and Inverse Document Frequency were the best algorithms, with the Synonym approach following them. Following these findings, a study was undertaken into the value of using human evaluators to judge the outputs. The Synonym method was compared to the original author keyphrases of the Reuters’ News Corpus. The findings show that authors of Reuters’ news articles provide good keyphrases but that more often than not they do not provide any keyphrases.

Item Type:Article
Divisions:Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
ID Code:32266
Uncontrolled Keywords:Automated Keyphrase Extraction; C-Value; Comparisons; Document Classification; Human Evaluation; Inverse Document Frequency; NC-Value; Reuters News Corpus; Synonyms; Term Frequency

University Staff: Request a correction | Centaur Editors: Update this record

Page navigation