Automatic keyphrase extraction: a comparison of methods
Hussey, R., Williams, S. and Mitchell, R. (2012) Automatic keyphrase extraction: a comparison of methods. In: eKNOW, Proceedings of The Fourth International Conference on Information Process, and Knowledge Management , Valencia, Spain, pp. 18-23.
Full text not archived in this repository.
Official URL: http://www.thinkmind.org/index.php?view=article&ar...
There are many published methods available for creating keyphrases for documents. Previous work in the field has shown that in a significant proportion of cases author selected keyphrases are not appropriate for the document they accompany. This requires the use of such automated methods to improve the use of keyphrases. Often the keyphrases are not updated when the focus of a paper changes or include keyphrases that are more classificatory than explanatory. The published methods are all evaluated using different corpora, typically one relevant to their field of study. This not only makes it difficult to incorporate the useful elements of algorithms in future work but also makes comparing the results of each method inefficient and ineffective. This paper describes the work undertaken to compare five methods across a common baseline of six corpora. The methods chosen were term frequency, inverse document frequency, the C-Value, the NC-Value, and a synonym based approach. These methods were compared to evaluate performance and quality of results, and to provide a future benchmark. It is shown that, with the comparison metric used for this study Term Frequency and Inverse Document Frequency were the best algorithms, with the synonym based approach following them. Further work in the area is required to determine an appropriate (or more appropriate) comparison metric.
 R. Hussey, S. Williams, and R. Mitchell. 2011. “A Comparison of Methods for Automatic Document Classification”, Proceedings of BAAL, The Forty Fourth Annual Meeting of the British Association for Applied Linguistics. Bristol, United Kingdom.  K. Frantziy, S. Ananiadou, and H. Mimaz. 2000. “Automatic Recognition of Multi-Word Terms: the C-value/NC-value Method”, International Journal on Digital Libraries , 3 (2), pp. 117-132.  R. Hussey, S. Williams, and R. Mitchell. 2011. “Keyphrase Extraction by Synonym Analysis of n-grams for E-Journal Classification”, Proceedings of eKNOW , The Third International Conference on Information, Process, and Knowledge Management, pp. 83-86. Gosier, Guadeloupe/France. http://www.thinkmind.org/index.php?view=article&articleid= eknow_2011_4_30_60053 [Last access: 5 September 2011]  S.C. Sood, S.H. Owsley, K.J. Hammond, and L. Birnbaum. 2007. “TagAssist: Automatic Tag Suggestion for Blog Posts”. Northwestern University. Evanston, IL, USA. http://www.icwsm.org/papers/2--Sood-Owsley-Hammond- Birnbaum.pdf [Last accessed: 13 December 2010]  Technorati. 2006. “Technorati”. http://www.technorati.com [Last accessed: 13 December 2010]  E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. 1999. “Domain-Specific Keyphrase Extraction”, Proceedings 16th International Joint Conference on Artificial Intelligence, pp. 668–673. San Francisco, CA Morgan Kaufmann Publishers.  P.M. Roget. 1911. “Roget’s Thesaurus of English Words and Phrases (Index)”. http://www.gutenberg.org/etext/10681 [Last accessed: 13 December 2010]  Academics Conferences International. 2009. “ACI E- Journals”. http://academic-conferences.org/ejournals.htm [Last accessed: 13 December 2010]  PubMed Central. 2011. “PubMed Central Open Access Subset”. http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ [Last accessed: 14 September 2011]  M.F. Porter. 1980. “An algorithm for suffix stripping”, Program, 14(3) pp. 130–137.