A comparison of automated keyphrase extraction techniques
and of automatic evaluation vs. human evaluation

Hussey, Richard; Williams, Shirley; Mitchell, Richard; Field, Ian

Download

Full text not archived in this repository.

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Hussey, R., Williams, S., Mitchell, R. and Field, I. (2012) A comparison of automated keyphrase extraction techniques and of automatic evaluation vs. human evaluation. International Journal on Advances in Life Sciences, 4 (3 and 4). pp. 136-153. ISSN 1942-2660

Abstract/Summary

Keyphrases are added to documents to help identify the areas of interest they contain. However, in a significant proportion of papers author selected keyphrases are not appropriate for the document they accompany: for instance, they can be classificatory rather than explanatory, or they are not updated when the focus of the paper changes. As such, automated methods for improving the use of keyphrases are needed, and various methods have been published. However, each method was evaluated using a different corpus, typically one relevant to the field of study of the method’s authors. This not only makes it difficult to incorporate the useful elements of algorithms in future work, but also makes comparing the results of each method inefficient and ineffective. This paper describes the work undertaken to compare five methods across a common baseline of corpora. The methods chosen were Term Frequency, Inverse Document Frequency, the C-Value, the NC-Value, and a Synonym based approach. These methods were analysed to evaluate performance and quality of results, and to provide a future benchmark. It is shown that Term Frequency and Inverse Document Frequency were the best algorithms, with the Synonym approach following them. Following these findings, a study was undertaken into the value of using human evaluators to judge the outputs. The Synonym method was compared to the original author keyphrases of the Reuters’ News Corpus. The findings show that authors of Reuters’ news articles provide good keyphrases but that more often than not they do not provide any keyphrases.

Item Type	Article
URI	https://centaur.reading.ac.uk/id/eprint/32266
Refereed	Yes
Divisions	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Uncontrolled Keywords	Automated Keyphrase Extraction; C-Value; Comparisons; Document Classification; Human Evaluation; Inverse Document Frequency; NC-Value; Reuters News Corpus; Synonyms; Term Frequency
Publisher	IARIA
Download/View statistics	View download statistics for this item

Deposit Details

References

[1] R. Hussey, S. Williams, and R. Mitchell. 2012. “Automatic Keyphrase Extraction: A Comparison of Methods”, Proceedings of eKNOW, The Fourth International Conference on Information, Process, and Knowledge Management, pp. 18-23. Valencia, Spain. http://www.thinkmind.org/index.php?view=article&articleid= eknow_2012_1_40_60072 [Last access: 10 December 2012] [2] R. Hussey, S. Williams, and R. Mitchell. 2011. “A Comparison of Methods for Automatic Document Classification”, Presentation at BAAL, The Forty-Fourth Annual Meeting of the British Association for Applied Linguistics. Bristol, United Kingdom. [3] K. Frantziy, S. Ananiadou, and H. Mimaz. 2000. “Automatic Recognition of Multi-Word Terms: the C-value/NC-value Method”, International Journal on Digital Libraries, 3 (2), pp. 117-132. [4] R. Hussey, S. Williams, and R. Mitchell. 2011. “Keyphrase Extraction by Synonym Analysis of n-grams for E-Journal Classification”, Proceedings of eKNOW, The Third International Conference on Information, Process, and Knowledge Management, pp. 83-86. Gosier, Guadeloupe/France. http://www.thinkmind.org/index.php?view=article&articleid= eknow_2011_4_30_60053 [Last access: 10 December 2012] [5] S.C. Sood, S.H. Owsley, K.J. Hammond, and L. Birnbaum. 2007. “TagAssist: Automatic Tag Suggestion for Blog Posts”, Northwestern University. Evanston, IL, USA. http://www.icwsm.org/papers/2--Sood-Owsley-Hammond- Birnbaum.pdf [Last accessed: 10 December 2012] [6] Technorati. 2006. “Technorati”. http://www.technorati.com [Last accessed: 10 December 2012] [7] E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. 1999. “Domain-Specific Keyphrase Extraction”, Proceedings 16th International Joint Conference on Artificial Intelligence, pp. 668–673. San Francisco, CA Morgan Kaufmann Publishers. [8] R. Hussey, S. Williams, R. Mitchell. 2011. “Automated Categorisation of E-Journals by Synonym Analysis of ngrams”, International Journal on Advances in Software. Volume: 4, Number: 3 & 4, pp. 532-542. http://www.thinkmind.org/index.php?view=article&articleid= soft_v4_n34_2011_25 [Last accessed: 10 December 2012] [9] P.M. Roget. 1911. “Roget’s Thesaurus of English Words and Phrases (Index)”. http://www.gutenberg.org/etext/10681 [Last accessed: 10 December 2012] [10] M.F. Porter. 1980. “An algorithm for suffix stripping”, Program, 14(3) pp. 130–137. [11] Academics Conferences International. 2009. “ACI EJournals”. http://academic-conferences.org/ejournals.htm [Last accessed: 10 December 2012] [12] PubMed Central. 2011. “PubMed Central Open Access Subset”. http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ [Last accessed: 10 December 2012] [13] Reuters. 1987. “Reuters-21578 Text Categorisation Collection”. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.ht ml [Last accessed: 10 December 2012] [14] D. Maynard and S. Ananiadou. 2000. “TRUCKS: a model for automatic multi-word term recognition”, Journal of Natural Language Processing, 8 (1), pp. 101-125. [15] Y. Matsuo and M. Ishizuka. 2003. “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information”. [16] Y. Ohsawa, N.E. Benson, M. Yachida. 1998. “Key-Graph; Automatic indexing by co-occurrence graph based on building construction metaphor”, Proceedings of the Advanced Digital Library Conference. [17] P.D. Turney. 1999. “Learning Algorithms for Keyphrase Extraction”, INRT, (pp. 34-99). Ontario. [18] K. Barker and N. Cornacchia. 2000. “Using Noun Phrase Heads to Extract Document Keyphrases”, AI '00: Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence (pp. 40-52). London: Springer. [19] J. Carletta. 1996. “Assessing Agreement on Classification Tasks: The Kappa Statistic”, Computational Linguistics, 22 (2), pp. 249-254. [20] J. Goldstein, M. Kantrowtiz, V. Mittal, and J. Carbonell. 1999. “Summarising Text Documents: Sentence Selection and Evaluation Metrics”, Proceedings of SIGIR’99, the 22nd International Conference on Research and Development in Information Reterival, pp. 121-128. Berkeley, CA ACM Press. [21] S. Jones, S. Lundy, and G. W. Paynter. 2002. “Interactive Document Summarisation Using Automatically Extracted Keyphrases”, Proceedings of the 35th Hawaii International Conference on System Sciences 4, pp. 101-111. Hawaii IEE Computer Soceity. [22] A. Joshi and R. Motwani. 2006. “Keyword Generation for Search Engine Advertising”, IEEE International Conference on Data Mining. [23] C. Y. Lin and E. Hovy. 2000. “The Automated Acquisiton of Topic Signatures for Text Summarisation”, University of Southern California, Information Science Institute, Marina del Rey, CA. [24] A. T. Schutz. 2008. “Keyphrase Extraction from Single Documents in the Open Domain Exploiting Linguistic and Statistical Methods”, M. App. Sc Thesis. [25] ANOVA (MathWorks – R2012a documentation). http://www.mathworks.co.uk/help/toolbox/stats/bqttcvf.html [Last accessed: 10 December 2012] [26] L. S. Murphy, S. Reinsch, W. I. Najm, V. M. Dickerson, M. A. Seffinger, A. Adams, and S. I. Mishra. 2003. "Searching biomedical databases on complementary medicine: the use of controlled vocabulary among authors, indexers and investigators", BMC Complementary and Alternative Medicine 2003, 3:3. http://www.biomedcentral.com/1472- 6882/3/3 [Last accessed: 10 December 2012]

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	24 Apr 2013 07:52	Date item deposited into CentAUR
Last Modified:	09 Jun 2024 01:38	Date item last modified