Improved filter-based feature selection using correlation and clustering techniques

Atmakuru, Akhila; Di Fatta, Giuseppe; Nicosia, Giuseppe; Badii, Atta

Download

Preview

Text
- Accepted Version

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Atmakuru, A., Di Fatta, G., Nicosia, G. and Badii, A. (2024) Improved filter-based feature selection using correlation and clustering techniques. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P. M. and Umeton, R. (eds.) Machine Learning, Optimization, and Data Science: 9th International Conference, LOD 2023, Grasmere, UK, September 22–26, 2023, Revised Selected Papers, Part I. Lecture Notes in Computer Science, 14505. Springer, pp. 379-389. ISBN 9783031539688 doi: 10.1007/978-3-031-53969-5_28

Abstract/Summary

Feature engineering and feature selection are essential techniques to most data science and machine learning applications, in which, respectively, raw data are transformed into features and features are selected to provide the most effective subset of features for the application. Feature selection techniques are particularly useful when dealing with high-dimensional datasets that contain noisy and redundant data. An optimised feature subset could enhance the performance as well as the interpretability of the model. There are three types of feature selection methods, namely filter, wrapper and embedded techniques. Amongst these methods, the filter method is more efficient than the others as it is computationally less expensive and more generalised. This work presents two improved filter-based feature selection methods based on a correlation coefficient and clustering techniques. The first approach is based on feature correlation where the feature subset consists of features above a similarity threshold to identify a kind of neighbourhood for each feature. The second method uses clustering analysis on the correlation data to identify features that can be used to represent the entire cluster. The obtained feature subsets have been applied as pre-processing step for logistic regression and artificial neural networks. The performance of the proposed methods has been compared against the popular ReliefF feature selection method. The experimental analysis shows that the proposed feature selection methods provide an observable improvement in accuracy by choosing the most effective features.

Altmetric Badge

Dimensions Badge

Item Type	Book or Report Section
URI	https://centaur.reading.ac.uk/id/eprint/120660
Identification Number/DOI	10.1007/978-3-031-53969-5_28
Refereed	Yes
Divisions	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Publisher	Springer
Download/View statistics	View download statistics for this item

Download Statistics

Downloads

Downloads per month over past year

Deposit Details

CORE (COnnecting REpositories)

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	06 Feb 2025 10:05	Date item deposited into CentAUR
Last Modified:	16 Feb 2025 01:38	Date item last modified