Accessibility navigation


Development and application of Machine Learning classification methods: optimal feature identification and prediction of antimicrobial peptides and other soft matter systems

Jullapech, N. (2025) Development and application of Machine Learning classification methods: optimal feature identification and prediction of antimicrobial peptides and other soft matter systems. PhD thesis, University of Reading

[thumbnail of JULLAPECH_Thesis_Nawisa Jullapech.pdf]
Preview
Text - Thesis
· Please see our End User Agreement before downloading.

26MB
[thumbnail of JULLAPECH_TDF_Nawisa Jullapech.pdf] Text - Thesis Deposit Form
· Restricted to Repository staff only

496kB

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

To link to this item DOI: 10.48683/1926.00127287

Abstract/Summary

The research in this thesis integrates the complementary strengths of five machine learning (ML) methods, namely logistic regression (LR), elastic net logistic regression (ENET), support vector machines (SVM), random forests (RF), and neural networks (NN), to simultaneously optimize predictive accuracy and refine variable selection in binary classification tasks. Predictive performance of the ensemble with these five base predictors is evaluated and the impacts of heterogeneity in predictive accuracy and dependency between base predictors are quantified. A novel ensemble prediction and feature selection method based on the majority vote approach, denoted as MV-FS (majority vote-feature selection), is developed to mitigate the limitations of individual predictors and so ensure robust performance across diverse datasets. Theoretical and simulation evaluations of the proposed methodology provide insights into how variations in the model performance and inter-model relationships influence overall effectiveness, which in turn can be used to ensure generalisation and stability, even in challenging classification scenarios. To validate its practical effectiveness, the ensemble is applied to study two distinct cases in the biological and soft matter areas. In the first case, the MV-FS method is employed to explore the relationship between the physicochemical features of antimicrobial peptides (AMPs) and their antibacterial activities, a task of significant importance in drug discovery due to the growing need for novel antibiotics. In this regard, ML has emerged as a powerful tool for predicting peptide sequences with enhanced antimicrobial activity and selectivity, revolutionizing the way researchers approach the development of novel antimicrobial agents. By extending existing ML methods via incorporating scientific knowledge of physicochemical and structural characteristics of AMPs with a data-driven ensemble approach, the MV-FS method is able to deepen confidence in identifying the key physicochemical features governing antimicrobial activity and predict regions in the physicochemical descriptor space with high probabilities to find active AMPs. In the second case, the ensemble is preliminarily tested for predicting the conformational transitions of single charged polymer chains, which can help understand the structural variations of biomacromolecules in different environments and the association behaviours of synthesised polymers for developing novel functional materials. Using molecular dynamics simulation results as training datasets, the conformational regimes predicted by the ensemble method agree well with theoretical expectations, indicating the strong potential of ML methods in predicting the structural properties of macromolecules, especially in regimes where brute force simulations are computationally very costly. Overall, the research presented in this thesis not only advances the state-of-the-art in ensemble learning for binary classification but also provides a scalable and adaptable framework that can be extended to other domains. By combining interpretability, robustness, and high predictive accuracy, the proposed methodology offers a powerful tool for researchers and practitioners seeking to address classification problems with high precision and reliability.

Item Type:Thesis (PhD)
Thesis Supervisor:Baksh, F.
Thesis/Report Department:School of Mathematical, Physical and Computational Sciences
Identification Number/DOI:10.48683/1926.00127287
Divisions:Science > School of Mathematical, Physical and Computational Sciences > Department of Mathematics and Statistics
ID Code:127287

Downloads

Downloads per month over past year

University Staff: Request a correction | Centaur Editors: Update this record

Page navigation