Online semi-supervised learning in non-stationary environmentsIdrees, M. M. (2024) Online semi-supervised learning in non-stationary environments. PhD thesis, University of Reading
It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing. To link to this item DOI: 10.48683/1926.00114864 Abstract/SummaryExisting Data Stream Mining (DSM) algorithms assume the availability of labelled and balanced data, immediately or after some delay, to extract worthwhile knowledge from the continuous and rapid data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet. Manual labelling of these data streams is not practical due to time consumption and the need for domain expertise. Another challenge is learning under Non-Stationary Environments (NSEs), which occurs due to changes in the data distributions in a set of input variables and/or class labels. The problem of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms have no access to the true class labels directly when the concept evolves. Several approaches exist that deal with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. This research directly responds to ILNSE’s challenge in proposing two novel algorithms “Predictor for Streaming Data with Scarce Labels” (PSDSL) and Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label scarcity issues in online machine learning. The key capabilities of PSDSL include learning from a small amount of labelled data in an incremental or online manner and being available to predict at any time. To achieve this, PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it continuously learns from incoming data and updates the model as new labelled or unlabelled data becomes available over time. Furthermore, it can predict under NSE conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier, which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch and adapt to the conditions. The PSDSL adapts to learning states between self-learning, micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of the data stream. HDWM makes use of “seed” learners of different types in an ensemble to maintain its diversity. The ensembles are simply the combination of predictive models grouped to improve the predictive performance of a single classifier. PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than existing approaches on most real-time data streams including randomised data instances. PSDSL performed significantly better than ‘Static’ i.e. the classifier is not updated after it is trained with the first examples in the data streams. When applied to MOA-generated data streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC, while SCARGC performed the same as the Static. PSDSL achieved better average prediction accuracies in a short time than SCARGC. The HDWM algorithm is evaluated on artificial and real-world data streams against existing well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic DWM algorithm. The results showed that HDWM performed significantly better than WMA and DWM. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM. In both drift and real-world streams, significance tests and post hoc comparisons found significant differences between algorithms, HDWM performed significantly better than DWM and WMA when applied to MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms benefit from the use of both forgetting and retaining the models. The algorithm also provides the independence of selecting the optimal base classifier in its ensemble depending on the problem. A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts during the cluster labelling process. In this process, PSDSL transforms the centroids’ information of micro-clusters into micro-instances and generates new clusters called Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and successfully guide the cluster labelling process after the concept drifts in the absence of true class labels. PSDSL has been evaluated on real-world problem ‘keystroke dynamics’, and the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC (81.6%), while the Static (49.0%) significantly degrades the performance due to changes in the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found highly fluctuated between (41.1% to 81.6%) based on different values of parameter ‘k’ (number of clusters), while PSDSL automatically determine the best values for this parameter.
Download Statistics DownloadsDownloads per month over past year Altmetric Deposit Details University Staff: Request a correction | Centaur Editors: Update this record |