Towards Reliable Machine Learning in Evolving Data Streams
Data streams are integral part of many modern applications (e.g. social media, trading, fraud detection, credit scoring, ...). In data streams, we have to deal with a potentially infinite number of observations, real-time demands, limited hardware capacity and shifts of the data generating distribution (concept drift). At DSAR, we develop novel methods for more robust, efficient and interpretable machine learning in evolving data streams:
Online Feature Selection - Feature selection is known as a popular pre-processing technique to mitigate the negative effects of high-dimensional data. Other than offline feature selection methods, online feature selection models need to be continuosly updated. Accordingly, online feature selection aims at discriminative feature sets, flexibility during concept drift and stability regarding noisy inputs. FIRES combines these valuable properties in a holistic framework for online feature selection (KDD-Paper, Github).
Concept Drift Detection - In order to deal with concept drift, online learning models either dynamically adapt over time or rely on dedicated concept drift detection methods. At DSAR, we introduced ERICS, a novel framework that monitors the latent uncertainty in the predictive model for more reliable concept drift detection (ICPR-Paper, Github).
Interpretable Online Learning - Compared to other domains such as image recognition, relatively little attention has been paid to the interpretability and explainability of machine learning in evolving data streams. At DSAR, we aim to develop a better understanding of interpretable online learning. In this context, we introduced the Dynamic Model Tree, a more powerful, flexible and interpretable alternative to popular online learning frameworks like the Hoeffding Tree (ICDE-Paper, Github).
Local Explainability in Data Streams - Local attribution methods are a popular and model-agnostic explanation technique. However, it often remains unclear how incremental model updates and concept drift affect the validity of local attributions over time. At DSAR, we investigated the behaviour of local attributions in evolving data streams. In this context, we proposed CDLEEDS, a novel local change detection method that enables more effective and efficient attribution-based explainability in online learning applications (CIKM-Paper, Github). Likewise, we investigated the role of the baseline for meaningful explanations (AAAI-Workshop-Paper).
Standardized Evaluation of Online Learning Methods - Due to the lack of real-world streaming data sets, it is difficult to evaluate new online learning methods under realistic conditions. In this context, the lack of common standards has led to a variety of different evaluation practices applied in the data stream literature. At DSAR, we summarized meaningful evaluation strategies, performance measures and best practices in a comprehensive paper (arXiv-Paper). In addition, we introduced float, an open source Python framework that enables more comparable and standardized evaluations of online machine learning methods (Github, Pypi).