Fachbereich Informatik

Unsere Forschungsthemen

Die DSAR-Gruppe konzentriert sich auf die qualitative Bewertung, Interpretierbarkeit, Robustheit und Fairness von Machine Learning-Methoden und deren skalierbare Anwendbarkeit in realen Szenarien. Wenn du Fragen zu unserer Forschung hast oder an einer Zusammenarbeit interessiert bist, kannst du uns gerne kontaktieren.

Prof. Gjergji Kasneci

Our General Research Objective - The research at DSAR focuses on the development of robust and fair machine learning methods with high predictive quality and interpretability. Currently, this involves the following considerations:

  • The generation/selection of robust feature sets for complex (stream-) learning procedures
  • Mechanisms for robust and fair predictions
  • Techniques for the quantification and improvement of data quality

In general, we aim to provide the groundwork for feasible machine learning solutions in various practical applications.

Hamed Jalali

Bayesian unsupervised ensemble methods in Distributed learning: In this work, we develop Bayesian aggregations on portioned data. Local estimators (or experts) are trained in different subsets of data and due to the divide-and-conquer approach,  they should be aggregated. Different scenarios are available, e.g. Product of Experts, Bayesian Committee Machine, and so on. Since they assume independence between local experts,  in practice they lead to a poor prediction. We want to develop an aggregation using correlated experts and Bayesian unsupervised ensemble technique in Random /disjoint partitioning by a focus on Gaussian process.

Graph-based aggregation in Local approximation: When we train a set of experts on the local subsets of training data, the interaction between experts can be used to develop a fast, accurate, and consistent predictive distribution. Since the state-of-the-art method which uses the correlation between the experts is not efficient for the large and high-dimensional data sets, we use graphical model and conditional independence to choose the most important experts and connections for estimation.

Undirected graphical models with latent variable in distributed learning:

In this line of work, we consider a graphical model with its observed random variables and a latent variable. The observed variables show the local experts while latent variable is the desired estimator. This work tries to find appropriate joint distribution between the nodes while the aggregation is based on an energy function.

Johannes Haug

Towards Reliable Machine Learning in Evolving Data Streams

Data streams are integral part of many modern applications (e.g. social media, trading, fraud detection, credit scoring, ...). In data streams, we have to deal with a potentially infinite number of observations, real-time demands, limited hardware capacity and shifts of the data generating distribution (concept drift). At DSAR, we develop novel methods for more robust, efficient and interpretable machine learning in evolving data streams:

Online Feature Selection - Feature selection is known as a popular pre-processing technique to mitigate the negative effects of high-dimensional data. Other than offline feature selection methods, online feature selection models need to be continuosly updated. Accordingly, online feature selection aims at discriminative feature sets, flexibility during concept drift and stability regarding noisy inputs. FIRES combines these valuable properties in a holistic framework for online feature selection (KDD-Paper, Github).

Concept Drift Detection - In order to deal with concept drift, online learning models either dynamically adapt over time or rely on dedicated concept drift detection methods. At DSAR, we introduced ERICS, a novel framework that monitors the latent uncertainty in the predictive model for more reliable concept drift detection (ICPR-Paper, Github). 

Interpretable Online Learning - Compared to other domains such as image recognition, relatively little attention has been paid to the interpretability and explainability of machine learning in evolving data streams. At DSAR, we aim to develop a better understanding of interpretable online learning. In this context, we introduced the Dynamic Model Tree, a more powerful, flexible and interpretable alternative to popular online learning frameworks like the Hoeffding Tree (ICDE-Paper, Github).

Local Explainability in Data Streams - Local attribution methods are a popular and model-agnostic explanation technique. However, it often remains unclear how incremental model updates and concept drift affect the validity of local attributions over time. At DSAR, we investigated the behaviour of local attributions in evolving data streams. In this context, we proposed CDLEEDS, a novel local change detection method that enables more effective and efficient attribution-based explainability in online learning applications (CIKM-Paper, Github). Likewise, we investigated the role of the baseline for meaningful explanations (AAAI-Workshop-Paper).

Standardized Evaluation of Online Learning Methods - Due to the lack of real-world streaming data sets, it is difficult to evaluate new online learning methods under realistic conditions. In this context, the lack of common standards has led to a variety of different evaluation practices applied in the data stream literature. At DSAR, we summarized meaningful evaluation strategies, performance measures and best practices in a comprehensive paper (arXiv-Paper). In addition, we introduced float, an open source Python framework that enables more comparable and standardized evaluations of online machine learning methods (Github, Pypi).

Martin Pawelczyk

Actionable Recourse Through Data Supported Counterfactuals - Counterfactual explanations can be obtained by identifying small changes made to an input vector to influence predictions; for example, from ’loan rejected’ to ’awarded’ or from ’high risk of cardiovascular disease’ to ’low risk’. Previous approaches often emphasized that counterfactuals should be easily interpretable to humans, motivating solutions with few changes to the input vectors. However, some approaches would not guarantee that the produced counterfactuals be (1) proximate (i.e., close to the current state), (2) connected to regions with substantial data density (i.e., supported by data) and (3) attainable without excessive efforts, three requirements that are fundamental when making suggestions to individuals that are indeed attainable. In our WWW 2020, we use data density approximators (for example VAEs) to find close and attainable counterfactuals.

Relating Data Supported & Greedy Counterfactuals under Predictive Muliplicity - Recent work has revitalized an old insight from the late Leo Breiman: there often does not exist one superior solution to a prediction problem with respect to commonly used measures of interest (e.g. error rate). In fact, often multiple different classifiers give almost equal solutions, which is called predictive multiplicity. In this work, we derive upper bounds for the costs of counterfactual explanations under predictive multiplicity.

Auditing Black Box Classifiers for Proxy Influence -  In this line of work, we develop a method to generate transparent predictions. Those differ from explainable predictions. Our goal is to answer which inputs are being used locally & globally, and to what extent even if they are not used to train the model?

 

Vadim Borisov

Explainable Deep Learning - We proposed a new layer - CancelOut that can help identify a subset of relevant input features for streaming or static data; the code is available online.

Multi-Output Regression - We open-sourced a python framework dedicated to multi-output regression problems - amorf

Deep Learning for Tabular data - Nowadays, artificial neural networks have become the main tool for machine learning tasks within a wide range of domains, including vision, NLP, and speech. However, deep neural networks show moderate performance on heterogeneous data. We would like to address this issue. 

Tobias Leemann

Seite im Aufbau...

Offene Reading Group

DSAR veranstaltet in regelmäßigen Abständen eine offene Reading Group zu unterschiedlichen Themen aus den Bereichen Data Science und Machine Learning. Die Reading Group ist aufgrund der aktuellen Umstände bis auf Weiteres ausgesetzt. Bislang haben wir die folgenden Themen behandelt:

  • 22.10.19 (Präsentation von Vadim): Shavitt, Ira, and Eran Segal. "Regularization learning networks: deep learning for tabular datasets." Advances in Neural Information Processing Systems. 2018.
  • 05.11.19 (Präsentation von Martin): Ancona, Marco, Cengiz Öztireli, and Markus Gross. "Explaining Deep Neural Networks with a Polynomial Time Algorithm for Shapley Values Approximation." ICML (2019).
  • 19.11.19 (Präsentation von Johannes): Arik, Sercan O., and Tomas Pfister. "TabNet: Attentive Interpretable Tabular Learning." arXiv preprint arXiv:1908.07442 (2019).
  • NeurIPS & Winterpause
  • 04.02.20 (Päsentation von Oemer vom HCI Lehrstuhl): On Graph Neural Networks
  • 04.03.20 (Präsentation von Hamed): Marc Deisenroth, Jun Wei Ng. "Distributed Gaussian Processes" ICML (2015)