Research | University of Tübingen

Our Research Areas

The DSAR group focuses on the qualitative assessment, interpretability, robustness and fairness of machine learning methods and their scalable applicability in real-life scenarios. If you have any questions related to our research, or if you are interested in a cooperation, please do not hesitate to contact us.

Prof. Gjergji Kasneci

Our General Research Objective - The research at DSAR focuses on the development of robust and fair machine learning methods with high predictive quality and interpretability. Currently, this involves the following considerations:

The generation/selection of robust feature sets for complex (stream-) learning procedures
Mechanisms for robust and fair predictions
Techniques for the quantification and improvement of data quality

In general, we aim to provide the groundwork for feasible machine learning solutions in various practical applications.

Hamed Jalali

Bayesian unsupervised ensemble methods in Distributed learning: In this work, we develop Bayesian aggregations on portioned data. Local estimators (or experts) are trained in different subsets of data and due to the divide-and-conquer approach, they should be aggregated. Different scenarios are available, e.g. Product of Experts, Bayesian Committee Machine, and so on. Since they assume independence between local experts, in practice they lead to a poor prediction. We want to develop an aggregation using correlated experts and Bayesian unsupervised ensemble technique in Random /disjoint partitioning by a focus on Gaussian process.

Graph-based aggregation in Local approximation: When we train a set of experts on the local subsets of training data, the interaction between experts can be used to develop a fast, accurate, and consistent predictive distribution. Since the state-of-the-art method which uses the correlation between the experts is not efficient for the large and high-dimensional data sets, we use graphical model and conditional independence to choose the most important experts and connections for estimation.

Undirected graphical models with latent variable in distributed learning: In this line of work, we consider a graphical model with its observed random variables and a latent variable. The observed variables show the local experts while latent variable is the desired estimator. This work tries to find appropriate joint distribution between the nodes while the aggregation is based on an energy function.

Johannes Haug

Towards Reliable Machine Learning in Evolving Data Streams

Data streams are integral part of many modern applications (e.g. social media, trading, fraud detection, credit scoring, ...). In data streams, we have to deal with a potentially infinite number of observations, real-time demands, limited hardware capacity and shifts of the data generating distribution (concept drift). At DSAR, we develop novel methods for more robust, efficient and interpretable machine learning in evolving data streams:

Online Feature Selection - Feature selection is known as a popular pre-processing technique to mitigate the negative effects of high-dimensional data. Other than offline feature selection methods, online feature selection models need to be continuosly updated. Accordingly, online feature selection aims at discriminative feature sets, flexibility during concept drift and stability regarding noisy inputs. FIRES combines these valuable properties in a holistic framework for online feature selection (KDD-Paper, Github).

Concept Drift Detection - In order to deal with concept drift, online learning models either dynamically adapt over time or rely on dedicated concept drift detection methods. At DSAR, we introduced ERICS, a novel framework that monitors the latent uncertainty in the predictive model for more reliable concept drift detection (ICPR-Paper, Github).

Interpretable Online Learning - Compared to other domains such as image recognition, relatively little attention has been paid to the interpretability and explainability of machine learning in evolving data streams. At DSAR, we aim to develop a better understanding of interpretable online learning. In this context, we introduced the Dynamic Model Tree, a more powerful, flexible and interpretable alternative to popular online learning frameworks like the Hoeffding Tree (ICDE-Paper, Github).

Local Explainability in Data Streams - Local attribution methods are a popular and model-agnostic explanation technique. However, it often remains unclear how incremental model updates and concept drift affect the validity of local attributions over time. At DSAR, we investigated the behaviour of local attributions in evolving data streams. In this context, we proposed CDLEEDS, a novel local change detection method that enables more effective and efficient attribution-based explainability in online learning applications (CIKM-Paper, Github). Likewise, we investigated the role of the baseline for meaningful explanations (AAAI-Workshop-Paper).

Standardized Evaluation of Online Learning Methods - Due to the lack of real-world streaming data sets, it is difficult to evaluate new online learning methods under realistic conditions. In this context, the lack of common standards has led to a variety of different evaluation practices applied in the data stream literature. At DSAR, we summarized meaningful evaluation strategies, performance measures and best practices in a comprehensive paper (arXiv-Paper). In addition, we introduced float, an open source Python framework that enables more comparable and standardized evaluations of online machine learning methods (Github, Pypi).

Martin Pawelczyk

Actionable Recourse Through Data Supported Counterfactuals - Counterfactual explanations can be obtained by identifying small changes made to a feature vector to influence predictions; for example, from ’loan rejected’ to ’awarded’ or from ’high risk of cardiovascular disease’ to ’low risk’. Previous approaches often emphasized that counterfactuals should be easily interpretable to humans, motivating solutions with few changes to the feature vectors. However, some approaches would not guarantee that the produced counterfactuals be (1) proximate (i.e., close to the current state), (2) connected to regions with substantial data density (i.e., reachable from the current state) and (3) attainable without excessive efforts, three requirements that are fundamental when making suggestions to individuals that are indeed attainable. In our WWW 2020, we use data density approximators (for example VAEs) to find close and attainable counterfactuals.

Relating Data Supported & Greedy Counterfactuals under Predictive Muliplicity - Recent work has revitalized an old insight from the late Leo Breiman: there often does not exist one superior solution to a prediction problem with respect to commonly used measures of interest (e.g. error rate). In fact, often multiple different classifiers give almost equal solutions, which is called predictive multiplicity. In this work, we derive upper bounds for the costs of counterfactual explanations under predictive multiplicity.

Auditing Black Box Classifiers for Proxy Influence - In this line of work, we develop a method to generate transparent predictions. Those differ from explainable predictions. Our goal is to answer which inputs are being used locally & globally, and to what extent even if they are not used to train the model?

Vadim Borisov

Explainable Deep Learning - We proposed a new layer - CancelOut - that can help identify subsets of relevant input features for streaming or static data. We made our code available online.

Multi-Output Regression - We open-sourced a python framework dedicated to multi-output regression problems - amorf.

Deep Learning for Tabular data - Artificial neural networks have become the main tool for machine learning tasks within a wide range of domains, including vision, NLP, and speech. However, deep neural networks show moderate performance on heterogeneous data. In this line of work, we address this issue.

Tobias Leemann

Page is under construction...

Open Reading Group

DSAR regularly organizes an open reading group on various topics from the fields of data science and machine learning. Due to the ongoing situation, the reading group is suspended until further notice. So far, we talked about the following topics:

22.10.19 (presentation by Vadim): Shavitt, Ira, and Eran Segal. "Regularization learning networks: deep learning for tabular datasets." Advances in Neural Information Processing Systems. 2018.
05.11.19 (presentation by Martin): Ancona, Marco, Cengiz Öztireli, and Markus Gross. "Explaining Deep Neural Networks with a Polynomial Time Algorithm for Shapley Values Approximation." arXiv preprint arXiv:1903.10992 (2019).
19.11.19 (presentation by Johannes): Arik, Sercan O., and Tomas Pfister. "TabNet: Attentive Interpretable Tabular Learning." arXiv preprint arXiv:1908.07442 (2019).
NeurIPS & Winter Break
04.02.20 (presentation by Oemer from the HCI chair): On Graph Neural Networks
04.03.20 (presentation by Hamed): Marc Deisenroth, Jun Wei Ng. "Distributed Gaussian Processes" ICML (2015)