Scientific machine learning for data-driven discovery
Scientific discovery lacks, by definition, a ground truth. We don't know if the problem is solvable or how well we can do. There rarely are benchmark datasets. Data is missing acutely not at random due to sensor failures and collection bias. A substantial body of previous knowledge needs consideration: Conservation laws, dynamical equations, integrity constraints. Prediction is seldom enough: Causal understanding is the ultimate goal, and uncertainty evaluation and interpretability are requisites. Data acquisition is not mediated by analytics of web behaviour but by expensive, often unique experiments, and data modalities are often mixed, sometimes exotic.
We see substantial potential for new developments at the interface between ML and topical research. Besides the abundant algorithmic challenges in scaling, robustness, interpretability and expression of inductive biases, there are opportunities at the edges of the ML pipeline, i.e. on the steps that are most actionable for domain scientists: Problem definition, data collection, feature development, quality evaluation and formulation of new hypotheses and interventions to adress causality.
With its specific challenges, methods, and standards a field is emerging. Some are calling this cross-disciplinary endeavour scientific machine learning.