SISSO: Statistical versus Computational Complexity
Date:
In this talk, we investigate some fundamental questions about the SISSO method for identifying interpretable symbolic regression models.
Abstract: SISSO is a sparse linear modelling method designed to efficiently fit symbolic regression models of materials properties. It iteratively applies some sparsifying operator, such as LASSO or best-subset-search, to a growing pool of variables selected through efficient marginal effect screening. A key parameter of this procedure is the number of variables s that is added to the pool in each iteration. Practitioners working with a fixed training dataset tend to maximise this parameter, subject to the available computational constraints, and notable successes have been achieved this way. However, from a methodological perspective several aspects of this approach are not fully understood, likely hindering further progress such as the development of appropriate uncertainty quantification for SISSO. Open questions include how s affects the required number of data points to guarantee the discovery of the optimal model with high confidence, whether good values for s can be identified in a data-driven fashion and, most fundamentally, what choices for s are even computationally feasible. In this talk, we present partial answers to these questions. Firstly, we provide theoretical results on the scalability of s for three sparsifying operators, LASSO, best-subset search, and the adaptive LASSO, followed by an empirical investigation for the data efficiency of each SISSO variant when choosing an optimal computationally feasible s. Finally, we investigate how that data efficiency changes when choosing s in a data-driven fashion for individual prediction problems. Our results indicate that the adaptive LASSO is the preferable choice of the three investigated sparsifying operators. It allows a much larger range of s values from a computational perspective and, even when picking s from that range in a data-driven way, it tends to identify optimal models with substantially smaller datasets than the second best choice, best-subset-search.