Synthetic Benchmark Datasets for Finance
SyBenDaFin
Abstract
This project addresses a key bottleneck in applying machine learning to finance: the lack of accessible, standardized benchmark datasets. Real financial data are often scarce, proprietary, or restricted by privacy constraints, limiting transparent evaluation and comparison of algorithms. The project aims to lay the foundations for synthetic benchmark datasets for finance, analogous to MNIST or ImageNet in other domains. Building on recent advances in signature methods and generative modeling, the project approaches dataset construction as a completion problem from a small number of carefully selected simulations. Signature-based quantile regression models are used to generate realistic path-valued time series, while payoff profiles that depend on full path histories are approximated using polynomial and Padé-type expansions in the signature space. A central emphasis is placed on quantifying model uncertainty, ensuring that imperfections in simulated data are explicitly measured and documented, enabling meaningful early-stage benchmarks.




