Synthetic Benchmark Datasets for Finance



In the application of machine learning techniques to financial settings one of the key elements is the training and testing of algorithms on suitable datasets. However, such datasets are currently subject to limitations: Real financial data is often scarce or limited by a number of constraints and privacy considerations limit the analysis of machine learning models beyond individual companies. The goal of this project is to pave the way to establishing simulated benchmark datasets for finance. Having widely accessible reference data sets has been a significant advantage in other areas – for example in image classification, where MNIST and ImageNet have become de facto standards for assessment of algorithms, while such benchmarks are currently not available for finance. This is a gap we aim to begin to gradually fill in the course of this project by looking at the problem of creating the dataset from the perspective of a completion problem from few carefully selected simulations. While solving the project fully is an ambitious long term goal, first valuable steps can already be made within the upcoming semester. Mainly, by lowering the expectations made of the “quality” of benchmark datasets by permitting simulated data with currently readily available generative techniques (see [Buehler, Horvath, Lyons, Perez, Wood 2021]). Delivering a precise quantification of the modelling error (uncertainty) alongside with the “imperfect” synthetic data is a crucial step for setting up first benchmarks which shall be improved upon in later research.