Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to 50×), with significantly reduced training data generation (up to 30×) and better accuracy (+8.7%) on real-world test datasets than previous methods.
@article{arxiv.2008.08424,
title = {AutoSimulate: (Quickly) Learning Synthetic Data Generation},
author = {Harkirat Singh Behl and Atılım Güneş Baydin and Ran Gal and Philip H. S. Torr and Vibhav Vineet},
journal= {arXiv preprint arXiv:2008.08424},
year = {2020}
}