The Handbook of Growing: An Empirical Guide for Practitioners
- Sponsored by: TUM Algorithmic Machine Learning and Explainable AI: Munich Data Science Institute (MDSI)
- Project lead: Dr. Ricardo Acevedo Cabra
- Scientific lead: Ferdinand Kapl, Vincent Pauline, Tobias Höppe, Prof. Stefan Bauer
- Term: Summer semester 2026
- Application deadline: Sunday 25.01.2026
Apply to this project here
Summary
We will build a practical, compute-aware “handbook” of growing strategies for deep neural networks—when and how to add depth or width during training to increase the number of parameters of the model—through a systematic ablation study, simple rules of thumb, and open-source tooling.
Motivation & Background
Growing architectures (adding layers/width over time) saves compute versus strong baselines trained from scratch and can exhibit an inductive bias for improved reasoning in text (Saunshi et al., 2024). For frontier LLMs and VLMs where pre-training and fine tuning incur substantial cost, well-designed growth schedules can yield sizable wall-clock and energy savings while preserving or improving downstream capabilities. Yet, practitioners lack clear guidance on: when and how to grow. This project distills empirical best practices into actionable recipes by investigating the design space thoroughly.
Project Goals
- G1: When-to-grow. Derive simple, robust triggers (e.g., loss-improvement plateaus, gradient/curvature surrogates) and stage lengths (e.g., equal-token or equal-FLOP per stage).
- G2: How-to-grow. Compare existing growth operators: growing in depth, width or jointly; suggest novel growth operators by improved knowledge reuse.
- G3: Optimizer & HP transfer. Provide recipes for mapping optimizer state (e.g., momenta, Adam statistics), LR warm-ups/decays, and regularization across growth events.
- G4: Open toolkit & report. Release a light PyTorch library with scripts and a short Handbook summarizing recommendations and trade-offs.
Key Methods (student-implemented)
Implement, investigate and improve:
- Growth triggers. Loss-slope/plateau detectors, scaling law derived thresholds, and strong simple baselines.
- Operators. Depth (e.g., stacking) vs. Width (e.g., learnable mapping).
- Optimizer state mapping. Preserving training dynamics; learning-rate (or other HP) adaptation post-growth.
Open-Source Data
- Vision: CIFAR-10/100, ImageNet.
- Language: OpenWebText, Fineweb-Edu.
Expected Outcomes
- A concise Handbook with practical guidelines and scaling laws for growing.
- An open-source PyTorch library: growth schedulers, operator implementations, optimizer-state transfer, and compute-normalized evaluation.
Student Profile
Team up to five; strong PyTorch, ML engineering, HPC experience; interest in modern architectures and alternative training paradigms.
References
Saunshi, Nikunj, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank Jakkam Reddi, and Sanjiv Kumar (2024). “On the inductive bias of stacking towards improving reasoning”. In: Advances in Neural Information Processing Systems 37, pp. 71437–71464.
Apply to this project here
