publications

* denotes equal contribution

2024

NeurIPS

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson, Bettina Messmer, and Martin Jaggi

In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

arXiv link
ICML

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson*, Bettina Messmer*, and Martin Jaggi

In ICML, 2024

arXiv link
NeurIPS

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Alexander Hagele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, and 1 more author

In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

arXiv link
AAAI

Ghost Noise for Regularizing Deep Neural Networks

Atli Kosson, Dongyang Fan, and Martin Jaggi

Proceedings of the AAAI Conference on Artificial Intelligence, 2024

arXiv

2024

2023

NeurIPS

Multiplication-Free Transformer Training via Piecewise Affine Operations

Atli Kosson, and Martin Jaggi

In Thirty-seventh Conference on Neural Information Processing Systems, 2023

arXiv link

2023

2021

MLSys

Pipelined Backpropagation at Scale: Training Large Models without Batches

Atli Kosson*, Vitaliy Chiley*, Abhinav Venigalla, Joel Hestness, and Urs Köster

In Proceedings of Machine Learning and Systems, 2021

arXiv link

2021

2020

Workshop

Adaptive Braking for Mitigating Gradient Delay

Abhinav Venigalla*, Atli Kosson*, Vitaliy Chiley, and Urs Köster

In ICML 2020 Workshop on Beyond first order methods in machine learning systems, 2020

arXiv link

2020

2019

NeurIPS

Online Normalization for Training Neural Networks

Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Köster, Ryan Reece, and 3 more authors

Advances in Neural Information Processing Systems, 2019

arXiv link

2019