611. Beyond One‑Hot Labels: KL‑Divergence Training with Empirical Distributions for Faster Optimization
Invited abstract in session MC-12: Robust optimisation and its applications, stream Applications: AI, uncertainty management and sustainability.
Monday, 14:00-16:00Room: B100/8009
Authors (first author is the speaker)
| 1. | Arman Bolatov
|
| Machine Learning, MBZUAI |
Abstract
We introduce an alternative training approach for deep learning classification that enhances standard pipelines by minimizing KL divergence against approximated true underlying distributions rather than cross-entropy on one-hot labels. Our work demonstrates two effective strategies: first, training models using KL divergence against distributions that accurately reflect label relationships and potential ambiguities; second, leveraging locality-sensitive hashing to create empirical distributions from semantically similar examples. Experiments on image classification tasks show these approaches lead to more faster and stable optimization. We further establish the effectiveness of teacher-student knowledge transfer, where the teacher models trained with KL divergence on the approximated true distributions successfully guide new networks, outperforming traditional training methods particularly when labels contain noise or ambiguity.
Keywords
- Applications of continuous optimization
- Large-scale optimization
- Distributed optimization
Status: accepted
Back to the list of papers