Accurate modeling and prediction of speech-sound durations is important for generating more natural synthetic speech. Deep neural networks (DNNs) offer powerful models, and large, found corpora of natural speech are easily acquired for training them. Unfortunately, poor quality control (e.g., transcription errors) and phenomena such as reductions and filled pauses complicate duration modelling from found speech data. To mitigate issues caused by these idiosyncrasies, we propose to incorporate methods from robust statistics into speech synthesis. Robust methods can disregard ill-fitting training-data points—errors or other outliers—to describe the typical case better. For instance, parameter estimation can be made robust by replacing maximum likelihood with a robust estimation criterion based on the density power divergence (a.k.a. the beta-divergence). Alternatively, a standard approximation for output generation with mixture density networks (MDNs) can be interpreted as a robust output generation heuristic. To evaluate the potential benefits of robust techniques, we adapted data from a free online audiobook to build several DNN-based text-to-speech systems, with either conventional or robust duration prediction. Our objective results indicate that robust methods described typical durations better than the baselines. Additionally, listeners significantly preferred synthetic speech generated using the robust methods in a subjective evaluation.