The remarkable ability of deep neural networks to perfectly fit training data when optimized by gradient-based algorithms is yet to be fully explained theoretically. Explanations by recent theoretical works rely on the networks to be wider by orders of magnitude than the ones used in practice. In this work, we take a step towards closing the gap between theory and practice. We show that a randomly initialized deep neural network with ReLU activation converges to a global minimum in a logarithmic number of gradient-descent iterations, under a considerably milder condition on its width. Our analysis is based on a novel technique of training a network with fixed activation patterns. We study the unique properties of the technique that allow an improved convergence, and can be transformed at any time to an equivalent ReLU network of a reasonable size.