Distributed Cloud-Based Machine Learning

This post is authored by Dhruv Mahajan, Sundararajan Sellamanickam and Keerthi Selvaraj, Researchers at Microsoft’s Cloud & Information Services Lab (CISL) and at Microsoft Research.

Enterprises of all stripes are amassing huge troves of data assets, e.g. logs pertaining to user behavior, system access, usage patterns and much more. Companies will benefit enormously by using the power of cloud services platforms such as Microsoft Azure not merely to host such data or perform classic “look-in-the-rear-view mirror” BI, but by applying the power and scale of cloud-based predictive analytics. Using modern tools such as Azure Machine Learning, for instance, companies can obtain actionable insights about how the future of their businesses might evolve – insights that can give them a competitive edge.

Gathering and maintaining “big data” is becoming a common need across many applications. As data sizes explode, it becomes necessary to store data in a distributed fashion. In many applications, the collection of data itself is a decentralized process, naturally leading to distributed data storage. In such situations it becomes necessary to build machine learning (ML) solutions over distributed data using distributed computing. Examples of such situations include click-through rate estimation via logistic regression in the online advertising universe, or deep learning solutions applied to huge image or speech training datasets, or log analytics to detect anomalous patterns.

Efficient distributed training of ML solutions on a cluster, therefore, is an important focus area at the Microsoft Cloud & Information Services Lab (CISL, that’s pronounced “sizzle” :-)) to which the authors belong. In this post, we delve a bit into this topic, discuss a few related issues and our recent research where we try to addresses some of the same. Some of the details presented here are rather technical, but we attempt to explain the central ideas in as simple a manner as possible. Anybody interested in doing distributed ML on big data will gain by understanding these ideas, and we look forward to your comments and feedback too.

Choosing the Right Infrastructure

In a recent post, John Langford described the Vowpal Wabbit (VW) system for fast learning, where he briefly touched on distributed learning over terascale datasets. Most ML algorithms being iterative in nature, choosing the right distributed framework to run them is crucial.

Map Reduce and its open source implementation, Hadoop, are popular platforms for distributed data processing. However, they are not well-suited for iterative ML algorithms as each iteration has large overheads – e.g. job scheduling, data transfer and data parsing.

Better alternatives would be to add communication infrastructure such as All Reduce, which is compatible with Hadoop (as in VW), or to employ newer distributed frameworks such as REEF which support efficient iterative computation.


Current state-of-the-art algorithms for distributed ML such as the one in VW are based on the Statistical Query Model (SQM). In SQM, learning is based on doing some computation on each data point and then accumulating the resulting information over all the data points. As an example, consider linear ML problems where the output is formed by doing a dot product of a feature vector with the vector of weight parameters. This includes important predictive models such as logistic regression, SVMs and least squares fitting. In this case, at each iteration, the overall gradient of the training objective function is computed by summing the gradients associated with individual data points. Each node forms the partial gradient corresponding to the training data present in that node and then an All Reduce operation is used to get the overall gradient.

Communication Bottleneck

Distributed computing often faces a critical bottleneck in the form of a large ratio of computation to communication bandwidth. E.g. it is quite common to see communication being 10x to 50x slower than computation.

Let Tcomm and Tcomp denote the per iteration time for communication and computation respectively. Thus, the overall cost of an iterative ML algorithm can be written as:

Toverall = (Tcomm + Tcomp) * #iterations

Tcomp typically decreases linearly with increasing number of nodes while Tcomm increases or remains constant (in best implementations of All Reduce). ML solutions involving Big Data often have a huge number of weight parameters (d) that must be updated and communicated between the computing nodes of a cluster in each iteration. Moreover, there are other steps like gradient computation in SQM that also require O(d) communication. The situation is even worse in Map Reduce where each iteration requires a separate Map Reduce job. Hence, Tcomm is large when d is large. SQM does not place sufficient emphasis on the inefficiency associated with this.

Overcoming the Communication Bottleneck

Our recent research addresses this important issue. It is based on the following observation: Consider a scenario in which Tcomm, the time for communicating the weight parameters between nodes, is large. In each iteration, what happens with a standard approach such as SQM is that Tcomp, the time associated with computations within each node, is a lot less than Tcomm. So we ask the following question: Is it possible to modify the algorithm and its iterations in such a way that Tcomp is increased to come closer to Tcomm, and, in the process, make the algorithm converge to the desired solution in fewer iterations?

Of course, answering this question is non-trivial since it requires a fundamental algorithmic change.

More Nitty-Gritty Details

Consider the ML problem of learning linear models. In our algorithm, the weight updates and gradients in the nodes are shared in a way similar to the SQM based method. However, at each node, the gradient (computed using All Reduce) and the local data in the node are used in a non-trivial way to form a local approximation of the global problem. Each node solves its approximate problem to form local updates of weight variables. Then the local updates from all nodes are combined together to form a global update of the weight variables. Note that solving the approximate problem leads to increased computation in each node, but it does not require any extra communication. As a result Tcomp increases and, since Tcomm is already high, the per-iteration cost is not affected significantly. However, since each node now is solving the approximate global view of the problem, the number of iterations needed to solve the problem is reduced significantly. Think of a case where the amount of data is so large that the data present within each node is itself sufficient to do good learning. For this case, the approximate problem formed in each node is close to the global problem; the result is that our algorithm requires just one or two iterations while SQM based methods need hundreds or even thousands of iterations. In addition, our approach is flexible and allows a class of approximations rather than a specific one. In general, our algorithm is almost always faster than SQM and, on average, about two to three times faster.

One could also think of distributing the weight vector over many cluster nodes and setting up the distributed data storage and computation in such a way that all updates for any one weight variable happen only in one cluster node. This turns out to be attractive in some situations, for example when one is interested in zeroing out irrelevant weight variables in linear ML problems or for doing distributed deep net training. Here again, we have developed specialized iterative algorithms that do increased computation in each node while decreasing the number of iterations.


We focused above on algorithms suited to communication-heavy situations. But not all problems solved in practice are of this nature. For general situations, there exist a range of good distributed ML algorithms in the recent academic literature. But a careful evaluation of these methods has not been performed. Best methods are finding their way into cloud ML libraries.

Automating Distributed ML to Suit User Needs

There is also another important side to the whole story. Users of distributed ML on the cloud have a variety of needs. They may be interested in minimizing the total solution time, or the cost in dollars associated with the solution. Users may be willing to sacrifice accuracy a bit while optimizing the above mentioned variables. Alternatively, they may be keen to get the best accuracy irrespective of time and cost. Given a problem description, a varied set of such user specifications, and details of the system configuration available for use, it is important to have an automatic procedure for choosing the right algorithm and its parameter settings. Our current research focuses on this aspect.

Automated distributed ML solutions will be one of the important areas/considerations for Azure ML as we evolve our product and expand our offering in the future.

Dhruv, Sundar and Keerthi