1 Introduction
In this paper we take up the study of the fundamental possibilities and limitations of distributed methods for highdimensional, or nonparametric problems. The design and study of such methods has attracted substantial attention recently. This is for a large part motivated by the ever increasing size of datasets, leading to the necessity to analyze data while distributed over multiple machines and/or cores. Other reasons to consider distributed methods include privacy considerations or the simple fact that in some situations data are physically collected at multiple locations.
By now a variety of methods are available for estimating nonparametric or highdimensional models to data in a distributed manner. A (certainly incomplete) list of recent references includes the papers [18, 25, 13, 6, 19, 20, 11, 1, 15]. Some of these papers propose new methods, some study theoretical aspects of such methods, and some do both. The number of theoretical papers on the fundamental performance of distributed methods is still rather limited however. In the paper [21]
we recently introduced a distributed version of the canonical signalinwhitenoise model to serve as a benchmark model to study aspects like convergence rates and optimal tuning of distributed methods. We used it to compare the performance of a number of distributed nonparametric methods recently introduced in the literature. The study illustrated the intuitively obvious fact that in order to achieve an optimal biasvariance tradeoff, or, equivalently, to find the correct balance between over and underfitting, distributed methods need to be tuned differently than methods that handle all data at once. Moreover, our comparison showed that some of the proposed methods are more successful at this than others.
A major challenge and fundamental question for nonparametric distributed methods is whether or not it is possible to achieve a form of adaptive inference. In other words, whether we can design methods that do automatic, datadriven tuning in order to achieve the optimal biasvariance tradeoff. We illustrated by example in [21] that naively using methods that are known to achieve optimal adaptation in nondistributed settings, can lead to suboptimal performance in the distributed case. In the recent paper [27], which considers the same distributed signalinwhitenoise model and was written independently and at the same time as the present paper, it is in fact conjectured that adaptation in the considered particular distributed model is not possible.
In order to study convergence rates and adaptation for distributed methods in a meaningful way the class of methods should be restricted somehow. Indeed, if there is no limitation on communication or computation, then we could simply communicate all data from the various local machines to a central machine, aggregate it, and use some existing adaptive, rateoptimal procedure. In this paper we consider a setting in which the communication between the local and the global machines is restricted, much in the same way as the communication restrictions imposed in [25] in a parametric framework and recently in the simultaneously written paper [27] in the context of the distributed signalinwhitenoise model we introduced in [21].
In the distributed nonparametric regression model with communication constraints that we consider we can derive minimax lower bounds for the best possible rate that any distributed procedure can achieve under smoothness conditions on the true regression function. Technically this essentially relies on an extension of the information theoretic approach of [25] to the infinitedimensional setting (this is different from the approach taken in [27], which relies on results from [23]). It turns out there are different regimes, depending on how much communication is allowed. On the one extreme end, and in accordance with intuition, if enough communication is allowed, it is possible to achieve the same convergence rates in the distributed setting as in the nondistributed case. The other extreme case is that there is so little communication allowed that combining different machines does not help. Then the optimal rate under the communication restriction can already be obtained by just using a single local machine and discarding the others. The interesting case is the intermediate regime. For that case we show there exists an optimal strategy that involves grouping the machines in a certain way and letting them work on different parts of the regression function.
These first results on rateoptimal distributed estimators are not adaptive, in the sense that the optimal procedures depend on the regularity of the unknown regression function. The same holds true for the procedure obtained in parallel in [27]. In this paper we go a step further and show that contrary perhaps to intuition, and contrary to the conjecture in [27], adaptation is in fact possible. Indeed, we exhibit in this paper an adaptive distributed method which involves a very specific grouping of the local machines, in combination with a Lepskitype method that is carried out in the central machine. We prove that the resulting distributed estimator adapts to a range of smoothness levels of the unknown regression function and that, up to logarithmic factors, it attains the minimax lower bound.
Although our analysis is theoretical, we believe it contains interesting messages that are ultimately very relevant for the development of applied distributed methods in highdimensional settings. First of all, we show that depending on the communication budget, it might be advantageous to group local machines and let different groups work on different aspects of the highdimensional object of interest. Secondly, we show that it is possible to have adaptation in communication restricted distributed settings, i.e. to have datadriven tuning that automatically achieves the correct biasvariance tradeoff. We note however that although our proof of this fact is constructive, the method we exhibit appears to be still rather unpractical. We view our adaptation result primarily as a first proof of concept, that hopefully invites the development of more practical adaptation techniques for distributed settings.
1.1 Notations
For two positive sequences we use the notation if there exists an universal positive constant such that . Along the lines denotes that and hold simultaneously. Furthermore we write if . Let us denote by and the upper and lower integer value of the real number , respectively. The sum for positive real number denotes the sum . For a set let denote the size of the set. For we denote the standard norm as , while for bounded functions denotes the norm. The function evaluates to on and on . Furthermore, we use the notation . Throughout the paper, and denote global constants whose value may change one line to another.
2 Main results
We work with the distributed version of the random design regression model. We assume that we have ‘local’ machines and in the
th machine we observe pairs of random variables
, , (with ) satisfying(2.1)  
and (which is the same for all machines) is the unknown functional parameter of interest. We denote the local distribution and expectation corresponding to the th machine in (2.1) by and
, respectively, and the joint distribution and expectation over all machines
, by and , respectively. We assume that the total sample size is known to every local machine. For our theoretical results we will assume that the unknown true function belongs to some regularity class. We work in our analysis with Besov smoothness classes, more specifically we assume that for some degree of smoothness we have or . The first class is of Sobolev type, while the second one is of Hölder type with minimax estimation rates and , respectively. For precise definitions, see Section B in the supplementary material [suppl:szabo:zanten:2018]. Each local machine carries out (parallel to the others) a local statistical procedure and transmits the results to a central machine, which produces an estimator for the signal by somehow aggregating the messages received from the local machines.We study these distributed procedures under communication constraints between the local machines and the central machine. We allow each local machine to send at most bits on average to the central machine. More formally, a distributed estimator is a measurable function of binary strings, or messages, passed down from the local machines to the central machine. We denote by the finite binary string transmitted from machine to the central machine, which is a measurable function of the local data . For a class of potential signals , we restrict the communication between the machines by assuming that for numbers , it holds that for every and , where denotes the length of the string . We denote the resulting class of communication restricted distributed estimators by . The number of machines and the communication constraints are allowed to depend on the overall sample size , in fact that is the interesting situation. To alleviate the notational burden somewhat we do not make this explicit in the notation however.
2.1 Distributed minimax lower bounds for the risk
The first theorem we present gives a minimax lower bound for distributed procedures for the risk, uniformly over Sobolevtype Besov balls, see Section B in the supplement for rigorous definitions.
Theorem 2.1.
Consider , and communication constraints . Let the sequence be defined as the solution to the equation
(2.2) 
Then in distributed random design nonparametric regression model (2.1) we have that
Proof.
See Section 3.1 ∎
We briefly comment on the derived result. First of all note that the quantity in (2.2) is well defined, since the lefthand side of the equation is increasing, while the righthand side is decreasing in . The proof of the theorem is based on an application of a version of Fano’s inequality, frequently used to derive minimax lower bounds. More specifically, as a first step we find as usual a large enough finite subset of the functional space over which the minimax rate is the same as over the whole space. This is done by finding the ‘effective resolution level’ in the wavelet representation of the function of interest and perturbing the corresponding wavelet coefficients, while setting the rest of the coefficients to zero. This effective resolution level for smooth functions is usually in case of the norm for nondistributed models (e.g. [10]). However, in our distributed setting the effective resolution level changes to , which can be substantially different from the nondistributed case, as it strongly depends on the number of transmitted bits. The dependence on the expected number of transmitted bits enters the formula by using a variation of Shannon’s source coding theorem. Many of the information theoretic manipulations in the proof are an extended and adapted version of the approach introduced in [25]
, where similar results were derived in context of distributed methods with communication constraints over parametric models.
To understand the result it is illustrative to consider the special case that the communication constraints are the same for all machines, i.e. for some . We can then distinguish three regimes: 1 the case ; 2 the case ; and 3 the case .
In regime 1 we have a large communication budget and by elementary computations we get that the minimum in (2.2) is taken in the second fraction and hence that . This means that in this case the derived lower bound corresponds to the usual nondistributed minimax rate . In the other extreme case, regime 3, the minimum is taken at the first term in (2.2) and , so the lower bound is of the order . This rate is, up to the factor, equal to the minimax rate corresponding to the sample size . Consequently, in this case it does not make sense to consider distributed methods, since by just using a single machine the best rate can already be obtained (up to a logarithmic factor). In the intermediate case 2 it is straightforward to see that . It follows that if , i.e. if we are only allowed to communicate ‘strictly’ less than in case 1, then the lower bound is strictly worse than the minimax rate corresponding to the nondistributed setting.
The findings above are summarized in the following corollary.
Corollary 2.2.
Consider , a communication constraints and assume that . Then

[label=()]

if ,

if ,

if ,
2.2 Nonadaptive rateoptimal distributed procedures for risk
Next we show that the derived lower bounds are sharp by exhibiting distributed procedures that attain the bounds (up to logarithmic factors). We note that it is sufficient to consider only the case , since otherwise distributed techniques do not perform better than standard techniques carried out on one of the local servers. In case 3
therefore one would probably prefer to use a single local machine instead of a complicated distributed method with (possibly) worse performance.
As a first step let us consider Daubechies wavelets , , with at least
vanishing moments (for details, see Section
B in the supplement). Then let us estimate the wavelet coefficients of the underlying function in each local problems, i.e. for every and let us constructand note that
Since one can only transmit finite amount of bits we have to approximate the estimators of the wavelet coefficients. Let us take an arbitrary and write it in a scientific binary representation, i.e. , with , . Then let us take consisting the same digits as up to the digits. for some , after the binary dot (and truncated there), i.e. , see also Algorithm 1.
Observe that the length of (viewed as a binary string) is bounded from above by bits. The following lemma asserts that if , then the expected length of the constructed binary string approximating is less than (for sufficiently large and by choosing ) and the approximation is sufficiently close to .
Lemma 2.3.
Assume that . Then the approximation of given in Algorithm 1 satisfies that
Proof.
See Section 3.4. ∎
After these preparations we can exhibit procedures attaining (nearly) the theoretical limits obtained in Corollary 2.2.
We first consider the case 1 that . In this case each local machine transmits the approximations (given in Algorithm 1) of the first wavelet coefficients , i.e. for . Then in the central machine we simply average the transmitted approximations to obtain the estimated wavelet coefficients
The final estimator for is the function in with these wavelet coefficients, i.e. . The method is summarized as Algorithm 2 below.
We note again that the procedure outlined in Algorithm 2 is just a simple averaging, sometimes called “divide and conquer” or “embarrassingly parallel” in the learning literature (e.g. [26], [17]). The following theorem asserts that the constructed estimator indeed attains the lower bound in case 1 (up to a logarithmic factor for close to the threshold).
Theorem 2.4.
Let , , and suppose that . Then the distributed estimator described in Algorithm 2 belongs to and satisfies
Proof.
See Section 3.2 ∎
Next we consider the case 2 of Corollary 2.2, i.e. the case that the communication restriction satisfies . For technical reasons we also assume that . Using Algorithm 2 in this case would result in a highly suboptimal procedure, as we prove at the end of Section 3.3. It turns out that under this more severe communication restriction we can do much better if we form different groups of machines that work on different parts of the signal.
We introduce the notation . Then we group the local machines into groups and let the different groups work on different parts of wavelet domain as follows: the machines with numbers each transmit the approximations of the estimated wavelet coefficients for ; the next machines, with numbers , each transmit the approximations for , and so on. The last machines with numbers transmit the for . Then in the central machine we average the corresponding transmitted noisy coefficients in the obvious way. Formally, using the notation , the aggregated estimator is the function with wavelet coefficients given by
The procedure is summarized as Algorithm 3.
The following theorem asserts that this estimator attains the lower bound in case 2 (up to a logarithmic factor).
Theorem 2.5.
Let , and suppose that . Then the distributed estimator described in Algorithm 3 belongs to and satisfies
with .
Proof.
See Section 3.3 ∎
2.3 Distributed minimax results for risk
When we replace the norm by the norm and correspondingly change the type of Besov balls we consider, we can derive a lower bound similar to Theorem 2.1 (see Section B in the supplement for the rigorous definition of Besov balls).
Theorem 2.6.
Proof.
See Section 4.1 ∎
The proof of the theorem is very similar to the proof of Theorem 2.6. The first term on the right hand side follows from the usual nondistributed minimax lower bound. For the second term we use the standard version of Fano’s inequality. We again consider a large enough finite subset of . The effective resolution level for the norm in the nondistributed case is . Similarly to the case the effective resolution level changes to in the distributed setting, which can be again substantially different from the nondistributed case. The rest of the proof follows the same lines reasoning as the proof of Theorem 2.6.
Similarly to the norm we consider again the specific case where all communication budgets are taken to be equal, i.e. . One can easily see that there are again three regimes of (slightly different compared to the case).
Corollary 2.7.
Consider , communication constraint and assume that .

[label=(b)]

If , then

If , then

If , then
Next we provide matching upper bounds (up to a factor) in the first two cases, i.e. 1 and 2. In the third case the lower bound matches (up to a logarithmic factor) the minimax rate corresponding to a single local machine, hence it is not advantageous at all to develop complicated distributed techniques as a single server with only fraction of the total information performs at least as well. In the previous section dealing with estimation we have provided two algorithms (one where the machines had the same tasks and one where the machines were divided into groups and were assigned different tasks) to highlight the differences between the cases. Here for simplicity we combine the algorithms to a single one, but essentially the same techniques are used as before.
In each local machine we compute the local estimators of the wavelet coefficients and transmit a finite digit approximations of them , as in the case. Then let us divide the machines into equal sized groups ( corresponds to case 1, while corresponds to case 2). Similarly to before machines with numbers transmit the approximations of the estimated wavelet coefficients for , and so on, the last machines with numbers transmit the approximations for . In the central machine we average the corresponding transmitted coefficients in the obvious way, i.e. the aggregated estimator is the function with wavelet coefficients given by
where . The procedure is summarized as Algorithm 4 and the (up to a logarithmic factor) optimal behaviour is given in Theorem 2.8 below.
Theorem 2.8.
Let . Then the distributed estimator described in Algorithm 4 belongs to and satisfies

for ,

for ,
with .
Proof.
See Section 4.2 ∎
We can draw similar conclusions for the norm as for the norm. If we do not transmit a sufficient amount of bits (at least up to a factor) from the local machines to the central one then the lower bound from the theorem exceeds the minimax risk corresponding the nondistributed case. Furthermore by transmitting the sufficient amount of bits (i.e. up to a factor) corresponding to the class , the lower bound will coincide with the nondistributed minimax estimation rate.
2.4 Adaptive distributed estimation
The (almost) rateoptimal procedures considered so far have in common that they are nonadaptive, in the sense that they all use the knowledge of the regularity level of the unknown functional parameter of interest. In this section we exhibit a distributed algorithm attaining the lower bounds (up to a logarithmic factor) across a whole range of regularities simultaneously. In the nondistributed setting it is well known that this is possible, and many adaptation methods exist, including for instance the block Stein method, Lepski’s method, wavelet thresholding, and Bayesian adaptation methods, just to mention but a few (e.g. [22, 10]). In the distributed case the matter is more complicated. Using the usual adaptive tuning methods in the local machines will typically not work (see [21]) and in fact it was recently conjectured that adaptation, if at all possible, would require more communication than is allowed in our model (see [27]).
We will show, however, that in our setting, if all machines have the same communication restriction given by , it is possible to adapt to regularities ranging in the interval , where
(2.3) 
and is the regularity of the considered Daubechies wavelet and can be chosen arbitrarily large. Note that is well defined. If , then we are in one of the nontrivial cases 1 or 2 of Corollary 2.2. We will construct a distributed method which, up to logarithmic factors, attains the corresponding lower bounds, without using knowledge about the regularity level .
Remark 2.9.
We provide some examples for the value of for different choices of and . Taking we have for all that . For and we get . For and we have that . Note that it is intuitively clear that in case the number of machines is large, then it is typically advantageous to use a distributed method compared to a single local machine as we would lose too much information in the later case. However, if we have a small number of machines and can transmit only a very limited amount of information, then it might be more advantageous to use only a single machine to make inference.
In the nonadaptive case we saw that different strategies were required to attain the optimal rate, case 2 requiring a particular grouping of the local machines. The cutoff between cases 1 and 2 depends, however, on the value of , so in the present adaptive setting we do not know beforehand in which of the two cases we are. In order to tackle this problem we introduce a somewhat more involved grouping of the machines, which basically gives us the possibility to carry out both strategies simultaneously. This is combined with a modified version of Lepski’s method, carried out in the central machine, ultimately leading to (nearly) optimal distributed concentration rates for every regularity class , simultaneously. (We note that in our distributed regression setting, deriving an appropriate version of Lepski’s method requires some nonstandard technical work, see Section 3.5).
As a first step in our adaptive procedure we divide the machines into groups. To begin with, let us take the first machines and denote the set of their index numbers by . Then the remaining machines are split into equally sized groups (for simplicity each group has machines and the leftovers are discarded), where
The corresponding sets of indexes are denoted by . Note that , for . Then the machines in the group transmit the approximations (with in Algorithm 1) of the local estimators of the wavelet coefficients , for , with
to the central machine. The machines in group , , will be responsible for transmitting the coefficients at resolution level . First for every , the machines in group are split again into equal size groups (for simplicity each group has machines and the leftovers are discarded again), denoted by . A machine in one of the groups for transmits the approximations (again with in Algorithm 1) of the local estimators of the wavelet coefficients , for and to the central machine.
In the central machine we first average the transmitted approximations of the corresponding coefficients. We define
(2.4) 
Using these coefficients we can construct for every the preliminary estimator
(2.5) 
This gives us a sequence of estimators from which we select the appropriate one using a modified version of Lepski’s method. We consider
Comments
There are no comments yet.