 ### Transcription

Japan. J. Math. 14, 27–65 (2019)DOI: 10.1007/s11537-018-1727-9Information complexity and applications?Takagi Lectures NotesMark BravermanReceived: 3 January 2018 / Revised: 6 December 2018 / Accepted: 27 December 2018Published online: 5 March 2019 The Mathematical Society of Japan and Springer Japan KK, part of Springer Nature 2019Communicated by: Toshiyuki KobayashiAbstract. This paper is a lecture note accompanying the 19th Takagi Lectures lectures in July2017 at Kyoto University.We give a high-level overview of information complexity theory and its connections to communication complexity. We then discuss some fundamental properties of information complexity,and applications to direct sum theorems and to exact communication bounds. We conclude withsome open questions and directions.Keywords and phrases: communication complexity, information complexityMathematics Subject Classification (2010): 68Q17, 94A15, 94A241. Information theory1.1. Prelude: noiseless one-way communicationWithin the theoretical study of computation and communication, Shannon’s information theory stands out as one of the most complete and satisfying theories.The primary goal of information theory is to broadly understand the theoretical?This article is based on the 19th Takagi Lectures that the author delivered at Research Institutefor Mathematical Science, Kyoto University on July 8 and 9, 2017.M. B RAVERMANDepartment of Computer Science, Princeton University35 Olden Street, Princeton, NJ 08540-5233 USA andInstitute for Advanced Study1 Einstein Drive, Princeton, NJ 08540, USA(e-mail: [email protected])

28M. Bravermanlimits of communication in various scenarios, and to devise ways to attain orapproach those limits. The simplest scenario studied in information theory isthat of two terminals, or parties, Alice and Bob1 , communicating over a channelC . Alice wishes to transmit a message X to Bob. In this paper, unless statedotherwise, we will assume that all messages have a prior distribution2 . Thus, Xis a random variable, distributed according to some distribution .Let us assume, for simplicity, that C is the Noiseless Binary Channel—Alicecan send bits to Bob using C , at a cost of one unit per bit. Let us denote byC.X/ the (expected) cost of sending a sample X over the noiseless binarychannel. As a running example, let us adopt D Unif f1; 2; 3g—that is, T is auniformly random element of f1; 2; 3g. In this case, an example of an optimalcodebook that Alice can use in encoding X as a binary string is 1 ! ‘0’, 2 !‘10’, 3 ! ‘11’, and C.T / D 13 1 C 13 2 C 13 2 D 53 . Note that we requirethe encoding to be “prefix-free”, so that no codeword is a prefix of anothercodeword. The encoding 1 ! ‘0’, 2 ! ‘1’, 3 ! ‘11’ appears to be moreefficient, but cannot be used to encode a sequence of symbols, as ‘111’ canbe decoded as either ‘23’, ‘32’, or ‘222’; this does not happen with prefix-freecodes.Denote by T n a sequence of n i.i.d. copies of T , that is, a uniformly randomsequence in f1; 2; 3gn . The encoding scheme for T immediately gives C.T n / 529323 n. Unfortunately, this is not tight. One can easily see that C.T / 9 53 2. The fact that C.X/ is not additive, and has a combinatorial flavor to it,makes it not the easiest quantity to study (even if in the case of C.T n / onecan produce a closed-form expression for it). Shannon’s entropy H.X/ is the“right” analytic analogue of C.X/, which, as we shall see, is better behaved,mainly due to the fact that it is analytic and not combinatorial in nature.Shannon’s entropy H.X/ is given by the equation4X1(1)H.X/ DPrŒX D x log:PrŒX D x x2supp.X/A key early result in the field, Shannon’s Source Coding Theorem, asserts that51In this paper we will adopt terminology typically used in the Theoretical Computer Scienceliterature.2 Kolmogorov complexity [LV08] dispenses with this assumption, essentially by introducinga universal prior on strings using computability theory. The probabilistic treatment is typicallycleaner in the contexts that interest us here.3 By encoding the 9 possible sequences as ‘000’, ‘001’, ‘010’, ‘011’, ‘100’, ‘101’, ‘110’,‘1110’, ‘1111’.4 All logs in this paper are base-2, unless stated otherwise.5 In fact, Shannon’s Source Coding Theorem asserts that due to concentration the worst casecommunication cost scales as H.X / as well, if we allow negligible error. We ignore this strongerstatement at the present level of abstraction.

Information complexity and applications: Takagi Lectures Notes(2)29C.X n /limD H.X/:n!1nThis fact can be viewed as the operational definition of entropy, i.e. one that isgrounded in reality. Whereas definition (1) may appear artificial, (2) implies thatit is the right one, since it connects to the “naturally occurring” quantity C.X n /.In the example above, we get that limn!1 C.T n / n D log2 3.Another indirect piece of evidence indicating that H.X/ is a natural quantityis its additivity property:(3)H.X n / D n H.X/;and more generally, if XY is the concatenation of random variables X and Y ,then H.XY / D H.X/ C H.Y / whenever X and Y are independent. We justsaw that (3) fails to hold for C.X/, making H.X/ a “nicer” quantity to dealwith than C.X/. Equation (2) implies thatC.X/ C.X n / H.X/n(it is also not hard to see this fact directly). Conversely, Huffman coding impliesthat(4)C.X/ H.X/ C 1:Thus, in the case of simple one-way transmission, the distinction between communication cost C.X/ and the information-theoretic quantity H.X/ is not thatconsequential. As we shall see later, this is no longer the case in interactivecomputation scenarios, where Alice and Bob exchange messages to achieve acertain goal.1.2. Quantities of interestInformally, Shannon’s entropy captures the (average) inherent amount of uncertainty (in bits) a sample of a random variable X contains. The act of transmittinga realization x of the random variable, reduces this uncertainty from H.X/ to0, and thus it is reasonable to expect it to take H.X/ bits of communication. Putit another way, Bob (the receiver) starts knowing no information about X , andfinishes the communication with H.X/ bits of information.With a little additional work, the notion of Shannon’s entropy can be usedto define additional quantities. For example, suppose .X; Y / are two (possiblycorrelated) random variables, where Alice knows X and Bob knows Y . Howmany bits should one expect to communicate to transmit X ? Bob starts outknowing H.Y / bits of information about the pair XY (as he completely knowsY ). He finishes the communication knowing both X and Y , thus having H.XY /

30M. Bravermaninformation about the pair XY . Thus we should expect H.XY / H.Y / bitsof communication to be required. This latter quantity is known as conditionalentropy(5)H.XjY / WD H.XY / H.Y / D Ey Y H.XjY D y/;where the last equality is (an easy) lemma. The mathematical expression H.XjY /can thus be parsed as “the (average) amount of uncertainty that remains in Xwhen one knows Y ”. Shannon’s Source Coding theorem (2) generalizes to thissetting: given a stream of i.i.d. pairs .Xi ; Yi /, such that Alice knows the Xi ’sand Bob the Yi ’s, the communication cost of transmitting the Xi ’s to Bob scalesas H.XjY /.The following statement, known as the Chain Rule for entropy, is almosttautological:(6)H.XY / D H.X/ C H.Y jX/:More generally, for three random variables X; Y; Z,(7)H.XY jZ/ D H.XjZ/ C H.Y jXZ/:It is not hard to prove that H.XjY / H.X/ (i.e., that adding informationabout Y cannot, on average, increase the uncertainty about X ). Another way toexpress this fact is by stating that entropy is sub-additive: H.XY / H.X/ CH.Y /. This allows us to define the following non-negative quantity, called themutual information between X and Y :(8)I.XI Y / WD H.X/ H.XjY / D H.X/ C H.Y / H.XY /D H.Y / H.Y jX/ 0:Note that the equalities imply that the notion is symmetric: I.XI Y / D I.Y I X/.Mutual information can be given the following operational meaning. It measures how much information knowing Y reveals about X . Another useful wayof looking at it is through the following communication setup. Alice and Bobhave access to a shared random source. Alice knows X , but Bob does not. Theywould like to jointly sample from Y jX .If X and Y were independent (i.e., H.XjY / D H.X/ and I.XI Y / D 0),then clearly no communication is needed. The parties can just use the sharedrandom source to sample from the distribution of Y . This does not work, however, when there is a dependence between X and Y : Bob does not know X , andwould need Alice’s help to sample from Y jX . It turns out that I.XI Y / is exactly the amount of information Alice needs to communicate to Bob to sampleY jX together with him. This is true in the limit sense as in the Source Codingtheorem (2): if Alice has a stream of variables X1 ; : : : ; Xn , and would like to

Information complexity and applications: Takagi Lectures Notes31sample, jointly with Bob, the sequence Y1 ; : : : ; Yn , where Xi Yi are independent, identically distributed as XY , then the optimal communication cost forthis task is n I.XI Y / C o.n/. It should be noted that the “one-shot” versionof this statement—along the lines of (4) generally fails to hold. This is part ofwhat makes the interactive information theory story which we shall discuss inthe next section more interesting.A final quantity that will be useful to us combines the two definitions aboveto give us the notion of conditional mutual information. For a 3-tuple of randomvariables .X; Y; Z/, the mutual information of X and Y conditioned on Z isdefined as(9)I.XI Y jZ/ WD H.XjZ/ H.XjY Z/:Conditional mutual information is, once again, symmetric. It is still true thatI.XI Y jZ/ 0, although the proof is slightly less straightforward (it still relies, fundamentally, on the concavity of the log function). Informally, it can bephrased as “the amount of information X reveals about Y to someone who already knows Z”.The Chain Rule for entropy (6), (7) implies the chain rule for mutual information:(10)I.XY I Z/ D H.XY / H.XY jZ/D H.X/ C H.Y jX/ H.XjZ/ H.Y jXZ/D H.X/ H.XjZ/ C H.Y jX/ H.Y jXZ/D I.XI Z/ C I.Y I ZjX/:Similarly, the following slightly more general statement holds for a 4-tupleof random variables .X; Y; Z; W /:(11)I.XY I ZjW / D I.XI ZjW / C I.Y I ZjXW /:2. Two-party information complexityArmed with the basic definitions, we can proceed to considering interactivecommunication, and the two-party information complexity. We will start withbasic notions of communication complexity, giving an overview that is closelybased on the overview in [Bra14].2.1. Communication complexityIn theoretical computer science interactive communication models are studiedwithin the area of communication complexity. While communication complexity can be viewed as a direct extension of the study of non-interactive communication models which were discussed in the previous section, its development

32M. Bravermanhas been largely disjoint from the development of information theory, and theareas have not reconnected until fairly recently. This may be partially explainedby the combinatorial nature of the tools most prevalent in theoretical computerscience.Communication complexity was introduced by Yao in [Yao79], and is thesubject of the text [KN97]. It has found numerous applications for unconditionallower bounds in a variety of models of computation, including Turing machines,streaming, sketching, data structure lower bounds, and VLSI layout, to namea few. In the basic (two-party) setup, the two parties Alice and Bob are giveninputs X 2 X and Y 2 Y , respectively, and are required to compute a functionF .X; Y / of these inputs (i.e., both parties should know the answer in the endof the communication), while communicating over a noiseless binary channel6 .The parties are computationally unbounded, and their only goal is to minimizethe number of bits transmitted in the process of computing F .X; Y /. In a typicalsetup F is a function F W f0; 1gn f0; 1gn ! f0; 1g. Examples of functionscommonly discussed and used include the Equality function EQn .X; Y / WD1XDY .X; Y /, and the Disjointness function(12)Disjn .X; Y / WDn .:Xi :Yi /:i D1We will return to these functions later in our discussion.Of course, the (non-interactive) transmission problem can be viewed as justa special case of computing the function Px W X f?g ! X , which maps.X; ?/ to X . However, there are two important distinctions between the “flavors” of typical information theory results and communication complexity.Firstly, information theory is often concerned with coding results where blocklength—i.e., the number of copies of the communication task to be performed—goes to infinity. Recall, for example, that Shannon’s Source Coding Theorem (2)gave Shannon’s entropy as a closed-form expression for the amortized transmission cost of sending a growing number of samples X . The descriptions giving anoperational meaning to H.XjY / and I.XI Y / also had this flavor. On the otherhand, communication complexity more commonly studies the communicationcost of computing a single copy of F . Secondly, as in the examples above, communication complexity often studies functions whose output is only a singlebit or a small number of bits, thus “counting style” direct lower bound proofsrarely apply. Tools that have been successfully applied in communication complexity over the years include combinatorics, linear algebra, discrepancy theory,and—only later—classical information theory.To make our discussion of communication complexity more technical, wewill first focus on the two-party setting. The basic notion in communicationLater in this note we will treat X and Y as random variables, but for now assume they arejust inputs unknown to Bob and Alice, respectively.6

Information complexity and applications: Takagi Lectures Notes33complexity is that of a communication protocol. A communication protocol overa binary channel formalizes a conversation, where each message only dependson the input to the speaker and the conversation so far:Definition 2.1. A (deterministic) protocol for F W X Y ! f0; 1g is definedas a finite rooted binary tree, whose nodes correspond to partial communicationtranscripts, such that the two edges coming out of each vertex are labeled witha 0 and 1. Each leaf is labeled by an output value f 2 f0; 1g. Each internalnode v is labeled by a player’s name and either by a function av W X !f0; 1g or bv W Y ! f0; 1g corresponding to the next message of Alice or Bob,respectively.The protocol .X; Y / is executed on a pair of inputs .X; Y / by starting fromthe root of the tree. At each internal node labeled by av the protocol followsthe child av .X/ (corresponding to Alice sending a message), and similarly ateach internal node labeled by bv the protocol follows bv .Y /. When a leaf isreached the protocol outputs f .By a slight abuse of notation, .X; Y / will denote both the transcript andthe output of the protocol; which is the case will be clear from the context. Thecommunication cost of a protocol is the depth of the corresponding protocoltree. A protocol succeeds on input .X; Y / if .X; Y / D F .X; Y /. Its communication cost on this pair of inputs is the depth of the leaf reached by the execution.The communication complexity CC.F / of a function F is the lowest attainablecommunication cost of a protocol that successfully computes F . In the case ofdeterministic communication we require the protocol to succeed on all inputs.A deterministic communication protocol induces a partition of the inputspace X Y into sets S by the leaf that .X; Y / reaches. Since at each stepthe next move of the protocol depends only on either X or Y alone, each S isa combinatorial rectangle of the form S D S X S Y . This key combinatorialproperty is at the heart of many combinatorial communication complexity lowerbounds. To give an example of such a simple combinatorial proof, consider therank bound. Let N D jX j, M D jY j, and consider the N M matrix MFover R whose .X; Y /-th entry is F .X; Y /. Each protocol with leaf set L ofsize L, induces a partition of X Y into combinatorial rectangles fS g 2L .Let M be the matrix whose entries are equal to MX;Y for .X; Y / 2 SP and are 0elsewhere. Since fS g 2L is a partition of X Y , we have MF D 2L M .Assuming is always correct, each M is monochromatic, i.e., either all-0, orall-1 on S , depending on the value of f . Thus, rank.M / 1, andXrank.M / L:(13)rank.MF / 2LIn fact, a stronger bound of .L 1/ holds unless MF is the trivial all-1 matrix. Thus any protocol computing F must have communication cost of at least

34M. Bravermanlog.rank.MF / C 1/, and it follows that the communication complexity of F isat least log.rank.MF / C 1/. As an example of an application, if F D EQnis the Equality function, then MEQn D I2n is the identity matrix, and thusCC.EQn / n C 1. In other words, the trivial protocol where Alice sends Bobher input X (n bits), and Bob responds whether X D Y (1 bit), is optimal.As in many other areas of theoretical computer science, there is much to begained from randomization. For example, in practice, the Equality function doesnot require linear communication as Alice and Bob can just hash their inputslocally and then compare the hash keys. The shorter protocol may return a falsepositive, but it is correct with high probability, and reduces the communicationcomplexity from .n C 1/ to O.log n/.More generally, a randomized protocol is a protocol that tosses coins (i.e.,accesses random bits), and produces the correct answer with high probability.The distributional setting, where there is a prior probability distribution on theinputs and the players need to output the correct answer with high probabilitywith respect to is closely related to the randomized setting, as will be seenbelow. In the randomized setting there are two possible types of random coins.Public coins, also called shared randomness, are generated at random and areaccessible to both Alice and Bob at no communication cost. Private coins, alsocalled private randomness, are coins generated privately by Alice and Bob, andare only accessible by the player who generated them. If Alice wants to shareher coins with Bob, she needs to use the communication channel. In the contextof communication complexity the pubic-coin model is clearly more powerfulthan the private coin one. Fortunately, the gap between the two is not very large[New91], and can be mostly ignored. For convenience reasons, we will focus onthe public-coin model.The definition of a randomized public-coin communication protocol R isidentical to Definition 2.1, except a public random string R is chosen at thebeginning of the execution of the randomized R , and all functions at the nodesof R may depend on R in addition to the respective input X or Y . We stillrequire the answer f to be unequivocally determined by the leaf alone. Thecommunication cost j R j of R is still its worst-case communication cost7 .The randomized communication complexity of F with error " 0 is givenby(14)R" .F / WDmin R W8X;Y PrR Œ R .X;Y /DF .X;Y / 1 "j R j:For a distribution on X Y the distributional communication complexity D ;" .F / is defined as the cost of the best protocol that achieves expectederror " with respect to . Note that in this case fixing public randomness R to7The worst-case communication cost notion is mainly chosen for historic reasons; an averagecase notion would also have been meaningful to discuss here.

Information complexity and applications: Takagi Lectures Notes35a uniformly random value does not change (on average) the expected successprobability of R with respect to . Therefore, without loss of generality, wemay require to be deterministic:(15)D ;" .F / WDmin W fX;Y W .X;Y /DF .X;Y /g 1 "j j:It is easy to see that for all , D ;" .F / R" .F /. By an elegant minimaxargument [Yao83], a partial converse is also true: for each F and ", there isa distribution against which the distributional communication complexity is ashigh as the randomized:R" .F / D max D ;" .F /:(16) For this reason, we will be able to discuss distributional and randomized communication complexity interchangeably.How can one prove lower bounds for the randomized setting? This settingis much less restrictive than the deterministic one, making lower bounds morechallenging. Given a function F , one can guess the hard distribution , and thentry to lower bound the distributional communication complexity D ;" .F /—that is, show that there is no low-communication protocol that computes Fwith error " with respect to . Such a protocol of cost k D j j still induces a partition fS g 2L of the inputs according to the leaf they reach, withL 2k and each S a combinatorial rectangle. However, it is no longer thecase that when we consider the corresponding submatrix M of MF it must bemonochromatic—the output of is allowed to be wrong on a fraction of S ,and thus for some inputs the output of on S may disagree with the value ofF . Still, it should be true that for most leaves the value of F on S is stronglybiased one way or the other, since the contribution of S to the error is(17)e.S / D min. .S \ F 1 .0//; .S \ F 1 .1///:In particular, a fruitful lower bound strategy is to show that all “large” rectangles with respect to have e.S / .S / ", and thus there must be manysmaller rectangles—giving a lower bound on L 2j j . One simple instantiationof this strategy is the discrepancy bound: for a distribution , the discrepancyDisc .F / of F with respect to is the maximum over all combinatorial rectangles R ofDisc .R; F / WD j .F 1 .0/ \ R/ .F 1 .1/ \ R/j:In other words, if F has low discrepancy with respect to then only very smallrectangles (as measured by ) can be unbalanced. With some calculations, itcan be shown that for all " 0 (see [KN97] and references therein),(18)D ; 1 " .F / log2 .2" Disc .F //:2

36M. BravermanNote that (18) not only says that if the discrepancy is low then the communication complexity is high, but also that it remains high even if we are only tryingto gain a tiny advantage over random guessing in computing F ! An example ofa natural function to which the discrepancy method can be applied is the n-bitInner Product function IPn .X; Y / D hX; Y i mod 2. This simple discrepancymethod can be generalized to a richer family of corruption bounds that can beviewed as combinatorial generalizations of the discrepancy bound. More on thismethod can be found in the survey [LS09].One of the early successes of applying combinatorial methods in communication complexity was the proof that the randomized communication complexity of the set disjointness problem (12) is linear, R1 4 .Disjn / D ‚.n/. The firstproof of this fact was given in the 1980s [KS92], and a much simpler proof wasdiscovered soon after [Raz92]. The proofs exhibit a specific distribution ofinputs on which the distributional communication complexity D ;1 4 .Disjn / is .n/. Note that the uniform distribution would not be a great fit, since uniformlydrawn sets are non-disjoint with a very high probability. It turns out that the following family of distributions is hard: select each coordinate pair .Xi ; Yi /i.i.d. from a distribution on f.0; 0/; .0; 1/; .1; 0/g (e.g. uniformly). This generates a distribution on pairs of disjoint sets. Now, with probability 1 2 choose auniformly random coordinate i 2U Œn and set .Xi ; Yi / D .1; 1/. Thus, under , X and Y are disjoint with probability 1 2.Treating communication complexity as a generalization of one-way communication and applying information-theoretic machinery to it is a very naturalapproach (perhaps the most natural, given the success of information theoryin communication theory). Interestingly, however, this is not how the field hasevolved. In fact, the fairly recent survey [LS09] was able to present the majority of communication complexity techniques to its date without dealing withinformation theory at all.A substantial amount of literature exists on communication complexity withinthe information theory community, see for example [OEG88], [OR95] and references therein. The flavor of the results is usually different from the ones discussed above. In particular, there is much more focus on bounded-round communication, and significantly less focus on techniques for obtaining specificlower bounds for the communication complexity of specific functions such asthe disjointness function. The most relevant work to our current discussion is arelatively recent line of work by Ishwar and Ma, which studied interactive amortized communication and obtained characterizations closely related to the onesdiscussed below [MI11], [MI13].Within the theoretical computer science literature, in the context of communication complexity, information theoretic tools were explicitly introducedin [CSWY01] in the early 2000s for the simultaneous message model (i.e., 2non-interactive rounds of communication). Building on this work, [BYJKS04]