Tensors, i.e., multi-linear functions, are a fundamental building block of machine learning algorithms. In order to train on large data-sets, it is common practice to distribute the computation amongst workers. However, stragglers and other faults can severely impact the performance and overall training time. A novel strategy to mitigate these failures is the use of coded computation. We introduce a new metric for analysis called the typical recovery threshold, which focuses on the most likely event and provide a novel construction of distributed coded tensor operations which are optimal with this measure. We show that our general framework encompasses many other computational schemes and metrics as a special case. In particular, we prove that the recovery threshold and the tensor rank can be recovered as a special case of the typical recovery threshold when the probability of noise, i.e., a fault, is equal to zero, thereby providing a noisy generalization of noiseless computation as a serendipitous result. Far from being a purely theoretical construction, these definitions lead us to practical random code constructions, i.e., locally random p-adic alloy codes, which are optimal with respect to the measures. We analyze experiments conducted on Amazon EC2 and establish that they are faster and more numerically stable than many other benchmark computation schemes in practice, as is predicted by theory.
We prove a central limit theorem for the entropic transportation cost between subgaussian probability measures, centered at the population cost. This is the first result which allows for asymptotically valid inference for entropic optimal transport between measures which are not necessarily discrete. In the compactly supported case, we complement these results with new, faster, convergence rates for the expected entropic transportation cost between empirical measures. Our proof is based on strengthening convergence results for dual solutions to the entropic optimal transport problem.
In this work a quantum analogue of Bayesian statistical inference is considered. Based on the notion of instrument, we propose a sequential measurement scheme from which observations needed for statistical inference are obtained. We further put forward a quantum analogue of Bayes rule, which states how the prior normal state of a quantum system updates under those observations. We next generalize the fundamental notions and results of Bayesian statistics according to the quantum Bayes rule. It is also note that our theory retains the classical one as its special case. Finally, we investigate the limit of posterior normal state as the number of observations tends to infinity.
This paper focuses on stochastic saddle point problems with decision-dependent distributions. These are problems whose objective is the expected value of a stochastic payoff function, where random variables are drawn from a distribution induced by a distributional map. For general distributional maps, the problem of finding saddle points is in general computationally burdensome, even if the distribution is known. To enable a tractable solution approach, we introduce the notion of equilibrium points -- which are saddle points for the stationary stochastic minimax problem that they induce -- and provide conditions for their existence and uniqueness. We demonstrate that the distance between the two solution types is bounded provided that the objective has a strongly-convex-strongly-concave payoff and a Lipschitz continuous distributional map. We develop deterministic and stochastic primal-dual algorithms and demonstrate their convergence to the equilibrium point. In particular, by modeling errors emerging from a stochastic gradient estimator as sub-Weibull random variables, we provide error bounds in expectation and in high probability that hold for each iteration. Moreover, we show convergence to a neighborhood almost surely. Finally, we investigate a condition on the distributional map -- which we call opposing mixture dominance -- that ensures that the objective is strongly-convex-strongly-concave. We tailor the convergence results for the primal-dual algorithms to this opposing mixture dominance setup.
We study online convex optimization with switching costs, a practically important but also extremely challenging problem due to the lack of complete offline information. By tapping into the power of machine learning (ML) based optimizers, ML-augmented online algorithms (also referred to as expert calibration in this paper) have been emerging as state of the art, with provable worst-case performance guarantees. Nonetheless, by using the standard practice of training an ML model as a standalone optimizer and plugging it into an ML-augmented algorithm, the average cost performance can be even worse than purely using ML predictions. In order to address the "how to learn" challenge, we propose EC-L2O (expert-calibrated learning to optimize), which trains an ML-based optimizer by explicitly taking into account the downstream expert calibrator. To accomplish this, we propose a new differentiable expert calibrator that generalizes regularized online balanced descent and offers a provably better competitive ratio than pure ML predictions when the prediction error is large. For training, our loss function is a weighted sum of two different losses -- one minimizing the average ML prediction error for better robustness, and the other one minimizing the post-calibration average cost. We also provide theoretical analysis for EC-L2O, highlighting that expert calibration can be even beneficial for the average cost performance and that the high-percentile tail ratio of the cost achieved by EC-L2O to that of the offline optimal oracle (i.e., tail cost ratio) can be bounded. Finally, we test EC-L2O by running simulations for sustainable datacenter demand response. Our results demonstrate that EC-L2O can empirically achieve a lower average cost as well as a lower competitive ratio than the existing baseline algorithms.
The classical coding theorem in Kolmogorov complexity states that if an $n$-bit string $x$ is sampled with probability $\delta$ by an algorithm with prefix-free domain then K$(x) \leq \log(1/\delta) + O(1)$. In a recent work, Lu and Oliveira [LO21] established an unconditional time-bounded version of this result, by showing that if $x$ can be efficiently sampled with probability $\delta$ then rKt$(x) = O(\log(1/\delta)) + O(\log n)$, where rKt denotes the randomized analogue of Levin's Kt complexity. Unfortunately, this result is often insufficient when transferring applications of the classical coding theorem to the time-bounded setting, as it achieves a $O(\log(1/\delta))$ bound instead of the information-theoretic optimal $\log(1/\delta)$. We show a coding theorem for rKt with a factor of $2$. As in previous work, our coding theorem is efficient in the sense that it provides a polynomial-time probabilistic algorithm that, when given $x$, the code of the sampler, and $\delta$, it outputs, with probability $\ge 0.99$, a probabilistic representation of $x$ that certifies this rKt complexity bound. Assuming the security of cryptographic pseudorandom generators, we show that no efficient coding theorem can achieve a bound of the form rKt$(x) \leq (2 - o(1)) \cdot \log(1/\delta) +$ poly$(\log n)$. Under a weaker assumption, we exhibit a gap between efficient coding theorems and existential coding theorems with near-optimal parameters. We consider pK$^t$ complexity [GKLO22], a variant of rKt where the randomness is public and the time bound is fixed. We observe the existence of an optimal coding theorem for pK$^t$, and employ this result to establish an unconditional version of a theorem of Antunes and Fortnow [AF09] which characterizes the worst-case running times of languages that are in average polynomial-time over all P-samplable distributions.
The lossless compression of a single source $X^n$ was recently shown to be achievable with a notion of strong locality; any $X_i$ can be decoded from a {\emph{constant}} number of compressed bits, with a vanishing in $n$ probability of error. In contrast with the single source setup, we show that for two separately encoded sources $(X^n,Y^n)$, lossless compression and strong locality is generally not possible. More precisely, we show that for the class of "confusable" sources strong locality cannot be achieved whenever one of the sources is compressed below its entropy. In this case, irrespectively of $n$, the probability of error of decoding any $(X_i,Y_i)$ is lower bounded by $2^{-O(d_{\mathrm{loc}})}$, where $d_{\mathrm{loc}}$ denotes the number of compressed bits accessed by the local decoder. Conversely, if the source is not confusable, strong locality is possible even if one of the sources is compressed below its entropy. Results extend to any number of sources.
In this work, we develop quantization and variable-length source codecs for the feedback links in linear-quadratic-Gaussian (LQG) control systems. We prove that for any fixed control performance, the approaches we propose nearly achieve lower bounds on communication cost that have been established in prior work. In particular, we refine the analysis of a classical achievability approach with an eye towards more practical details. Notably, in the prior literature the source codecs used to demonstrate the (near) achievability of these lower bounds are often implicitly assumed to be time-varying. For single-input single-output (SISO) plants, we prove that it suffices to consider time-invariant quantization and source coding. This result follows from analyzing the long-term stochastic behavior of the system's quantized measurements and reconstruction errors. To our knowledge, this time-invariant achievability result is the first in the literature.
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We derive an upper bound on the relative entropy between the distribution of the first $k$ in a sequence of $n$ exchangeable random variables, and an appropriate mixture over product distributions. The mixing measure is characterised as the law of the empirical measure of the original sequence, and de Finetti's result is recovered as a corollary. The proof is nicely motivated by the Gibbs conditioning principle in connection with statistical mechanics, and it follows along an appealing sequence of steps. The technical estimates required for these steps are obtained via the use of a collection of combinatorial tools known within information theory as `the method of types.'
In the pooled data problem we are given a set of $n$ agents, each of which holds a hidden state bit, either $0$ or $1$. A querying procedure returns for a query set the sum of the states of the queried agents. The goal is to reconstruct the states using as few queries as possible. In this paper we consider two noise models for the pooled data problem. In the noisy channel model, the result for each agent flips with a certain probability. In the noisy query model, each query result is subject to random Gaussian noise. Our results are twofold. First, we present and analyze for both error models a simple and efficient distributed algorithm that reconstructs the initial states in a greedy fashion. Our novel analysis pins down the range of error probabilities and distributions for which our algorithm reconstructs the exact initial states with high probability. Secondly, we present simulation results of our algorithm and compare its performance with approximate message passing (AMP) algorithms that are conjectured to be optimal in a number of related problems.
Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.