In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's benefits showing that it achieves consistent improvements relative to the performance trade-off profile of standard static weighting. We analyze under what data regimes this method is applicable and show its improvements empirically in neural machine translation (NMT) and multi-lingual language modeling.
In this study, we introduce a sketch-based method for testing how subtle expressive cues influence the perception of affect in illustrations of human figures. We specifically study the impact of human posture and gaze direction, implicitly specified through nose orientation, on perceived emotions and mood. Through a series of user studies using sketchy illustrations of a running figure, where a professional illustrator manipulated gaze direction through adjustments on the nose orientation, we found that this simple change resulted in a diverse range of perceived affects, spanning from fear to concern and wonder. These findings shed light on the importance of fine details in defining context for context-aware system designs and underscore the importance of recognizing and expressing affect. Understanding minor expressive cues is crucial to developing emotionally intelligent systems capable of expressing affect.
In this paper, we propose an advancement to Tarskian model-theoretic semantics, leading to a unified quantitative theory of semantic information and communication. We start with description of inductive logic and probabilities, which serve as notable tools in development of the proposed theory. Then, we identify two disparate kinds of uncertainty in semantic communication, that of physical and content, present refined interpretations of semantic information measures, and conclude with proposing a new measure for semantic content-information and entropy. Our proposition standardizes semantic information across different universes and systems, hence bringing measurability and comparability into semantic communication. We then proceed with introducing conditional and mutual semantic cont-information measures and point out to their utility in formulating practical and optimizable lossless and lossy semantic compression objectives. Finally, we experimentally demonstrate the value of our theoretical propositions.
In this paper, we prove the universal consistency of wide and deep ReLU neural network classifiers trained on the logistic loss. We also give sufficient conditions for a class of probability measures for which classifiers based on neural networks achieve minimax optimal rates of convergence. The result applies to a wide range of known function classes. In particular, while most previous works impose explicit smoothness assumptions on the regression function, our framework encompasses more general settings. The proposed neural networks are either the minimizers of the logistic loss or the $0$-$1$ loss. In the former case, they are interpolating classifiers that exhibit a benign overfitting behavior.
In this paper, we consider multi-robot localization problems with focus on cooperative localization and observability analysis of relative pose estimation. For cooperative localization, there is extra information available to each robot via communication network and message passing. If odometry data of a target robot can be transmitted to the ego-robot then the observability of their relative pose estimation can be achieved by range-only or bearing-only measurements provided both of their linear velocities are non-zero. If odometry data of a target robot is not directly transmitted but estimated by the ego-robot then there must be both range and bearing measurements to guarantee the observability of relative pose estimation. For ROS/Gazebo simulations, we consider four different sensing and communication structures in which extended Kalman filtering (EKF) and pose graph optimization (PGO) estimation with different robust loss functions (filtering and smoothing with different batch sizes of sliding window) are compared in terms of estimation accuracy. For hardware experiments, two Turtlebot3 equipped with UWB modules are used for real-world inter-robot relative pose estimation, in which both EKF and PGO are applied and compared.
In this paper, we focus on the design of binary constant-weight codes that admit low-complexity encoding and decoding algorithms, and that have size as a power of $2$. We construct a family of $(n=2^\ell, M=2^k, d=2)$ constant-weight codes ${\cal C}[\ell, r]$ parameterized by integers $\ell \geq 3$ and $1 \leq r \leq \lfloor \frac{\ell+3}{4} \rfloor$, by encoding information in the gaps between successive $1$'s of a vector. The code has weight $w = \ell$ and combinatorial dimension $k$ that scales quadratically with $\ell$. The encoding time is linear in the input size $k$, and the decoding time is poly-logarithmic in the input size $n$, discounting the linear time spent on parsing the input. Encoding and decoding algorithms of similar codes known in either information-theoretic or combinatorial literature require computation of large number of binomial coefficients. Our algorithms fully eliminate the need to evaluate binomial coefficients. While the code has a natural price to pay in $k$, it performs fairly well against the information-theoretic upper bound $\lfloor \log_2 {n \choose w} \rfloor$. When $\ell =3$, the code is optimal achieving the upper bound; when $\ell=4$, it is one bit away from the upper bound, and as $\ell$ grows it is order-optimal in the sense that the ratio of $k$ with its upper bound becomes a constant $\frac{11}{16}$ when $r=\lfloor \frac{\ell+3}{4} \rfloor$. With the same or even lower complexity, we derive new codes permitting a wider range of parameters by modifying ${\cal C}[\ell, r]$ in two different ways. The code derived using the first approach has the same blocklength $n=2^\ell$, but weight $w$ is allowed to vary from $\ell-1$ to $1$. In the second approach, the weight remains fixed as $w = \ell$, but the blocklength is reduced to $n=2^\ell - 2^r +1$. For certain selected values of parameters, these modified codes have an optimal $k$.
In this paper, we provide a theoretical analysis for a preconditioned steepest descent (PSD) iterative solver that improves the computational time of a finite difference numerical scheme for the Cahn-Hilliard equation with Flory-Huggins energy potential. In the numerical design, a convex splitting approach is applied to the chemical potential such that the logarithmic and the surface diffusion terms are treated implicitly while the expansive concave term is treated with an explicit update. The nonlinear and singular nature of the logarithmic energy potential makes the numerical implementation very challenging. However, the positivity-preserving property for the logarithmic arguments, unconditional energy stability, and optimal rate error estimates have been established in a recent work and it has been shown that successful solvers ensure a similar positivity-preserving property at each iteration stage. Therefore, in this work, we will show that the PSD solver ensures a positivity-preserving property at each iteration stage. The PSD solver consists of first computing a search direction (involved with solving a Poisson-like equation) and then takes a one-parameter optimization step over the search direction in which the Newton iteration becomes very powerful. A theoretical analysis is applied to the PSD iteration solver and a geometric convergence rate is proved for the iteration. In particular, the strict separation property of the numerical solution, which indicates a uniform distance between the numerical solution and the singular limit values of $\pm 1$ for the phase variable, plays an essential role in the iteration convergence analysis. A few numerical results are presented to demonstrate the robustness and efficiency of the PSD solver.
In this paper, we study the asymptotic nonnegative rank of matrices, which characterizes the asymptotic growth of the nonnegative rank of fixed nonnegative matrices under the Kronecker product. This quantity is important since it governs several notions in information theory such as the so-called exact R\'enyi common information and the amortized communication complexity. By using the theory of asymptotic spectra of V. Strassen (J. Reine Angew. Math. 1988), we define formally the asymptotic spectrum of nonnegative matrices and give a dual characterization of the asymptotic nonnegative rank. As a complementary of the nonnegative rank, we introduce the notion of the subrank of a nonnegative matrix and show that it is exactly equal to the size of the maximum induced matching of the bipartite graph defined on the support of the matrix (therefore, independent of the value of entries). Finally, we show that two matrix parameters, namely rank and fractional cover number, belong to the asymptotic spectrum of nonnegative matrices.
In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
In this paper, we study the expressivity of scalar, Markovian reward functions in Reinforcement Learning (RL), and identify several limitations to what they can express. Specifically, we look at three classes of RL tasks; multi-objective RL, risk-sensitive RL, and modal RL. For each class, we derive necessary and sufficient conditions that describe when a problem in this class can be expressed using a scalar, Markovian reward. Moreover, we find that scalar, Markovian rewards are unable to express most of the instances in each of these three classes. We thereby contribute to a more complete understanding of what standard reward functions can and cannot express. In addition to this, we also call attention to modal problems as a new class of problems, since they have so far not been given any systematic treatment in the RL literature. We also briefly outline some approaches for solving some of the problems we discuss, by means of bespoke RL algorithms.
This work considers the question of how convenient access to copious data impacts our ability to learn causal effects and relations. In what ways is learning causality in the era of big data different from -- or the same as -- the traditional one? To answer this question, this survey provides a comprehensive and structured review of both traditional and frontier methods in learning causality and relations along with the connections between causality and machine learning. This work points out on a case-by-case basis how big data facilitates, complicates, or motivates each approach.