This report gives the reasoning for the setting of the target queue delay parameter in the reference Linux implementation of PI$^2$ Active Queue Management (AQM)
We investigate a novel scheduling problem where we have $n$ clients, each associated with a single job on each of a set of $m$ different days. On each day, a single machine is available to process the $n$ jobs non-preemptively. The goal is provide an equitable set of schedules for all $m$ days such that the sum of completion times of each client over all days is not greater than some specified equability parameter $k$. The $1\mid\mid\max_j\sum_i C_{i,j}$ problem, as we refer to it in this paper, fits nicely into a new model introduced by Heeger et al. [AAAI '21] that aims at capturing a generic notion of fairness in scheduling settings where the same set of clients repeatedly submit scheduling requests over a fixed period of time. We show that the $1\mid\mid\max_j\sum_i C_{i,j}$ problem is NP-hard even under quite severe restrictions. This leads us to investigating two natural special cases: One where we assume the number of days to be small and one where we consider the number of clients to be small. We present several tractability results for both cases.
The joint convexity of the map $(X,A) \mapsto X^* A^{-1} X$, an integral representation of convex operator functions, and an observation of Ando are used to obtain a simple proof of both the joint convexity of relative entropy and a trace convexity result of Lieb. The latter was the key ingredient in the original proof of the strong subadditivity of quantum entropy.
Multilingual neural machine translation (MNMT) aims to translate multiple languages with a single model and has been proved successful thanks to effective knowledge transfer among different languages with shared parameters. However, it is still an open question which parameters should be shared and which ones need to be task-specific. Currently, the common practice is to heuristically design or search language-specific modules, which is difficult to find the optimal configuration. In this paper, we propose a novel parameter differentiation based method that allows the model to determine which parameters should be language-specific during training. Inspired by cellular differentiation, each shared parameter in our method can dynamically differentiate into more specialized types. We further define the differentiation criterion as inter-task gradient similarity. Therefore, parameters with conflicting inter-task gradients are more likely to be language-specific. Extensive experiments on multilingual datasets have demonstrated that our method significantly outperforms various strong baselines with different parameter sharing configurations. Further analyses reveal that the parameter sharing configuration obtained by our method correlates well with the linguistic proximities.
We prove that the border rank of the Kronecker square of the little Coppersmith-Winograd tensor $T_{cw,q}$ is the square of its border rank for $q > 2$ and that the border rank of its Kronecker cube is the cube of its border rank for $q > 4$. This answers questions raised implicitly in [Coppersmith-Winograd, 1990] and explicitly in [Bl\"aser, 2013] and rules out the possibility of proving new upper bounds on the exponent of matrix multiplication using the square or cube of a little Coppersmith-Winograd tensor in this range. In the positive direction, we enlarge the list of explicit tensors potentially useful for Strassen's laser method, introducing a skew-symmetric version of the Coppersmith-Winograd tensor, $T_{skewcw,q}$. For $q = 2$, the Kronecker square of this tensor coincides with the $3\times 3$ determinant polynomial, $\det_3 \in \mathbb{C}^9\otimes \mathbb{C}^9\otimes \mathbb{C}^9$, regarded as a tensor. We show that this tensor could potentially be used to show that the exponent of matrix multiplication is two. We determine new upper bounds for the (Waring) rank and the (Waring) border rank of $\det_3$, exhibiting a strict submultiplicative behaviour for $T_{skewcw,2}$ which is promising for the laser method. We establish general results regarding border ranks of Kronecker powers of tensors, and make a detailed study of Kronecker squares of tensors in $\mathbb{C}^3\otimes \mathbb{C}^3\otimes \mathbb{C}^3$.
This paper discusses the estimation of the generalization gap, the difference between a generalization error and an empirical error, for overparameterized models (e.g., neural networks). We first show that a functional variance, a key concept in defining a widely-applicable information criterion, characterizes the generalization gap even in overparameterized settings where a conventional theory cannot be applied. We also propose a computationally efficient approximation of the function variance, the Langevin approximation of the functional variance (Langevin FV). This method leverages only the $1$st-order gradient of the squared loss function, without referencing the $2$nd-order gradient; this ensures that the computation is efficient and the implementation is consistent with gradient-based optimization algorithms. We demonstrate the Langevin FV numerically by estimating the generalization gaps of overparameterized linear regression and non-linear neural network models.
We consider equivariant estimation of location/scale parameters of a general bivariate distribution, under quite general conditions on underlying distributions and the loss function, when it is known apriori that these parameters satisfy a order restriction. This paper unifies various results in the literature on the inadmissibility of location/scale equivariant estimators and finding their improvements. The usefulness of these unified results and their relation to existing results in the literature are illustrated through various examples.
In recent years, Bi-Level Optimization (BLO) techniques have received extensive attentions from both learning and vision communities. A variety of BLO models in complex and practical tasks are of non-convex follower structure in nature (a.k.a., without Lower-Level Convexity, LLC for short). However, this challenging class of BLOs is lack of developments on both efficient solution strategies and solid theoretical guarantees. In this work, we propose a new algorithmic framework, named Initialization Auxiliary and Pessimistic Trajectory Truncated Gradient Method (IAPTT-GM), to partially address the above issues. In particular, by introducing an auxiliary as initialization to guide the optimization dynamics and designing a pessimistic trajectory truncation operation, we construct a reliable approximate version of the original BLO in the absence of LLC hypothesis. Our theoretical investigations establish the convergence of solutions returned by IAPTT-GM towards those of the original BLO without LLC. As an additional bonus, we also theoretically justify the quality of our IAPTT-GM embedded with Nesterov's accelerated dynamics under LLC. The experimental results confirm both the convergence of our algorithm without LLC, and the theoretical findings under LLC.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.
We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not require any hyper-parameters, distance thresholds and/or the need to specify the number of clusters. The proposed algorithm belongs to the family of hierarchical agglomerative methods. The technique has a very low computational overhead, is easily scalable and applicable to large practical problems. Evaluation on well known datasets from different domains ranging between 1077 and 8.1 million samples shows substantial performance gains when compared to the existing clustering techniques.
For neural networks (NNs) with rectified linear unit (ReLU) or binary activation functions, we show that their training can be accomplished in a reduced parameter space. Specifically, the weights in each neuron can be trained on the unit sphere, as opposed to the entire space, and the threshold can be trained in a bounded interval, as opposed to the real line. We show that the NNs in the reduced parameter space are mathematically equivalent to the standard NNs with parameters in the whole space. The reduced parameter space shall facilitate the optimization procedure for the network training, as the search space becomes (much) smaller. We demonstrate the improved training performance using numerical examples.