In many real-life situations, we are often faced with combinatorial problems under linear cost functions. In this paper, we propose a fast method for exactly enumerating a very large number of all lower cost solutions for various combinatorial problems. Our method is based on backtracking for a given decision diagram which represents the feasible solutions of a problem. The main idea is to memoize the intervals of cost bounds to avoid duplicate search in the backtracking process. Although existing state-of-the-art methods with dynamic programming requires a pseudo-polynomial time with the total cost values, it may take an exponential time when the cost values become large. In contrast, the computation time of the proposed method does not directly depend on the total cost values, but is bounded by the input and output size of the decision diagrams. Therefore, it can be much faster than the existing methods if the cost values are large but the output of the decision diagrams are well-compressed. Our experimental results show that, for some practical instances of the Hamiltonian path problem, we succeeded in exactly enumerating billions of all lower cost solutions in a few seconds, which is hundred or more times faster than existing methods. Our method can be regarded as a novel search algorithm which integrates the two classical techniques, branch-and-bound and dynamic programming. This method would have many applications in various fields, including operations research, data mining, statistical testing, hardware/software system design, etc.
Implementation of many statistical methods for large, multivariate data sets requires one to solve a linear system that, depending on the method, is of the dimension of the number of observations or each individual data vector. This is often the limiting factor in scaling the method with data size and complexity. In this paper we illustrate the use of Krylov subspace methods to address this issue in a statistical solution to a source separation problem in cosmology where the data size is prohibitively large for direct solution of the required system. Two distinct approaches are described: one that uses the method of conjugate gradients directly to the Kronecker-structured problem and another that reformulates the system as a Sylvester matrix equation. We show that both approaches produce an accurate solution within an acceptable computation time and with practical memory requirements for the data size that is currently available.
In selfish bin packing, each item is regarded as a player, who aims to minimize the cost-share by choosing a bin it can fit in. To have a least number of bins used, cost-sharing rules play an important role. The currently best known cost sharing rule has a lower bound on $PoA$ larger than 1.45, while a general lower bound 4/3 on $PoA$ applies to any cost-sharing rule under which no items have incentive unilaterally moving to an empty bin. In this paper, we propose a novel and simple rule with a $PoA$ matching the lower bound, thus completely resolving this game. The new rule always admits a Nash equilibrium and its $PoS$ is one. Furthermore, the well-known bin packing algorithm $BFD$ (Best-Fit Decreasing) is shown to achieve a strong equilibrium, implying that a stable packing with an asymptotic approximation ratio of $11/9$ can be produced in polynomial time.
In this paper we get error bounds for fully discrete approximations of infinite horizon problems via the dynamic programming approach. It is well known that considering a time discretization with a positive step size $h$ an error bound of size $h$ can be proved for the difference between the value function (viscosity solution of the Hamilton-Jacobi-Bellman equation corresponding to the infinite horizon) and the value function of the discrete time problem. However, including also a spatial discretization based on elements of size $k$ an error bound of size $O(k/h)$ can be found in the literature for the error between the value functions of the continuous problem and the fully discrete problem. In this paper we revise the error bound of the fully discrete method and prove, under similar assumptions to those of the time discrete case, that the error of the fully discrete case is in fact $O(h+k)$ which gives first order in time and space for the method. This error bound matches the numerical experiments of many papers in the literature in which the behaviour $1/h$ from the bound $O(k/h)$ have not been observed.
The Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We point out that existing learning-to-route MoE methods suffer from the routing fluctuation issue, i.e., the target expert of the same input may change along with training, but only one expert will be activated for the input during inference. The routing fluctuation tends to harm sample efficiency because the same input updates different experts but only one is finally used. In this paper, we propose StableMoE with two training stages to address the routing fluctuation problem. In the first training stage, we learn a balanced and cohesive routing strategy and distill it into a lightweight router decoupled from the backbone model. In the second training stage, we utilize the distilled router to determine the token-to-expert assignment and freeze it for a stable routing strategy. We validate our method on language modeling and multilingual machine translation. The results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
We study the distributed minimum spanning tree (MST) problem, a fundamental problem in distributed computing. It is well-known that distributed MST can be solved in $\tilde{O}(D+\sqrt{n})$ rounds in the standard CONGEST model (where $n$ is the network size and $D$ is the network diameter) and this is essentially the best possible round complexity (up to logarithmic factors). However, in resource-constrained networks such as ad hoc wireless and sensor networks, nodes spending so much time can lead to significant spending of resources such as energy. Motivated by the above consideration, we study distributed algorithms for MST under the \emph{sleeping model} [Chatterjee et al., PODC 2020], a model for design and analysis of resource-efficient distributed algorithms. In the sleeping model, a node can be in one of two modes in any round -- \emph{sleeping} or \emph{awake} (unlike the traditional model where nodes are always awake). Only the rounds in which a node is \emph{awake} are counted, while \emph{sleeping} rounds are ignored. A node spends resources only in the awake rounds and hence the main goal is to minimize the \emph{awake complexity} of a distributed algorithm, the worst-case number of rounds any node is awake. We present deterministic and randomized distributed MST algorithms that have an \emph{optimal} awake complexity of $O(\log n)$ time with a matching lower bound. We also show that our randomized awake-optimal algorithm has essentially the best possible round complexity by presenting a lower bound of $\tilde{\Omega}(n)$ on the product of the awake and round complexity of any distributed algorithm (including randomized) that outputs an MST, where $\tilde{\Omega}$ hides a $1/(\text{polylog } n)$ factor.
We describe a numerical algorithm for approximating the equilibrium-reduced density matrix and the effective (mean force) Hamiltonian for a set of system spins coupled strongly to a set of bath spins when the total system (system+bath) is held in canonical thermal equilibrium by weak coupling with a "super-bath". Our approach is a generalization of now standard typicality algorithms for computing the quantum expectation value of observables of bare quantum systems via trace estimators and Krylov subspace methods. In particular, our algorithm makes use of the fact that the reduced system density, when the bath is measured in a given random state, tends to concentrate about the corresponding thermodynamic averaged reduced system density. Theoretical error analysis and numerical experiments are given to validate the accuracy of our algorithm. Further numerical experiments demonstrate the potential of our approach for applications including the study of quantum phase transitions and entanglement entropy for long-range interaction systems.
We present faster-than-native alternatives for the full AVX512-VP2INTERSECT instruction subset using basic AVX512F instructions. These alternatives compute only one of the output masks, which is sufficient for the typical case of computing the intersection of two sorted lists of integers, or computing the size of such an intersection. While the na\"ive implementation (compare the first input vector against all rotations of the second) is slower than the native instructions, we show that by rotating both the first and second operands at the same time there is a significant saving in the total number of vector rotations, resulting in the emulations being faster than the native instructions, for all instructions in the VP2INTERSECT subset. Additionally, the emulations can be easily extended to other types of inputs (e.g. packed vectors of 16-bit integers) for which native instructions are not available.
Tensor PCA is a stylized statistical inference problem introduced by Montanari and Richard to study the computational difficulty of estimating an unknown parameter from higher-order moment tensors. Unlike its matrix counterpart, Tensor PCA exhibits a statistical-computational gap, i.e., a sample size regime where the problem is information-theoretically solvable but conjectured to be computationally hard. This paper derives computational lower bounds on the run-time of memory bounded algorithms for Tensor PCA using communication complexity. These lower bounds specify a trade-off among the number of passes through the data sample, the sample size, and the memory required by any algorithm that successfully solves Tensor PCA. While the lower bounds do not rule out polynomial-time algorithms, they do imply that many commonly-used algorithms, such as gradient descent and power method, must have a higher iteration count when the sample size is not large enough. Similar lower bounds are obtained for Non-Gaussian Component Analysis, a family of statistical estimation problems in which low-order moment tensors carry no information about the unknown parameter. Finally, stronger lower bounds are obtained for an asymmetric variant of Tensor PCA and related statistical estimation problems. These results explain why many estimators for these problems use a memory state that is significantly larger than the effective dimensionality of the parameter of interest.
Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures.
We propose a new fast streaming algorithm for the tensor completion problem of imputing missing entries of a low-tubal-rank tensor using the tensor singular value decomposition (t-SVD) algebraic framework. We show the t-SVD is a specialization of the well-studied block-term decomposition for third-order tensors, and we present an algorithm under this model that can track changing free submodules from incomplete streaming 2-D data. The proposed algorithm uses principles from incremental gradient descent on the Grassmann manifold of subspaces to solve the tensor completion problem with linear complexity and constant memory in the number of time samples. We provide a local expected linear convergence result for our algorithm. Our empirical results are competitive in accuracy but much faster in compute time than state-of-the-art tensor completion algorithms on real applications to recover temporal chemo-sensing and MRI data under limited sampling.