In this paper, we propose an alternating direction method of multipliers (ADMM)-based optimization algorithm to achieve better undersampling rate for multiple measurement vector (MMV) problem. The core is to introduce the $\ell_{2,0}$-norm sparsity constraint to describe the joint-sparsity of the MMV problem, which is different from the widely used $\ell_{2,1}$-norm constraint in the existing research. In order to illustrate the better performance of $\ell_{2,0}$-norm, first this paper proves the equivalence of the sparsity of the row support set of a matrix and its $\ell_{2,0}$-norm. Afterward, the MMV problem based on $\ell_{2,0}$-norm is proposed. Moreover, building on the Kurdyka-Lojasiewicz property, this paper establishes that the sequence generated by ADMM globally converges to the optimal point of the MMV problem. Finally, the performance of our algorithm and comparison with other algorithms under different conditions is studied by simulated examples.
Over-the-air federated learning (OTA-FL) is an emerging technique to reduce the computation and communication overload at the PS caused by the orthogonal transmissions of the model updates in conventional federated learning (FL). This reduction is achieved at the expense of introducing aggregation error that can be efficiently suppressed by means of receive beamforming via large array-antennas. This paper studies OTA-FL in massive multiple-input multiple-output (MIMO) systems by considering a realistic scenario in which the edge server, despite its large antenna array, is restricted in the number of radio frequency (RF)-chains. For this setting, the beamforming for over-the-air model aggregation needs to be addressed jointly with antenna selection. This leads to an NP-hard problem due to the combinatorial nature of the optimization. We tackle this problem via two different approaches. In the first approach, we use the penalty dual decomposition (PDD) technique to develop a two-tier algorithm for joint antenna selection and beamforming. The second approach interprets the antenna selection task as a sparse recovery problem and develops two iterative joint algorithms based on the Lasso and fast iterative soft-thresholding methods. Convergence and complexity analysis is presented for all the schemes. The numerical investigations depict that the algorithms based on the sparse recovery techniques outperform the PDD-based algorithm, when the number of RF-chains at the edge server is much smaller than its array size. However, as the number of RF-chains increases, the PDD approach starts to be superior. Our simulations further depict that learning performance with all the antennas being active at the PS can be closely tracked by selecting less than 20% of the antennas at the PS.
A piecewise linear function can be described in different forms: as an arbitrarily nested expression of $\min$- and $\max$-functions, as a difference of two convex piecewise linear functions, or as a linear combination of maxima of affine-linear functions. In this paper, we provide two main results: first, we show that for every piecewise linear function there exists a linear combination of $\max$-functions with at most $n+1$ arguments, and give an algorithm for its computation. Moreover, these arguments are contained in the finite set of affine-linear functions that coincide with the given function in some open set. Second, we prove that the piecewise linear function $\max(0, x_{1}, \ldots, x_{n})$ cannot be represented as a linear combination of maxima of less than $n+1$ affine-linear arguments. This was conjectured by Wang and Sun in 2005 in a paper on representations of piecewise linear functions as linear combination of maxima.
Consider a hiring process with candidates coming from different universities. It is easy to order candidates who have the same background, yet it can be challenging to compare them otherwise. The latter case requires additional costly assessments and can result in sub-optimal hiring decisions. Given an assigned budget, what would be an optimal strategy to select the most qualified candidate? We model the above problem by introducing a new variant of the secretary problem in which sequentially observed candidates are split into two distinct groups. For each new candidate, the decision maker observes its rank among already seen candidates from the same group and can access its rank among all observed candidates at some fixed cost. To tackle this new problem, we introduce and study the family of Dynamic Double Threshold (DDT) algorithms. We show that, with well-chosen parameters, their success probability converges rapidly to 1/e as the budget grows, recovering the optimal success probability from the usual secretary problem. Finally, focusing on the class of memory-less algorithms, we propose an optimal algorithm in the non-asymptotic regime and show that it belongs to the DDT family when the number of candidates is large.
In this paper we derive sufficient conditions for the convergence of two popular alternating minimisation algorithms for dictionary learning - the Method of Optimal Directions (MOD) and Online Dictionary Learning (ODL), which can also be thought of as approximative K-SVD. We show that given a well-behaved initialisation that is either within distance at most $1/\log(K)$ to the generating dictionary or has a special structure ensuring that each element of the initialisation only points to one generating element, both algorithms will converge with geometric convergence rate to the generating dictionary. This is done even for data models with non-uniform distributions on the supports of the sparse coefficients. These allow the appearance frequency of the dictionary elements to vary heavily and thus model real data more closely.
How well do language models deal with quantification? In this study, we focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models because the sentence components with out the quantifier are likely to co-occur, and 'few'-type quantifiers are rare. We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do all the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. This inverse scaling is consistent with previous work suggesting that larger models increasingly reflect online rather than offline human processing, and we argue that the decreasing performance of larger models may challenge uses of language models as the basis for natural language systems.
We revisit $M$-ary classification of Gutman (TIT 1989), where one is tasked to determine whether a testing sequence is generated with the same distribution as one of the $M$ training sequences or not. Our main result is a two-phase test, its theoretical analysis and its optimality guarantee. Specifically, our two-phase test is a special case of a sequential test with only two decision time points: the first phase of our test is a fixed-length test with a reject option, the second-phase of our test proceeds only if a reject option is decided in the first phase and the second phase of our test does \emph{not} allow a reject option. To provide theoretical guarantee for our test, we derive achievable error exponents using the method of types and derive a converse result for the optimal sequential test using the techniques recently proposed by Hsu, Li and Wang (ITW, 2022) for binary classification. Analytically and numerically, we show that our two phase test achieves the performance of an optimal sequential test with proper choice of test parameters. In particular, similarly as the optimal sequential test, our test does not need a final reject option to achieve the optimal error exponent region while an optimal fixed-length test needs a reject option to achieve the same region. Finally, we specialize our results to binary classification when $M=2$ and to $M$-ary hypothesis testing when the ratio of the lengths of training sequences and testing sequences tends to infinity so that generating distributions can be estimated perfectly.
We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to distinguish samples from two groups. To solve this problem, we propose a framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a group of variables with a pre-specified size that maximizes the variance-regularized MMD statistics. This formulation also corresponds to the minimization of asymptotic type-II error while controlling type-I error, as studied in the literature. We present mixed-integer programming formulations and offer exact and approximation algorithms with performance guarantees for linear and quadratic types of kernel functions. Experimental results demonstrate the superior performance of our framework.
The article discusses the localization of radiation sources whose number and other relevant parameters are not known in advance. The data collection is ensured by an autonomous mobile robot that performs a survey in a defined region of interest populated with static obstacles. The measurement trajectory is information-driven rather than pre-planned. The localization exploits a regularized particle filter estimating the sources' parameters continuously. The dynamic robot control switches between two modes, one attempting to minimize the Shannon entropy and the other aiming to reduce the variance of expected measurements in unexplored parts of the target area; both of the modes maintain safe clearance from the obstacles. The performance of the algorithms was tested in a simulation study based on real-world data acquired previously from three radiation sources exhibiting various activities. Our approach reduces the time necessary to explore the region and to find the sources by approximately 40 %; at present, however, the method is unable to reliably localize sources that have a relatively low intensity. In this context, additional research has been planned to increase the credibility and robustness of the procedure and to improve the robotic platform autonomy.
Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richt\'{a}rik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak's momentum.
We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction. Assuming the $d$-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd T})$. Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order $\tilde O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the learner's losses $V_T$, and a closely related first-order bound of order $\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than $T$, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.