Group testing can help maintain a widespread testing program using fewer resources amid a pandemic. In group testing, we are given $n$ samples, one per individual. These samples are arranged into $m < n$ pooled samples, where each pool is obtained by mixing a subset of the $n$ individual samples. Infected individuals are then identified using a group testing algorithm. In this paper, we use side information (SI) collected from contact tracing (CT) within nonadaptive/single-stage group testing algorithms. We generate CT SI data by incorporating characteristics of disease spread between individuals. These data are fed into two signal and measurement models for group testing, and numerical results show that our algorithms provide improved sensitivity and specificity. We also show how to incorporate CT SI into the design of the pooling matrix. That said, our numerical results suggest that the utilization of SI in the pooling matrix design based on the minimization of a weighted coherence measure does not yield significant performance gains beyond the incorporation of SI in the group testing algorithm.
Due to the COVID 19 pandemic, smartphone-based proximity tracing systems became of utmost interest. Many of these systems use BLE signals to estimate the distance between two persons. The quality of this method depends on many factors and, therefore, does not always deliver accurate results. In this paper, we present a multi-channel approach to improve proximity classification, and a novel, publicly available data set that contains matched IEEE 802.11 (2.4 GHz and 5 GHz) and BLE signal strength data, measured in four different environments. We have developed and evaluated a combined classification model based on BLE and IEEE 802.11 signals. Our approach significantly improves the distance classification and consequently also the contact tracing accuracy. We are able to achieve good results with our approach in everyday public transport scenarios. However, in our implementation based on IEEE 802.11 probe requests, we also encountered privacy problems and limitations due to the consistency and interval at which such probes are sent. We discuss these limitations and sketch how our approach could be improved to make it suitable for real-world deployment.
Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year's WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of Independent and Identically Distributed (i.i.d) assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to claim the result on a single dataset, because it may leads to inconsistent results with most of other datasets.
We study streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the $L_1$-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams. If the white-box adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our $L_1$-heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any two-player deterministic communication lower bound to a lower bound for {\it randomized} algorithms robust to a white-box adversary. In particular, our results show that for all $p\ge 0$, there exists a constant $C_p>1$ such that any $C_p$-approximation algorithm for $F_p$ moment estimation in insertion-only streams with a white-box adversary requires $\Omega(n)$ space for a universe of size $n$. Similarly, there is a constant $C>1$ such that any $C$-approximation algorithm in an insertion-only stream for matrix rank requires $\Omega(n)$ space with a white-box adversary. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. (Abstract shortened to meet arXiv limits.)
Community detection refers to the problem of clustering the nodes of a network into groups. Existing inferential methods for community structure mainly focus on unweighted (binary) networks. Many real-world networks are nonetheless weighted and a common practice is to dichotomize a weighted network to an unweighted one which is known to result in information loss. Literature on hypothesis testing in the latter situation is still missing. In this paper, we study the problem of testing the existence of community structure in weighted networks. Our contributions are threefold: (a). We use the (possibly infinite-dimensional) exponential family to model the weights and derive the sharp information-theoretic limit for the existence of consistent test. Within the limit, any test is inconsistent; and beyond the limit, we propose a useful consistent test. (b). Based on the information-theoretic limits, we provide the first formal way to quantify the loss of information incurred by dichotomizing weighted graphs into unweighted graphs in the context of hypothesis testing. (c). We propose several new and practically useful test statistics. Simulation study show that the proposed tests have good performance. Finally, we apply the proposed tests to an animal social network.
The key relay protocol (KRP) plays an important role in improving the performance and the security of quantum key distribution (QKD) networks. On the other hand, there is also an existing research field called secure network coding (SNC), which has similar goal and structure. We here analyze differences and similarities between the KRP and SNC rigorously. We found, rather surprisingly, that there is a definite gap in security between the KRP and SNC; that is, certain KRPs achieve better security than any SNC schemes on the same graph. We also found that this gap can be closed if we generalize the notion of SNC by adding free public channels; that is, KRPs are equivalent to SNC schemes augmented with free public channels.
We study the problem of testing whether a function $f: \mathbb{R}^n \to \mathbb{R}$ is a polynomial of degree at most $d$ in the \emph{distribution-free} testing model. Here, the distance between functions is measured with respect to an unknown distribution $\mathcal{D}$ over $\mathbb{R}^n$ from which we can draw samples. In contrast to previous work, we do not assume that $\mathcal{D}$ has finite support. We design a tester that given query access to $f$, and sample access to $\mathcal{D}$, makes $(d/\varepsilon)^{O(1)}$ many queries to $f$, accepts with probability $1$ if $f$ is a polynomial of degree $d$, and rejects with probability at least $2/3$ if every degree-$d$ polynomial $P$ disagrees with $f$ on a set of mass at least $\varepsilon$ with respect to $\mathcal{D}$. Our result also holds under mild assumptions when we receive only a polynomial number of bits of precision for each query to $f$, or when $f$ can only be queried on rational points representable using a logarithmic number of bits. Along the way, we prove a new stability theorem for multivariate polynomials that may be of independent interest.
We propose a novel framework for learning a low-dimensional representation of data based on nonlinear dynamical systems, which we call dynamical dimension reduction (DDR). In the DDR model, each point is evolved via a nonlinear flow towards a lower-dimensional subspace; the projection onto the subspace gives the low-dimensional embedding. Training the model involves identifying the nonlinear flow and the subspace. Following the equation discovery method, we represent the vector field that defines the flow using a linear combination of dictionary elements, where each element is a pre-specified linear/nonlinear candidate function. A regularization term for the average total kinetic energy is also introduced and motivated by optimal transport theory. We prove that the resulting optimization problem is well-posed and establish several properties of the DDR method. We also show how the DDR method can be trained using a gradient-based optimization method, where the gradients are computed using the adjoint method from optimal control theory. The DDR method is implemented and compared on synthetic and example datasets to other dimension reductions methods, including PCA, t-SNE, and Umap.
The problem of scheduling unrelated machines has been studied since the inception of algorithmic mechanism design~\cite{NR99}. It is a resource allocation problem that entails assigning $m$ tasks to $n$ machines for execution. Machines are regarded as strategic agents who may lie about their execution costs so as to minimize their allocated workload. To address the situation when monetary payment is not an option to compensate the machines' costs, \citeauthor{DBLP:journals/mst/Koutsoupias14} [2014] devised two \textit{truthful} mechanisms, K and P respectively, that achieve an approximation ratio of $\frac{n+1}{2}$ and $n$, for social cost minimization. In addition, no truthful mechanism can achieve an approximation ratio better than $\frac{n+1}{2}$. Hence, mechanism K is optimal. While approximation ratio provides a strong worst-case guarantee, it also limits us to a comprehensive understanding of mechanism performance on various inputs. This paper investigates these two scheduling mechanisms beyond the worst case. We first show that mechanism K achieves a smaller social cost than mechanism P on every input. That is, mechanism K is pointwise better than mechanism P. Next, for each task $j$, when machines' execution costs $t_i^j$ are independent and identically drawn from a task-specific distribution $F^j(t)$, we show that the average-case approximation ratio of mechanism K converges to a constant. This bound is tight for mechanism K. For a better understanding of this distribution dependent constant, on the one hand, we estimate its value by plugging in a few common distributions; on the other, we show that this converging bound improves a known bound \cite{DBLP:conf/aaai/Zhang18} which only captures the single-task setting. Last, we find that the average-case approximation ratio of mechanism P converges to the same constant.
There are many important high dimensional function classes that have fast agnostic learning algorithms when strong assumptions on the distribution of examples can be made, such as Gaussianity or uniformity over the domain. But how can one be sufficiently confident that the data indeed satisfies the distributional assumption, so that one can trust in the output quality of the agnostic learning algorithm? We propose a model by which to systematically study the design of tester-learner pairs $(\mathcal{A},\mathcal{T})$, such that if the distribution on examples in the data passes the tester $\mathcal{T}$ then one can safely trust the output of the agnostic learner $\mathcal{A}$ on the data. To demonstrate the power of the model, we apply it to the classical problem of agnostically learning halfspaces under the standard Gaussian distribution and present a tester-learner pair with a combined run-time of $n^{\tilde{O}(1/\epsilon^4)}$. This qualitatively matches that of the best known ordinary agnostic learning algorithms for this task. In contrast, finite sample Gaussian distribution testers do not exist for the $L_1$ and EMD distance measures. A key step in the analysis is a novel characterization of concentration and anti-concentration properties of a distribution whose low-degree moments approximately match those of a Gaussian. We also use tools from polynomial approximation theory. In contrast, we show strong lower bounds on the combined run-times of tester-learner pairs for the problems of agnostically learning convex sets under the Gaussian distribution and for monotone Boolean functions under the uniform distribution over $\{0,1\}^n$. Through these lower bounds we exhibit natural problems where there is a dramatic gap between standard agnostic learning run-time and the run-time of the best tester-learner pair.
Multimodal sentiment analysis is a very actively growing field of research. A promising area of opportunity in this field is to improve the multimodal fusion mechanism. We present a novel feature fusion strategy that proceeds in a hierarchical fashion, first fusing the modalities two in two and only then fusing all three modalities. On multimodal sentiment analysis of individual utterances, our strategy outperforms conventional concatenation of features by 1%, which amounts to 5% reduction in error rate. On utterance-level multimodal sentiment analysis of multi-utterance video clips, for which current state-of-the-art techniques incorporate contextual information from other utterances of the same clip, our hierarchical fusion gives up to 2.4% (almost 10% error rate reduction) over currently used concatenation. The implementation of our method is publicly available in the form of open-source code.