The Hilbert metric is a distance function defined for points lying within the interior of a convex body. It arises in the analysis and processing of convex bodies, machine learning, and quantum information theory. In this paper, we show how to adapt the Euclidean Delaunay triangulation to the Hilbert geometry defined by a convex polygon in the plane. We analyze the geometric properties of the Hilbert Delaunay triangulation, which has some notable differences with respect to the Euclidean case, including the fact that the triangulation does not necessarily cover the convex hull of the point set. We also introduce the notion of a Hilbert ball at infinity, which is a Hilbert metric ball centered on the boundary of the convex polygon. We present a simple randomized incremental algorithm that computes the Hilbert Delaunay triangulation for a set of $n$ points in the Hilbert geometry defined by a convex $m$-gon. The algorithm runs in $O(n (\log n + \log^3 m))$ expected time. In addition we introduce the notion of the Hilbert hull of a set of points, which we define to be the region covered by their Hilbert Delaunay triangulation. We present an algorithm for computing the Hilbert hull in time $O(n h \log^2 m)$, where $h$ is the number of points on the hull's boundary.
In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention--skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@$k$).
The number of Language Models (LMs) dedicated to processing scientific text is on the rise. Keeping pace with the rapid growth of scientific LMs (SciLMs) has become a daunting task for researchers. To date, no comprehensive surveys on SciLMs have been undertaken, leaving this issue unaddressed. Given the constant stream of new SciLMs, appraising the state-of-the-art and how they compare to each other remain largely unknown. This work fills that gap and provides a comprehensive review of SciLMs, including an extensive analysis of their effectiveness across different domains, tasks and datasets, and a discussion on the challenges that lie ahead.
We reiterate the contribution made by Harrow, Hassidim, and Llyod to the quantum matrix equation solver with the emphasis on the algorithm description and the error analysis derivation details. Moreover, the behavior of the amplitudes of the phase register on the completion of the Quantum Phase Estimation is studied. This study is beneficial for the comprehension of the choice of the phase register size and its interrelation with the Hamiltonian simulation duration in the algorithm setup phase.
We propose a method for computing integrable orthogonal frame fields on planar surfaces. Frames and their symmetries are implicitly represented using orthogonally decomposable (odeco) tensors. To formulate an integrability criterion, we express the frame field's Lie bracket solely in terms of the tensor representation; this is made possible by studying the sensitivity of the frame with respect to perturbations in the tensor. We construct an energy formulation that computes smooth and integrable frame fields, in both isotropic and anisotropic settings. The user can prescribe any size and orientation constraints in input, and the solver creates and places the singularities required to fit the constraints with the correct topology. The computed frame field can be integrated to a seamless parametrization that is aligned with the frame field.
Numerous applications in the field of molecular communications (MC) such as healthcare systems are often event-driven. The conventional Shannon capacity may not be the appropriate metric for assessing performance in such cases. We propose the identification (ID) capacity as an alternative metric. Particularly, we consider randomized identification (RI) over the discrete-time Poisson channel (DTPC), which is typically used as a model for MC systems that utilize molecule-counting receivers. In the ID paradigm, the receiver's focus is not on decoding the message sent. However, he wants to determine whether a message of particular significance to him has been sent or not. In contrast to Shannon transmission codes, the size of ID codes for a Discrete Memoryless Channel (DMC) grows doubly exponentially fast with the blocklength, if randomized encoding is used. In this paper, we derive the capacity formula for RI over the DTPC subject to some peak and average power constraints. Furthermore, we analyze the case of state-dependent DTPC.
Missing data is a pernicious problem in epidemiologic research. Research on the validity of complete case analysis for missing data has typically focused on estimating the average treatment effect (ATE) in the whole population. However, other target populations like the treated (ATT) or external targets can be of substantive interest. In such cases, whether missing covariate data occurs within or outside the target population may impact the validity of complete case analysis. We sought to assess bias in complete case analysis when covariate data is missing outside the target (e.g., missing covariate data among the untreated when estimating the ATT). We simulated a study of the effect of a binary treatment X on a binary outcome Y in the presence of 3 confounders C1-C3 that modified the risk difference (RD). We induced missingness in C1 only among the untreated under 4 scenarios: completely randomly (similar to MCAR); randomly based on C2 and C3 (similar to MAR); randomly based on C1 (similar to MNAR); or randomly based on Y (similar to MAR). We estimated the ATE and ATT using weighting and averaged results across the replicates. We conducted a parallel simulation transporting trial results to a target population in the presence of missing covariate data in the trial. In the complete case analysis, estimated ATE was unbiased only when C1 was MCAR among the untreated. The estimated ATT, on the other hand, was unbiased in all scenarios except when Y caused missingness. The parallel simulation of generalizing and transporting trial results saw similar bias patterns. If missing covariate data is only present outside the target population, complete case analysis is unbiased except when missingness is associated with the outcome.
One of the primary sequencing methods gaining prominence in DNA storage is nanopore sequencing, attributed to various factors. In this work, we consider a simplified model of the sequencer, characterized as a channel. This channel takes a sequence and processes it using a sliding window of length $\ell$, shifting the window by $\delta$ characters each time. The output of this channel, which we refer to as the read vector, is a vector containing the sums of the entries in each of the windows. The capacity of the channel is defined as the maximal information rate of the channel. Previous works have already revealed capacity values for certain parameters $\ell$ and $\delta$. In this work, we show that when $\delta < \ell < 2\delta$, the capacity value is given by $\frac{1}{\delta}\log_2 \frac{1}{2}(\ell+1+ \sqrt{(\ell+1)^2 - 4(\ell - \delta)(\ell-\delta +1)})$. Additionally, we construct an upper bound when $2\delta < \ell$. Finally, we extend the model to the two-dimensional case and present several results on its capacity.
Although continuous advances in theoretical modelling of Molecular Communications (MC) are observed, there is still an insuperable gap between theory and experimental testbeds, especially at the microscale. In this paper, the development of the first testbed incorporating engineered yeast cells is reported. Different from the existing literature, eukaryotic yeast cells are considered for both the sender and the receiver, with {\alpha}-factor molecules facilitating the information transfer. The use of such cells is motivated mainly by the well understood biological mechanism of yeast mating, together with their genetic amenability. In addition, recent advances in yeast biosensing establish yeast as a suitable detector and a neat interface to in-body sensor networks. The system under consideration is presented first, and the mathematical models of the underlying biological processes leading to an end-to-end (E2E) system are given. The experimental setup is then described and used to obtain experimental results which validate the developed mathematical models. Beyond that, the ability of the system to effectively generate output pulses in response to repeated stimuli is demonstrated, reporting one event per two hours. However, fast RNA fluctuations indicate cell responses in less than three minutes, demonstrating the potential for much higher rates in the future.
We use Markov categories to develop generalizations of the theory of Markov chains and hidden Markov models in an abstract setting. This comprises characterizations of hidden Markov models in terms of local and global conditional independences as well as existing algorithms for Bayesian filtering and smoothing applicable in all Markov categories with conditionals. We show that these algorithms specialize to existing ones such as the Kalman filter, forward-backward algorithm, and the Rauch-Tung-Striebel smoother when instantiated in appropriate Markov categories. Under slightly stronger assumptions, we also prove that the sequence of outputs of the Bayes filter is itself a Markov chain with a concrete formula for its transition maps. There are two main features of this categorical framework. The first is its generality, as it can be used in any Markov category with conditionals. In particular, it provides a systematic unified account of hidden Markov models and algorithms for filtering and smoothing in discrete probability, Gaussian probability, measure-theoretic probability, possibilistic nondeterminism and others at the same time. The second feature is the intuitive visual representation of information flow in these algorithms in terms of string diagrams.
The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.