Unlike other metaheuristics, differential Evolution (DE) employs a crossover operation filtering variables to be mutated, which contributes to its successful applications in a variety of complicated optimization problems. However, the underlying working principles of the crossover operation is not yet fully understood. In this paper, we try to reveal the influence of the binomial crossover by performing a theoretical comparison between the $(1+1)EA$ and its variants, the $(1+1)EA_{C}$ and the $(1+1)EA_{CM}$. Generally, the introduction of the binomial crossover contributes to the enhancement of the exploration ability as well as degradation of the exploitation ability, and under some conditions, leads to the dominance of the transition matrix for binary optimization problems. As a result, both the $(1+1)EA_{C}$ and the $(1+1)EA_{CM}$ outperform the $(1+1)EA$ on the unimodal OneMax problem, but do not always dominate it on the Deceptive problem. Finally, we perform exploration analysis by investigating probabilities to transfer from non-optimal statuses to the optimal status of the Deceptive problem, inspired by which adaptive strategies are proposed to improve the ability of global exploration. It suggests that incorporation of the binomial crossover could be a feasible strategy to improve the performances of randomized search heuristics.
The Moore-Penrose inverse is widely used in physics, statistics and various fields of engineering. Among other characteristics, it captures well the notion of inversion of linear operators in the case of overcomplete data. In data science, nonlinear operators are extensively used. In this paper we define and characterize the fundamental properties of a pseudo-inverse for nonlinear operators. The concept is defined broadly. First for general sets, and then a refinement for normed spaces. Our pseudo-inverse for normed spaces yields the Moore-Penrose inverse when the operator is a matrix. We present conditions for existence and uniqueness of a pseudo-inverse and establish theoretical results investigating its properties, such as continuity, its value for operator compositions and projection operators, and others. Analytic expressions are given for the pseudo-inverse of some well-known, non-invertible, nonlinear operators, such as hard- or soft-thresholding and ReLU. Finally, we analyze a neural layer and discuss relations to wavelet thresholding and to regularized loss minimization.
Recent advances in the theoretical understanding of SGD led to a formula for the optimal batch size minimizing the number of effective data passes, i.e., the number of iterations times the batch size. However, this formula is of no practical value as it depends on the knowledge of the variance of the stochastic gradients evaluated at the optimum. In this paper we design a practical SGD method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions. Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour; that is, it works as if the optimal batch size was known a-priori. Further, we generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$. In this work, we investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature. Specifically, can the behavior of an algorithm $\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}'$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of statistical query algorithms, under all reasonable noise models. We then show that in the specific case of additive noise, this equivalence holds for all algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.
The case-crossover design (Maclure, 1991) is widely used in epidemiology and other fields to study causal effects of transient treatments on acute outcomes. However, its validity and causal interpretation have only been justified under informal conditions. Here, we place the design in a formal counterfactual framework for the first time. Doing so helps to clarify its assumptions and interpretation. In particular, when the treatment effect is non-null, we identify a previously unnoticed bias arising from common causes of the outcome at different person-times. We analytically characterize the direction and size of this bias and demonstrate its potential importance with a simulation. We also use our derivation of the limit of the case-crossover estimator to analyze its sensitivity to treatment effect heterogeneity, a violation of one of the informal criteria for validity. The upshot of this work for practitioners is that, while the case-crossover design can be useful for testing the causal null hypothesis in the presence of baseline confounders, extra caution is warranted when using the case-crossover design for point estimation of causal effects.
We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We propose several estimates of autocorrelation times, some inspired by analytical results derived for the Metropolized Independent Sampler, which we compare and study as a function of inverse temperature $\beta$. Based on that we propose an alternative loss function and study its impact on the autocorelation times. Furthermore, we investigate the impact of imposing system symmetries ($Z_2$ and/or translational) in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guide to the implementation of Neural Markov Chain Monte Carlo simulations of more complicated models.
The minimisation of cost functions is crucial in various optimisation fields. However, identifying their global minimum remains challenging owing to the huge computational cost incurred. This work analytically expresses the computational cost to identify an approximate global minimum for a class of cost functions defined under a high-dimensional discrete state space. Then, we derive an optimal global search scheme that minimises the computational cost. Mathematical analyses demonstrate that a combination of the gradient descent algorithm and the selection and crossover algorithm--with a biased crossover weight--maximises the search efficiency. Remarkably, its computational cost is of the square root order in contrast to that of the conventional gradient descent algorithms, indicating a quadratic speedup of global search. We corroborate this proposition using numerical analyses of the travelling salesman problem. The simple computational architecture and minimal computational cost of the proposed scheme are highly desirable for biological organisms and neuromorphic hardware.
Real-world robotic tasks require complex reward functions. When we define the problem the robot needs to solve, we pretend that a designer specifies this complex reward exactly, and it is set in stone from then on. In practice, however, reward design is an iterative process: the designer chooses a reward, eventually encounters an "edge-case" environment where the reward incentivizes the wrong behavior, revises the reward, and repeats. What would it mean to rethink robotics problems to formally account for this iterative nature of reward design? We propose that the robot not take the specified reward for granted, but rather have uncertainty about it, and account for the future design iterations as future evidence. We contribute an Assisted Reward Design method that speeds up the design process by anticipating and influencing this future evidence: rather than letting the designer eventually encounter failure cases and revise the reward then, the method actively exposes the designer to such environments during the development phase. We test this method in a simplified autonomous driving task and find that it more quickly improves the car's behavior in held-out environments by proposing environments that are "edge cases" for the current reward.
Shannon-Hartley theorem can accurately calculate the channel capacity when the signal observation time is infinite. However, the calculation of finite-time capacity, which remains unknown, is essential for guiding the design of practical communication systems. In this paper, we investigate the capacity between two correlated Gaussian processes within a finite-time observation window. We first derive the finite-time capacity by providing a limit expression. Then we numerically compute the maximum transmission rate within a single finite-time window. We reveal that the number of bits transmitted per second within the finite-time window can exceed the classical Shannon capacity, which is called as the Exceed-Shannon phenomenon. Furthermore, we derive a finite-time capacity formula under a typical signal autocorrelation case by utilizing the Mercer expansion of trace class operators, and reveal the connection between the finite-time capacity problem and the operator theory. Finally, we analytically prove the existence of the Exceed-Shannon phenomenon in this typical case, and demonstrate the achievability of the finite-time capacity and its compatibility with the classical Shannon capacity.
Thul et al. (2020) called attention to problems that arise when chronometric experiments implementing specific factorial designs are analysed with the generalized additive mixed model (GAMM), using factor smooths to capture trial-to-trial dependencies. From a series of simulations incorporating such dependencies, they conclude that GAMMs are inappropriate for between-subject designs. They argue that in addition GAMMs come with too many modeling possibilities, and advise using the linear mixed model (LMM) instead. We address the questions raised by Thul et al. (2020), who clearly demonstrated that problems can indeed arise when using factor smooths in combination with factorial designs. We show that the problem does not arise when using by-smooths. Furthermore, we have traced a bug in the implementation of factor smooths in the mgcv package, which will have been removed from version 1.8-36 onwards. To illustrate that GAMMs now produce correct estimates, we report simulation studies implementing different by-subject longitudinal effects. The maximal LMM emerges as slightly conservative compared to GAMMs, and GAMMs provide estimated coefficients that can be less variable across simulation runs. We also discuss two datasets where time-varying effects interact with numerical predictors in a theoretically informative way. Furthermore, we argue that the wide range of tools that GAMMs make available to researcher across all domains of scientific inquiry do not come with uncontrolled researcher degrees of freedom once confronted with a specific psycholinguistic datasets. We also introduce a distinction between replicable and non-replicable non-linear effects. We conclude that GAMMs are an excellent and reliable tool for understanding experimental data, including chronometric data with time-varying effects.
Several recent applications of optimal transport (OT) theory to machine learning have relied on regularization, notably entropy and the Sinkhorn algorithm. Because matrix-vector products are pervasive in the Sinkhorn algorithm, several works have proposed to \textit{approximate} kernel matrices appearing in its iterations using low-rank factors. Another route lies instead in imposing low-rank constraints on the feasible set of couplings considered in OT problems, with no approximations on cost nor kernel matrices. This route was first explored by Forrow et al., 2018, who proposed an algorithm tailored for the squared Euclidean ground cost, using a proxy objective that can be solved through the machinery of regularized 2-Wasserstein barycenters. Building on this, we introduce in this work a generic approach that aims at solving, in full generality, the OT problem under low-rank constraints with arbitrary costs. Our algorithm relies on an explicit factorization of low rank couplings as a product of \textit{sub-coupling} factors linked by a common marginal; similar to an NMF approach, we alternatively updates these factors. We prove the non-asymptotic stationary convergence of this algorithm and illustrate its efficiency on benchmark experiments.