End-to-end backpropagation has a few shortcomings: it requires loading the entire model during training, which can be impossible in constrained settings, and suffers from three locking problems (forward locking, update locking and backward locking), which prohibit training the layers in parallel. Solving layer-wise optimization problems can address these problems and has been used in on-device training of neural networks. We develop a layer-wise training method, particularly welladapted to ResNets, inspired by the minimizing movement scheme for gradient flows in distribution space. The method amounts to a kinetic energy regularization of each block that makes the blocks optimal transport maps and endows them with regularity. It works by alleviating the stagnation problem observed in layer-wise training, whereby greedily-trained early layers overfit and deeper layers stop increasing test accuracy after a certain depth. We show on classification tasks that the test accuracy of block-wise trained ResNets is improved when using our method, whether the blocks are trained sequentially or in parallel.
Operating in the near-vicinity of marine energy devices poses significant challenges to the control of underwater vehicles, predominantly due to the presence of large magnitude wave disturbances causing hazardous state perturbations. Approaches to tackle this problem have varied, but one promising solution is to adopt predictive control methods. Given the predictable nature of ocean waves, the potential exists to incorporate disturbance estimations directly within the plant model; this requires inclusion of a wave predictor to provide online preview information. To this end, this paper presents a Nonlinear Model Predictive Controller with an integrated Deterministic Sea Wave Predictor for trajectory tracking of underwater vehicles. State information is obtained through an Extended Kalman Filter, forming a complete closed-loop strategy and facilitating online wave load estimations. The strategy is compared to a similar feed-forward disturbance mitigation scheme, showing mean performance improvements of 51% in positional error and 44.5% in attitude error. The preliminary results presented here provide strong evidence of the proposed method's high potential to effectively mitigate disturbances, facilitating accurate tracking performance even in the presence of high wave loading.
Low-density parity-check codes together with belief propagation (BP) decoding are known to be well-performing for large block lengths. However, for short block lengths there is still a considerable gap between the performance of the BP decoder and the maximum likelihood decoder. Different ensemble decoding schemes such as, e.g., the automorphism ensemble decoder (AED), can reduce this gap in short block length regime. We propose a generalized AED (GAED) that uses automorphisms according to the definition in linear algebra. Here, an automorphism of a vector space is defined as a linear, bijective self-mapping, whereas in coding theory self-mappings that are scaled permutations are commonly used. We show that the more general definition leads to an explicit joint construction of codes and automorphisms, and significantly enlarges the search space for automorphisms of existing linear codes. Furthermore, we prove the concept that generalized automorphisms can indeed be used to improve decoding. Additionally, we propose a code construction of parity check codes enabling the construction of codes with suitably designed automorphisms. Finally, we analyze the decoding performances of the GAED for some of our constructed codes.
Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems, via analysis of two-step gradient updates. Specifically, we characterize a local condition involving third-order derivatives that guarantees existence and convergence to fixed points of the two-step updates, and leverage such property in a teacher-student setting, under population loss. Finally, starting from Matrix Factorization, we provide observations of period-2 orbit of GD in high-dimensional settings with intuition of its dynamics, along with exploration into more general settings.
Attention models are typically learned by optimizing one of three standard loss functions that are variously called -- soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models -- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes the selected segment into the target label. However, they differ significantly in the way the selected segments are aggregated, resulting in distinct dynamics and final results. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed. We also analyze these paradigms in a simple setting and derive closed-form expressions for the parameter trajectory under gradient flow. With the soft attention loss, the focus model improves quickly at initialization and splutters later on. On the other hand, hard attention loss behaves in the opposite fashion. Based on our observations, we propose a simple hybrid approach that combines the advantages of the different loss functions and demonstrates it on a collection of semi-synthetic and real-world datasets
The growing demand for accurate control in varying and unknown environments has sparked a corresponding increase in the requirements for power supply components, including permanent magnet synchronous motors (PMSMs). To infer the unknown part of the system, machine learning techniques are widely employed, especially Gaussian process regression (GPR) due to its flexibility of continuous system modeling and its guaranteed performance. For practical implementation, distributed GPR is adopted to alleviate the high computational complexity. However, the study of distributed GPR from a control perspective remains an open problem. In this paper, a control-aware optimal aggregation strategy of distributed GPR for PMSMs is proposed based on the Lyapunov stability theory. This strategy exclusively leverages the posterior mean, thereby obviating the need for computationally intensive calculations associated with posterior variance in alternative approaches. Moreover, the straightforward calculation process of our proposed strategy lends itself to seamless implementation in high-frequency PMSM control. The effectiveness of the proposed strategy is demonstrated in the simulations.
Next-generation wireless communication systems impose much stricter requirements for transmission rate, latency, and reliability. The peak data rate of 6G networks should be no less than 1 Tb/s, which is comparable to existing long-haul optical transport networks. It is believed that using long error-correcting codes (ECC) with soft-decision decoding (SDD) is not feasible in this case due to the resulting high power consumption. On the other hand, ECC with hard-decision decoding (HDD) suffers from significant performance degradation. In this paper, we consider a concatenated solution consisting of an outer long HDD code and an inner short SDD code. The latter code is a crucial component of the system and the focus of our research. Due to its short length, the code cannot correct all errors, but it is designed to minimize the number of errors. Such codes are known as error-reducing codes. We investigate the error-reducing properties of superposition codes. Initially, we explore sparse regression codes (SPARCs) with Gaussian signals. This approach outperforms error-reducing binary LDPC codes optimized by Barakatain, et al. (2018) in terms of performance but faces limitations in practical applicability due to high implementation complexity. Subsequently, we propose an LDPC-based superposition code scheme with low-complexity soft successive interference cancellation (SIC) decoding. This scheme demonstrates comparable performance to SPARCs while maintaining manageable complexity. Numerical results were obtained for inner codes with an overhead (OH) of 8.24% within a concatenated scheme (15% OH) with an outer hard-decision decoded staircase code (6.25% OH).
Quantization is a promising approach to reduce the high computational complexity of image super-resolution (SR) networks. However, compared to high-level tasks like image classification, low-bit quantization leads to severe accuracy loss in SR networks. This is because feature distributions of SR networks are significantly divergent for each channel or input image, and is thus difficult to determine a quantization range. Existing SR quantization works approach this distribution mismatch problem by dynamically adapting quantization ranges to the variant distributions during test time. However, such dynamic adaptation incurs additional computational costs that limit the benefits of quantization. Instead, we propose a new quantization-aware training framework that effectively Overcomes the Distribution Mismatch problem in SR networks without the need for dynamic adaptation. Intuitively, the mismatch can be reduced by directly regularizing the variance in features during training. However, we observe that variance regularization can collide with the reconstruction loss during training and adversely impact SR accuracy. Thus, we avoid the conflict between two losses by regularizing the variance only when the gradients of variance regularization are cooperative with that of reconstruction. Additionally, to further reduce the distribution mismatch, we introduce distribution offsets to layers with a significant mismatch, which either scales or shifts channel-wise features. Our proposed algorithm, called ODM, effectively reduces the mismatch in distributions with minimal computational overhead. Experimental results show that ODM effectively outperforms existing SR quantization approaches with similar or fewer computations, demonstrating the importance of reducing the distribution mismatch problem. Our code is available at //github.com/Cheeun/ODM.
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. By exploiting noise distribution and boundary artifact surrounding tampered regions, the former aims to learn semantic-agnostic and thus more generalizable features. The latter allows us to learn from authentic images which are nontrivial to be taken into account by current semantic segmentation network based methods. Our thoughts are realized by a new network which we term MVSS-Net. Extensive experiments on five benchmark sets justify the viability of MVSS-Net for both pixel-level and image-level manipulation detection.
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.
Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes $94k$ images with manually curated boxes from $15k$ unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we further introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially at no additional memory cost, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.