An $(n,k,\ell)$ array code has $k$ information coordinates and $r = n-k$ parity coordinates, where each coordinate is a vector in $\mathbb{F}_q^{\ell}$ for some field $\mathbb{F}_q$. An $(n,k,\ell)$ MDS array code has the additional property that any $k$ out of $n$ coordinates suffice to recover the whole codeword. Dimakis et al. considered the problem of repairing the erasure of a single coordinate and proved a lower bound on the amount of data transmission that is needed for the repair. A minimum storage regenerating (MSR) array code with repair degree $d$ is an MDS array code that achieves this lower bound for the repair of any single erased coordinate from any $d$ out of $n-1$ remaining coordinates. An MSR code has the optimal access property if the amount of accessed data is the same as the amount of transmitted data in the repair procedure. The sub-packetization $\ell$ and the field size $q$ are of paramount importance in the MSR array code constructions. For optimal-access MSR codes, Balaji et al. proved that $\ell\geq s^{\left\lceil n/s \right\rceil}$, where $s = d-k+1$. Rawat et al. showed that this lower bound is attainable for all admissible values of $d$ when the field size is exponential in $n$. After that, tremendous efforts have been devoted to reducing the field size. However, till now, reduction to linear field size is only available for $d\in\{k+1,k+2,k+3\}$ and $d=n-1$. In this paper, we construct optimal-access MSR codes with linear field size and smallest sub-packetization $\ell = s^{\left\lceil n/s \right\rceil}$ for all $d$ between $k+1$ and $n-1$. We also construct another class of MSR codes that are not optimal-access but have even smaller sub-packetization $s^{\left\lceil n/(s+1)\right\rceil }$. The second class also has linear field size and works for all admissible values of $d$.
A code of length $n$ is said to be (combinatorially) $(\rho,L)$-list decodable if the Hamming ball of radius $\rho n$ around any vector in the ambient space does not contain more than $L$ codewords. We study a recently introduced class of higher order MDS codes, which are closely related (via duality) to codes that achieve a generalized Singleton bound for list decodability. For some $\ell\geq 1$, higher order MDS codes of length $n$, dimension $k$, and order $\ell$ are denoted as $(n,k)$-MDS($\ell$) codes. We present a number of results on the structure of these codes, identifying the `extend-ability' of their parameters in various scenarios. Specifically, for some parameter regimes, we identify conditions under which $(n_1,k_1)$-MDS($\ell_1$) codes can be obtained from $(n_2,k_2)$-MDS($\ell_2$) codes, via various techniques. We believe that these results will aid in efficient constructions of higher order MDS codes. We also obtain a new field size upper bound for the existence of such codes, which arguably improves over the best known existing bound, in some parameter regimes.
We consider the multi-access coded caching problem, which contains a central server with $N$ files, $K$ caches with $M$ units of memory each and $K$ users where each one is connected to $L (\geq 1)$ consecutive caches, with a cyclic wrap-around. Caches are populated with content related to the files and each user then requests a file that has to be served via a broadcast message from the central server with the help of the caches. We aim to design placement and delivery policies for this setup that minimize the central servers' transmission rate while satisfying an additional linear sub-packetization constraint. We propose policies that satisfy this constraint and derive upper bounds on the achieved server transmission rate, which upon comparison with the literature establish the improvement provided by our results. To derive our results, we map the multi-access coded caching problem to variants of the well-known index coding problem. In this process, we also derive new bounds on the optimal transmission size for a `structured' index coding problem, which might be of independent interest.
Statistical models typically capture uncertainties in our knowledge of the corresponding real-world processes, however, it is less common for this uncertainty specification to capture uncertainty surrounding the values of the inputs to the model, which are often assumed known. We develop general modelling methodology with uncertain inputs in the context of the Bayes linear paradigm, which involves adjustment of second-order belief specifications over all quantities of interest only, without the requirement for probabilistic specifications. In particular, we propose an extension of commonly-employed second-order modelling assumptions to the case of uncertain inputs, with explicit implementation in the context of regression analysis, stochastic process modelling, and statistical emulation. We apply the methodology to a regression model for extracting aluminium by electrolysis, and emulation of the motivating epidemiological simulator chain to model the impact of an airborne infectious disease.
The trade algorithm, which includes the curveball and fastball implementations, is the state-of-the-art for uniformly sampling r x c binary matrices with fixed row and column sums. The mixing time of the trade algorithm is currently unknown, although 5r is currently used as a heuristic. We propose a distribution-based approach to estimating the mixing time, but which also can return a sample of matrices that are nearly guaranteed to be uniformly randomly sampled. In numerical experiments on matrices that vary by size, fill, and row and column sum distributions, we find that the upper bound on mixing time is at least 10r, and that it increases as a function of both c and the fraction of cells containing a 1.
We present a new class of private information retrieval (PIR) schemes that keep the identity of the file requested private in the presence of at most $t$ colluding servers, based on the recent framework developed for such $t$-PIR schemes using star products of transitive codes. These $t$-PIR schemes employ the class of Berman codes as the storage-retrieval code pairs. Berman codes, which are binary linear codes of length $n^m$ for any $n\geq 2$ and $m\geq 1$ being positive integers, were recently shown to achieve the capacity of the binary erasure channel. We provide a complete characterization of the star products of the Berman code pairs, enabling us to calculate the PIR rate of the star product-based schemes that employ these codes. The schemes we present have flexibility in the number of servers, the PIR rate, the storage rate, and the collusion parameter $t$, owing to numerous codes available in the class of Berman codes.
Instruction tuning has been shown to be able to improve cross-task generalization of language models. However, it is still challenging for language models to complete the target tasks following the instructions, as the instructions are general and lack intermediate steps. To address this problem, we propose to incorporate the step-by-step instructions to help language models to decompose the tasks, which can provide the detailed and specific procedures for completing the target tasks. The step-by-step instructions are obtained automatically by prompting ChatGPT, which are further combined with the original instructions to tune language models. The extensive experiments on SUP-NATINST show that the high-quality step-by-step instructions can improve cross-task generalization across different model sizes. Moreover, the further analysis indicates the importance of the order of steps of the step-by-step instruction for the improvement. To facilitate future research, we release the step-by-step instructions and their human quality evaluation results.
Motivated by applications in distributed storage, distributed computing, and homomorphic secret sharing, we study communication-efficient schemes for computing linear combinations of coded symbols. Specifically, we design low-bandwidth schemes that evaluate the weighted sum of $\ell$ coded symbols in a codeword $\pmb{c}\in\mathbb{F}^n$, when we are given access to $d$ of the remaining components in $\pmb{c}$. Formally, suppose that $\mathbb{F}$ is a field extension of $\mathbb{B}$ of degree $t$. Let $\pmb{c}$ be a codeword in a Reed-Solomon code of dimension $k$ and our task is to compute the weighted sum of $\ell$ coded symbols. In this paper, for some $s<t$, we provide an explicit scheme that performs this task by downloading $d(t-s)$ sub-symbols in $\mathbb{B}$ from $d$ available nodes, whenever $d\geq \ell|\mathbb{B}|^s-\ell+k$. In many cases, our scheme outperforms previous schemes in the literature. Furthermore, we provide a characterization of evaluation schemes for general linear codes. Then in the special case of Reed-Solomon codes, we use this characterization to derive a lower bound for the evaluation bandwidth.
Recently, $(\beta,\gamma)$-Chebyshev functions, as well as the corresponding zeros, have been introduced as a generalization of classical Chebyshev polynomials of the first kind and related roots. They consist of a family of orthogonal functions on a subset of $[-1,1]$, which indeed satisfies a three-term recurrence formula. In this paper we present further properties, which are proven to comply with various results about classical orthogonal polynomials. In addition, we prove a conjecture concerning the Lebesgue constant's behavior related to the roots of $(\beta,\gamma)$-Chebyshev functions in the corresponding orthogonality interval.
We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter $\varepsilon$ and a set $P$ of $n$ points in $\mathbb{R}^d$ where each point is assigned a color from a set $C$, and we want to build a structure s.t. given any geometric range $\gamma$, we can efficiently find a list of approximate heavy hitters in $\gamma\cap P$, i.e., colors that appear at least $\varepsilon |\gamma \cap P|$ times in $\gamma \cap P$, as well as their frequencies with an additive error of $\varepsilon |\gamma \cap P|$. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence $S$ of $1+1/\varepsilon$ weights s.t. the $i$-th weight in $S$ has approximate rank $i\varepsilon|\gamma\cap P|$, meaning, rank $i\varepsilon|\gamma\cap P|$ up to an additive error of $\varepsilon|\gamma\cap P|$. Previously, optimal results were only known in 1D [WY11] but a few sub-optimal methods were available in higher dimensions [AW17, ACH+12]. We study the problems for 3D halfspace and dominance queries. We consider the real RAM model with integer registers of size $w=\Theta(\log n)$ bits. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time $O(\log n + 1/\varepsilon)$. Note that as the output size is $\frac{1}{\varepsilon}$, after investing the initial $O(\log n)$ searching time, our structure takes on average $O(1)$ time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra $\log_w\frac{1}{\varepsilon}$ (resp. $\log\log_w\frac{1}{\varepsilon}$) factor in 3D (resp. 2D). By spending extra $\log^{O(1)}\frac{1}{\varepsilon}$ factors in time and space, we can also support quantile queries.
A multifold $1$-perfect code ($1$-perfect code for list decoding) in any graph is a set $C$ of vertices such that every vertex of the graph is at distance not more than $1$ from exactly $\mu$ elements of $C$. In $q$-ary Hamming graphs, where $q$ is a prime power, we characterise all parameters of multifold $1$-perfect codes and all parameters of additive multifold $1$-perfect codes. In particular, we show that additive multifold $1$-perfect codes are related to special multiset generalizations of spreads, multispreads, and that multispreads of parameters corresponding to multifold $1$-perfect codes always exist. Keywords: perfect codes, multifold packing, multiple covering, list-decoding codes, additive codes, spreads, multispreads, completely regular codes, intriguing sets.