Private information retrieval (PIR) schemes (with or without colluding servers) have been proposed for realistic coded distributed data storage systems. Star product PIR schemes with colluding servers for general coded distributed storage system were constructed over general finite fields by R. Freij-Hollanti, O. W. Gnilke, C. Hollanti and A. Karpuk in 2017. These star product PIR schemes with colluding servers are suitable for the storage of files over small fields and can be constructed for coded distributed storage system with large number of servers. In this paper for an efficient storage code, the problem to find good retrieval codes is considered. In general if the storage code is a binary Reed-Muller code the retrieval code needs not to be a binary Reed-Muller code in general. It is proved that when the storage code contains some special codewords, nonzero retrieval rate star product PIR schemes with colluding servers can only protect against small number of colluding servers. We also give examples to show that when the storage code is a good cyclic code, the best choice of the retrieval code is not cyclic in general. Therefore in the design of star product PIR schemes with colluding servers, the scheme with the storage code and the retrieval code in the same family of algebraic codes is not always efficient.
In this paper, we propose a new covering technique localized for the trajectories of SGD. This localization provides an algorithm-specific complexity measured by the covering number, which can have dimension-independent cardinality in contrast to standard uniform covering arguments that result in exponential dimension dependency. Based on this localized construction, we show that if the objective function is a finite perturbation of a piecewise strongly convex and smooth function with $P$ pieces, i.e. non-convex and non-smooth in general, the generalization error can be upper bounded by $O(\sqrt{(\log n\log(nP))/n})$, where $n$ is the number of data samples. In particular, this rate is independent of dimension and does not require early stopping and decaying step size. Finally, we employ these results in various contexts and derive generalization bounds for multi-index linear models, multi-class support vector machines, and $K$-means clustering for both hard and soft label setups, improving the known state-of-the-art rates.
Investigators are increasingly using novel methods for extending (generalizing or transporting) causal inferences from a trial to a target population. In many generalizability and transportability analyses, the trial and the observational data from the target population are separately sampled, following a non-nested trial design. In practical implementations of this design, non-randomized individuals from the target population are often identified by conditioning on the use of a particular treatment, while individuals who used other candidate treatments for the same indication or individuals who did not use any treatment are excluded. In this paper, we argue that conditioning on treatment in the target population changes the estimand of generalizability and transportability analyses and potentially introduces serious bias in the estimation of causal estimands in the target population or the subset of the target population using a specific treatment. Furthermore, we argue that the naive application of marginalization-based or weighting-based standardization methods does not produce estimates of any reasonable causal estimand. We use causal graphs and counterfactual arguments to characterize the identification problems induced by conditioning on treatment in the target population and illustrate the problems using simulated data. We conclude by considering the implications of our findings for applied work.
A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if $P$ is a set of points, $Q$ is a set of queries, and $f:P\times Q\to\mathbb{R}$ is a cost function, then a set $S\subseteq P$ with weights $w:P\to[0,\infty)$ is an $\epsilon$-coreset for some parameter $\epsilon>0$ if $\sum_{s\in S}w(s)f(s,q)$ is a $(1+\epsilon)$ multiplicative approximation to $\sum_{p\in P}f(p,q)$ for all $q\in Q$. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity $t$. In this paper we improve this bound from $O(t^2)$ to $O(t\log t)$. Thus our results imply more space efficient solutions to a number of problems, including projective clustering, $k$-line clustering, and subspace approximation. Moreover, we generalize the notion of sensitivity sampling for sup-sampling that supports non-multiplicative approximations, negative cost functions and more. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an $(\nu,\alpha)$-sample for this class of functions with appropriate parameters $\nu$ and $\alpha$ suffices to achieve space efficient $\epsilon$-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to $k$-median/$k$-means, $k$-line clustering, $j$-subspace approximation, and the integer $(j,k)$-projective clustering problem.
We study two problems of private matrix multiplication, over a distributed computing system consisting of a master node, and multiple servers who collectively store a family of public matrices using Maximum-Distance-Separable (MDS) codes. In the first problem of Private and Secure Matrix Multiplication from Colluding servers (MDS-C-PSMM), the master intends to compute the product of its confidential matrix $\mathbf{A}$ with a target matrix stored on the servers, without revealing any information about $\mathbf{A}$ and the index of target matrix to some colluding servers. In the second problem of Fully Private Matrix Multiplication from Colluding servers (MDS-C-FPMM), the matrix $\mathbf{A}$ is also selected from another family of public matrices stored at the servers in MDS form. In this case, the indices of the two target matrices should both be kept private from colluding servers. We develop novel strategies for MDS-C-PSMM and MDS-C-FPMM, which simultaneously guarantee information-theoretic data/index privacy and computation correctness. The key ingredient is a careful design of secret sharings of the matrix $\mathbf{A}$ and the private indices, which are tailored to matrix multiplication task and MDS storage structure, such that the computation results from the servers can be viewed as evaluations of a polynomial at distinct points, from which the intended result can be obtained through polynomial interpolation. We compare the proposed MDS-C-PSMM strategy with a previous MDS-PSMM strategy with a weaker privacy guarantee (non-colluding servers), and demonstrate substantial improvements over the previous strategy in terms of communication and computation performance.
Various cryptographic techniques are used in outsourced database systems to ensure data privacy while allowing for efficient querying. This work proposes a definition and components of a new secure and efficient outsourced database system, which answers various types of queries, with different privacy guarantees in different security models. This work starts with the survey of five order-revealing encryption schemes that can be used directly in many database indices and five range query protocols with various security / efficiency tradeoffs. The survey systematizes the state-of-the-art range query solutions in a snapshot adversary setting and offers some non-obvious observations regarding the efficiency of the constructions. In $\mathcal{E}\text{psolute}$, a secure range query engine, security is achieved in a setting with a much stronger adversary where she can continuously observe everything on the server, and leaking even the result size can enable a reconstruction attack. $\mathcal{E}\text{psolute}$ proposes a definition, construction, analysis, and experimental evaluation of a system that provably hides both access pattern and communication volume while remaining efficient. The work concludes with $k\text{-a}n\text{o}n$ -- a secure similarity search engine in a snapshot adversary model. The work presents a construction in which the security of $k\text{NN}$ queries is achieved similarly to OPE / ORE solutions -- encrypting the input with an approximate Distance Comparison Preserving Encryption scheme so that the inputs, the points in a hyperspace, are perturbed, but the query algorithm still produces accurate results. We use TREC datasets and queries for the search, and track the rank quality metrics such as MRR and nDCG. For the attacks, we build an LSTM model that trains on the correlation between a sentence and its embedding and then predicts words from the embedding.
Phase field models are promising to tackle various fracture problems where a diffusive crack is introduced and modelled using the phase variable. Owing to the non-convexity of the energy functional, the derived partial differential equations are usually solved in a staggered manner. However, this method suffers from a low convergence rate, and a large number of staggered iterations are needed, especially at the fracture nucleation and propagation. In this study, we propose novel staggered schemes, which are inspired by the fixed-stress split scheme in poromechanics. By fixing the stress when solving the damage evolution, the displacement increment is expressed in terms of the increment of the phase variable. The relation between these two increments enables a prediction of the displacement and the active energy based on the increment of the phase variable. Thus, the maximum number of staggered iterations is reduced, and the computational efficiency is improved. We present three staggered schemes by fixing the first invariant, second invariant, or both invariants of the stress, denoted by S1, S2, and S3 schemes. The performance of the schemes is then verified by comparing with the standard staggered scheme through three benchmark examples, i.e., tensile, shear, and L-shape panel tests. The results exhibit that the force-displacement relations and the crack patterns computed using the fast schemes are consistent with the ones based on the standard staggered scheme. Moreover, the proposed S1 and S2 schemes can largely reduce the maximum number of staggered iterations and total CPU time in all benchmark tests. The S2 scheme performs comparably except in the shear test, where the underlying assumption is violated in the region close to the crack.
A goodness-of-fit test for one-parameter count distributions with finite second moment is proposed. The test statistic is derived from the $L_1$-distance of a function of the probability generating function of the model under the null hypothesis and that of the random variable actually generating data, when the latter belongs to a suitable wide class of alternatives. The test statistic has a rather simple form and it is asymptotically normally distributed under the null hypothesis, allowing a straightforward implementation of the test. Moreover, the test is consistent for alternative distributions belonging to the class, but also for all the alternative distributions whose probability of zero is different from that under the null hypothesis. Thus, the use of the test is proposed and investigated also for alternatives not in the class. The finite-sample properties of the test are assessed by means of an extensive simulation study.
In the real-world question answering scenarios, hybrid form combining both tabular and textual contents has attracted more and more attention, among which numerical reasoning problem is one of the most typical and challenging problems. Existing methods usually adopt encoder-decoder framework to represent hybrid contents and generate answers. However, it can not capture the rich relationship among numerical value, table schema, and text information on the encoder side. The decoder uses a simple predefined operator classifier which is not flexible enough to handle numerical reasoning processes with diverse expressions. To address these problems, this paper proposes a \textbf{Re}lational \textbf{G}raph enhanced \textbf{H}ybrid table-text \textbf{N}umerical reasoning model with \textbf{T}ree decoder (\textbf{RegHNT}). It models the numerical question answering over table-text hybrid contents as an expression tree generation task. Moreover, we propose a novel relational graph modeling method, which models alignment between questions, tables, and paragraphs. We validated our model on the publicly available table-text hybrid QA benchmark (TAT-QA). The proposed RegHNT significantly outperform the baseline model and achieve state-of-the-art results\footnote{We openly released the source code and data at~\url{//github.com/lfy79001/RegHNT}}~(2022-05-05).
We present new results for online contention resolution schemes for the matching polytope of graphs, in the random-order (RCRS) and adversarial (OCRS) arrival models. Our results include improved selectability guarantees (i.e., lower bounds), as well as new impossibility results (i.e., upper bounds). By well-known reductions to the prophet (secretary) matching problem, a $c$-selectable OCRS (RCRS) implies a $c$-competitive algorithm for adversarial (random order) edge arrivals. Similar reductions are also known for the query-commit matching problem. For the adversarial arrival model, we present a new analysis of the OCRS of Ezra et al.~(EC, 2020). We show that this scheme is $0.344$-selectable for general graphs and $0.349$-selectable for bipartite graphs, improving on the previous $0.337$ selectability result for this algorithm. We also show that the selectability of this scheme cannot be greater than $0.361$ for general graphs and $0.382$ for bipartite graphs. We further show that no OCRS can achieve a selectability greater than $0.4$ for general graphs, and $0.433$ for bipartite graphs. For random-order arrivals, we present two attenuation-based schemes which use new attenuation functions. Our first RCRS is $0.474$-selectable for general graphs, and our second is $0.476$-selectable for bipartite graphs. These results improve upon the recent $0.45$ (and $0.456$) selectability results for general graphs (respectively, bipartite graphs) due to Pollner et al.~(EC, 2022). On general graphs, our 0.474-selectable RCRS provides the best known positive result even for offline contention resolution, and also for the correlation gap. We conclude by proving a fundamental upper bound of 0.5 on the selectability of RCRS, using bipartite graphs.
The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to user's information need. Recently, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Since there have been a large number of works dedicating to the application of PTMs in IR, we believe it is the right time to summarize the current status, learn from existing methods, and gain some insights for future development. In this survey, we present an overview of PTMs applied in different components of IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and envision some promising directions, with the hope of inspiring more works on these topics for future research.