The Fr\'{e}chet distance is one of the most studied distance measures between curves $P$ and $Q$. The data structure variant of the problem is a longstanding open problem: Efficiently preprocess $P$, so that for any $Q$ given at query time, one can efficiently approximate their Fr\'{e}chet distance. There exist conditional lower bounds that prohibit $(1 + \varepsilon)$-approximate Fr\'{e}chet distance computations in subquadratic time, even when preprocessing $P$ using any polynomial amount of time and space. As a consequence, the problem has been studied under various restrictions: restricting $Q$ to be a (horizontal) segment, or requiring $P$ and $Q$ to be so-called \emph{realistic} input curves. We give a data structure for $(1+\varepsilon)$-approximate discrete Fr\'{e}chet distance in any metric space $\mathcal{X}$ between a realistic input curve $P$ and any query curve $Q$. After preprocessing the input curve $P$ (of length $|P|=n$) in $O(n \log n)$ time, we may answer queries specifying a query curve $Q$ and an $\varepsilon$, and output a value $d(P,Q)$ which is at most a $(1+\varepsilon)$-factor away from the true Fr\'{e}chet distance between $Q$ and $P$. Thus, we give the first data structure that adapts to $\varepsilon$-values specified at query time, and the first data structure to handle query curves with arbitrarily many vertices. Our query time is asymptotically linear in $|Q|=m$, $\frac{1}{\varepsilon}$, $\log n$, and the realism parameter $c$ or $\kappa$. The method presented in this paper simplifies and generalizes previous contributions to the static problem variant. We obtain efficient queries (and therefore static algorithms) for Fr\'{e}chet distance computation in high-dimensional spaces and other metric spaces (e.g., when $\mathcal{X}$ is a graph under the shortest path metric). Our method supports subcurve queries at no additional cost.
Pressure for higher productivity and faster delivery is increasingly pervading software organizations. This can lead software engineers to act like chess players playing a gambit -- making sacrifices of their technically sound estimates, thus submitting their teams to time pressure. In turn, time pressure can have varied detrimental effects, such as poor product quality and emotional distress, decreasing productivity, which leads to more time pressure and delays: a hard-to-stop vicious cycle. This reveals a need for moving on from the more passive strategy of yielding to pressure to a more active one of defending software estimates. Therefore, we propose an approach to support software estimators in acquiring knowledge on how to carry out such defense, by introducing negotiation principles encapsulated in a set of defense lenses, presented through a digital simulation. We evaluated the proposed approach through a controlled experiment with software practitioners from different companies. We collected data on participants' attitudes, subjective norms, perceived behavioral control, and intentions to perform the defense of their estimates in light of the Theory of Planned Behavior. We employed a frequentist and a bayesian approach to data analysis. Results show improved scores among experimental group participants after engaging with the digital simulation and learning about the lenses. They were also more inclined to choose a defense action when facing pressure scenarios than a control group exposed to questions to reflect on the reasons and outcomes of pressure over estimates. Qualitative evidence reveals that practitioners perceived the set of lenses as useful in their current work environments. Collectively, these results show the effectiveness of the proposed approach and its perceived relevance for the industry, despite the low amount of time required to engage with it.
Modern time series analysis requires the ability to handle datasets that are inherently high-dimensional; examples include applications in climatology, where measurements from numerous sensors must be taken into account, or inventory tracking of large shops, where the dimension is defined by the number of tracked items. The standard way to mitigate computational issues arising from the high dimensionality of the data is by applying some dimension reduction technique that preserves the structural properties of the ambient space. The dissimilarity between two time series is often measured by ``discrete'' notions of distance, e.g. the dynamic time warping or the discrete Fr\'echet distance. Since all these distance functions are computed directly on the points of a time series, they are sensitive to different sampling rates or gaps. The continuous Fr\'echet distance offers a popular alternative which aims to alleviate this by taking into account all points on the polygonal curve obtained by linearly interpolating between any two consecutive points in a sequence. We study the ability of random projections \`a la Johnson and Lindenstrauss to preserve the continuous Fr\'echet distance of polygonal curves by effectively reducing the dimension. In particular, we show that one can reduce the dimension to $O(\epsilon^{-2} \log N)$, where $N$ is the total number of input points while preserving the continuous Fr\'echet distance between any two determined polygonal curves within a factor of $1\pm \epsilon$. We conclude with applications on clustering.
Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.
Pervasive cross-section dependence is increasingly recognized as a characteristic of economic data and the approximate factor model provides a useful framework for analysis. Assuming a strong factor structure where $\Lop\Lo/N^\alpha$ is positive definite in the limit when $\alpha=1$, early work established convergence of the principal component estimates of the factors and loadings up to a rotation matrix. This paper shows that the estimates are still consistent and asymptotically normal when $\alpha\in(0,1]$ albeit at slower rates and under additional assumptions on the sample size. The results hold whether $\alpha$ is constant or varies across factor loadings. The framework developed for heterogeneous loadings and the simplified proofs that can be also used in strong factor analysis are of independent interest.
A patient seller aims to sell a good to an impatient buyer (i.e., one who discounts utility over time). The buyer will remain in the market for a period of time $T$, and her private value is drawn from a publicly known distribution. What is the revenue-optimal pricing-curve (sequence of (price, time) pairs) for the seller? Is randomization of help here? Is the revenue-optimal pricing curve computable in polynomial time? We answer these questions in this paper. We give an efficient algorithm for computing the revenue-optimal pricing curve. We show that pricing curves, that post a price at each point of time and let the buyer pick her utility maximizing time to buy, are revenue-optimal among a much broader class of sequential lottery mechanisms. I.e., mechanisms that allow the seller to post a menu of lotteries at each point of time cannot get any higher revenue than pricing curves. We also show that the even broader class of mechanisms that allow the menu of lotteries to be adaptively set, can earn strictly higher revenue than that of pricing curves, and the revenue gap can be as big as the support size of the buyer's value distribution.
This paper develops and analyzes an accelerated proximal descent method for finding stationary points of nonconvex composite optimization problems. The objective function is of the form $f+h$ where $h$ is a proper closed convex function, $f$ is a differentiable function on the domain of $h$, and $\nabla f$ is Lipschitz continuous on the domain of $h$. The main advantage of this method is that it is "parameter-free" in the sense that it does not require knowledge of the Lipschitz constant of $\nabla f$ or of any global topological properties of $f$. It is shown that the proposed method can obtain an $\varepsilon$-approximate stationary point with iteration complexity bounds that are optimal, up to logarithmic terms over $\varepsilon$, in both the convex and nonconvex settings. Some discussion is also given about how the proposed method can be leveraged in other existing optimization frameworks, such as min-max smoothing and penalty frameworks for constrained programming, to create more specialized parameter-free methods. Finally, numerical experiments are presented to support the practical viability of the method.
We revisit the following problem: given a set of indices $S = \{1, \dots, n\}$ and weights $w_1, \dots, w_n \in \mathbb{R}_{> 0}$, provide samples from $S$ with distribution $p(i) = w_i / W$ where $W = \sum_j w_j$ gives the proper normalization. In the static setting, there is a simple data structure due to Walker called Alias Table that allows for samples to be drawn in constant time. A more challenging task is to maintain the distribution in a dynamic setting, where elements may be added or removed, or weights may change over time; here, existing solutions restrict the permissible weights, require rebuilding of the associated data structure after a number of updates, or are rather complex. In this paper, we describe, analyze, and engineer a simple data structure for maintaining a discrete probability distribution in the dynamic setting. Construction of the data structure for an arbitrary distribution takes time $O(n)$, sampling takes expected time $O(1)$, and updates of size $\Delta = O(W / n)$ can be processed in time $O(1)$. To evaluate the efficiency of the data structure we conduct an experimental study. The results suggest that the dynamic sampling performance is comparable to the static Alias Table with a minor slowdown.
The Fr\'{e}chet distance is one of the most studied distance measures between curves $P$ and $Q$. The data structure variant of the problem is a longstanding open problem: Efficiently preprocess $P$, so that for any $Q$ given at query time, one can efficiently approximate their Fr\'{e}chet distance. There exist conditional lower bounds that prohibit $(1 + \varepsilon)$-approximate Fr\'{e}chet distance computations in subquadratic time, even when preprocessing $P$ using any polynomial amount of time and space. As a consequence, the problem has been studied under various restrictions: restricting $Q$ to be a (horizontal) segment, or requiring $P$ and $Q$ to be so-called \emph{realistic} input curves. We give a data structure for $(1+\varepsilon)$-approximate discrete Fr\'{e}chet distance in any metric space $\mathcal{X}$ between a realistic input curve $P$ and any query curve $Q$. After preprocessing the input curve $P$ (of length $|P|=n$) in $O(n \log n)$ time, we may answer queries specifying a query curve $Q$ and an $\varepsilon$, and output a value $d(P,Q)$ which is at most a $(1+\varepsilon)$-factor away from the true Fr\'{e}chet distance between $Q$ and $P$. Our query time is asymptotically linear in $|Q|=m$, $\frac{1}{\varepsilon}$, $\log n$, and the realism parameter $c$ or $\kappa$. Our data structure is the first to: adapt to the approximation parameter $\varepsilon$ at query time, handle query curves with arbitrarily many vertices, work for any ambient space of the curves, or be dynamic. The method presented in this paper simplifies and generalizes previous contributions to the static problem variant. We obtain efficient queries (and therefore static algorithms) for Fr\'{e}chet distance computation in high-dimensional spaces and other ambient metric spaces.
When estimating a Global Average Treatment Effect (GATE) under network interference, units can have widely different relationships to the treatment depending on a combination of the structure of their network neighborhood, the structure of the interference mechanism, and how the treatment was distributed in their neighborhood. In this work, we introduce a sequential procedure to generate and select graph- and treatment-based covariates for GATE estimation under regression adjustment. We show that it is possible to simultaneously achieve low bias and considerably reduce variance with such a procedure. To tackle inferential complications caused by our feature generation and selection process, we introduce a way to construct confidence intervals based on a block bootstrap. We illustrate that our selection procedure and subsequent estimator can achieve good performance in terms of root mean squared error in several semi-synthetic experiments with Bernoulli designs, comparing favorably to an oracle estimator that takes advantage of regression adjustments for the known underlying interference structure. We apply our method to a real world experimental dataset with strong evidence of interference and demonstrate that it can estimate the GATE reasonably well without knowing the interference process a priori.