Honest Moments

Compact shape summaries for I/O distributions - and explicit verdicts for when moments are the wrong tool.

A tool I’ve been thinking about for fifteen years

Some time around 2011, I started wondering whether you could characterize the shape of an I/O latency distribution - disk, filesystem, any streaming measurement - using only its first few moments, computed online, in kernel space where memory is expensive. Every PhD mathematician I asked told me some variant of “no, you can’t reconstruct a distribution from a finite number of moments; the problem is ill-posed.” They were answering a different question than the one I was asking. It took me most of fifteen years to articulate which question I actually had, and why their objection, though correct, did not apply to it.

This article is about that distinction. There is a tool I’d like to exist, called iomoments, that sits in the middle of two well-known positions - the “pure reconstruction is impossible” of moment-problem theory and the “here’s a histogram, ship it” of engineering observability. The middle is empty because the two neighboring answers are each coherent without it. But the middle position is the useful one for characterizing I/O distributions in constrained environments, and it has a clean mathematical statement.

The problem: compact shape, honestly

Say you’re running a filesystem benchmark, or monitoring a production disk. Latency per operation is a random variable drawn from some distribution. You want to understand its shape - not the exact density, just enough to know “is it Gaussian-ish, or log-normal-ish?”; “is the right tail heavy?”; “are there two modes (cache hit vs. cache miss), or one?”; “has something drifted since yesterday?” These are shape questions, and they are what most performance analysis actually needs.

The standard answer is a histogram. A log-bucketed histogram like bcc’s biolatency or HdrHistogram gives you a direct picture of the distribution with O(1) update cost and a fixed footprint - typically a few kibibytes for several decades of dynamic range. That’s small on a laptop, but per-CPU counters multiplied across many monitored subsystems add up, and kernel memory budgets are tighter than userspace budgets even when they aren’t literally counted in cache lines. There is room for something smaller when you only care about shape.

Moments occupy that space. If X is the random variable,

m_1 = E[X] (mean),
m_2 = E[X^2] (second raw moment; variance is m_2 - m_1^2),
m_3 = E[X^3] (third raw moment; skewness derives from this and the first two),
m_k = E[X^k] for any order k,

and a streaming estimator updates these as samples arrive using nothing but a few scalar accumulators. Skewness tells you about tail asymmetry. Kurtosis tells you about tail heaviness. A handful of higher moments can begin to resolve bimodality. Six moments fit in 48 bytes - roughly sixteen times smaller than a modest log-bucketed histogram, and they live in CPU registers during updates.

So far, this sounds like a free win: pay ~50 bytes per monitored quantity, get meaningful shape descriptors. The catch is that the interpretation of those numbers depends on properties of the underlying distribution that may or may not hold. The rest of this article is about what those properties are, how to test for them at runtime, and what to do when they fail.

The classical moment problem - and why mathematicians flinch

If you walk into a probability theorist’s office and say “I want to reconstruct a distribution from its moments,” they will probably hand you a copy of Akhiezer (1965) or Schmüdgen (2017) and send you home. The classical moment problem, going back to Stieltjes (1894), Hamburger (1920), and Hausdorff (1921), asks:

Given a sequence of real numbers m_0, m_1, m_2, ..., is there a probability measure whose moments are exactly that sequence? If so, is it unique?

This is the exact-reconstruction question, and the answers are genuinely discouraging:

Existence is characterized by positive-semidefiniteness of certain Hankel matrices (the matrices H_k whose (i, j) entry is m_{i+j}). This is constructive in principle but numerically fragile in practice - Hankel matrices built from real data become catastrophically ill-conditioned as their order grows.
Uniqueness - called determinacy - is the real problem. There exist distinct probability distributions that share every moment. The canonical counterexample is the log-normal distribution: it has finite moments of all orders, yet infinitely many distinct densities share exactly those moments (Heyde, 1963; Stoyanov, 2013). A sufficient condition for uniqueness is Carleman’s condition (Carleman, 1926), which says that the even-moment sequence must not grow too fast. The condition is sufficient but not necessary: distributions that violate it are not automatically indeterminate, but they are not proved determinate by this route either. The log-normal violates Carleman and is indeterminate; other distributions violate it and remain determinate. The practical takeaway is that indeterminacy is a real risk for heavy-tailed or heavy-moment distributions, and the reconstruction-from-moments program has no general remedy for it.

When a mathematician hears “shape from a few moments,” they hear the classical moment problem, and they are right to flinch. Truncated moment inversion is ill-posed, and even in the idealized case where you somehow obtained the entire infinite moment sequence exactly, some of the distributions most relevant to performance work - log-normal being the worst offender - are moment-indeterminate. No amount of moment collection will distinguish them from their doppelgängers.

This is a real result. It is, however, a result about the wrong question.

The engineering question - different, and well-posed

The question I actually want to answer is not “what distribution produced these moments?” It is:

Given N estimated moments and a battery of diagnostic tests, can I output a shape summary for which I am willing to state the conditions under which it is meaningful?

This is well-posed. The answer is yes for some workloads (Carleman-satisfying tails, stationary processes, sampled above the Nyquist frequency of any interesting system dynamics) and no for others (heavy-tailed distributions where higher moments fail to exist, non-stationary processes, temporally aliased sampling regimes). The contribution is not the moments themselves - Pébay (2008) gave us the numerically stable update rules for those two decades ago. The contribution is answering which case we are in right now, for this workload, and refusing to emit moment-based summaries when the tests say we shouldn’t.

The rest of this article sketches the tests. They come in two families: distribution-shape diagnostics, which ask whether moments can meaningfully characterize whatever distribution is generating the samples, and temporal-coherence diagnostics, which ask whether our sampling is fast enough to see the dynamics that matter.

Diagnostics I: is this distribution moment-friendly at all?

A short warmup phase collects raw samples - say, a few seconds to a minute - and runs several small numerical probes. None of these are new. All of them are textbook statistics. The point is assembling them into a single verdict before any moment-based output is trusted.

Does the tail exist?

For a power-law tail of the form P(X > x) ~ x^{-α}, the k-th moment exists only if α > k. If α falls below 2 - a regime that can arise under contention, though the actual tail index is workload- and hardware-dependent and must be measured per system - then variance does not exist. Reported variance will be dominated by the single largest outlier in the trace and will not converge as sample count grows. Kurtosis is worse. Everything beyond the mean is noise.

The Hill estimator (Hill, 1975; Embrechts, Klüppelberg & Mikosch, 1997) gives a rough estimate of the tail index α from the top k order statistics of the sample. If α is below the largest moment order you want to report, the tool should refuse to report those moments. This is the canonical infeasibility case: moments are not the wrong tool, they are the wrong primitive for this distribution.

Is the moment sequence determinate?

Carleman’s condition says that moments uniquely determine a distribution when the series ∑ m_{2k}^{-1/(2k)} diverges. You can estimate the partial sums of this series up to the largest reliably-estimable even moment and check whether the partial sums look like they are diverging or converging. For log-normal-flavored data, raw-space moments grow super-exponentially in order (as exp(kμ + k²σ²/2)), which makes m_{2k}^{-1/(2k)} decay geometrically and the Carleman series converge - the test flags indeterminacy. Meanwhile, in log space the data is approximately Gaussian, whose moments grow slowly enough that Carleman’s series diverges and determinacy is established. That observation informs the next diagnostic.

Log space or raw space?

Fit a shifted log-normal to the warmup sample by maximum likelihood. Measure the Kolmogorov–Smirnov distance between the empirical CDF and the fitted log-normal CDF. If the distance is small, recommend computing moments in log(latency) space, where the distribution is approximately Gaussian and moments converge fast. If the distance is large and the distribution looks symmetric around its mean, raw space is fine. If it’s multi-modal, emit a caveat: moments in either space will average over modes and hide them.

How many modes are there, approximately?

Fit a small family of Gaussian mixture models with K = 1, 2, 3, ... components and pick K by the Bayesian information criterion. A one-component fit resolves the modality question - the moments are summarizing a single shape rather than a mixture - but the tail-existence, determinacy, and estimation-stability diagnostics still have to pass independently before moments can be trusted. Two-component fit means the mean is a weighted average of two centers and the variance is inflated by their separation; skewness will be dominated by the mixture weights; higher moments become hard to interpret. There is a related structural fact worth noting: a K-atomic discrete measure makes the (k+1)-by-(k+1) Hankel H_k rank-deficient for k ≥ K (classical; see Curto & Fialkow, 1991 for a modern treatment of the truncated moment problem). For continuous multimodal distributions this rank-drop is a heuristic rather than a theorem - well-separated narrow modes behave approximately like atoms - so in practice BIC over GMM fits is the primary mode estimate and Hankel numerical rank is a cross-check.

Using either BIC or the Hankel rank heuristic, iomoments can report “this workload has approximately K modes” and tune its output accordingly: if K > 1, it emits moments with caveats, not as if they summarized a unimodal shape. The Hankel-rank approach is exact for atomic measures and approximate for continuous multimodal distributions; BIC over GMM fits is more direct for the continuous case, and the two agree in the regime where both apply.

How stable are the estimates at this sample size?

Split the warmup sample at random into halves. Compute moments 1 through N separately on each half. Compare. A moment whose two half-estimates differ by 50% is not reliably estimated yet. By the central limit theorem, each moment estimator’s noise decays as n^{-1/2}, which tells you directly how much more sample you need for a given target precision. The catch is that the constant multiplying n^{-1/2} grows with moment order - the variance of the k-th sample moment scales with the 2k-th population moment, which typically blows up quickly for anything heavier-tailed than a Gaussian. For Gaussian data, a few thousand samples give 5% relative precision on kurtosis; for non-Gaussian data, the required sample count can be orders of magnitude larger, easily into the millions or more for moderately heavy-tailed distributions.

The probe can report this as a sample budget: “at your current event rate, reaching 5% kurtosis precision takes about fifteen minutes of warmup.” The operator can choose whether to wait, back off to lower-order moments, or switch tools.

Diagnostics II: are we sampling fast enough to see the system?

The second family of diagnostics is a more subtle trap. Even if the distribution is perfectly moment-friendly, moments computed over a time window that is slow relative to the system’s internal dynamics can alias real structure into nonsense, in exactly the sense of the Shannon (1949) / Nyquist (1928) sampling theorem. This concern is the one that motivated the question in the first place - an I/O subsystem has physical inertia, a bounded bandwidth, and any interesting time-variation in its output must be sampled at above twice its rate to be faithfully captured.

Inter-arrival coefficient of variation

Compute moments of the time between events, not just the event values. A Poisson process has inter-arrival coefficient of variation 1. Higher CV means bursty arrivals: during bursts, the effective sampling rate is much higher than average; during quiet intervals, it can be far lower. A workload with CV = 5 provides worse temporal coverage than its mean event rate suggests, and the operator should be told.

Variance of windowed moments

Window the sample stream into windows of length W. Compute the mean-in-window for each window. Take the variance of those means across windows, as a function of W. For a stationary ergodic process, that variance decays as 1/W. If instead it plateaus, the stream contains slow structure - a periodic latency cycle, a drifting subsystem - that the W-averaged mean cannot resolve. If it oscillates as you sweep W, look at the local minima: when W is an integer multiple of the hidden period T, each window spans a whole number of cycles and the windowed means become insensitive to phase, so variance across windows drops. Local minima of the variance-vs-W curve therefore occur at W = T, 2T, 3T, ..., which makes T recoverable from the spacing between minima. This gives a crude but reliable aliasing detector using only accumulators you were already keeping.

Spectral density of the moment stream

With a short ring buffer of windowed moments, compute a Welch (1967) power spectral density estimate. Two aliasing signatures are:

Energy concentrated near the Nyquist frequency f_s/2 - the “almost aliasing” signature.
Unexpected spectral peaks in the [0, f_s/2] band. A signal component at a true frequency f > f_s/2 folds back into this band at |f - k·f_s| for whichever integer k brings it into range, producing spurious peaks at specific fold-back locations. Peaks whose locations don’t match any plausible physical rate at the physical bandwidth implied by the workload are suspect.

Autocorrelation and Gaussianization

For a stationary ergodic process with no periodic component, autocorrelation of the windowed-mean sequence should decay to zero at large lag. Persistent or oscillating autocorrelation at characteristic lags is consistent with stationarity but signals hidden periodicity or long-range dependence - which is the diagnostic point here, since periodic latency structure is precisely one of the things iomoments is trying to catch. Similarly, by the central limit theorem, windowed means should become approximately Gaussian as W grows under suitable mixing conditions; if higher moments of the windowed-mean distribution stay far from their Gaussian values, the process has long-range dependence the simple moments cannot resolve.

The common thread: moments of moments. The instantaneous moments are a summary; the moments of those summaries across time windows tell you whether the summary itself is capturing the shape, or silently averaging over structure you care about.

Feasibility verdicts

Running those diagnostics, iomoments emits a verdict - not just numbers. Four categories:

Green. Moments are a trustworthy shape summary for this workload. Emit moments with an expected error budget (derived from the stability test) and move on.
Yellow. Moments are informative but miss some structure - multiple modes, mild non-stationarity, a soft periodicity at a detectable frequency. Emit moments with explicit caveats that the operator can read.
Amber. Moments are likely biased - an aliasing signature was detected, or the stability test shows unexpectedly slow convergence. Emit moments alongside a recommendation for further investigation.
Red. Moments are the wrong primitive. Heavy tail where variance does not exist; extreme non-stationarity; sampling rate below the characteristic system frequency. Refuse to emit moment-based summaries and recommend an alternative: a log-bucketed histogram (HdrHistogram), a quantile sketch (DDSketch - Masson, Rim & Lee, 2019; t-digest - Dunning & Ertl, 2021), or a synchronized-snapshot approach that is not event-driven.

The verdicts are the product. The four-colored output is not a decoration around some core moment-reporting feature; it is the point. A tool that computes moments and reports them unconditionally will mislead its user whenever the underlying distribution is heavy-tailed, non-stationary, or multi-modal - failure modes that are not rare in real I/O workloads under contention. A tool that computes moments and states the conditions under which those moments mean what the user thinks they mean is a genuinely different artifact.

Implementation sketch: Linux eBPF

None of the diagnostics above are novel. They are standard numerical statistics. What makes an iomoments tool feasible today, rather than fifteen years ago when I first thought about it, is Linux eBPF.

Why eBPF specifically

Linux’s extended BPF (Starovoitov et al., since kernel 3.18 circa 2014; see Gregg, 2019 for the practitioner’s tour) gives us an in-kernel VM for safely running short, verifier-checked programs at kernel probe points - including the block-layer tracepoints that fire on every disk I/O completion. With BPF maps, particularly per-CPU array maps, we can maintain small accumulators without per-event locking, and merge them in userspace on demand. BTF-based “compile once, run everywhere” (CO-RE) builds have been functional since kernel 5.2 introduced CONFIG_DEBUG_INFO_BTF, with stable support across the 5.4 and 5.10 LTS kernels and every release since. The 5.15 LTS is a convenient production target but not the portability floor.

Classic BPF (the original McCanne & Jacobson, 1993 design) was a packet filter; extended BPF is a general-purpose, tracing-capable in-kernel runtime, and the two share a name but almost no architecture.

The numerical core

For the moment updates, use Pébay’s (2008) parallel-combine update rules, which generalize the Welford (1962) online-variance algorithm to arbitrary order and - critically - include a merge rule that combines two partial summaries into the summary of their union. The merge rule is what makes per-CPU accumulation work: each CPU keeps its own running summary, and userspace merges them without replaying samples.

Do not accumulate sum(x), sum(x^2), sum(x^3), ... directly. This is the most common implementation mistake in the space: naive power-sum accumulation suffers catastrophic cancellation when you later compute central moments as m_k - (combinations of lower m). The errors in raw accumulators and their combinations can wipe out every digit of precision. Pébay’s central-moment updates are substantially more numerically stable - they avoid that cancellation by design - though, like every finite-precision algorithm, they still accumulate rounding error at very large N and under extreme value ranges.

The rest is glue

A rough estimate for the warmup diagnostics: a few hundred lines of Python in userspace (numpy, scipy). The BPF program itself - block-layer kprobe, per-CPU accumulators, ringbuffer event output for the warmup phase - should fit in a similar order of magnitude of C, based on the scope of the components. The tool is a proposal here, not a shipped artifact; the real work is not the code volume but the decisions about what to test for, when to refuse to report, and how to communicate that refusal.

The streaming-quantile family of sketches - HdrHistogram (Tene), t-digest (Dunning & Ertl, 2021), DDSketch (Masson et al., 2019) - is excellent at answering quantile questions with bounded error. If you want p99 latency, use one of these. They are mature, well-analyzed, and widely deployed.

The moment-based approach answers a different question: shape. Is the distribution symmetric or skewed? Light-tailed or heavy? Unimodal or multi-modal? Stationary or drifting? These are not quantile questions. You can derive some of them from a sketch’s output, but the primitives are different in their storage, their update cost, and their failure modes. iomoments is not a replacement for a sketch; it is a complement. In fact, its Red verdict explicitly redirects the user to a sketch when the sketch is the better primitive.

There is also a connection to kinetic theory worth naming briefly. The moment closure methods used in rarefied-gas dynamics (Grad, 1949; Struchtrup, 2005) face an analogous problem - an infinite moment hierarchy has to be truncated - and resolve it by exploiting structural priors (the Maxwellian equilibrium as an expansion point, Hermite polynomials as the natural basis). The analogue for I/O distributions would be physics-informed closures that exploit specifics of queueing dynamics, caching hierarchies, or bandwidth ceilings to reconstruct the full distribution from a small number of moments with bounded error. This is the direction where new mathematical work might actually live, as opposed to the synthesis described above, which is just existing statistics assembled into a tool.

The middle ground

There are two nearby positions on this topic, each coherent, neither quite right.

Reconstruction-first. “You cannot reconstruct a distribution from finite moments; the problem is ill-posed; therefore this whole enterprise is suspect.” This is the position of a mathematician evaluating the classical moment problem. It is correct about reconstruction. It answers a question I am not asking.
Histogram-first. “Just emit a log-bucketed histogram and be done with it; stop trying to be clever.” This is the position of an engineer who has shipped many perf tools. It is correct that histograms work. It gives up memory density that moments could provide, and it does not ask the prior question of whether the specific workload at hand is moment-friendly at all.

The middle ground is the one I’ve been wanting for fifteen years: moments as a compact shape primitive, used only when diagnostic tests show they are trustworthy for the specific workload, with explicit infeasibility verdicts when they are not. The mathematicians were not wrong to dismiss the reconstruction framing. The engineers are not wrong to ship histograms when the tool they have works. What neither camp has been doing is asking the prior question and reporting the answer. The tool is about that: asking before reporting, so the numbers that finally come out mean what the user thinks they do.

The individual pieces - the Hill estimator (Hill, 1975), Carleman’s condition (Carleman, 1926), Hankel conditioning (classical), Welch spectra (Welch, 1967), Welford’s online variance (Welford, 1962), and its generalization to higher orders and parallel merge by Pébay (2008) - are well-established in the literature; none is novel, and most of them are genuinely decades old. What has been missing is their assembly into a single tool whose observable behavior is honest shape characterization with validity reporting rather than emit six numbers and hope. That assembly is what I’ve been wanting.

References

Akhiezer, N. I. (1965). The Classical Moment Problem and Some Related Questions in Analysis. Oliver & Boyd / Hafner Publishing. (Reprinted by Dover Publications.)

Carleman, T. (1926). Les fonctions quasi-analytiques: leçons professées au Collège de France. Gauthier-Villars, Paris.

Curto, R. E., & Fialkow, L. A. (1991). Recursiveness, positivity, and truncated moment problems. Houston Journal of Mathematics, 17(4), 603–635.

Dunning, T., & Ertl, O. (2021). Computing extremely accurate quantiles using t-digests. arXiv:1902.04023. https://arxiv.org/abs/1902.04023

Embrechts, P., Klüppelberg, C., & Mikosch, T. (1997). Modelling Extremal Events for Insurance and Finance. Springer-Verlag. ISBN 978-3-540-60931-5.

Grad, H. (1949). On the kinetic theory of rarefied gases. Communications on Pure and Applied Mathematics, 2(4), 331–407. https://doi.org/10.1002/cpa.3160020403

Gregg, B. (2019). BPF Performance Tools: Linux System and Application Observability. Addison-Wesley Professional. ISBN 978-0-13-655482-0.

Hamburger, H. (1920). Über eine Erweiterung des Stieltjesschen Momentproblems. Mathematische Annalen, 81(4), 235–319.

Hausdorff, F. (1921). Summationsmethoden und Momentfolgen. Mathematische Zeitschrift, 9, 74–109 and 280–299.

Heyde, C. C. (1963). On a property of the lognormal distribution. Journal of the Royal Statistical Society, Series B, 25(2), 392–393.

Hill, B. M. (1975). A simple general approach to inference about the tail of a distribution. Annals of Statistics, 3(5), 1163–1174. https://doi.org/10.1214/aos/1176343247

Masson, C., Rim, J. E., & Lee, H. K. (2019). DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. Proceedings of the VLDB Endowment, 12(12), 2195–2205. https://doi.org/10.14778/3352063.3352135

McCanne, S., & Jacobson, V. (1993). The BSD packet filter: a new architecture for user-level packet capture. USENIX Winter 1993 Technical Conference, San Diego, CA.

Nyquist, H. (1928). Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers, 47(2), 617–644. (Reprinted in Proceedings of the IEEE, 90(2), February 2002.)

Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London A, 185, 71–110. https://doi.org/10.1098/rsta.1894.0003

Pébay, P. (2008). Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments. Sandia National Laboratories Technical Report SAND2008-6212. https://www.osti.gov/biblio/1028931

Schmüdgen, K. (2017). The Moment Problem. Graduate Texts in Mathematics 277, Springer International Publishing.

Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21.

Stieltjes, T. J. (1894). Recherches sur les fractions continues. Annales de la Faculté des Sciences de Toulouse, 8, 1–122.

Stoyanov, J. M. (2013). Counterexamples in Probability (3rd ed.). Dover Publications. ISBN 978-0-486-49998-7.

Struchtrup, H. (2005). Macroscopic Transport Equations for Rarefied Gas Flows: Approximation Methods in Kinetic Theory. Interaction of Mechanics and Mathematics, Springer-Verlag.

Welch, P. D. (1967). The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73. https://doi.org/10.1109/TAU.1967.1161901

Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420. https://doi.org/10.1080/00401706.1962.10490022

A tool I’ve been thinking about for fifteen years#

The problem: compact shape, honestly#

The classical moment problem - and why mathematicians flinch#

The engineering question - different, and well-posed#

Diagnostics I: is this distribution moment-friendly at all?#

Does the tail exist?#

Is the moment sequence determinate?#

Log space or raw space?#

How many modes are there, approximately?#

How stable are the estimates at this sample size?#

Diagnostics II: are we sampling fast enough to see the system?#

Inter-arrival coefficient of variation#

Variance of windowed moments#

Spectral density of the moment stream#

Autocorrelation and Gaussianization#

Feasibility verdicts#

Implementation sketch: Linux eBPF#

Why eBPF specifically#

The numerical core#

The rest is glue#

Related primitives, and where this fits#

The middle ground#

References#