Class Quantiles
- java.lang.Object
-
- com.google.common.math.Quantiles
-
@Beta @GwtIncompatible public final class Quantiles extends java.lang.Object
Provides a fluent API for calculating quantiles.Examples
To compute the median:
wheredouble myMedian = median().compute(myDataset);
median()
has been statically imported.To compute the 99th percentile:
wheredouble myPercentile99 = percentiles().index(99).compute(myDataset);
percentiles()
has been statically imported.To compute median and the 90th and 99th percentiles:
whereMap<Integer, Double> myPercentiles = percentiles().indexes(50, 90, 99).compute(myDataset);
percentiles()
has been statically imported:myPercentiles
maps the keys 50, 90, and 99, to their corresponding quantile values.To compute quartiles, use
quartiles()
instead ofpercentiles()
. To compute arbitrary q-quantiles, usescale(q)
.These examples all take a copy of your dataset. If you have a double array, you are okay with it being arbitrarily reordered, and you want to avoid that copy, you can use
computeInPlace
instead ofcompute
.Definition and notes on interpolation
The definition of the kth q-quantile of N values is as follows: define x = k * (N - 1) / q; if x is an integer, the result is the value which would appear at index x in the sorted dataset (unless there are
NaN
values, see below); otherwise, the result is the average of the values which would appear at the indexes floor(x) and ceil(x) weighted by (1-frac(x)) and frac(x) respectively. This is the same definition as used by Excel and by S, it is the Type 7 definition in R, and it is described by wikipedia as providing "Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1]."Handling of non-finite values
If any values in the input are
NaN
then all values returned areNaN
. (This is the one occasion when the behaviour is not the same as you'd get from sorting withArrays.sort(double[])
orCollections.sort(List<Double>)
and selecting the required value(s). Those methods would sortNaN
as if it is greater than any other value and place them at the end of the dataset, even afterPOSITIVE_INFINITY
.)Otherwise,
NEGATIVE_INFINITY
andPOSITIVE_INFINITY
sort to the beginning and the end of the dataset, as you would expect.If required to do a weighted average between an infinity and a finite value, or between an infinite value and itself, the infinite value is returned. If required to do a weighted average between
NEGATIVE_INFINITY
andPOSITIVE_INFINITY
,NaN
is returned (note that this will only happen if the dataset contains no finite values).Performance
The average time complexity of the computation is O(N) in the size of the dataset. There is a worst case time complexity of O(N^2). You are extremely unlikely to hit this quadratic case on randomly ordered data (the probability decreases faster than exponentially in N), but if you are passing in unsanitized user data then a malicious user could force it. A light shuffle of the data using an unpredictable seed should normally be enough to thwart this attack.
The time taken to compute multiple quantiles on the same dataset using
indexes
is generally less than the total time taken to compute each of them separately, and sometimes much less. For example, on a large enough dataset, computing the 90th and 99th percentiles together takes about 55% as long as computing them separately.When calling
Quantiles.ScaleAndIndex.compute(java.util.Collection<? extends java.lang.Number>)
(in either form), the memory requirement is 8*N bytes for the copy of the dataset plus an overhead which is independent of N (but depends on the quantiles being computed). When callingcomputeInPlace
(in either form), only the overhead is required. The number of object allocations is independent of N in both cases.- Since:
- 20.0
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Quantiles.Scale
Describes the point in a fluent API chain where only the scale (i.e.static class
Quantiles.ScaleAndIndex
Describes the point in a fluent API chain where the scale and a single quantile index (i.e.static class
Quantiles.ScaleAndIndexes
Describes the point in a fluent API chain where the scale and a multiple quantile indexes (i.e.
-
Constructor Summary
Constructors Constructor Description Quantiles()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description private static void
checkIndex(int index, int scale)
private static int
chooseNextSelection(int[] allRequired, int requiredFrom, int requiredTo, int from, int to)
Chooses the next selection to do from the required selections.private static boolean
containsNaN(double... dataset)
Returns whether any of the values indataset
areNaN
.private static double
interpolate(double lower, double upper, double remainder, double scale)
Returns a value a fraction(remainder / scale)
of the way betweenlower
andupper
.private static double[]
intsToDoubles(int[] ints)
private static double[]
longsToDoubles(long[] longs)
static Quantiles.ScaleAndIndex
median()
Specifies the computation of a median (i.e.private static void
movePivotToStartOfSlice(double[] array, int from, int to)
Selects the pivot to use, namely the median of the values atfrom
,to
, and halfway between the two (rounded down), fromarray
, and ensure (by swapping elements if necessary) that that pivot value appears at the start of the slice i.e.private static int
partition(double[] array, int from, int to)
Performs a partition operation on the slice ofarray
with elements in the range [from
,to
].static Quantiles.Scale
percentiles()
Specifies the computation of percentiles (i.e.static Quantiles.Scale
quartiles()
Specifies the computation of quartiles (i.e.static Quantiles.Scale
scale(int scale)
Specifies the computation of q-quantiles.private static void
selectAllInPlace(int[] allRequired, int requiredFrom, int requiredTo, double[] array, int from, int to)
Performs an in-place selection, likeselectInPlace(int, double[], int, int)
, to select all the indexesallRequired[i]
fori
in the range [requiredFrom
,requiredTo
].private static void
selectInPlace(int required, double[] array, int from, int to)
Performs an in-place selection to find the element which would appear at a given index in a dataset if it were sorted.private static void
swap(double[] array, int i, int j)
Swaps the values ati
andj
inarray
.
-
-
-
Method Detail
-
median
public static Quantiles.ScaleAndIndex median()
Specifies the computation of a median (i.e. the 1st 2-quantile).
-
quartiles
public static Quantiles.Scale quartiles()
Specifies the computation of quartiles (i.e. 4-quantiles).
-
percentiles
public static Quantiles.Scale percentiles()
Specifies the computation of percentiles (i.e. 100-quantiles).
-
scale
public static Quantiles.Scale scale(int scale)
Specifies the computation of q-quantiles.- Parameters:
scale
- the scale for the quantiles to be calculated, i.e. the q of the q-quantiles, which must be positive
-
containsNaN
private static boolean containsNaN(double... dataset)
Returns whether any of the values indataset
areNaN
.
-
interpolate
private static double interpolate(double lower, double upper, double remainder, double scale)
Returns a value a fraction(remainder / scale)
of the way betweenlower
andupper
. Assumes thatlower <= upper
. Correctly handles infinities (but notNaN
).
-
checkIndex
private static void checkIndex(int index, int scale)
-
longsToDoubles
private static double[] longsToDoubles(long[] longs)
-
intsToDoubles
private static double[] intsToDoubles(int[] ints)
-
selectInPlace
private static void selectInPlace(int required, double[] array, int from, int to)
Performs an in-place selection to find the element which would appear at a given index in a dataset if it were sorted. The following preconditions should hold:required
,from
, andto
should all be indexes intoarray
;required
should be in the range [from
,to
];- all the values with indexes in the range [0,
from
) should be less than or equal to all the values with indexes in the range [from
,to
]; - all the values with indexes in the range (
to
,array.length - 1
] should be greater than or equal to all the values with indexes in the range [from
,to
].
from
,to
] such that all the values with indexes in the range [from
,required
) are less than or equal to the value with indexrequired
, and all the values with indexes in the range (required
,to
] are greater than or equal to that value. Therefore, the value atrequired
is the value which would appear at that index in the sorted dataset.
-
partition
private static int partition(double[] array, int from, int to)
Performs a partition operation on the slice ofarray
with elements in the range [from
,to
]. Uses the median offrom
,to
, and the midpoint between them as a pivot. Returns the index which the slice is partitioned around, i.e. if it returnsret
then we know that the values with indexes in [from
,ret
) are less than or equal to the value atret
and the values with indexes in (ret
,to
] are greater than or equal to that.
-
movePivotToStartOfSlice
private static void movePivotToStartOfSlice(double[] array, int from, int to)
Selects the pivot to use, namely the median of the values atfrom
,to
, and halfway between the two (rounded down), fromarray
, and ensure (by swapping elements if necessary) that that pivot value appears at the start of the slice i.e. atfrom
. Expects thatfrom
is strictly less thanto
.
-
selectAllInPlace
private static void selectAllInPlace(int[] allRequired, int requiredFrom, int requiredTo, double[] array, int from, int to)
Performs an in-place selection, likeselectInPlace(int, double[], int, int)
, to select all the indexesallRequired[i]
fori
in the range [requiredFrom
,requiredTo
]. These indexes must be sorted in the array and must all be in the range [from
,to
].
-
chooseNextSelection
private static int chooseNextSelection(int[] allRequired, int requiredFrom, int requiredTo, int from, int to)
Chooses the next selection to do from the required selections. It is required that the arrayallRequired
is sorted and thatallRequired[i]
are in the range [from
,to
] for alli
in the range [requiredFrom
,requiredTo
]. The value returned by this method is thei
in that range such thatallRequired[i]
is as close as possible to the center of the range [from
,to
]. Choosing the value closest to the center of the range first is the most efficient strategy because it minimizes the size of the subranges from which the remaining selections must be done.
-
swap
private static void swap(double[] array, int i, int j)
Swaps the values ati
andj
inarray
.
-
-