Newer Version Available

This content describes an older version of this product. View Latest

Aggregate Functions

Use aggregate functions to perform computations on values.

Using an aggregate function on an empty set returns null. For example, if you use an aggregate function with a nonmatching column of an outer cogrouping, you might have an empty set.

Aggregation functions treat each line as its own group if not preceded by group by.

This table lists the aggregate functions that are supported:

Aggregate Function Description
avg() or average() Returns the average value of a numeric field.
For example, to calculate the average number of miles:
1a1 = group a by (origin, dest);
2a2 = foreach a1 generate origin as origin, dest as destination, average(miles) as miles;
count() Returns the number of rows that match the query criteria.
For example, to calculate the number of carriers:
1q = foreach q generate 'carrier' as 'carrier', count() as 'count';
The count() function operates on streams that were inputs to the group or cogroup statements. It doesn’t operate on the newly grouped stream or on an ungrouped stream.
1a = load "0Fcyy000000002qCAA/0Fcyy000000002WCAQ";
2a1 = group a by (Year);
3q = foreach a1 generate count(a) as countYear, count() as count, Year as year;
4q = limit q 20;
You can’t pass a1 to the count() function because it’s a newly grouped stream.
first() Returns the value for the first tuple. To work as expected, you must be aware of the sort order or know that the values of that measure are the same for all tuples in the set.

For example, you can use these statements to compute the distance between each combination of origin and destination:

1a1 = group a by (origin, dest);
2a2 = foreach a1 generate origin as origin, dest as destination, first(miles) as miles;
last() Returns the value for the last tuple.

For example, to compute the distance between each combination of origin and destination:

1a1 = group a by (origin, dest);
2a2 = foreach a1 generate origin as origin, dest as destination, last(miles) as miles;
max() Returns the maximum value of a field.

This function takes only a measure as an argument. It can’t take a dimension.

median() Accepts a grouped expression of numeric type and returns the middle number (by sorted order, ignoring null values). If there is no one middle number (in other words, the count of non-null values is even), then median returns the average of the two numbers closest to the middle.

The expression can be any identifier, such as ‘xInt’ or ‘price’, but cannot be a complex expression, such as price/100 or ceil(distance), or a literal, such as 2.5.

1q = load "data/airline";
2q = group q by dest;
3q = foreach q generate dest, median(miles) as medMiles;
4limit q 5;
If median is not preceded by a group by clause, it treats each individual row as its own group:
1q = load "data/airline";
2q = foreach q generate dest, median(miles) as medMiles;
3limit q 5;
min() Returns the minimum value of a field.

This function takes only a measure as an argument. It can’t take a dimension.

sum() Returns the sum of a numeric field.
1a = load "0Fbxx000000002qCAA/0Fcxx000000002WCAQ";
2a = filter a by dest in ["ORD", "LAX", "ATL", "DFW", "PHX", "DEN", "LGA"]; 
3a = group a by carrier; 
4b = foreach a generate carrier as airline, sum(miles) as miles;
unique() Returns the count of unique values.

For example, to find how many origins and destinations a carrier flies from:

1a1 = group a by carrier;
2a2 = foreach a1 generate carrier as carrier, unique(origin) as origins, unique(dest) as destinations;
stddev() Returns the sample standard deviation computed on the group.

Accepts a grouped expression of numeric type. If the number of non-null values in the group is equal to 1, stddev return null. Otherwise, stddev returns the sample standard deviation computed on the group, ignoring null values.

The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5.

1q = load "data/airline";
2q = group q by dest;
3q = foreach q generate dest, stddev(miles) as stddevMiles;
4limit q 5;
stddevp() Returns the population standard deviation computed on the group.

Accepts a grouped expression of numeric type and returns the population standard deviation computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5.

1q = load "data/airline";
2q = group q by dest;
3q = foreach q generate dest, stddevp(miles) as stddevMiles;
4limit q 5;
var() Returns the sample variance (also called the unbiased variance) computed on the group.

Accepts a grouped expression of numeric type. If the number of non-null values in the group is equal to 1, var return null. Otherwise, var returns the sample variance computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5.

1q = load "data/airline";
2q = group q by dest;
3q = foreach q generate dest, var(miles) as varMiles;
4limit q 5;
varp() Returns the population variance (also called the biased variance) computed on the group.

Accepts a grouped expression of numeric type and returns the population variance computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5.

1q = load "data/airline";
2q = group q by dest;
3q = foreach q generate dest, varp(miles) as varMiles;
4limit q 5;
percentile_disc() Computes a specific percentile for sorted values in an entire rowset or within distinct partitions of a rowset. The full syntax is:

percentile_disc(p as numeric) within group (order by expr [asc | desc])

The percentile_disc function accepts a grouped expression expr of numeric type and sorts it in the specified order (asc or desc). If order is not specified, the default order is asc. It returns the value behind which (100*p)% of values in the group would fall in the sorted order, ignoring null values.

p can be any real numeric value between 0 and 1, and is accurate to 8 decimal places of precision. expr can be any identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(distance), or a literal, such as 2.5.

If expr contains no value that falls exactly at the 100*p-th percentile mark, percentile_disc will return the next value from expr in the sort order.

For example, if Mea1 contains the values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] then:

1percentile_disc(0.25) within group (order by Mea1 asc) = 4
2percentile_disc(0.25) within group (order by Mea1 desc) = 9
3percentile_disc(0) within group (order by Mea1 asc) = 0
4percentile_disc(1) within group (order by Mea1 asc) = 13

Example query:

1q = load "data/airline";
2q = group q by dest;         
3q = foreach q generate dest, percentile_disc(0.25) within group (order by miles desc) as perMiles;
4limit q 5;
percentile_cont() Calculates a percentile based on a continuous distribution of the column value. The full syntax is:

percentile_cont(p as numeric) within group (order by expr [asc | desc])

The percentile_cont function accepts a grouped expression expr of numeric type and sorts it in the specified order (asc or desc). If the order is not specified, the default order is asc. It returns the value behind which (100*p)% of values in the group would fall in the sorted order, ignoring null values.

p can be any real numeric value between 0 and 1. expr can be any identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(distance), or a literal, such as 2.5.

If expr contains no value that falls exactly at the 100*p-th percentile mark, percentile_cont returns a value linear interpolated from the two closest values in expr.

For example, if Mea1 contains the values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] then:

1percentile_cont(0.25) within group (order by Mea1 asc) = 3.25
2percentile_cont(0.25) within group (order by Mea1 desc) = 9.75
3percentile_cont(0) within group (order by Mea1 asc) = 0
4percentile_cont(1) within group (order by Mea1 asc) = 13

Example query:

1q = load "data/airline";
2q = group q by dest;
3q = foreach q generate dest, percentile_cont(0.25) within group (order by miles) as perMiles;
4limit q 5;