Newer Version Available
Aggregate Functions
Using an aggregate function on an empty set returns null. For example, if you use an aggregate function with a nonmatching column of an outer cogrouping, you might have an empty set.
Aggregation functions treat each line as its own group if not preceded by group by.
This table lists the aggregate functions that are supported:
| Aggregate Function | Description |
|---|---|
| avg() or average() | Returns the average value of a numeric field. For example, to calculate the average
number of
miles:
|
| count() | Returns the number of rows that match the query criteria. For example, to calculate
the number of
carriers:
The
count() function operates on streams that were inputs
to the group or cogroup statements. It doesn’t operate on the newly grouped stream or on an
ungrouped stream.
|
| first() | Returns the value for the first tuple. To work as expected, you must be aware of the
sort order or know that the values of that measure are the same for all tuples in the
set. For example, you can use these statements to compute the distance between each combination of origin and destination: |
| last() | Returns the value for the last tuple. For example, to compute the distance between each combination of origin and destination: |
| max() | Returns the maximum value of a field. This function takes only a measure as an argument. It can’t take a dimension. |
| median() | Accepts a grouped expression of numeric type and returns the middle number (by sorted
order, ignoring null values). If there is no one middle number (in other words, the count of
non-null values is even), then median returns the average of the two numbers closest to the
middle. The expression can be any identifier, such as ‘xInt’ or ‘price’, but cannot be a complex expression, such as price/100 or ceil(distance), or a literal, such as 2.5. |
| min() | Returns the minimum value of a field. This function takes only a measure as an argument. It can’t take a dimension. |
| sum() | Returns the sum of a numeric
field. |
| unique() | Returns the count of unique values. For example, to find how many origins and destinations a carrier flies from: |
| stddev() | Returns the sample standard deviation computed on the group. Accepts a grouped expression of numeric type. If the number of non-null values in the group is equal to 1, stddev return null. Otherwise, stddev returns the sample standard deviation computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5. |
| stddevp() | Returns the population standard deviation computed on the group. Accepts a grouped expression of numeric type and returns the population standard deviation computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5. |
| var() | Returns the sample variance (also called the unbiased variance) computed on the group.
Accepts a grouped expression of numeric type. If the number of non-null values in the group is equal to 1, var return null. Otherwise, var returns the sample variance computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5. |
| varp() | Returns the population variance (also called the biased variance) computed on the
group. Accepts a grouped expression of numeric type and returns the population variance computed on the group, ignoring null values. The expression can be any numeric identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(price), or a literal, such as 2.5. |
| percentile_disc() | Computes a specific percentile for sorted values in an entire rowset or within
distinct partitions of a rowset. The full syntax is: percentile_disc(p as numeric) within group (order by expr [asc | desc]) The percentile_disc function accepts a grouped expression expr of numeric type and sorts it in the specified order (asc or desc). If order is not specified, the default order is asc. It returns the value behind which (100*p)% of values in the group would fall in the sorted order, ignoring null values. p can be any real numeric value between 0 and 1, and is accurate to 8 decimal places of precision. expr can be any identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(distance), or a literal, such as 2.5. If expr contains no value that falls exactly at the 100*p-th percentile mark, percentile_disc will return the next value from expr in the sort order. For example, if Mea1 contains the values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] then: Example query: |
| percentile_cont() | Calculates a percentile based on a continuous distribution of the column value. The
full syntax is: percentile_cont(p as numeric) within group (order by expr [asc | desc]) The percentile_cont function accepts a grouped expression expr of numeric type and sorts it in the specified order (asc or desc). If the order is not specified, the default order is asc. It returns the value behind which (100*p)% of values in the group would fall in the sorted order, ignoring null values. p can be any real numeric value between 0 and 1. expr can be any identifier, such as 'xInt' or 'price', but cannot be a complex expression, such as price/100 or ceil(distance), or a literal, such as 2.5. If expr contains no value that falls exactly at the 100*p-th percentile mark, percentile_cont returns a value linear interpolated from the two closest values in expr. For example, if Mea1 contains the values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] then: Example query: |
| regr_slope(y, x) |
The simple linear regression function regr_slope accepts a grouped dependent numeric expression y and a grouped independent numeric expression x, and returns the slope of the regression line. The regr_slope function only considers pairs of (x, y) values where both values are not null, and returns null if there exists fewer than 2 such pairs in the given group. Simple linear regression functions work like aggregation functions on simple grouped values, but do not work with cogroups. Example query: |
| regr_intercept(y, x) |
The simple linear regression function regr_intercept accepts a grouped dependent numeric expression y and a grouped independent numeric expression x, and returns the y-intercept for the regression line. The regr_intercept function only considers pairs of (x, y) values where both values are not null, and returns null if there exists fewer than 2 such pairs in the given group. Simple linear regression functions work like aggregation functions on simple grouped values, but do not work with cogroups. Example query: |
| regr_r2(y, x) |
The simple linear regression function regr_r2 accepts a grouped dependent numeric expression y and a grouped independent numeric expression x, and returns the coefficient of determination (also called R-squared or goodness of fit) for the regression. The regr_r2 function only considers pairs of (x, y) values where both values are not null, and returns null if there exists fewer than 2 such pairs in the given group. Simple linear regression functions work like aggregation functions on simple grouped values, but do not work with cogroups. Example query: |