Newer Version Available

This content describes an older version of this product. View Latest

sample

Returns a random sample from a large dataset, where each data point has an equal probability of being selected. This keyword uses the Bernoulli distribution.

Syntax

1sample(percentage-size-of-dataset) repeatable(seed)
sample
Required. Specifies the percentage of the dataset that is returned as a random sample. The percentage size value can be any positive decimal.
repeatable
Optional. To create a random sample deterministically, specify a seed. sample returns the same subset of data each time you pass repeatable the same seed value. The seed value can be any positive integer.

Usage

Use sample to project a query on a representative sample from your dataset, where each data point has an equal probability of being selected. sample runs pre-projection.

Add sample and repeatable after the load statement. Any operation performed on the query after the load statement affects only the random sample of data. Let’s look at an example.

1q = load "Opportunity" sample(10) repeatable(1);
2q = group q by all;
3q = foreach q generate count() as 'count';
4q = limit q 2000;
Count of Rows
453

Here, the query returns the row count of the sample, 453—around 10% of the dataset's 4.6k rows. The repeatable keyword guarantees that the query always returns the same result. Without the repeatable keyword, the query returns a sample of a slightly different size each time you run it. If you modify your dataset and add more data, then repeatable doesn’t return the same result.

group-by Example

This query returns the counts of opportunities for each stage. Since the query operates on 10% of the dataset, the counts for each stage are approximately 1/10 of the original count.

1q = load "Opportunity" sample(10) repeatable(1);
2q = group q by 'StageName';
3q = foreach q generate 'StageName', count() as 'count';
4q = limit q 2000;
Stage Count of Rows
Closed Lost 89
Closed Won 254
Id. Decision Makers 13
Needs Analysis 15
Negotiation/Review 6
Perception Analysis 13
Proposal/Price Quote 9
Prospecting 10
Qualification 25
Value Proposition 19

filter Example

This query returns only the won opportunities for each stage. Since the query operates on 10% of the dataset, the count for each stage is approximately 1/10 of the original count.

1q = load "Opportunity" sample(10);
2q = filter q by 'IsWon' == "true";
3q = group q by 'StageName';
4q = foreach q generate 'StageName', count() as 'count';
5q = limit q 2000;
Stage Count of Rows
Closed Won 275