Newer Version Available

This content describes an older version of this product. View Latest

cogroup

Use cogroup to combine data from two or more data streams into a single data stream. The data streams must have at least one common field.
cogroup is similar to relational database joins, but with some important differences. Unlike a relational database join, in a cogroup the datasets are grouped first, and then the groups are joined. You can use cogroup in these ways:
  • inner cogroup
  • left outer cogroup
  • right outer cogroup
  • full outer cogroup

The statements cogroup and group are interchangeable. For clarity, we use group for statements involving one data stream and cogroup for statements involving two or more data streams.

Note

Inner cogroup

Inner cogroup combines data from two or more data streams into a resulting data stream. The resulting data stream only contains values that exist in both data streams. That is, unmatched records are dropped.
Syntax
1result = cogroup data_stream_1 by field1, data_stream_2 by field2;

field1 and field2 must be the same type, but can have different names. For example, q=group ops by 'Owner', quota by 'Name';

Example - Inner cogroup

Suppose that you want to understand how much time your reps spend meeting with each account. Is there a relationship between spending more time and winning an account? Are some reps spending much more or much less time than average? To answer these questions, first combine meeting data with account data using cogroup.

Suppose that you have a dataset of meeting information from the Salesforce Event object. In this example, your reps have had six meetings with four different companies. The Meetings dataset has a MeetingDuration column, which contains the meeting duration in hours.

Diagram showing the meeting dataset.

The account data exists in the Salesforce Opportunity object. The Ops dataset has an Account, Won, and Amount column. The Amount column contains the dollar value of the opportunity, in millions.

Diagram showing the Ops datasets.

To see the effect of meeting duration on opportunities, you start by combining these two datasets into a single data stream using cogroup.

1q = cogroup ops by 'Account', meetings by 'Company';

Internally (you cannot see these results yet), the resulting cogrouped data stream contains the following data. Note how the data streams are rolled up on one or more dimensions.

1(1,{(Shoes2Go,2,), (Shoes2Go,5)},{(Shoes2Go,1,1.5), (Shoes2Go,0,3})
2
3(2,{(FreshMeals,3), (FreshMeals, 5)},{(FreshMeals,1,2) (FreshMeals, 1, 1.4)})
4
5(3,{(ZipBikeShare,4)},{(ZipBikeShare,1, 1.1)})
6
7(4,{(ZenRetreats,6)},{(ZenRetreats,0, 2)})

Now the datasets are combined. To see the data, you create a projection using foreach:

1ops = load "Ops";
2meetings = load "Meetings";
3q = cogroup ops by 'Account', meetings by 'Company';
4q = foreach q generate ops.'Account' as 'Account', sum(ops.'Amount') as 'sum_Amount', sum(meetings.'MeetingDuration') as 'TimeSpent';

The resulting data stream contains the sum of amount and total meeting time for each company. The sum of amount is the sum of the dollar value for every opportunity for the company.

Diagram showing the combined dataset.

Now that you have combined the data into a single data stream, you can analyze the effects that total meeting time has on your opportunities.

Left Outer cogroup

Left outer cogroup combines data from the right data stream with the left data stream. The resulting data stream only contains values that exist in the left data stream. If the left data stream has a value that the right data stream does not, the missing value is null in the resulting data stream.

Use coalesce to replace a null value with the value of your choice.

Tip

Syntax
1result = cogroup data_stream_1 by field1 left, data_stream_2 by field2;

field1 and field2 must be the same type, but can have different names. For example, q=group ops by 'Owner' left, quota by 'Name';

Example - Left Outer cogroup With coalesce

Suppose that you want to see what percentage of quota that your reps have obtained. Your quota dataset shows each employee's quota (notice that Farah does not have a quota):

Quota dataset.

Your opportunities data shows the opportunity amount that each employee has won (notice that Jonathan does not have a won opportunity).

Opportunities dataset.

Use a left outer cogroup to show only employees that have quotas. Also show the percentage of quota attained.

1quota = load "Quota";
2opp = load "Opportunity";
3q = group quota by 'Employee' left, opp by 'Employee';
4q = foreach q generate quota.'Employee' as 'Employee', trunc(sum(opp.'Amount')/sum(quota.'Quota')*100, 2) as 'Percent Attained';

Jonathan has not won any opportunities yet, so his percent attained is null.

Opportunities dataset.

Use coalesce to replace the null opportunities with a zero.

1quota = load "Quota";
2opp = load "Opportunity";
3q = group quota by 'Employee' left, opp by 'Employee';
4q = foreach q generate quota.'Employee' as 'Employee', trunc(coalesce(sum(opp.'Amount'),0)/sum(quota.'Quota')*100, 2) as 'Percent Attained';

Now Jonathan's percent attained is displayed as zero.

Opportunities dataset.

Right Outer cogroup

Right outer cogroup combines data from the left data stream with the right data stream. The resulting data stream only contains values that exist in the right data stream. If the right data stream has a value that the left data stream does not, the missing value is null in the resulting data stream.

Use coalesce to replace a null value with the value of your choice.

Tip

Syntax
1result = cogroup data_stream_1 by field1 right, data_stream_2 by field2;

field1 and field2 must be the same type, but can have different names. For example, q=group ops by 'Owner' right, quota by 'Name';

Full Outer cogroup

Full outer cogroup combines data from the left and right data streams. The resulting data stream contains all values. If one data stream has a value that the other data stream does not, the missing value is null in the resulting data stream.

Use coalesce to replace a null value with the value of your choice.

Tip

Syntax
1result = cogroup data_stream_1 by field1 full, data_stream_2 by field2;

field1 and field2 must be the same type, but can have different names. For example, q=group ops by 'Owner' full, quota by 'Name';