join semi and anti
Usage
A semi-join returns the rows from one data stream if one or more matching rows are found in the second data stream. Each matched row is returned one time. The row data types must match for the specified data streams.
An anti-join returns the rows in the first data stream that don’t match any rows in the second data stream.
Syntax
results = join alias1 by (field1, ... fieldK) join_type, alias2 by (field1, ... fieldK)
Name | Description |
---|---|
alias1 | Required. The data stream to report semi-join or anti-join results for. |
alias2 | Required. The data stream to look for matches or no matches in. |
field1 | Required. The field name as it appears in the data stream. The field data type must be the same in alias1 and alias2 to match. Multi-value fields aren’t allowed. At least 1 field is requires, with a maximum of 5 fields allowed. Duplicate field names aren’t allowed in either data stream. |
join_type | Required. The type of join to run. Valid values are semi and anti. |
The result stream contains the matched or unmatched rows from the alias1 data stream only. For a semi-join, a row from alias1 is only present if it satisfies the join criteria. The syntax supports equijoin (equality) criteria only. There isn't a guarantee that the rows in the result stream are in the same order as in alias1.
The parenthesis used to specify the fields are optional if there’s only 1 field.
The input data stream aliases must be unique. These streams can't be unprojected group or cogroup results, either directly or indirectly. The group or cogroup statement is made after the join statement.
Semi-Join Syntax
a = join a by (id) semi, b by (id);
Anti-Join Syntax
a = join a by (id) anti, b by (id);
Multiway Semi-Join Syntax
join statements can be combined to form a multiway semi-join. A maximum of 3 join statements are allowed in a query. These statements combine into a conjunctive predicate.
a = join a by (id) semi, b by (id);
a = join a by (id) anti, c by (id);
join Use Cases
Use a join statement to query for accounts with at least 1 opportunity
account = load \"accounts\";
opp = load \"opportunities\";
q = join account by (id) semi, opp by (accountId);
Use a join statement to query for accounts with opportunity amount more than 10K.
account = load \"accounts\";
opp = load \"opportunities\";
opp = filter opp by amount > 10000;
q = join account by (id) semi, opp by (accountId);
Use a join statement to query for accounts with more than 10 opportunities.
account = load \"accounts\";
opp = load \"opportunities\";
opp = group opp by accountId;
opp = foreach opp generate accountId, count() as count;
opp = filter opp by count > 10;
q = join account by (id) semi, opp by (accountId);
Use a join statement to query for accounts with no opportunities.
account = load \"accounts\";
opp = load \"opportunities\";
q = join account by (id) anti, opp by (accountId);
Use a join statement to query for accounts with opportunities, but no orders.
account = load \"accounts\";
opp = load \"opportunities\";
orders = load \"orders\";
q = join account by (id) semi, opp by (accountId);
q = join q by (id) anti, orders by (accountId)
Null Handling
Running the join query with null fields is a special case. For the SAQL anti-join statement, null isn't equal to null, which differs from the cogroup statement. The behavior of the statement is the same as NOT EXISTS in SQL.
In this example, imagine you’re joining the accounts and the opportunities data streams, which contain these rows:
accounts | opportunities | ||
---|---|---|---|
id | account_id | ||
1 | 1 | ||
2 | NULL | ||
NULL |
For SAQL, this statement:
a = load \"accounts\";
opp = load \"opportunities\";
q = join a by (id) anti, opp by (account_id);
q = foreach q generate id, name;
has the same behavior as this SQL statement:
select id, name from accounts a where not EXISTS (select 1 from opportunities opp where opp.account_id = a.id);
The SAQL anti-join query returns two rows:
[
{ id : null },
{ id : 2 }
]
Considerations
Using a union statement and a join statement in the same query has strict enforcements. When a semi or anti join statement is present, the union statement returns an error if there are:
- mismatched number of columns
- mismatched data types
The errors appear as "Different number of fields found across union streams" or "Different types found for field 1: 'fieldName' in different union inputs."
q1 = load "Opportunity";
q2 = load "Opportunity";
q1 =join q1 by (AccountId) semi, q3 by (AccountId);
q2 = group q2 by all;
q1 = foreach q1 generate q1.'AccountId' as 'AccountId',q1.StageName as 'StageName', q1.'Description' as 'Description', q1.'Id' as 'Id',q1.'Amount' as 'Amount', q1.'Probability' as 'Probability';
q2 = foreach q2 generate null as 'AccountId', null as 'StageName',null as 'Description',null as 'Id', sum(unique('Amount')) over([..] partition by all) as 'Amount_total',sum(unique('Probability')) over([..] partition by all) as 'Probability';
q1 = limit q1 5;
qU = union q1, q2;