join semi and anti

Use the join statement with the join_type to create semi-join or anti-join results.

Usage

A semi-join returns the rows from one data stream if one or more matching rows are found in the second data stream. Each matched row is returned one time. The row data types must match for the specified data streams.

An anti-join returns the rows in the first data stream that don’t match any rows in the second data stream.

Syntax

results = join alias1 by (field1, ... fieldK) join_type, alias2 by (field1, ... fieldK)
Name Description
alias1 Required. The data stream to report semi-join or anti-join results for.
alias2 Required. The data stream to look for matches or no matches in.
field1 Required. The field name as it appears in the data stream. The field data type must be the same in alias1 and alias2 to match. Multi-value fields aren’t allowed. At least 1 field is requires, with a maximum of 5 fields allowed. Duplicate field names aren’t allowed in either data stream.
join_type Required. The type of join to run. Valid values are semi and anti.

The result stream contains the matched or unmatched rows from the alias1 data stream only. For a semi-join, a row from alias1 is only present if it satisfies the join criteria. The syntax supports equijoin (equality) criteria only. There isn't a guarantee that the rows in the result stream are in the same order as in alias1.

The parenthesis used to specify the fields are optional if there’s only 1 field.

The input data stream aliases must be unique. These streams can't be unprojected group or cogroup results, either directly or indirectly. The group or cogroup statement is made after the join statement.

The join alias field must be a dimension or a date. If you use a measure field as an alias, the query returns an error stating Error in join: non-dimension field: {field1} is not allowed in pre-projection alias: {alias1} at join keys list position: 1. This restriction is only for pre-projection alias. All data types are allowed in the post-projection alias.

For example, to use a measure in a join, project the measure field first. This query joins Number_of_Employees, a measure, by projecting it before running the join.

c = load \"cases\";
c = foreach c generate ID, Industry, Name, Year, Number_of_Employees;
a = load \"accounts\";
a = foreach a generate ID, Industry, Name, Year, Number_of_Employees;
a = join a by Number_of_Employees semi, c by Number_of_Employees;
a = order a by (ID);

Note

Semi-Join Syntax

a = join a by (id) semi, b by (id);

Anti-Join Syntax

a = join a by (id) anti, b by (id);

Multiway Semi-Join Syntax

join statements can be combined to form a multiway semi-join. A maximum of 3 join statements are allowed in a query. These statements combine into a conjunctive predicate.

a = join a by (id) semi, b by (id);
a = join a by (id) anti, c by (id);

join Use Cases

Use a join statement to query for accounts with at least 1 opportunity

account = load \"accounts\";
opp = load \"opportunities\";
q = join account by (id) semi, opp by (accountId);

Use a join statement to query for accounts with opportunity amount more than 10K.

account = load \"accounts\";
opp = load \"opportunities\";
opp = filter opp by amount > 10000;
q = join account by (id) semi, opp by (accountId);

Use a join statement to query for accounts with more than 10 opportunities.

account = load \"accounts\";
opp = load \"opportunities\";
opp = group opp by accountId;
opp = foreach opp generate accountId, count() as count;
opp = filter opp by count > 10;
q = join account by (id) semi, opp by (accountId);

Use a join statement to query for accounts with no opportunities.

account = load \"accounts\";
opp = load \"opportunities\";
q = join account by (id) anti, opp by (accountId);

Use a join statement to query for accounts with opportunities, but no orders.

account = load \"accounts\";
opp = load \"opportunities\";
orders = load \"orders\";
q = join account by (id) semi, opp by (accountId);
q = join q by (id) anti, orders by (accountId)

Null Handling

Running the join query with null fields is a special case. For the SAQL anti-join statement, null isn't equal to null, which differs from the cogroup statement. The behavior of the statement is the same as NOT EXISTS in SQL.

In this example, imagine you’re joining the accounts and the opportunities data streams, which contain these rows:

accounts opportunities
id account_id
1 1
2 NULL
NULL

For SAQL, this statement:

a = load \"accounts\";
opp = load \"opportunities\";
q = join a by (id) anti, opp by (account_id);
q = foreach q generate id, name;

has the same behavior as this SQL statement:

select id, name from accounts a where not EXISTS (select 1 from opportunities opp where opp.account_id = a.id);

The SAQL anti-join query returns two rows:

[
    { id : null },
    { id : 2 }
]

Considerations

Using a union statement and a join statement in the same query has strict enforcements. When a semi or anti join statement is present, the union statement returns an error if there are:

  • mismatched number of columns
  • mismatched data types

The errors appear as "Different number of fields found across union streams" or "Different types found for field 1: 'fieldName' in different union inputs."

There's also strict checking on the name of the columns. Using this SAQL example with a semi join statement and a union statement:
q1 = load "Opportunity";
q2 = load "Opportunity";

q1 =join q1 by (AccountId) semi, q3 by (AccountId);

q2 = group q2 by all;
q1 = foreach q1 generate q1.'AccountId' as 'AccountId',q1.StageName as 'StageName', q1.'Description' as 'Description', q1.'Id' as 'Id',q1.'Amount' as 'Amount', q1.'Probability' as 'Probability';
q2 = foreach q2 generate null as 'AccountId', null as 'StageName',null as 'Description',null as 'Id', sum(unique('Amount')) over([..] partition by all) as 'Amount_total',sum(unique('Probability')) over([..] partition by all) as 'Probability';
q1 = limit q1 5;

qU = union q1, q2;
When the join is present, strict name checking only allows for one ‘Amount’ column name, which comes from the first stream, q1. The ‘Amount_total’ summary column name from the second stream, q2, isn’t honored and only the normal 'Amount' column name is in the union results. If the join statement is removed, both the normal ‘Amount’ and the summary ‘Amount_total’ column names are in the union results.