Set Up an AWS Glue Data Catalog File Federation Connection (Beta)
Create an AWS Glue Data Catalog connection to pull data into Data 360.
| User Permissions Needed | |
|---|---|
| To create an AWS Glue Data Catalog Connection: | System Admin profile or Data 360 Architect permission set |
Requirements:
- Firewall: If your Glue Data Catalog-managed Iceberg data lakehouse is behind a network firewall, add these Data 360 IP addresses to your allowlist before creating a connection. If the network firewall that protects the Glue Data Catalog and the underlying S3 bucket are different, make sure to update both network firewalls. Also, both the Glue Data Catalog and the S3 bucket must be publicly accessible.
Connecting over AWS PrivateLink to either the Glue Data Catalog or the storage bucket isn't supported.
Set Up Connector
-
In Data Cloud, click Setup, and select Data Cloud Setup.
-
Under External Integrations, select Other Connectors.
-
Click New.
-
On the Source tab, select AWS Glue Data Catalog and click Next.
-
Enter a name for the connection.
-
In the Authentication Details section, enter the AWS Access Key ID and AWS Secret Access Key for the IAM user. The IAM user must be assigned a policy that grants the user read-only access to Glue Data Catalog. Use AWS Lake Formation to configure storage credential vending. This approach allows Glue Data Catalog to vend temporary credentials at runtime so that Data 360 can access the underlying storage bucket.
If you don't configure storage credential vending, this policy must also explicitly cover the S3 bucket.
-
For Catalog URL, enter Glue Data Catalog’s URL. See Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint.
-
For Catalog ID, enter the 12-digit AWS account ID linked to the Glue Data Catalog.
-
For Signing Region, enter the AWS region that Glue Data Catalog is hosted in.
-
For Signing Service, enter glue. The Signing Service, in addition to the AWS Access Key ID, Secret Access Key, and Signing Region, is required for Data 360 to correctly sign requests to the Glue Data Catalog using AWS Signature Version 4.
Considerations
- Row-Level Updates: Querying Iceberg tables that are configured to use Iceberg V2 MoR Position Equality Deletes or Iceberg V3 Deletion Vectors isn't supported. To express a merge-on-read table as a copy-on-write table, see Submitting Spark Jobs for MoR to CoW Conversion.
- Views: Querying Iceberg views isn't supported.
- Namespaces: Only single-level (catalog -> database -> table) and two-level (catalog -> database -> schema -> table) namespaces are supported.
- Temporal Data Types: The time and timestamp_ntz data types aren't supported.
- Cross-Region S3 Storage Bucket: If your Data 360 org is not in the same region as your S3 storage bucket and your catalog doesn't support storage credential vending, the server hosting the REST catalog must include the client.region property in the LoadTableResult object. See Iceberg open API.
- S3 Storage Bucket Name: If you use AWS S3, the S3 bucket’s name must not contain a period (“.”). Such buckets can only be accessed via path-style addressing, which AWS has marked for deprecation. Only buckets that can be accessed via virtual host-style addressing with SSL is supported.
- Change Data: Certain features in Data Cloud require the ability to detect when a data lake object changes (example, data actions). In order for Data Cloud to construct a change data feed, that is, a log of incremental changes to an external data lake object, a primary key must be specified. To use such features, leverage Iceberg’s identifier-field-ids construct to define which columns comprise a table’s primary key and ensure that your query engines (writers) respect your configuration.