Set Up an AWS Glue Data Catalog File Federation Connection (Beta)
Create an AWS Glue Data Catalog connection to pull data into Data 360.
| User Permissions Needed | |
|---|---|
| To create an AWS Glue Data Catalog connection: | System Admin profile or Data 360 Architect permission set |
Requirements:
- Firewall: If your Glue Data Catalog-managed Iceberg data lakehouse is behind a firewall, add these Data 360 IP Addresses to your allowlist before creating a connection. If the network firewall that protects the Glue Data Catalog and the underlying S3 bucket are different, make sure to update both network firewalls. Also, make sure both the Glue Data Catalog and the S3 bucket are publicly accessible.
Data 360 doesn't support connecting over AWS PrivateLink to either the Glue Data Catalog or the storage bucket.
Set Up the Connector
-
In Data Cloud, click Setup, and select Data Cloud Setup.
-
Under External Integrations, select Other Connectors.
-
Click New.
-
On the Source tab, select AWS Glue Data Catalog and click Next.
-
Enter a name for the connection.
-
In the Authentication Details section, enter the AWS Access Key ID and AWS Secret Access Key for the Identity and Access Management (IAM) user. Make sure that the IAM user has a policy that grants the user read-only access to Glue Data Catalog. Use AWS Lake Formation to configure storage credential vending. This approach is for Glue Data Catalog to vend temporary credentials at run time so that Data 360 can access the underlying storage bucket.
If you don't configure storage credential vending, make sure that the policy also explicitly covers the S3 bucket. Add this IAM policy, and remove the component of the previously mentioned policy that is specific to Lake Formation.
-
For catalog URL, enter the URL of the Glue Data Catalog. See Connecting to the Data Catalog by using AWS Glue Iceberg REST endpoint.
-
For Catalog ID, enter the 12-digit AWS account ID linked to the Glue Data Catalog.
-
For Signing Region, enter the host AWS region that Glue Data Catalog.
-
For Signing Service, enter glue. Data 360 requires the Signing Service, in addition to the AWS Access Key ID, Secret Access Key, and Signing Region, to sign requests to the Glue Data Catalog by using AWS Signature Version 4.
Storage Considerations
- Cross-Region S3 Storage Bucket: If your Data 360 org isn’t in the same region as your S3 storage bucket and your catalog doesn't support storage credential vending, make sure that the server hosting the REST catalog has the client.region property in the LoadTableResult object. See Iceberg OpenAPI.
- S3 Storage Bucket Name: If you use AWS S3, make sure that the S3 bucket’s name doesn't contain a period. You can access such buckets via path-style addressing, which AWS has marked for deprecation. Data 360 supports only those buckets that are accessed via virtual host-style addressing with SSL.
Other Considerations
- Row-Level Updates: Querying Iceberg tables that are configured to use Iceberg V2 MoR Position Equality Deletes or Iceberg V3 Deletion Vectors isn't supported. To express a merge-on-read table as a copy-on-write table, see Submitting Spark Jobs for MoR to CoW Conversion.
- Views: Querying Iceberg views isn't supported.
- Namespaces: Only single-level (catalog -> database -> table) and two-level (catalog -> database -> schema -> table) namespaces are supported.
- Temporal Data Types: The time and timestamp_ntz data types aren't supported.
- Change Data: Certain features in Data Cloud require the ability to detect when a data lake object changes (example, data actions). For Data Cloud to construct a change data feed, that is, a log of incremental changes to an external data lake object, it needs a primary key. For such features, use Iceberg’s identifier-field-ids construct to define which columns comprise a table’s primary key and make sure that your query engines (writers) respect your configuration.