Set Up an Apache Iceberg File Federation Connection
Set up the Apache Iceberg File Federation connection to start the flow of data into Data 360.
| User Permissions Needed | |
|---|---|
| To create a connection: | System Admin profile or Data Cloud Architect permission set |
Requirements
- Firewall: If the Iceberg data lake is behind a network firewall, add these Data 360 IP addresses to your access control list before configuring a connection. If the network firewalls that protect the server hosting the lake's metadata catalog and the lake's storage bucket are different, make sure to update both. Both the Open Catalog and the storage bucket must be publicly accessible. Data 360 doesn't support connecting over AWS PrivateLink or Azure Private Link. However, if the AWS S3 bucket is in the same AWS region as the Data 360 tenant, and a VPC gateway endpoint is provisioned for the bucket, then Data 360 can transparently use AWS PrivateLink to connect to the bucket.
- Catalog: The metadata catalog (REST catalog) must both implement the REST OpenAPI specification and be publicly accessible.
| REST Catalog Endpoint | Description |
|---|---|
/v1/config(Required) | Get all configuration metadata. When you save a connection, Data 360 will attempt to make a request to /v1/config, and a 200 response must be received in order for the connection to be saved. See REST Config Spec. |
/v1/{prefix}/namespaces (Required) | List all namespaces in the catalog. See Namespaces Spec. Even if you do not support nested namespaces, ensure that a request to /v1/{prefix}/namespaces?parent={some_namespace} returns an empty list per the SupportsNamespaces API. |
/v1/{prefix}/namespaces/{namespaces} (Required) | Get metadata about the specified namespace. See Namespace Metadata Spec. |
/v1/{prefix}/namespaces/{namespaces}/tables (Required) | List all tables in the namespace. See Tables List Spec. |
/v1/{prefix}/namespaces/{namespace}/tables/{table}(Required) | Get metadata about the specified table. See Table Metadata Spec. |
/v1/oauth/tokens (Optional) | When the access token you provide to Data 360 expires, Data 360 will exchange the expired access token for a new one by making a token-exchange request to /v1/oauth/tokens. The request must contain the required fields defined in RFC 8693. This endpoint is deprecated but still supported by some catalog implementations. See Token Exchange Spec. If unsupported, you must manually or programmatically update the connection to use a new access token. |
- Storage: The data must be formatted as Apache Parquet files that are organized as Iceberg V1 tables stored in either (1) a publicly-accessible AWS S3 bucket, (2) an AWS S3 bucket in a VPC that is the same region as the AWS region that the Data 360 org is in, (3) a publicly-accessible ADLS Gen2 container, or (4) a publicly-accessible Azure Blob Storage container. We do not support on-premise or GCP-hosted storage buckets.
Set Up the Connection
-
In Data Cloud, click Setup, and select Data Cloud Setup.
-
Under External Integrations, select Other Connectors.
-
Click New.
-
On the Source tab, select Apache Iceberg File Federation and click Next.
-
Enter a connection name, connection API name.
-
Enter the authentication details.
If your data store’s REST catalog supports storage credential vending, select CATALOG_PROVIDED. If not, select S3 if the data is stored in AWS S3 and AZURE if the data is stored in either Azure Blob Storage or ADLS Gen2.
- For REST Catalog, in the Client Access Token field, enter the OAuth 2.0 bearer token.
- For Storage Bucket, if you did not select CATALOG_PROVIDED, Data Cloud requires additional information about your storage bucket or container.
Storage Type Authentication Details AWS S3 Bucket Name: Provide the name of the storage bucket. See General purpose bucket naming rules - Amazon Simple Storage Service. Access Key ID: Provide the access key ID for the IAM user that Data 360 will use to access the storage bucket. See Manage access keys for IAM users. Secret Access Key: Provide the secret access key for the IAM user that Data 360 will use to access the storage bucket. See Manage access keys for IAM users. Azure Blob Storage or Azure Data Lake Gen2 Storage Account Name: Provide the name of the storage account. See Storage account overview – Azure Storage | Microsoft Learn. SAS Token: Provide the shared access signature token that Data 360 will use to access the relevant storage container within the storage account. See Create shared access signature (SAS) tokens – Azure AI services | Microsoft Learn. -
In the Connection Details section, enter the publicly-accessible HTTPS URL of the Apache Iceberg REST catalog. To make a request to an endpoint, append the endpoint to the URL. For example, a typical REST catalog's URL is
https://myicebergdatalakehouse:443/iceberg. -
If your REST catalog’s implementation of the /v1/config REST endpoint accepts a value for the optional warehouse parameter and you want Data Cloud to read warehouse-specific configuration metadata, specify a value for the warehouse field.
-
To review your configuration, click Test Connection.
-
Click Save.
After the connector details are accepted, the connection is created and listed under Connectors. You can now create data streams.
Keep these behaviors in mind when you set up a Snowflake File Federation connection.
Storage Considerations
- Cross-Region S3 Storage Bucket: If the Data 360 org is not in the same region as the S3 storage bucket and your catalog doesn't support storage credential vending, make sure that the server hosting the REST catalog includes the client.region property in the LoadTableResult object. See, Iceberg open API. If you use AWS S3, the S3 bucket’s name must not contain a period. Such buckets can only be accessed via path-style addressing, which AWS has marked for deprecation. Only buckets that can be accessed via virtual host-style addressing with SSL are supported. See Virtual hosting of general purpose buckets - Amazon Simple Storage Service.
- S3 Storage Bucket Name: If you use AWS S3, the S3 bucket’s name must not contain a period (i.e. “.”). Such buckets can only be accessed via path-style addressing, which AWS has marked for deprecation. We only support buckets that can be accessed via virtual host-style addressing with SSL.
- Azure Storage: If the data is stored in Azure, make sure that the paths to all files (the JSON metadata file, Avro manifest list file, Avro manifest files, and Parquet data files) included in the LoadTableResult object use either the abfs or abfss protocols and not the wasb or wasbs protocols. The latter two protocols have been officially deprecated by Microsoft in favor of the former.
Other Considerations
- Row-Level Updates: Querying Iceberg tables that are configured to use Iceberg V2 MoR Position or Equality Deletes or Iceberg V3 Deletion Vectors isn't supported.
- Views: Querying Iceberg views isn't supported.
- Namespaces: Up to two levels of nesting is supported: catalog -> namespace -> table / catalog -> namespace -> namespace -> table. When you configure a data stream, the database dropdown displays the names of all top-level namespaces, and the Schema drop down displays the names of all secondary namespaces registered under the selected top-level namespace. If there are no namespaces registered under the selected top-level namespace, the Schema picklist will be empty.
- Temporal Data Types: The time and timestamp_ntz data types aren't supported.
- Governance: To allow Data 360 to see only a portion of the tables registered with the REST catalog, configure a service principal that is only allowed to see the relevant portion and provide Data 360 the access token that corresponds to that service principal.
- Change Data: Features in Data 360 (such as Data Actions) that require tracking any changes to a data lake object are not currently supported. To enable incremental change logging, Open Catalog would need to:
- Use the identifier-field-ids construct to define a primary key for the source Iceberg table.
- Provide a way for Data 360 to access the second-to-last (parent) snapshot of the source Iceberg table to compare changes. This functionality is not supported by Snowflake.