Set Up a Databricks File Federation Connection
Configure a connection between Data Cloud and an AWS or Azure-hosted Databricks workspace and federate data into Data Cloud.
User Permissions Needed | |
---|---|
To create a connection in Data Cloud: | System Administrator |
Prerequisites:
- Firewall: If the Databricks workspace is behind a network firewall, add these Data Cloud IP addresses to your access control list before configuring a connection. If the network firewall that protects the server hosting the catalog is distinct from the network firewall that protects the storage bucket, make sure that you update both. Both the Unity Catalog and the storage bucket must be publicly accessible. Connecting over AWS PrivateLink or Azure Private Link is not supported.
- Catalog: The Databricks workspace must be managed by a Unity Catalog, and the Unity Catalog must be configured to generate Iceberg metadata for every Delta table. See Read Delta Tables with Iceberg Clients. Data Cloud doesn't support file federation with a workspace that is managed by a Hive Metastore.
- Storage: If the Databricks workspace is hosted on AWS, data must be stored in an AWS S3 storage bucket. If the Databricks workspace is hosted on Azure, data must be stored in an Azure Data Lake Storage Gen2 storage container.
Set Up Connection
-
In Data Cloud, go to Data Cloud Setup.
-
Under External Integrations, select Other Connectors.
-
On the Source tab, select Databricks File Federation and click Next.
-
Enter a connection name and connection API name.
-
In the Authentication Details section, select CATALOG_PROVIDED if your Unity Catalog supports storage credential vending. Otherwise, select S3 if data is stored in AWS S3 and AZURE if data is stored in either Azure Blob Storage or ADLS Gen2.
- Unity Catalog: In the Personal Access Token field, enter the personal access token for your Databricks workspace. Data Cloud uses the personal access token to communicate with the Unity Catalog. See Databricks personal access token authentication.
- Storage Bucket: If you didn't select CATALOG_PROVIDED, provide additional information about your storage bucket or container.
Storage Type | Authentication Details |
---|---|
Azure Blob Storage or Azure Data Lake Storage Gen2 | Storage Account Name - Provide the name of the storage account. |
SAS Token - SAS Token - Provide the shared access signature token that Data Cloud will use to access the relevant storage container within the storage account. | |
S3 | Bucket Name - Provide the name of the storage bucket. |
Access Key - Provide the access key ID for the IAM user that Data Cloud will use to access the storage bucket. | |
Secret Key - Provide the secret access key for the IAM user that Data Cloud will use to access the storage bucket. | |
AWS Region - Provide the name of the AWS region the storage bucket is hosted in. See Regions, Availability Zones, and Local Zones - Amazon Relational Database Service for the list of AWS regions. |
-
In the Connection Details section, enter the publicly-accessible HTTPS URL of the Unity Catalog. If the Databricks workspace is hosted on AWS, the format of the URL is
https://<workspace-instance>/api/2.1/unity-catalog/iceberg
. If the Databricks workspace is hosted on Azure, the format of the URL isadb-<workspace_id>.<random_number>.azuredatabricks.net/api/2.1/unity-catalog/iceberg
. The number after api corresponds to the version of the Databricks REST API. -
To review configuration, click Test Connection.
-
If the test succeeds, click Save.
Considerations
- Nullability: Assume there is a delta table that has N columns some of which are defined as NOT NULL. If an SQL query copies those N columns into either a new Delta table (SELECT INTO) or an existing Delta table (INSERT INTO) where the destination Delta table's schema doesn't enforce the same nullability constraints, the Iceberg table-level metadata generated for the destination Delta table will differ from the Parquet file-level metadata generated for the underlying Parquet files. This means Data Cloud will not be able to query the destination table. In other words, if data is being copied from a source Delta table into a destination Delta table, and the two tables differ with respect to nullability, inconsistencies can arise.
- Cross-Region S3 Storage Bucket: If your Data Cloud org is not in the same region as your S3 storage bucket and your catalog does not support storage credential vending, you must ensure that the server hosting the REST catalog includes the client.region property in the LoadTableResult object.
- Row-Level Updates: Querying Iceberg tables that are configured to use Iceberg V2 MoR Position / Equality Deletes or Iceberg V3 Deletion Vectors is not supported.
- Views: Querying Iceberg views is not supported.
- Namespaces: Only single-level (catalog -> database -> table) and two-level namespaces (catalog -> database -> schema -> table) are supported
- Temporal Data Types: The time and timestamp_ntz data types are unsupported.
- Change Data: Certain features in Data Cloud require the ability to detect when a data lake object changes , for example, Data Actions. In order for Data Cloud to construct a change data feed, that is, a log of incremental changes to an external data lake object, a primary key must be specified. To use such features, (1) leverage Iceberg’s identifier-field-ids construct to define which columns comprise a table’s primary key and (2) ensure that your query engines (writers) respect your configuration.