Databricks
On This Page
Databricks is an open-source storage layer that allows you to operate a lakehouse architecture that provides data warehousing performance at data lake cost. Databricks runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Apache Spark is an open source data analytics engine that can perform analytics and data processing on very large sets of data. Read A Gentle Introduction to Apache Spark on Databricks.
Hevo can load data from any of your Sources into Databricks. You can set up the Databricks Destination on the fly, as part of the Pipeline creation process or independently. The ingested data is first staged in Hevo’s S3 bucket before it is batched and loaded to the Databricks Destination. Additionally, Hevo supports Databricks on the AWS, Azure, and GCP platforms.
You can connect your Databricks warehouse to Hevo using one of the following modes:
-
A Databricks cluster (version 7.0 and above). A cluster defines the computing resources to be used for loading the objects to the Databricks warehouse. For instructions to set up a cluster, read Create a Databricks Cluster. Apache Spark jobs are available only in the Cluster mode.
-
A Databricks SQL endpoint. An SQL endpoint is a computation resource that allows you to run only SQL commands on the data objects. For instructions to set up an SQL endpoint, read Create a Databricks SQL endpoint.
Clusters and SQL endpoints can be created within a workspace. A workspace refers to your Databricks deployment in the cloud service account.
Prerequisites
-
An active AWS account is available.
-
A workspace is created in Databricks.
-
Databricks workspace URL. Format: <deployment name>.cloud.databricks.com.
-
The Databricks cluster or SQL endpoint is created.
-
The database credentials (hostname, port, and HTTP path) and Personal Access Token (PAT) of the Databricks instance are available.
Perform the following steps to configure Databricks as a Destination:
(Optional) Create a Databricks Workspace
-
Log in to your Databricks account.
-
Create a workspace. You are automatically added as an
admin
to the workspace that you create.
(Optional) Add Members to the Workspace
Once you have created the workspace, add your team members who can access the workspace and create and manage clusters in it.
-
Log in as the workspace
admin
and follow these steps to add users to the workspace. -
Follow these steps to assign admin privileges to the user(s) for creating clusters.
Connect your Databricks Warehouse
Use one of the following options to connect your Databricks warehouse to Hevo:
Option 1: Create a Databricks cluster
Clusters are created within a Databricks workspace. You can connect an existing Databricks cluster to which you want to load the data or create one now.
To do this:
-
Log in to your Databricks workspace. URL: [https://<workspace-name><env>.databricks.com/]
-
In the databricks console, select Data Science & Engineering in the drop-down.
-
Do one of the following:
-
Click Create, Cluster.
-
Click Compute, + Create Cluster.
-
-
Specify a Cluster name and select the required configuration, such as the Worker type and Driver type.
-
Expand the Advanced options section and select the Spark tab.
-
In the Spark Config box, paste the following code that specifies the configurations needed to read the data from your S3 account:
spark.databricks.delta.alterTable.rename.enabledOnAWS true spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl.disable.cache true spark.hadoop.fs.s3.impl.disable.cache true spark.hadoop.fs.s3a.impl.disable.cache true spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
-
Click Create Cluster to create your cluster.
Option 2: Create a Databricks SQL endpoint
-
Log in to your Databricks workspace. URL: [https://<workspace-name><env>.databricks.com/]
-
In the databricks console, select SQL from the drop-down.
-
Do one of the following:
-
Click Create, SQL Endpoint.
-
Click SQL Endpoints, Create your first SQL endpoint.
-
-
In the New SQL Endpoint window:
-
Specify a Name for the endpoint.
-
Select your Cluster Size.
-
Configure other endpoint options, as required.
-
Click Create.
-
Obtain the Databricks Credentials
Once you have a cluster that you want to load data to, obtain the cluster details that you must provide while configuring Databricks in Hevo. To do this:
-
In the databricks console, click Compute in the left navigation bar.
-
Click the cluster you want to use.
-
In the Configuration tab, scroll down to the Advanced Options section and select the JDBC/ODBC tab.
-
Make a note of the following values:
-
Server Hostname
-
Port
-
HTTP Path
-
Create a Personal Access Token (PAT)
Hevo requires a Databricks Personal Access Token (PAT) to authenticate and connect to your Databricks instance and use the Databricks REST APIs.
To generate the PAT:
-
In the databricks console, click Settings in the left navigation bar, and then click User Settings.
-
Click the Access Tokens tab.
-
Click Generate New Token.
-
Optionally, provide a description Comment and the token Lifetime (Expiration Period).
-
Click Generate.
-
Copy the generated token. This token would be used to connect Databricks as a Destination in Hevo.
Note: PATs are similar to passwords; store these securely.
Configure Databricks as a Destination
Perform the following steps to configure Databricks as a Destination in Hevo:
-
Click DESTINATIONS in the Asset Palette.
-
Click + CREATE in the Destinations List View.
-
On the Add Destination page, select Databricks.
-
In the Configure your Databricks Destination page, specify the following:
-
Destination Name: A unique name for the Destination.
-
Server Hostname: The server hostname in your cluster credentials.
-
Database Port: The port in your cluster credentials. Default value: 443.
-
HTTP Path: The HTTP path to the data source in Databricks, from your cluster credentials.
-
Personal Access Token (PAT): The PAT generated in Databricks that Hevo must use to authenticate and connect to Databricks. It works similar to a username-password combination.
-
Advanced Settings:
-
Populate Loaded Timestamp: If enabled, Hevo appends the
___hevo_loaded_at_
column to the Destination table to indicate the time when the Event was loaded. -
Sanitize Table/Column Names: If enabled, Hevo sanitizes the column names to remove all non-alphanumeric characters and spaces in the table and column names and replaces them with an underscore (_). Read Name Sanitization.
-
Create Delta Tables in External Location (Optional): If enabled, you can create tables in a different location than the Databricks File System location registered with the cluster. Read Identifying the External Location for Delta Tables.
If disabled, the default Databricks File System location registered with the cluster is used. Hevo creates the external Delta tables in the
/{schema}/{table}
path. -
Vacuum Delta Tables: If enabled, Hevo runs the Vacuum operation every weekend to delete the uncommitted files and clean up your Delta tables. Read VACUUM | Databricks on AWS. Databricks charges additional costs for these queries.
-
Optimize Delta Tables: If enabled, Hevo runs the Optimize queries every weekend to optimize the layout of the data and improve the query speed. Read OPTIMIZE (Delta Lake on Databricks). Databricks charges additional costs for these queries.
-
-
-
Click TEST CONNECTION to test and SAVE & CONTINUE to complete the setup.
Identifying the External Location for Delta Tables
If the Create Delta tables in an external location option is enabled, Hevo creates the Delta tables in the {external-location}/{schema}/{table}
path specified by you.
To locate the path of the external location, do one of the following
-
If you have DBFS access in Databricks:
-
In the databricks console, click Data in the left navigation bar.
-
Click the DBFS tab on the top of the sliding sidebar.
-
Select/view the path where the tables must be created. For example, in the above image,
/demo/default
is the path, and the external location is derived as/demo/default/{schema}/{table}
.
-
-
If you do not have DBFS access:
- Run the following command in your Databricks instance or the Destination workbench in Hevo:
DESCRIBE TABLE EXTENDED <table-name>;
Read Describe Table.
Destination Considerations
None.
Limitations
None.
See Also
Revision History
Refer to the following table for the list of key updates made to this page:
Date | Release | Description of Change |
---|---|---|
Jan-03-2022 | 1.79 | New document. |