Databricks

Databricks is an open-source storage layer that allows you to operate a lakehouse architecture that provides data warehousing performance at data lake cost. Databricks runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Apache Spark is an open source data analytics engine that can perform analytics and data processing on very large sets of data. Read A Gentle Introduction to Apache Spark on Databricks.

Hevo can load data from any of your Sources into Databricks. You can set up the Databricks Destination on the fly, as part of the Pipeline creation process or independently. The ingested data is first staged in Hevo’s S3 bucket before it is batched and loaded to the Databricks Destination. Additionally, Hevo supports Databricks on the AWS, Azure, and GCP platforms.

You can connect your Databricks warehouse to Hevo using one of the following modes:

Clusters and SQL endpoints can be created within a workspace. A workspace refers to your Databricks deployment in the cloud service account.


Prerequisites


Perform the following steps to configure Databricks as a Destination:

(Optional) Create a Databricks Workspace

  1. Log in to your Databricks account.

  2. Create a workspace. You are automatically added as an admin to the workspace that you create.


(Optional) Add Members to the Workspace

Once you have created the workspace, add your team members who can access the workspace and create and manage clusters in it.

  1. Log in as the workspace admin and follow these steps to add users to the workspace.

  2. Follow these steps to assign admin privileges to the user(s) for creating clusters.


Connect your Databricks Warehouse

Use one of the following options to connect your Databricks warehouse to Hevo:

Option 1: Create a Databricks cluster

Clusters are created within a Databricks workspace. You can connect an existing Databricks cluster to which you want to load the data or create one now.

To do this:

  1. Log in to your Databricks workspace. URL: [https://<workspace-name><env>.databricks.com/]

  2. In the databricks console, select Data Science & Engineering in the drop-down.

    DSE option

  3. Do one of the following:

    Create cluster

    • Click Create, Cluster.

    • Click Compute, + Create Cluster.

  4. Specify a Cluster name and select the required configuration, such as the Worker type and Driver type.

  5. Expand the Advanced options section and select the Spark tab.

    Spark tab

  6. In the Spark Config box, paste the following code that specifies the configurations needed to read the data from your S3 account:

    spark.databricks.delta.alterTable.rename.enabledOnAWS true
    spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3n.impl.disable.cache true
    spark.hadoop.fs.s3.impl.disable.cache true
    spark.hadoop.fs.s3a.impl.disable.cache true
    spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
    
  7. Click Create Cluster to create your cluster.


Option 2: Create a Databricks SQL endpoint

  1. Log in to your Databricks workspace. URL: [https://<workspace-name><env>.databricks.com/]

  2. In the databricks console, select SQL from the drop-down.
    SQL

  3. Do one of the following:

    SQL Endpoint

    • Click Create, SQL Endpoint.

    • Click SQL Endpoints, Create your first SQL endpoint.

  4. In the New SQL Endpoint window:

    Create SQL Endpoint

    • Specify a Name for the endpoint.

    • Select your Cluster Size.

    • Configure other endpoint options, as required.

    • Click Create.


Obtain the Databricks Credentials

Once you have a cluster that you want to load data to, obtain the cluster details that you must provide while configuring Databricks in Hevo. To do this:

  1. In the databricks console, click Compute in the left navigation bar.

  2. Click the cluster you want to use.

  3. In the Configuration tab, scroll down to the Advanced Options section and select the JDBC/ODBC tab.

  4. Make a note of the following values:

    • Server Hostname

    • Port

    • HTTP Path

    Credentials

Create a Personal Access Token (PAT)

Hevo requires a Databricks Personal Access Token (PAT) to authenticate and connect to your Databricks instance and use the Databricks REST APIs.

To generate the PAT:

  1. In the databricks console, click Settings in the left navigation bar, and then click User Settings.

    PAT

  2. Click the Access Tokens tab.

  3. Click Generate New Token.

  4. Optionally, provide a description Comment and the token Lifetime (Expiration Period).

  5. Click Generate.

  6. Copy the generated token. This token would be used to connect Databricks as a Destination in Hevo.

Note: PATs are similar to passwords; store these securely.


Configure Databricks as a Destination

Perform the following steps to configure Databricks as a Destination in Hevo:

  1. Click DESTINATIONS in the Asset Palette.

  2. Click + CREATE in the Destinations List View.

  3. On the Add Destination page, select Databricks.

  4. In the Configure your Databricks Destination page, specify the following:

    Databricks settings

    • Destination Name: A unique name for the Destination.

    • Server Hostname: The server hostname in your cluster credentials.

    • Database Port: The port in your cluster credentials. Default value: 443.

    • HTTP Path: The HTTP path to the data source in Databricks, from your cluster credentials.

    • Personal Access Token (PAT): The PAT generated in Databricks that Hevo must use to authenticate and connect to Databricks. It works similar to a username-password combination.

    • Advanced Settings:

      • Populate Loaded Timestamp: If enabled, Hevo appends the ___hevo_loaded_at_ column to the Destination table to indicate the time when the Event was loaded.

      • Sanitize Table/Column Names: If enabled, Hevo sanitizes the column names to remove all non-alphanumeric characters and spaces in the table and column names and replaces them with an underscore (_). Read Name Sanitization.

      • Create Delta Tables in External Location (Optional): If enabled, you can create tables in a different location than the Databricks File System location registered with the cluster. Read Identifying the External Location for Delta Tables.

        If disabled, the default Databricks File System location registered with the cluster is used. Hevo creates the external Delta tables in the /{schema}/{table} path.

      • Vacuum Delta Tables: If enabled, Hevo runs the Vacuum operation every weekend to delete the uncommitted files and clean up your Delta tables. Read VACUUM | Databricks on AWS. Databricks charges additional costs for these queries.

      • Optimize Delta Tables: If enabled, Hevo runs the Optimize queries every weekend to optimize the layout of the data and improve the query speed. Read OPTIMIZE (Delta Lake on Databricks). Databricks charges additional costs for these queries.

  5. Click TEST CONNECTION to test and SAVE & CONTINUE to complete the setup.


Identifying the External Location for Delta Tables

If the Create Delta tables in an external location option is enabled, Hevo creates the Delta tables in the {external-location}/{schema}/{table} path specified by you.

To locate the path of the external location, do one of the following

  • If you have DBFS access in Databricks:

    1. In the databricks console, click Data in the left navigation bar.

      DBFS

    2. Click the DBFS tab on the top of the sliding sidebar.

    3. Select/view the path where the tables must be created. For example, in the above image, /demo/default is the path, and the external location is derived as /demo/default/{schema}/{table}.

  • If you do not have DBFS access:

    • Run the following command in your Databricks instance or the Destination workbench in Hevo:
       DESCRIBE TABLE EXTENDED <table-name>;
    

    Read Describe Table.


Destination Considerations

None.


Limitations

None.



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Jan-03-2022 1.79 New document.
Last updated on 20 Jun 2022