- Introduction
- Getting Started
- Data Ingestion
- Data Loading
- Loading Data in a Database Destination
- Loading Data to a Data Warehouse
- Optimizing Data Loading for a Destination Warehouse
- Manually Triggering the Loading of Events
- Scheduling Data Load for a Destination
- Loading Events in Batches
- Data Loading Statuses
- Data Spike Alerts
- Name Sanitization
- Table and Column Name Compression
- Parsing Nested JSON Fields in Events
- Pipelines
- Data Flow in a Pipeline
- Familiarizing with the Pipelines UI
- Working with Pipelines
- Managing Objects in Pipelines
-
Transformations
-
Python Code-Based Transformations
- Supported Python Modules and Functions
-
Transformation Methods in the Event Class
- Create an Event
- Retrieve the Event Name
- Rename an Event
- Retrieve the Properties of an Event
- Modify the Properties for an Event
- Fetch the Primary Keys of an Event
- Modify the Primary Keys of an Event
- Fetch the Data Type of a Field
- Check if the Field is a String
- Check if the Field is a Number
- Check if the Field is Boolean
- Check if the Field is a Date
- Check if the Field is a Time Value
- Check if the Field is a Timestamp
-
TimeUtils
- Convert date string to required format
- Convert date to required format
- Convert datetime string to required format
- Convert epoch time to a date
- Convert epoch time to a datetime
- Convert epoch to required format
- Convert epoch to a time
- Get time difference
- Parse date string to date
- Parse date string to datetime format
- Parse date string to time
- Utils
- Examples of Python Code-based Transformations
-
Drag and Drop Transformations
- Special Keywords
-
Transformation Blocks and Properties
- Add a Field
- Change Datetime Field Values
- Change Field Values
- Drop Events
- Drop Fields
- Find & Replace
- Flatten JSON
- Format Date to String
- Format Number to String
- Hash Fields
- If-Else
- Mask Fields
- Modify Text Casing
- Parse Date from String
- Parse JSON from String
- Parse Number from String
- Rename Events
- Rename Fields
- Round-off Decimal Fields
- Split Fields
- Examples of Drag and Drop Transformations
- Effect of Transformations on the Destination Table Structure
- Transformation Reference
- Transformation FAQs
-
Python Code-Based Transformations
-
Schema Mapper
- Using Schema Mapper
- Mapping Statuses
- Auto Mapping Event Types
- Manually Mapping Event Types
- Modifying Schema Mapping for Event Types
- Schema Mapper Actions
- Fixing Unmapped Fields
- Resolving Incompatible Schema Mappings
- Resizing String Columns in the Destination
- Schema Mapper Compatibility Table
- Limits on the Number of Destination Columns
- File Log
- Troubleshooting Failed Events in a Pipeline
- Mismatch in Events Count in Source and Destination
- Activity Log
-
Pipeline FAQs
- Does creation of Pipeline incur cost?
- Why are my new Pipelines in trial?
- Can multiple Sources connect to one Destination?
- What happens if I re-create a deleted Pipeline?
- Why is there a delay in my Pipeline?
- Can I delete skipped objects in a Pipeline?
- Can I change the Destination post-Pipeline creation?
- How does changing the query mode affect data ingestion?
- Why is my billable Events high with Delta Timestamp mode?
- Can I drop multiple Destination tables in a Pipeline at once?
- How does Run Now affect scheduled ingestion frequency?
- Will pausing some objects increase the ingestion speed?
- Can I sort Event Types listed in the Schema Mapper?
- How do I include new tables in the Pipeline?
- Can I see the historical load progress?
- Why is my Historical Load Progress still at 0%?
- Why is historical data not getting ingested?
- How do I restart the historical load for all the objects?
- How do I set a field as a primary key?
- How can I load only filtered Events to the Destination?
- How do I ensure that records are loaded only once?
- Why do the Source and the Destination events count differ?
- Events Usage
- Sources
- Free Sources
-
Databases and File Systems
- Data Warehouses
-
Databases
- Connecting to a Local Database
- Amazon DocumentDB
- Amazon DynamoDB
- Elasticsearch
-
MongoDB
- Generic MongoDB
- MongoDB Atlas
- Support for Multiple Data Types for the _id Field
- Example - Merge Collections Feature
-
Troubleshooting MongoDB
-
Errors During Pipeline Creation
- Error 1001 - Incorrect credentials
- Error 1005 - Connection timeout
- Error 1006 - Invalid database hostname
- Error 1007 - SSH connection failed
- Error 1008 - Database unreachable
- Error 1011 - Insufficient access
- Error 1028 - Primary/Master host needed for OpLog
- Error 1029 - Version not supported for Change Streams
- SSL 1009 - SSL Connection Failure
- Troubleshooting MongoDB Change Streams Connection
- Troubleshooting MongoDB OpLog Connection
-
Errors During Pipeline Creation
- SQL Server
-
MySQL
- Amazon Aurora MySQL
- Amazon RDS MySQL
- Azure MySQL
- Google Cloud MySQL
- Generic MySQL
- MariaDB MySQL
-
Troubleshooting MySQL
-
Errors During Pipeline Creation
- Error 1003 - Connection to host failed
- Error 1006 - Connection to host failed
- Error 1007 - SSH connection failed
- Error 1011 - Access denied
- Error 1012 - Replication access denied
- Error 1017 - Connection to host failed
- Error 1026 - Failed to connect to database
- Error 1027 - Unsupported BinLog format
- Failed to determine binlog filename/position
- Schema 'xyz' is not tracked via bin logs
- Errors Post-Pipeline Creation
-
Errors During Pipeline Creation
- MySQL FAQs
- Oracle
-
PostgreSQL
- Amazon Aurora PostgreSQL
- Amazon RDS PostgreSQL
- Azure PostgreSQL
- Google Cloud PostgreSQL
- Generic PostgreSQL
- Heroku PostgreSQL
-
Troubleshooting PostgreSQL
-
Errors during Pipeline creation
- Error 1003 - Authentication failure
- Error 1006 - Connection settings errors
- Error 1011 - Access role issue for logical replication
- Error 1012 - Access role issue for logical replication
- Error 1014 - Database does not exist
- Error 1017 - Connection settings errors
- Error 1023 - No pg_hba.conf entry
- Error 1024 - Number of requested standby connections
- Errors Post-Pipeline Creation
-
Errors during Pipeline creation
- PostgreSQL FAQs
- Troubleshooting Database Sources
- File Storage
-
Engineering Analytics
- Apify
- Asana
- Buildkite
- GitHub
-
Streaming
- Android SDK
- Kafka
-
REST API
- Writing JSONPath Expressions
-
REST API FAQs
- Why does my REST API token keep changing?
- Can I use a bearer authorization token for authentication?
- Does Hevo’s REST API support API chaining?
- What is the maximum payload size returned by a REST API?
- How do I split an Event into multiple Event Types?
- How do I split multiple values in a key into separate Events?
- Webhook
- GitLab
- Jira Cloud
- Opsgenie
- PagerDuty
- Pingdom
- Trello
- Finance & Accounting Analytics
-
Marketing Analytics
- ActiveCampaign
- AdRoll
- Apple Search Ads
- AppsFlyer
- CleverTap
- Criteo
- Drip
- Facebook Ads
- Facebook Page Insights
- Firebase Analytics
- Freshsales
- Google Campaign Manager
- Google Ads
- Google Analytics
- Google Analytics 4
- Google Analytics 360
- Google Play Console
- Google Search Console
- HubSpot
- Instagram Business
- Klaviyo
- Lemlist
- LinkedIn Ads
- Mailchimp
- Mailshake
- Marketo
- Microsoft Advertising
- Onfleet
- Outbrain
- Pardot
- Pinterest Ads
- Pipedrive
- Recharge
- Segment
- SendGrid Webhook
- SendGrid
- Salesforce Marketing Cloud
- Snapchat Ads
- SurveyMonkey
- Taboola
- TikTok Ads
- Twitter Ads
- Typeform
- YouTube Analytics
- Product Analytics
- Sales & Support Analytics
-
Source FAQs
- From how far back can the Pipeline ingest data?
- Can I connect to a Source not listed in Hevo?
- Can I connect a local database as a Source?
- How can I push data to Hevo API?
- How do I connect a CSV file as a Source?
- Why are my selected Source objects not visible in the Schema Mapper?
- How can I transfer Excel files using Hevo?
- How does the Merge Table feature work?
- Destinations
- Familiarizing with the Destinations UI
- Databases
-
Data Warehouses
- Amazon Redshift
- Azure Synapse Analytics
- Databricks
- Firebolt
- Google BigQuery
- Hevo Managed Google BigQuery
- Snowflake
-
Destination FAQs
- Can I move data between SaaS applications using Hevo?
- Can I change the primary key in my Destination table?
- How do I change the data type of table columns?
- Can I change the Destination table name after creating the Pipeline?
- How can I change or delete the Destination table prefix?
- How do I resolve duplicate records in the Destination table?
- How do I enable or disable deduplication of records?
- Why does my Destination have deleted Source records?
- How do I filter deleted Events from the Destination?
- Does a data load regenerate deleted Hevo metadata columns?
- Can I load data to a specific Destination table?
- How do I filter out specific fields before loading data?
- How do I sort the data in the Destination?
- Transform
- Alerts
- Account Management
- Personal Settings
- Team Settings
-
Billing
- Pricing Plans
- Time-based Events Buffer
- Setting up Pricing Plans, Billing, and Payments
- On-Demand Purchases
- Billing Alerts
- Viewing Billing History
- Billing Notifications
-
Billing FAQs
- Can I try Hevo for free?
- Can I get a plan apart from the Starter plan?
- Are free trial Events charged once I purchase a plan?
- For how long can I stay on the Free plan?
- How can I upgrade my plan?
- Is there a discount for non-profit organizations?
- Can I seek a refund of my payment?
- Do ingested Events count towards billing?
- Will Pipeline get paused if I exceed the Events quota?
- Will the initial load of data be free?
- Does the Hevo plan support multiple Destinations?
- Do rows loaded through Models count in my usage?
- Is Hevo subscription environment-specific?
- Can I pause billing if I have no active Pipelines?
- Can you explain the pricing plans in Hevo?
- Where do I get invoices for payments?
- Account Suspension and Restoration
- Account Management FAQs
- Activate
- Glossary
- Release Notes
- Release Version 2.13
- Release Version 2.12
- Release Version 2.11
- Release Version 2.10
- Release Version 2.09
- Release Version 2.08
- Release Version 2.07
- Release Version 2.06
- Release Version 2.05
- Release Version 2.04
- Release Version 2.03
- Release Version 2.02
- Release Version 2.01
- Release Version 2.00
- Release Version 1.99
- Release Version 1.98
- Release Version 1.97
- Release Version 1.96
- Release Version 1.95
- Release Version 1.93 & 1.94
- Release Version 1.92
- Release Version 1.91
- Release Version 1.90
- Release Version 1.89
- Release Version 1.88
- Release Version 1.87
- Release Version 1.86
- Release Version 1.84 & 1.85
- Release Version 1.83
- Release Version 1.82
- Release Version 1.81
- Release Version 1.80 (Jan-24-2022)
- Release Version 1.79 (Jan-03-2022)
- Release Version 1.78 (Dec-20-2021)
- Release Version 1.77 (Dec-06-2021)
- Release Version 1.76 (Nov-22-2021)
- Release Version 1.75 (Nov-09-2021)
- Release Version 1.74 (Oct-25-2021)
- Release Version 1.73 (Oct-04-2021)
- Release Version 1.72 (Sep-20-2021)
- Release Version 1.71 (Sep-09-2021)
- Release Version 1.70 (Aug-23-2021)
- Release Version 1.69 (Aug-09-2021)
- Release Version 1.68 (Jul-26-2021)
- Release Version 1.67 (Jul-12-2021)
- Release Version 1.66 (Jun-28-2021)
- Release Version 1.65 (Jun-14-2021)
- Release Version 1.64 (Jun-01-2021)
- Release Version 1.63 (May-19-2021)
- Release Version 1.62 (May-05-2021)
- Release Version 1.61 (Apr-20-2021)
- Release Version 1.60 (Apr-06-2021)
- Release Version 1.59 (Mar-23-2021)
- Release Version 1.58 (Mar-09-2021)
- Release Version 1.57 (Feb-22-2021)
- Release Version 1.56 (Feb-09-2021)
- Release Version 1.55 (Jan-25-2021)
- Release Version 1.54 (Jan-12-2021)
- Release Version 1.53 (Dec-22-2020)
- Release Version 1.52 (Dec-03-2020)
- Release Version 1.51 (Nov-10-2020)
- Release Version 1.50 (Oct-19-2020)
- Release Version 1.49 (Sep-28-2020)
- Release Version 1.48 (Sep-01-2020)
- Release Version 1.47 (Aug-06-2020)
- Release Version 1.46 (Jul-21-2020)
- Release Version 1.45 (Jul-02-2020)
- Release Version 1.44 (Jun-11-2020)
- Release Version 1.43 (May-15-2020)
- Release Version 1.42 (Apr-30-2020)
- Release Version 1.41 (Apr-2020)
- Release Version 1.40 (Mar-2020)
- Release Version 1.39 (Feb-2020)
- Release Version 1.38 (Jan-2020)
- Upcoming Features
Databricks
Databricks is an open-source storage layer that allows you to operate a lakehouse architecture that provides data warehousing performance at data lake cost. Databricks runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Apache Spark is an open source data analytics engine that can perform analytics and data processing on very large sets of data. Read A Gentle Introduction to Apache Spark on Databricks.
Hevo can load data from any of your Sources into Databricks. You can set up the Databricks Destination on the fly, as part of the Pipeline creation process or independently. The ingested data is first staged in Hevo’s S3 bucket before it is batched and loaded to the Databricks Destination. Additionally, Hevo supports Databricks on the AWS, Azure, and GCP platforms.
You can connect your Databricks warehouse to Hevo using one of the following methods:
-
Using the Databricks credentials:
Hevo allows you to configure Databricks as a Destination using the credentials obtained from your Databricks account. For this, you can use one of the following modes:-
A Databricks cluster (version 7.0 and above). A cluster defines the computing resources to be used for loading the objects to the Databricks warehouse. For instructions to set up a cluster, read Create a Databricks Cluster. Apache Spark jobs are available only in the Cluster mode.
-
A Databricks SQL warehouse. An SQL warehouse is a computation resource that allows you to run only SQL commands on the data objects. For instructions to set up an SQL warehouse, read Create a Databricks SQL warehouse.
Clusters and SQL warehouses can be created within a workspace. A workspace refers to your Databricks deployment in the cloud service account.
-
-
Using the Databricks Partner Connect (Recommended Method):
In collaboration with Databricks, Hevo allows you to configure Databricks as a Destination using the Databricks Partner Connect page. Refer to section, Connect Using the Databricks Partner Connect for the steps to do this.
Prerequisites
-
An active AWS account is available.
-
A workspace is created in Databricks.
-
Databricks workspace URL. Format: <deployment name>.cloud.databricks.com.
-
The Databricks cluster or SQL warehouse is created, if you are connecting using the Databricks credentials.
-
The database credentials (hostname, port, and HTTP path) and Personal Access Token (PAT) of the Databricks instance are available, if you are connecting using the Databricks credentials.
-
You are assigned the Team Collaborator, or any administrator role except the Billing Administrator role in Hevo to create the Destination.
Connect to Databricks as a Destination using either of the following methods:
Connect Using the Databricks Credentials
(Optional) Create a Databricks workspace
-
Log in to your Databricks account.
-
Create a workspace. You are automatically added as an
admin
to the workspace that you create.
(Optional) Add Members to the Workspace
Once you have created the workspace, add your team members who can access the workspace and create and manage clusters in it.
-
Log in as the workspace
admin
and follow these steps to add users to the workspace. -
Follow these steps to assign admin privileges to the user(s) for creating clusters.
Connect your Databricks Warehouse
Use one of the following options to connect your Databricks warehouse to Hevo:
Create a Databricks cluster
Clusters are created within a Databricks workspace. You can connect an existing Databricks cluster to which you want to load the data or create one now.
To do this:
-
Log in to your Databricks workspace. URL: [https://<workspace-name><env>.databricks.com/]
-
In the Databricks console, select Data Science & Engineering in the drop-down.
-
In the left navigation pane, click New, and then click Cluster.
-
Specify a Cluster name and select the required configuration, such as the Worker type and Driver type.
-
Expand the Advanced options section and select the Spark tab.
-
In the Spark Config box, paste the following code that specifies the configurations needed to read the data from your S3 account:
spark.databricks.delta.alterTable.rename.enabledOnAWS true spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl.disable.cache true spark.hadoop.fs.s3.impl.disable.cache true spark.hadoop.fs.s3a.impl.disable.cache true spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
-
Click Create Cluster to create your cluster.
Create a Databricks SQL warehouse
-
Log in to your Databricks workspace. URL: [https://<workspace-name><env>.databricks.com/]
-
In the Databricks console, select SQL from the drop-down.
-
In the left navigation pane, click New, and then click SQL Warehouse.
-
In the New SQL Warehouse window, do the following:
-
Specify a Name for the warehouse.
-
Select your Cluster Size.
-
Configure other warehouse options, as required.
-
Click Create.
-
Obtain the Databricks Credentials
Once you have a cluster that you want to load data to, obtain the cluster details that you must provide while configuring Databricks in Hevo. To do this:
-
In the databricks console, click Compute in the left navigation bar.
-
Click the cluster you want to use.
-
In the Configuration tab, scroll down to the Advanced Options section and select the JDBC/ODBC tab.
-
Make a note of the following values:
-
Server Hostname
-
Port
-
HTTP Path
-
Create a Personal Access Token (PAT)
Hevo requires a Databricks Personal Access Token (PAT) to authenticate and connect to your Databricks instance and use the Databricks REST APIs.
To generate the PAT:
-
Click on the top right of your Databricks console, and in the drop-down, click User Settings.
-
Click the Access Tokens tab.
-
Click Generate new token.
-
Optionally, in the Generate new token dialog box, provide a description Comment and the token Lifetime (Expiration Period).
-
Click Generate.
-
Copy the generated token and save it securely like any other password. Use this token to connect Databricks as a Destination in Hevo.
Note: PATs are similar to passwords; store these securely.
Configure Databricks as a Destination
Perform the following steps to configure Databricks as a Destination in Hevo:
-
Click DESTINATIONS in the Navigation Bar.
-
Click + CREATE in the Destinations List View.
-
On the Add Destination page, select Databricks.
-
In the Configure your Databricks Destination page, specify the following:
-
Destination Name: A unique name for the Destination.
-
Server Hostname: The server hostname in your cluster credentials.
-
Database Port: The port in your cluster credentials. Default value: 443.
-
HTTP Path: The HTTP path to the data source in Databricks, from your cluster credentials.
-
Personal Access Token (PAT): The PAT generated in Databricks that Hevo must use to authenticate and connect to Databricks. It works similar to a username-password combination.
-
Advanced Settings:
-
Populate Loaded Timestamp: If enabled, Hevo appends the
___hevo_loaded_at_
column to the Destination table to indicate the time when the Event was loaded. -
Sanitize Table/Column Names: If enabled, Hevo sanitizes the column names to remove all non-alphanumeric characters and spaces in the table and column names and replaces them with an underscore (_). Read Name Sanitization.
-
Create Delta Tables in External Location (Optional): If enabled, you can create tables in a different location than the Databricks File System location registered with the cluster. Read Identifying the External Location for Delta Tables.
If disabled, the default Databricks File System location registered with the cluster is used. Hevo creates the external Delta tables in the
/{schema}/{table}
path. -
Vacuum Delta Tables: If enabled, Hevo runs the Vacuum operation every weekend to delete the uncommitted files and clean up your Delta tables. Read VACUUM | Databricks on AWS. Databricks charges additional costs for these queries.
-
Optimize Delta Tables: If enabled, Hevo runs the Optimize queries every weekend to optimize the layout of the data and improve the query speed. Read OPTIMIZE (Delta Lake on Databricks). Databricks charges additional costs for these queries.
-
-
-
Click TEST CONNECTION. This button is enabled once all the mandatory fields are specified.
-
Click SAVE & CONTINUE. This button is enabled once all the mandatory fields are specified.
Connect Using the Databricks Partner Connect (Recommended Method)
-
Log in to your Databricks account.
-
In the left navigation pane, click Partner Connect.
-
In the Partner Connect page, under Data Ingestion, click HEVO.
-
In the Connect to partner pop-up window, select the options according to your requirements, and click Next.
-
Specify your Email, and click Connect to Hevo Data.
-
Sign up for Hevo or Log in to your Hevo account. Post-login, you are redirected to the Configure your Databricks Destination page.
-
In the Configure your Databricks Destination page, specify the following:
-
Destination Name: A unique name for the Destination.
-
Schema Name: The name of the Destination database schema. Default value: default.
-
Advanced Settings:
-
Populate Loaded Timestamp: If enabled, Hevo appends the
___hevo_loaded_at_
column to the Destination table to indicate the time when the Event was loaded. -
Sanitize Table/Column Names: If enabled, Hevo sanitizes the column names to remove all non-alphanumeric characters and spaces in the table and column names and replaces them with an underscore (_). Read Name Sanitization.
-
Create Delta Tables in External Location (Optional): The default Databricks File System location registered with the cluster is used. Hevo creates the external Delta tables in the
/{schema}/{table}
path.This option is disabled since Databricks automatically configures this option when connecting using the Databricks Partner Connect.
-
Vacuum Delta Tables: If enabled, Hevo runs the Vacuum operation every weekend to delete the uncommitted files and clean up your Delta tables. Read VACUUM | Databricks on AWS. Databricks charges additional costs for these queries.
-
Optimize Delta Tables: If enabled, Hevo runs the Optimize queries every weekend to optimize the layout of the data and improve the query speed. Read OPTIMIZE (Delta Lake on Databricks). Databricks charges additional costs for these queries.
-
-
-
Click TEST CONNECTION. This button is enabled once all the mandatory fields are specified.
-
Click SAVE & CONTINUE. This button is enabled once all the mandatory fields are specified.
Identifying the External Location for Delta Tables
If the Create Delta tables in an external location option is enabled, Hevo creates the Delta tables in the {external-location}/{schema}/{table}
path specified by you.
To locate the path of the external location, do one of the following
-
If you have DBFS access in Databricks:
-
In the databricks console, click Data in the left navigation bar.
-
Click the DBFS tab on the top of the sliding sidebar.
-
Select/view the path where the tables must be created. For example, in the above image,
/demo/default
is the path, and the external location is derived as/demo/default/{schema}/{table}
.
-
-
If you do not have DBFS access:
- Run the following command in your Databricks instance or the Destination workbench in Hevo:
DESCRIBE TABLE EXTENDED <table-name>;
Read Describe Table.
Additional Information
Read the detailed Hevo documentation for the following related topics:
Destination Considerations
None.
Limitations
- Hevo currently does not support Databricks as a Destination in the US-GCP region.
See Also
Revision History
Refer to the following table for the list of key updates made to this page:
Date | Release | Description of Change |
---|---|---|
Apr-25-2023 | 2.12 | Updated section, Connect Using the Databricks Partner Connect (Recommended Method) to add information that you must specify all fields to create a Pipeline. |
Nov-23-2022 | 2.02 | - Added section, Connect Using the Databricks Partner Connect to mention about Databricks Partner Connect integration. - Updated screenshots in the page to reflect the latest Databricks UI. |
Oct-17-2022 | NA | Updated section, Limitations to add limitation regarding Hevo not supporting Databricks on Google Cloud. |
Jan-03-2022 | 1.79 | New document. |