Types of Data Synchronization

Data Synchronization is categorized depending on the timeline of the data that Hevo ingests.

Historical Data

Historical data load is the one-time initial load of data that the Source already had before the creation of the Pipeline.

All Sources in Hevo except for the ones purely dependent on webhooks support the Historical load of data. How far back in time Hevo can replicate data from may depend upon the constraints put by the Source. Please read Source-specific articles for more information on individual Sources.

Enabling Historical Load for a Pipeline

Historical data is ingested the first time you run the Pipeline. Some Sources provide you the option to select/deselect the Load Historical Data Advanced Setting during Source configuration to specify whether historical data must be ingested or not.

Note: Hevo does not allow you to choose the tables/collections for which historical load is to be performed. To disable historical load for certain tables/collections you can skip the individual Historical Load jobs from the Pipeline Overview screen.

Ingesting Historical Data

To ingest the historical data:

  1. A Historical Load job is created for each table/collection in the database.

  2. The job starts replicating the data from the beginning and performs one-time ingestion for all the records in the table/collection.

  3. Once all the data has been ingested, the job is marked with status Historical Load Finished and is never run again unless restarted.

A historical load job uses a Primary Key or a Unique Key to replicate data from the tables/collections. Wherever the database provides information around these keys, Hevo picks this up automatically. In other cases, Hevo asks you to provide a set of unique columns for each table/collection during the Pipeline creation process.

Prioritization of Historical Load Jobs

Log Replication jobs (BinLog, WAL, OpLog, ChangeStream) have precedence over Historical Load jobs. Once the log-based replication completes, historical data ingestion is started

To avoid overwriting updated data with older data, historical load and log replication never occur in parallel. When the log replication is running, all historical loads are put in QUEUED status and vice versa.

In Hevo, every Pipeline job has a maximum run time of 1 hour. Therefore, all historical loads run for an hour and then wait for the log replication to run before resuming. The ingestion always resumes running from the position where the last run stopped.

Incremental Data

Incremental data is the changed data that is fetched in a continuous manner. For example, log-based jobs for databases, daily synchronization jobs for SaaS Sources, or Webhook-based jobs.

Incremental load updates only the new or modified data in the Source. After the initial load, Hevo loads most of the objects using incremental updates. Hevo uses a variety of mechanisms to capture the changes in the Source data, depending on how the Source provides these changes. During incremental load, Hevo maintains an internal position, which lets Hevo track the exact point where the last successful load stopped.

Incremental load is efficient for the user as it updates only the changed data, instead of re-ingesting the entire data for the objects.

Data Refresh

Data refresh is important in marketing oriented Sources that use conversion attribution windows to track a “conversion” (purchase/signup/any other action) and attribute it to a click on an ad or a post. For example, Marketo, Facebook, and LinkedIn.

Attribution window is the number of days between a person clicking on your ad and then subsequently taking an action on it, such as, a purchase or sign-up.

In order to keep the data in the Destination updated and fresh, Hevo re-ingests the data for a configurable period on every data refresh. The data refresh period is defined by the Source and your configuration settings within that.

For example, in Microsoft Advertising, if a customer clicks on a link on Day 1, and later signs up on Day 10 (conversion), then, the metrics are updated in the report of Day 10 and not Day 1. Suppose the data refresh period is configured as 2 days. Then, the data refresh that occurs on Day 11 would update the records for Day 9 and Day 10, correctly attributing the user action to the click of Day 1.

Last updated on 22 Dec 2020