Types of Data Synchronization

Data Synchronization is categorized depending on the timeline of the data that Hevo ingests.

Historical Data

Historical load is the one-time initial load of data that the Source already had before the creation of the Pipeline.

All Sources in Hevo except for the ones purely dependent on webhooks, support loading of historical data. How far back in time Hevo can replicate data from may depend upon the constraints put by the Source. Read Sources for more information.

Enabling historical load for a Pipeline

Historical data is ingested the first time you run the Pipeline. Some Sources provide you the option to select and deselect the Load Historical Data Advanced Setting during Source configuration to specify whether historical data must be ingested or not. If this option is deselected, Events older than the Pipeline creation date are not loaded.

Note: Hevo does not allow you to choose the tables or collections for which historical load is to be performed. To disable historical load for certain tables or collections, you can skip the individual objects in the Pipeline Overview page.

Ingesting historical data

To ingest the historical data:

  1. A Historical Load job is created for each table or collection in the database.

  2. The job starts replicating the data from the beginning and performs one-time ingestion for all the records in the table or collection.

  3. Once all the data has been ingested, the job is marked with status Historical Load Finished and is never run again unless restarted.

Note: If you restart the historical load, all the re-ingested Events count towards your Events quota. Read Pipeline Frequency and Events Quota Consumption for more information.

A historical load job uses a Primary Key or a Unique Key to replicate data from the tables or collections. Wherever the database provides information around these keys, Hevo picks this up automatically. In other cases, Hevo asks you to provide a set of unique columns for each table or collection during the Pipeline creation process.

Prioritization of historical load jobs

Log Replication (BinLog, WAL, OpLog, Change Tracking) has precedence over historical data loading. Once the log-based replication completes, historical data ingestion is started.

To avoid overwriting updated data with older data, historical load and log replication never occur in parallel. When the log replication is running, all historical loads are put in QUEUED status and vice versa.

In Hevo, every Pipeline job has a maximum run time of one hour. Therefore, all historical loads run for an hour and then wait for the log replication to run before resuming. The ingestion always resumes running from the position where the last run stopped.

Incremental Data

Incremental data is the changed data that is fetched in a continuous manner. For example, log-based jobs for databases, daily synchronization jobs for SaaS Sources, or Webhook-based jobs.

Incremental load updates only the new or modified data in the Source. After the initial load, Hevo loads most of the objects using incremental updates. Hevo uses a variety of mechanisms to capture the changes in the Source data, depending on how the Source provides these changes. During incremental load, Hevo maintains an internal position, which lets Hevo track the exact point where the last successful load stopped.

Incremental load is efficient for the user as it updates only the changed data, instead of re-ingesting the entire data for the objects.

Data Refresh

Data refresh is important in marketing-oriented Sources that use conversion attribution windows to track a “conversion” (purchase or signup or any other action) and attribute it to a click on an ad or a post. For example, Sources like Marketo, Facebook, and LinkedIn.

Attribution window is the number of days between a person clicking on your ad and then subsequently taking an action on it, such as, a purchase or sign-up.

In order to keep the data in the Destination updated and fresh, Hevo re-ingests the data for a configurable period on every data refresh. The data refresh period is defined by the Source and your configuration settings within that.

For example, in Microsoft Advertising, if a customer clicks on a link on Day 1, and later signs up on Day 10 (conversion), then, the metrics are updated in the report of Day 10 and not Day 1. Suppose the data refresh period is configured as 2 days. Then, the data refresh that occurs on Day 11 would update the records for Day 9 and Day 10, correctly attributing the user action to the click of Day 1.

See Also

Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Nov-09-2021 NA - Added a note in the section, Historical Data about Events quota consumption during a historical load restart.
- Added a See Also link to the Pipeline Frequency and Events Quota Consumption page.
Last updated on 19 Nov 2021