Types of Data Synchronization
On This Page
Data Synchronization is categorized depending on the timeline of the data that Hevo ingests.
Historical load is the one-time initial load of data that the Source already had before the creation of the Pipeline.
All Sources in Hevo, except for the ones purely dependent on webhooks, support loading of historical data. How far back in time Hevo can replicate data from may depend upon the constraints put by the Source. Read Sources for more information.
Enabling historical load for a Pipeline
Historical data is ingested the first time you run the Pipeline. Some Sources provide you the option to select or deselect the Load Historical Data Advanced setting during Source configuration. If this option is deselected, Events older than the Pipeline creation date are not loaded.
Note: Hevo does not allow you to choose the tables or collections for which historical load is to be performed. To disable historical load for certain tables or collections, you can skip the individual objects in the Pipeline Overview page.
Historical data ingestion methods
Hevo uses three methods for ingesting historical data, depending on the Source:
Recent Data First: For many Sources, Hevo uses the Recent Data First method to ingest historical data starting from the most recent data to the earliest data. This method provides you with faster access to your most recent historical data.
The Recent Data First method is applicable for the following SaaS Sources:
Historical load parallelization: For many Sources, Hevo now divides the historical data into multiple parts and then ingests these parts simultaneously. This method provides you with faster access to your historical data.
The Historical load parallelization method is applicable for the following Sources:
Earliest Data First: For all other Sources, Hevo ingests historical data starting from the earliest data to the most recent data.
Ingesting historical data
Hevo ingests your historical data using the following steps:
A Historical load task is created for each table or collection in the database, or object in the Source.
Hevo starts ingesting historical data using either the Recent Data First or Earliest Data First method and performs one-time ingestion for all the Events in the Source. For a few Sources such as LinkedIn Ads and Instagram Business, Hevo allows you to specify the historical sync duration while setting up the Source in Hevo. Refer to the respective Source document for more details.
Once all the data is ingested, Hevo displays the status Historical Load Ingested, and the historical load is never run again unless restarted.
In a historical load, wherever primary keys are defined in the Source, Hevo uses these primary keys to replicate data to the Destination. In other cases, Hevo asks you to provide a set of unique columns for each table or collection during the Pipeline creation process.
Prioritization of historical data loads
Log replication (BinLog, WAL, OpLog, Change Tracking) has precedence over historical data loading. Once the log-based replication completes, historical data ingestion is started.
To avoid overwriting updated data with older data, historical load and log replication never occur in parallel. When the log replication is running, all historical loads are put in QUEUED status and vice versa.
In Hevo, every Pipeline job has a maximum run time of one hour. Therefore, all historical loads run for an hour and then wait for the log replication to run before resuming. The ingestion always resumes running from the position where the last run stopped.
Incremental data is the changed data that is fetched in a continuous manner. For example, log-based jobs for databases, daily synchronization jobs for SaaS Sources, or Webhook-based jobs.
Incremental load updates only the new or modified data in the Source. After the initial load, Hevo loads most of the objects using incremental updates. Hevo uses a variety of mechanisms to capture the changes in the Source data, depending on how the Source provides these changes. During incremental load, Hevo maintains an internal position, which lets Hevo track the exact point where the last successful load stopped.
Incremental load is efficient for the user as it updates only the changed data, instead of re-ingesting the entire data for the objects.
Data refresh refers to the process of re-ingesting data from the Source and loading it again to the Destination in order to keep the data updated and fresh. The data refresh period is usually defined at the Source for example, the past 30 days in case of PostgreSQL, however, for a few Sources, you can define this setting in Hevo. Hevo performs the data refresh task on every run of the Pipeline.
Data refresh is important in marketing-oriented Sources such as Marketo, Facebook, and LinkedIn. Such Sources use the conversion or attribution window to track the conversion (purchase or sign-up or any other user action) and attribute it to a click on an ad or a post.
A conversion or attribution window is the number of days within which a person clicks on your ad and then subsequently takes an action on it, such as, a purchase or sign-up.
For example, let us assume a prospect clicks a LinkedIn product ad on Day 1 and converts or signs up for the product on Day 10. The conversion is attributed to the click Event of Day 1, therefore, that record is updated with the attribution information and the Day 10 timestamp. Now, suppose the data refresh period is 2 days. Then, the data refresh on Day 11 picks up all the records having the timestamp of the past two days. Therefore, the modified record of Day 1 carrying the attribution information also gets picked up and loaded to the Destination, thereby capturing the conversion and the attribution correctly.
Note: After each data refresh, the number of ingested Events displayed in the Pipeline Activity section could be greater than the actual number of Events in your Source. This is because the ingestion count also includes any Events that are re-ingested due to changes in the Source data, while there is no change in the number of records or the rows in the Source. This happens for any changed data that is re-ingested while there is no change in the number of records or the rows in the Source. However, for Google Sheets, the entire set of data (changed and unchanged) gets re-ingested on each run of the Pipeline, as Google Sheets API provides no way to identify just the changed data.
Refer to the following table for the list of key updates made to this page:
|Date||Release||Description of Change|
|Oct-17-2022||NA||- Updated section, Historical data ingestion methods to add information about historical load parallelization.
- Updated section, Ingesting historical data to add information about historical sync duration.
|May-24-2022||NA||Updated the sub-section, Historical data ingestion methods to:
- Add Pendo to the list of Sources.
- Organize the content as per the SaaS Source category.
|Mar-07-2022||NA||Updated section, Historical Data to add information about different methods used for ingesting historical data.|
|Feb-21-2022||1.82||- Updated section, Data Refresh with a note about the number of Events.
- Removed the note about Events quota consumption during a historical load restart from the Historical Data section as all historical data ingestion is free.
|Nov-09-2021||NA||- Added a note in the section, Historical Data about Events quota consumption during a historical load restart.
- Added a See Also link to the Pipeline Frequency and Events Quota Consumption page.