Historical or Initial Loads in Hevo

Historical Load is the one-time initial load of data that the source already had before the creation of the Pipeline.

All sources in Hevo except for the ones purely dependent on webhooks support the Historical load of data. How far back in time Hevo can replicate data from may depend upon the constraints put by the source. Please read source-specific guides for more information on individual sources.

Hevo creates jobs in a Pipeline to replicate data from a Source. In most cases, each of these jobs is responsible for performing the historical load as well before starting real-time incremental replication.

In a few sources, however, the Historical Load jobs differ from jobs that are responsible for real-time incremental replication. These sources are usually databases where Hevo replicates by reading a Write Ahead database log.

Sources that have different jobs for Historical Loads

The following sources in the mentioned Pipeline modes use different jobs for Historical Loads

Source Pipeline Mode
MySQL BinLog
Postgres Logical Replication (WAL)
MongoDB OpLog or Change Streams
Oracle RedoLog

How Historical Load jobs work?

  • A Historical Load job is created for each table/collection in the database.
  • The job will start replicating the data from the beginning and perform one-time ingestion for all the records in the table/collection.
  • Once all the data has been ingested, the job will be marked with status Historical Load Finished and will never run again.
  • A historical Load job will use a Primary Key or a Unique Key to replicate data from the tables/collections. Wherever the database provides information around these keys, Hevo will pick them up automatically. In other cases, you will be asked to provide us with a set of unique columns for each table/collection during the pipeline creation process.

Enabling Historical Load for a Pipeline

Hevo provides the option to enable Historical Load during the creation of Pipeline on the Source Configuration screen under Advanced Settings. By enabling this option historical load will be performed all the tables/collections.

Hevo does not allow you to choose the tables/collections for which historical load is to be performed. To disable historical load for certain tables/collections you can skip the individual Historical Load jobs from the Pipeline Overview screen.

Technical Details on Historical Load Jobs

  • Log Replication jobs (BinLog, WAL, OpLog, ChangeStream) have higher precedence over Historical Load jobs. Once the log replication job finished its poll, the historical load jobs will start polling the data.
  • To avoid overwriting updated data with older data, historical load jobs and log replication jobs are never run in parallel. When the log replication job is running, all historical load jobs will be put in QUEUED status and vice versa.
  • In Hevo every pipeline job has a maximum runtime of 1 hour. Therefore, all historical load jobs will run for an hour and then will wait for the log replication job to run before running again. The jobs always start running from the position where they stopped in the last run.

Important Statuses for Historical Load Jobs

Status Description
Bootstrapping This is the first run of the job and it is actively ingesting historical data from the source.
Historical Load Finished Historical load has finished. You can run the historical load again by restarting the job.
Queued The job is ready to ingest events and will begin as soon as the resources are available.
Skipped The job has been skipped and will not ingest events until it is included again in the pipeline.