Loading Data to a Data Warehouse

The data ingested from the Source is loaded to the Destination data warehouse at each run of your Pipeline. By default, Hevo maintains any primary keys, which are defined in the Source data, in the Destination tables.

You can load both types of data:

  • Data without primary keys

  • Data with primary keys

Data without Primary Keys

If primary keys are not present in the Destination tables, Hevo directly appends the data into the target tables. While this can result in duplicate Events occurring in the Destination, there is no resource overhead stemming from the data loading process.

Data with Primary Keys

If primary keys are present in the Source data but not enforceable on the Destination data warehouse, as in the case of Google BigQuery, Amazon Redshift, and Snowflake, then, ensuring uniqueness of data is not possible by default. Hevo circumvents this lack of primary key enforcement and guarantees that no duplicate data is loaded to or exists in the Destination tables by:

  • Adding temporary Hevo-internal meta columns to the tables to identify eligible Events,

  • Using specific queries to cleanse the data of any duplicate and stale Events,

  • Adding metadata information to each Event to uniquely identify its ingestion and loading time

Note: These steps utilize your Destination system’s resources in terms of CPU usage for running the queries and additional storage utilization for the duration of processing of the data.

Additions to the Destination Schema

Irrespective of the type of data, Hevo adds the following columns to the data warehouse tables as part of the data loading process:

Metadata Column Description
__hevo_ingested_at A timestamp that is applied to each Event during ingestion. This timestamp helps to verify that the ingested Event is more current than what already exists in the Destination. For example, by the time a failed Event is resolved and replayed, a more recent Event may already have been loaded to the Destination. Comparing the ingestion timestamp, the stale record can be discarded from the ingested data. The timestamp is retained in the Destination table also.
__hevo_loaded_at A timestamp to indicate when data was inserted/updated/deleted (delete flag updated) in the Destination table. The difference between __hevo_ingested_at and __hevo_loaded_at measures the total time taken by Hevo to process the respective Event, and can be used to identify latency.
__hevo__marked_deleted A column to logically represent a deleted Event. When an Event is deleted in the Source, it is not physically deleted from the Destination table during the data loading process. Instead, a logical column, __hevo__marked_deleted, is set to True for it.

Note: Hevo also adds the internal columns __he__msg_seq_id and __hevo__consumption_id to the ingested data to help with the deduplication process; these columns are removed before the final step of loading data into the Destination tables.

Atomicity of Data Loaded to the Destination

  • Amazon Redshift: Since the data loading steps of updating, inserting, and deleting Events use distinct queries, any system consuming data from the Destination table may see changes while the updates are being applied by Hevo. Further, update is a two step process of first deleting the stale Destination record and then inserting the more current record. If data is queried from the Destination in between the two steps, results may momentarily be inconsistent.

  • Google BigQuery and Snowflake: The Merge SQL statement is used, which handles all three operations of deletions, insertions, and updates through a single query. This makes the operation faster and atomic, and you have a consistent view of the data.

Data Loading Process

The following diagram illustrates the typical process followed by Hevo to upload your data to a data warehouse.

Illustration of steps for loading data into a data warehouse

Refer to the data loading process of each data warehouse to understand the methodology followed to determine the eligible records to be loaded and view the queries performed for this at each step.


See Also