Data Flow in a Pipeline

Hevo Pipelines extract data from a Source, transform or cleanse the data, and then, load it to a Destination. In this process, the data goes through the following stages:

Data Flow Architecture.png

Stage 1: Establishing a connection

Hevo initiates the replication process by establishing a connection with the Source successfully. Based on the nature of the Source, Hevo either uses REST API or appropriate drivers to establish the connection.

Stage 2: Ingestion

Tasks to ingest the data are created for each Source object. This stage is referred to as Ingestion. In the case of databases with log-based replication, such as MySQL and PostgreSQL, a task is created to sync the historical load and another task to Capture Data Change (CDC). Read Types of Data Synchronization to understand how Hevo performs this task. From here, data is streamed across other stages discussed below. You can adjust the ingestion rate by adjusting the ingestion frequency from the UI, based on the Source.

Stage 3: Transformations

You can transform the data that is streaming before loading it to the Destination. You can write Python scripts or use drag-and-drop Transformation blocks to do the same. After the Transformation script is deployed, every Event (Row) that gets ingested goes through the transformation. Read Transformations.

Stage 4: Schema mapping

Hevo ingests your data from Sources, applies Transformations on it, and brings it to the Schema Mapper. In the Schema Mapper, you can define how the data is mapped in the Destination. You can map the tables and columns, omit columns, or skip loading the data to a table in the Destination. Hevo can automatically create and manage all these mappings for you if you enable Auto Mapping for a Pipeline. Based on the configurations applied in this stage, the data from the Source is mapped to the data in the Destination.

Stage 5: Staging in GCS/S3 bucket

In the case of JDBC Destinations, such as MySQL, PostgreSQL, and MS SQL Server, the data is directly loaded to the Destination tables from the Schema Mapper phase.

For data warehouse Destinations, after being mapped, the data is stored in a temporary staging area as CSV files. This staging area is a GCS bucket in the case of a Google BigQuery Destination and an AWS S3 bucket in the case of other data warehouse Destinations, such as AWS Redshift, Snowflake, Databricks, and Firebolt. The data is loaded to the Destination tables from the staging area and then flushed or cleared from the staging area.

Stage 6: Loading

  • Loading to AWS Redshift, Snowflake, Databricks, and Firebolt: Data is loaded to Redshift, Databricks, Firebolt, or Snowflake from the S3 bucket in batches. Data is deduplicated using the value of the primary key and copied to an intermediate S3 staging table from where it is loaded to the Destination table. Read Loading Data to a Data Warehouse for details about the steps to load data to the Destination tables from an S3 bucket.

  • Loading to Google BigQuery: Data is loaded to BigQuery from the GCS bucket in batches. Data is deduplicated and loaded to an intermediate BigQuery staging table before loading it to the Destination table. The Merge SQL statement is used, which handles all three operations of deletions, insertions, and updates through a single query. This makes the operation faster and atomic, and you have a consistent view of the data. Read Loading Data to a Google BigQuery Data Warehouse for details about the steps followed to load data to BigQuery from a GCS bucket.

  • Loading to PostgreSQL, MS SQL Server, MySQL: Data is replicated from the Schema Mapper based on the primary keys defined in the Destination table. Here, data is not staged in S3/GCS buckets and is bulk inserted directly to the Destination. As part of loading the data, deduplication is done to ensure that unique records are replicated and duplicates are dropped. This deduplication is done using the value of the primary key. It is recommended not to use these databases as a Destination if the volume of data is high as they do not perform well while ingesting a huge volume of data. Read Loading Data in a Database Destination for details about how data is loaded to the database.


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Jul-27-2022 NA Updated the content in Stage 5: Staging in GCS/S3 bucket for clarity and coherence.
Mar-21-2022 NA New document.
Last updated on 27 Jul 2022

Tell us what went wrong