Loading Data to a Google BigQuery Data Warehouse

This section describes the queries for loading data into a Google BigQuery data warehouse. It assumes that you are familiar with Hevo’s process for Loading Data to a Data Warehouse.

Note: The queries listed here have been simplified to facilitate understanding of the overall data deduplication and loading process.

Loading Data without Primary Keys

If primary keys are not present in the Destination tables, Hevo directly appends the data into the target tables in the Destination data warehouse using the following steps:

  1. Apply the __hevo_ingested_at timestamp to each Event at the time of ingestion from Source. This column is retained in the Destination table also.

  2. Ensure __hevo_ingested_at and, if required, __hevo__loaded_at columns are present in the Destination table, else, create them.

  3. Copy files into the target table.

   load <gs://path_to/file> to <dataset>.<target_table>

Loading Data with Primary Keys

If the Destination tables provide for primary keys, Hevo performs the following steps to deduplicate and load the data to the data warehouse.

  1. Apply the __hevo_ingested_at timestamp to each Event at the time of ingestion from Source. This column is retained in the Destination table also.

  2. Ensure __hevo_ingested_at, and if required, __hevo__loaded_at columns are present in the Destination table, else, create them.

  3. Create a temporary staging table with the same schema as the target table and the additional metadata columns.The ingested data is loaded to this table and all the steps for its deduplication are performed on this table. Read Creating and using tables in Google BigQuery.

  4. Copy files to the staging table.
      load <gs://path_to/file> to <dataset>.<stage_table>
    
  5. Delete duplicate data from the staging table on basis of __hevo__ingested_at, __he__msg_seq_id, __hevo__consumption_id.

      DELETE FROM <dataset>.<stage_table>
        WHERE STRUCT(<PK1, PK2, ...PKn>, __hevo__ingested_at)
               NOT IN (SELECT AS STRUCT <PK1, PK2, ...PKn>, max(__hevo__ingested_at)
                         FROM <dataset>.<stage_table>
                       GROUP BY <PK1, PK2, ...PKn>)
    
      DELETE FROM <dataset>.<stage_table>
        WHERE STRUCT(<PK1, PK2, ...PKn>, __he__msg_seq_id)
              NOT IN (SELECT AS STRUCT <PK1, PK2, ...PKn>, max(__he__msg_seq_id)
                        FROM <dataset>.<stage_table>
                      GROUP BY <PK1, PK2, ...PKn>)
    
      DELETE FROM <dataset>.<stage_table>
        WHERE STRUCT(<PK1, PK2, ...PKn>, __hevo__consumption_id)
              NOT IN (SELECT AS STRUCT <PK1, PK2, ...PKn>, max(__hevo__consumption_id)
                       FROM <dataset>.<stage_table>
                     GROUP BY <PK1, PK2, ...PKn>)
    
  6. Load the data into the target table:

    Note: Google BigQuery uses a single Merge statement to handles all three operations of deletions, insertions, and updates to the target table.

       MERGE INTO <dataset>.<target_table> AS T
         USING <dataset>.<stage_table> AS S
           ON T.PK1 = S.PK1 AND
              T.PK2 = S.PK2 AND
              ...
              T.PKn = S.PKn
             WHEN MATCHED AND
                  S.__hevo__ingested_at >= T.__hevo__ingested_at AND
                  S.__hevo__marked_deleted IS NOT NULL AND
                  S.__hevo__marked_deleted = true
               THEN UPDATE SET T.__hevo__marked_deleted = true,
                               T.__hevo__ingested_at = S.__hevo__ingested_at,
                               T.__hevo__loaded_at = <loaded_at>
             WHEN MATCHED AND
                  S.__hevo__ingested_at >= T.__hevo__ingested_at AND
                  (S.__hevo__marked_deleted IS NULL OR S.__hevo__marked_deleted = false)
               THEN UPDATE SET T.field_1 = S.field_1,
                               T.field_2 = S.field_2,
                               ...
                               T.field_n = S.field_n,
                               T.__hevo__loaded_at = <loaded_at>
             WHEN NOT MATCHED AND
                  (S.__hevo__marked_deleted IS NULL OR S.__hevo__marked_deleted = false)
               THEN INSERT (field_1,
                            field_2,
                            ...
                            field_n,
                            __hevo__loaded_at)
                      VALUES (S.field_1,
                              S.field_2,
                              ...
                              S.field_n,
                              <loaded_at>);
    
  7. Drop the staging table.

This completes the loading of data to the Google BigQuery data warehouse.


See Also

Last updated on 08 Oct 2020