Google BigQuery

Last updated on Sep 03, 2024

Google BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over huge sizes of data. Hevo allows users to migrate multiple datasets and tables within a BigQuery project to any other data warehouse of their choice.

Organization of data in BigQuery

Google BigQuery uses Projects to store data. An organization can have multiple projects associated with it. However, each Pipeline can be associated with only one BigQuery project.

Within a project, the data tables are organized into units called datasets.

Data Structure in BigQuery

Permissions

Hevo needs permission to access your data in BigQuery as well as GCS. The files written to GCS are deleted as soon as they are moved to the next stage in the Pipeline. These permissions are assigned to the account you use to authenticate Hevo on BigQuery. Read Google Account Authentication Methods for more information.

Data Replication Strategy

Hevo adopts one of the following strategies to replicate data from your Google BigQuery Source:

  • Direct Query: Hevo adopts this replication strategy to ingest data from non-partitioned tables in your dataset. This is also used with partitioned tables if a GCS bucket is not specified at the time of creating the Pipeline. In this strategy, Hevo first scans the selected objects (tables), and then reads data from them. To identify the incremental data, Hevo scans the entire table to find the difference between the new and existing data.

    Hevo saves the ingested data in temporary tables to avoid fetching any data that is already ingested. Hevo writes this data using streaming inserts. As streaming is not available for GCP free tier accounts, billing must be enabled for your GCP project.

  • GCS Export: Hevo adopts this replication strategy to ingest data from partitioned tables in your dataset if you have specified a GCS bucket at the time of creating the Pipeline. In this strategy, Hevo first ingests data from the partitions and then temporarily exports it into a bucket in your Google Cloud Storage. From there, the data is loaded into the Destination. An offset is maintained to help identify the latest partition, and data from that partition is ingested as incremental data.

    Read Introduction to partitioned tables to understand how partitioning affects the data processing performance and costs in Google BigQuery.


Source Considerations

  • The Cloud Storage bucket into which Hevo temporarily exports your ingested data must exist in the BigQuery location as your dataset with the exception of the datasets in the US multi-region. Read Location considerations for further details on the requirements.

Limitations

  • Updates in the BigQuery Source data are appended as new rows in the Destination. The existing rows are not modified. Therefore, both old and new entries exist in the Destination.

  • Deleted data is not marked or removed in the Destination.

  • Hevo requests access to your data in Cloud Storage even if you do not specify a GCS bucket while configuring the Pipeline.



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Mar-05-2024 2.21 Updated the ingestion frequency table in the Data Replication section.
Feb-02-2024 NA Updated section, Prerequisites to add information about the required permissions.
Jun-19-2023 NA - Updated sections, Data Replication Strategy and Prerequisites to add information about enabling billing to use streaming inserts.
Mar-09-2023 NA Updated section, Configuring Google BigQuery as a Source to add a note about switching your authentication method post-Pipeline creation.
Dec-07-2022 2.03 Updated section, Configuring Google BigQuery as a Source to add information about support for service accounts.
Sep-13-2022 1.97 - Added the Data Replication Strategy subsection in the overview text to explain the different data ingestion strategies,
- Added the Source Considerations section,
- Updated the Configuring your Google BigQuery Source section to add the GCS bucket field description,
- Updated the Limitations section to inform about Hevo requesting access to data in GCS,
- Modified the content for historical and incremental data in the Data Replication section to describe the impact of providing a GCS bucket on data ingestion.
Mar-22-2022 NA Updated information regarding Historical Data in the Data Replication section to remove the mention of historical sync duration.
Jul-12-2021 1.67 Added the field Include New Tables in the Pipeline under Source configuration settings.
Apr-20-2021 1.61 New document.

Tell us what went wrong