Google BigQuery

Google BigQuery is a fully-managed, serverless data warehouse that supports Structured Query Language (SQL) to derive meaningful insights from the data. With Hevo Pipelines, you can ingest data from any Source system and load it to a Google BigQuery Destination.

Note: For a select range of Sources, Hevo provides a pre-configured and fully managed Google BigQuery data warehouse Destination. Read Hevo Managed Google BigQuery.

Organization of data in BigQuery

Google BigQuery is hosted on the Google Cloud Platform (GCP). GCP uses projects to organize resources. An organization can create multiple projects. Each project has a set of users, a set of APIs, and a billing account. Creating a project provides access to the BigQuery APIs, which allow you to store data in units called datasets, within the project. Datasets organize and control access to your BigQuery tables and views. Each dataset has a location associated with it that affects how BigQuery loads, queries, or exports data.

The Hevo Pipeline stages the Source data in Google Cloud Storage (GCS) buckets before loading it to your BigQuery Destination. For this, the location of the dataset and the GCS bucket must be the same, with the exception of the datasets in the US multi-region. Read Location considerations for further details about the requirements.

Data partitioning and clustering

The Hevo Pipeline loads the Source data using BigQuery’s Batch Load feature. To optimize the SQL queries that are run on the BigQuery Destination for deduplication and loading of data, you can create partitioned or clustered tables. Partitions make it easier to manage and query your data. Similarly, clustering allows you to organize data based on the contents of one or more columns of the table, and helps you colocate related data. Read this to know when to use clustering over partitioning and vice versa.

Permissions

In order to load data in BigQuery, Hevo needs permissions to read and write to BigQuery as well as GCS. The files written to GCS are deleted after seven days to free up your storage. These permissions are assigned to the account you use to authenticate Hevo on BigQuery. Read Google Account Authentication Methods for more information.

User Authentication

You can connect to the BigQuery Destination via User accounts (OAuth) or Service accounts. One service account can be mapped to your entire team. Read Google Account Authentication Methods to know how to set up a service account if you do not already have one.


Handling Source Data with Different Data Types

For teams created in or after Hevo Release 1.56, Hevo automatically modifies the data type of Google BigQuery table columns to accommodate Source data with different data types. Data type promotion is performed on tables whose size is less than 50 GB. Read Handling Different Data Types in Source Data.

Note: Your Hevo release version is mentioned at the bottom of the Asset Palette.


Handling JSON Fields

Read Parsing Nested JSON Fields in Events to know how Hevo parses the JSON data and makes it available at the Destination. Because BigQuery has built-in support for nested and repeated columns, JSON data is neither split nor compressed, but based on the Source type, arrays may be collapsed into JSON strings or passed as-is to the Destination. Read JSON functions in Standard SQL to know about parsing JSON strings back to arrays.

The fields of a JSON object are shown as a struct (for RECORD) or an array (mode: REPEATED) in the Schema Mapper. The nested fields are not editable.


Modifying BigQuery Destination Configuration

You can modify some of the settings post creation of the BigQuery Destination. However, changing any configuration would affect all the Pipelines that are using this Destination. Refer to section, Potential Impacts of Modifying the Configuration Settings to know the impacts.

To modify the BigQuery Destination configuration:

  1. Navigate to the Destination Detailed View.

  2. Do one of the following:

    • Click the Settings icon next to the Destination name and then, click the Edit icon.

      Modify BQ settings

    • Click the kebab menu on the right and click Edit.

  3. Update the required field values in the Edit Google BigQuery Destination connection settings page. You can modify the following settings:

    • Destination name

    • Project ID

    • Dataset ID

    • GCS Bucket

    • Populate Loaded Timestamp Read the following section to know the impact from making these changes.

  4. Click SAVE DESTINATION.

Potential impacts of modifying the configuration settings

  • Changing any configuration setting impacts all Pipelines using the Destination.

  • Changing the Project ID or Dataset ID may lead to data getting split between the old and the new Dataset. This is because the Events in the old Dataset are not copied to the new one. Further, new tables for all the existing Pipelines that are using this Destination are created in the new Dataset.

  • Changing the Bucket ID may lead to Events getting lost. The Events not yet copied from the old bucket are not copied to the Dataset.


Limitations



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Nov-08-2022 2.01 - Updated screenshot in the Configure Google BigQuery as a Destination section to reflect the latest UI.
Oct-17-2022 NA - Added the Set up the Google Cloud Platform section to help set up GCP.
- Modified the Prerequisites to add the requirement of an existing GCP project.
Aug-24-2022 NA Updated section, Limitations to add the limitation regarding Hevo not mapping arrays of type struct in the Source to arrays of type struct in the BigQuery Destination.
Apr-27-2022 NA - Removed sections, Retrieve the Project ID, Get your Dataset ID and Location, and Get your GCS Bucket and Bucket Location.
- Updated section, Configure Google BigQuery as a Destination.
Apr-11-2022 NA Updated screenshot in the Configure Google BigQuery as a Destination section to reflect the latest UI.
Aug-09-2021 1.69 Updated the section, Handling JSON Fields.
May-19-2021 NA Added the section Modifying BigQuery Destination Configuration.
Apr-20-2021 1.61 - Updated the page overview section and added sub-section User Authentication.
- Updated section Configure Google BigQuery as a Destination to include the option to connect to BigQuery using service accounts.
- Moved the section Assign BigQuery Admin and Storage Admin Roles to the User to Google Account Authentication Methods.
Apr-06-2021 1.60 Updated the overview to inform customers when to use clustering over partitioning and vice versa.
Feb-22-2021 1.57 Added the Enable Streaming Writes setting in Step 7 of section Configure Google BigQuery as a Destination.
Feb-11-2021 NA Updated the page overview to inform customers that the files written to GCS are deleted after seven days to free up their Google Cloud storage.
Last updated on 16 Nov 2022

Tell us what went wrong