Google BigQuery

Last updated on May 30, 2023

Google BigQuery is a fully-managed, serverless data warehouse that supports Structured Query Language (SQL) to derive meaningful insights from the data. With Hevo Pipelines, you can ingest data from any Source system and load it to a Google BigQuery Destination.

Note: For a select range of Sources, Hevo provides a pre-configured and fully managed Google BigQuery data warehouse Destination. Read Hevo Managed Google BigQuery.

Organization of data in BigQuery

Google BigQuery is hosted on the Google Cloud Platform (GCP). GCP uses projects to organize resources. An organization can create multiple projects. Each project has a set of users, a set of APIs, and a billing account. Creating a project provides access to the BigQuery APIs, which allow you to store data in units called datasets, within the project. Datasets organize and control access to your BigQuery tables and views. Each dataset has a location associated with it that affects how BigQuery loads, queries, or exports data.

The Hevo Pipeline stages the Source data in Google Cloud Storage (GCS) buckets before loading it to your BigQuery Destination. For this, the location of the dataset and the GCS bucket must be the same, except for the datasets in the US multi-region. Read Location considerations for further details about the requirements.

Roles and permissions

Hevo requires that the following predefined roles are assigned to the connecting Google account for the configured GCP project, to enable Hevo to read and write data from and to BigQuery and the GCS buckets.

Roles Description
BigQuery  
BigQuery Data Editor Grants Hevo permission to:

- Create new datasets in the project.

- Read a dataset’s metadata and list the tables in the dataset.

- Create, update, get, and delete tables and views from a dataset.

- Read and update data and metadata for the tables and views in a dataset.
BigQuery Job User Grants Hevo permission to perform actions such as loading data, exporting data, copying data, and querying data in a project.
Google Cloud Storage  
Storage Admin Grants Hevo permission to list, create, view, and delete:

- Objects in GCS buckets.

- GCS buckets.

If you do not want to assign the Storage Admin role, you can create a custom role at the project or organization level, add the following permissions, and assign this role to the connecting Google account, to grant Hevo access to GCS resources:

  • resourcemanager.projects.get

  • storage.buckets.create

  • storage.buckets.delete

  • storage.buckets.get

  • storage.buckets.list

  • storage.buckets.update

  • storage.multipartUploads.abort

  • storage.multipartUploads.create

  • storage.multipartUploads.list

  • storage.multipartUploads.listParts

  • storage.objects.create

  • storage.objects.delete

  • storage.objects.get

  • storage.objects.list

  • storage.objects.update

Note: Hevo uses its GCS bucket if the connecting Google account is not assigned the Storage Admin role or a custom role.

Google account authentication

You can connect to your BigQuery Destination with a user account (OAuth) or a service account. One service account can be mapped to your entire team. You must provide the required roles to both types of accounts for accessing GCP-hosted services such as BigQuery and the GCS buckets. Read Authentication for GCP-hosted Services to know how to assign these roles to your user account and service account.

Data partitioning and clustering

The Hevo Pipeline loads the Source data using BigQuery’s Batch Load feature. To optimize the SQL queries that are run on the BigQuery Destination for deduplication and loading of data, you can create partitioned or clustered tables. Partitions make it easier to manage and query your data. Similarly, clustering allows you to organize data based on the contents of one or more columns of the table, and helps you colocate related data. Read this to know when to use clustering over partitioning and vice versa.


Handling Source Data with Different Data Types

For teams created in or after Hevo Release 1.56, Hevo automatically modifies the data type of Google BigQuery table columns to accommodate Source data with different data types. Data type promotion is performed on tables whose size is less than 50 GB. Read Handling Different Data Types in Source Data.

Note: Your Hevo release version is mentioned at the bottom of the Navigation Bar.


Handling JSON Fields

Read Parsing Nested JSON Fields in Events to know how Hevo parses the JSON data and makes it available at the Destination. Because BigQuery has built-in support for nested and repeated columns, JSON data is neither split nor compressed, but based on the Source type, arrays may be collapsed into JSON strings or passed as-is to the Destination. Read JSON functions in Standard SQL to know about parsing JSON strings back to arrays.

The fields of a JSON object are shown as a struct (for RECORD) or an array (mode: REPEATED) in the Schema Mapper. The nested fields are not editable.


Modifying BigQuery Destination Configuration

You can modify some of the settings post creation of the BigQuery Destination. However, changing any configuration would affect all the Pipelines that are using this Destination. Refer to section, Potential Impacts of Modifying the Configuration Settings to know the impacts.

To modify the BigQuery Destination configuration:

  1. Navigate to the Destination Detailed View.

  2. Do one of the following:

    • Click the Settings icon next to the Destination name and then, click the Edit icon.

      Modify BQ settings

    • Click the kebab menu on the right and click Edit.

  3. Update the required field values in the Edit Google BigQuery Destination connection settings page. You can modify the following settings:

    • Destination name

    • Project ID

    • Dataset ID

    • GCS Bucket

    • Populate Loaded Timestamp Read the following section to know the impact from making these changes.

  4. Click SAVE DESTINATION.

Potential impacts of modifying the configuration settings

  • Changing any configuration setting impacts all Pipelines using the Destination.

  • Changing the Project ID or Dataset ID may lead to data getting split between the old and the new Dataset. This is because the Events in the old Dataset are not copied to the new one. Further, new tables for all the existing Pipelines that are using this Destination are created in the new Dataset.

  • Changing the Bucket ID may lead to Events getting lost. The Events not yet copied from the old bucket are not copied to the Dataset.


Limitations



Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Mar-09-2023 NA Updated section, Configuring Google BigQuery as a Destination to add a note about switching your authentication method post-Pipeline creation.
Dec-20-2022 NA - Added content in the Roles and Permissions section to explain the permissions needed by Hevo.
- Updated the Prerequisites section to remove BigQuery Admin and Storage Admin roles requirement.
Nov-08-2022 2.01 Updated screenshot in the Configure Google BigQuery as a Destination section to reflect the latest UI.
Oct-17-2022 NA - Added the Set up the Google Cloud Platform section to help set up GCP.
- Modified the Prerequisites to add the requirement of an existing GCP project.
Aug-24-2022 NA Updated section, Limitations to add the limitation regarding Hevo not mapping arrays of type struct in the Source to arrays of type struct in the BigQuery Destination.
Apr-27-2022 NA - Removed sections, Retrieve the Project ID, Get your Dataset ID and Location, and Get your GCS Bucket and Bucket Location.
- Updated section, Configure Google BigQuery as a Destination.
Apr-11-2022 NA Updated screenshot in the Configure Google BigQuery as a Destination section to reflect the latest UI.
Aug-09-2021 1.69 Updated the section, Handling JSON Fields.
May-19-2021 NA Added the section Modifying BigQuery Destination Configuration.
Apr-20-2021 1.61 - Updated the page overview section and added sub-section User Authentication.
- Updated section Configure Google BigQuery as a Destination to include the option to connect to BigQuery using service accounts.
- Moved the section Assign BigQuery Admin and Storage Admin Roles to the User to Google Account Authentication Methods.
Apr-06-2021 1.60 Updated the overview to inform customers when to use clustering over partitioning and vice versa.
Feb-22-2021 1.57 Added the Enable Streaming Writes setting in Step 7 of section Configure Google BigQuery as a Destination.
Feb-11-2021 NA Updated the page overview to inform customers that the files written to GCS are deleted after seven days to free up their Google Cloud storage.

Tell us what went wrong