Google BigQuery

Google BigQuery is a fully-managed, server-less data warehouse that supports Structured Query Language (SQL) to derive meaningful insights from the data. Read more.

With Hevo Pipelines, you can configure Google BigQuery as a Destination for loading data from any Source system.

Note: For a select range of Sources, Hevo provides a pre-configured and fully managed Google BigQuery data warehouse Destination. Read Hevo Managed Google BigQuery.

Organization of Data in BigQuery

Google BigQuery is hosted on Google’s Cloud Platform (GCP). GCP uses projects to organize resources. An organization can have multiple projects associated with it. Each project has a set of users, a set of APIs, and a billing account. Once you enable your BigQuery API, you can store your data in BigQuery tables. BigQuery organizes the data tables into units called datasets. These datasets are in turn stored in your GCP project.

Each dataset has a location associated with it. The Pipeline loads the Source data every five minutes by default to your Google Cloud Storage (GCS) buckets. GCS is the storage service offered by GCP. From the GCS buckets, the data is loaded to BigQuery in batches using BigQuery’s Batch Load feature. For this to happen successfully, the location of the GCS buckets must be the same as the dataset location.

Note: If your dataset’s location is other than US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset. You can read more about this here.

To optimize the SQL queries you run on the BigQuery Destination, you can use partitioning or clustering the Source fields. To know when to use clustering over partitioning and vice versa, read this.

Permissions

In order to load data in BigQuery, Hevo needs permissions to read and write to BigQuery as well as GCS. The files written to GCS are deleted after seven days to free up your storage. These permissions are assigned to the account you use to authenticate Hevo on BigQuery. Refer to Google Account Authentication Methods for more information.

User Authentication

You can connect to the BigQuery Destination via User accounts (OAuth) or Service accounts. One service account can be mapped to your entire team. Read Google Account Authentication Methods to know how to set up a service account if you do not already have one.


Prerequisites

  • The authenticating user has BigQuery Admin and Storage Admin roles for the Google BigQuery project. These roles are irrespective of having Owner or Admin roles within the project. Read Google Account Authentication Methods for steps to configure these roles.

  • An active Billing account is associated with your BigQuery project.


Perform the following steps to configure your BigQuery Destination:

Retrieve the Project ID

  1. Go to the Google API Console, and in the top left corner, besides the hamburger menu icon, click the drop-down to view all the projects.

    The names and IDs for all the projects you are a member of are displayed.

    List of projects

  2. Copy the ID of the required project.


Get your Dataset ID and Location

  1. Go to your BigQuery instance.

  2. Select the Project ID.

  3. From the list of datasets for the project, choose the one you want the dataset ID and location for.

    Select the dataset

  4. Copy the Dataset ID and Data Location. The dataset ID is displayed as project-name:dataset-ID. Copy only the dataset ID. For example, in the image above, the dataset ID is test-dataset.


Get your GCS Bucket and Bucket Location

  1. Go to Storage in Google Cloud Platform. In the Storage browser, check for the bucket Name and Location.

    GCS bucket name and location

Note: The bucket name has to be unique across all global namespaces within Google Cloud Storage.


Add Billing Details to your Project

Read this document to set up the billing details of your project.


Configure Google BigQuery as a Destination

You can modify only some of the settings that you provide here once the Destination is created. Refer to section Modifying BigQuery Destination Configuration below for more information.

  1. Click DESTINATIONS in the Asset Palette.

  2. Click + CREATE in the Destinations List View.

  3. In Add Destination page select Google BigQuery as the Destination type.

  4. In the Configure your Google BigQuery Account page, select the authentication method for connecting to BigQuery.

    Google account authentication methods

  5. Do one of the following:

    • To connect using a Service Account:

      1. Attach the Service Account Key file.

        Configure Google service account

        Note: Hevo supports only JSON format for the key file.

      2. Click CONFIGURE GOOGLE BIGQUERY ACCOUNT.

    • To connect using a User Account:

      1. Click + ADD GOOGLE BIGQUERY ACCOUNT.

      2. Sign in as a user with BigQuery Admin and Storage Admin permissions.

      3. Click Allow to authorize Hevo to access your data.

        Authorize Hevo

  6. In the Configure your Google BigQuery Warehouse page, specify the following details:

    BigQuery Settings

    • Destination Name: A unique name for your Destination.

    • Project ID: The Project ID of your BigQuery instance, as retrieved in Step 2 above, for which you have granted permissions in the above steps.

    • Dataset ID: Name of the dataset, as retrieved in Step 3 above, in which you want to sync your data.

    • GCS Bucket: Cloud storage bucket, as retrieved in Step 4 above, where files must be staged before being uploaded to BigQuery.

    • Enable Streaming Inserts: Enable this option to stream data to your BigQuery Destination as it gets received from the Source instead of loading it via a job as per a defined Pipeline schedule. Read Near Real-time Data Loading using Streaming.

      Note: You cannot modify this setting later.

    • Sanitize Table/Column Names: Enable this option to remove all non-alphanumeric characters and spaces in between the table and column names and replace them with an underscore (_). Read Name Sanitization.

      Note: You cannot modify this setting later.

    • Advanced Settings:

      • Populate Loaded Timestamp: Enable this option to append the ___hevo_loaded_at_ column to the Destination table to indicate the time when the Event was loaded to the Destination.

        Read Loading Data to a Data Warehouse for more information.

  7. Click TEST CONNECTION to test and SAVE DESTINATION to complete the setup.


Handling Source Data with Different Data Types

For teams created in or after Hevo Release 1.56, Hevo automatically modifies the data type of google BigQuery table columns to accommodate Source data with different data types. Datatype promotion is performed on tables whose size is less than 50 GB. Read Handling Different Data Types in Source Data.

Note: Your Hevo release version is mentioned at the bottom of the Asset Palette.


Handling JSON Fields

Refer to Parsing Nested JSON Fields in Events to know how Hevo parses the JSON data and makes it available at the Destination. Because BigQuery has built-in support for nested and repeated columns, JSON data is neither split nor compressed but passed as-is to the Destination for all the Pipelines with BigQuery as a Destination.

The fields of a JSON object are shown as a struct (for RECORD) or an array (mode: REPEATED) in the Schema Mapper. The nested fields are not editable.


Modifying BigQuery Destination Configuration

You can modify some of the settings post creation of the BigQuery Destination. However, changing any configuration would affect all the Pipelines that are using this Destination. Refer to section, [Potential Impacts of Modifying the Configuration Settings] to know the impacts.

To modify the BigQuery Destination configuration:

  1. Navigate to the Destination Detailed View.

  2. Do one of the following:

    • Click the Settings icon next to the Destination name and then, click the Edit icon.

      Modify BQ settings

    • Click the kebab menu on the right and click Edit.

  3. Update the required field values in the Edit Google BigQuery Destination connection settings page. You can modify the following settings:

    • Destination name

    • Project ID

    • Dataset ID

    • GCS Bucket

    • Populate Loaded Timestamp
      Read the following section to know the impact from making these changes.

  4. Click SAVE DESTINATION.

Potential Impacts of Modifying the Configuration Settings

  • Changing any configuration setting impacts all Pipelines using the Destination.

  • Changing the Project ID or Dataset ID may lead to data getting split between the old and the new Dataset. This is because the Events in the old Dataset are not copied to the new one. Further, new tables for all the existing Pipelines that are using this Destination are created in the new Dataset.

  • Changing the Bucket ID may lead to Events getting lost. The Events not yet copied from the old bucket are not copied to the Dataset.


Limitations

None.


See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
May-19-2021 NA Added the section Modifying BigQuery Destination Configuration.
Apr-20-2021 1.61 - Updated the page overview section and added sub-section User Authentication.
- Updated section Configure Google BigQuery as a Destination to include the option to connect to BigQuery using service accounts.
- Moved the section Assign BigQuery Admin and Storage Admin Roles to the User to Google Account Authentication Methods.
Apr-06-2021 1.60 Updated the overview to inform customers when to use clustering over partitioning and vice versa.
Feb-22-2021 1.57 Added the Enable Streaming Writes setting in Step 7 of section Configure Google BigQuery as a Destination.
Feb-11-2021 NA Updated the page overview to inform customers that the files written to GCS are deleted after seven days to free up their Google Cloud storage.
Last updated on 01 Jun 2021