Google BigQuery

Google BigQuery is a fully-managed, server-less data warehouse that supports Structured Query Language (SQL) to derive meaningful insights from the data. Refer to this article to get started with your BigQuery account.

You can connect Google BigQuery as a Destination for your Pipeline created using any Source system.

Note: For a select range of Sources, Hevo provides a pre-configured and fully managed Google BigQuery data warehouse Destination. Read Hevo Managed Google BigQuery.

Google Cloud Platform (GCP) uses projects to organize resources. An organization can have multiple projects associated with it. Each project has a set of users, a set of APIs , and a billing account. Once you enable your BigQuery API, you can store your data in BigQuery tables. BigQuery organizes the data tables into units called datasets. These datasets are in turn stored in your GCP project.

Each dataset has a location associated with it. The Pipeline loads the Source data every five minutes by default to your Google Cloud Storage (GCS) buckets, from where it is loaded to BigQuery in batches using BigQuery’s Batch Load feature. For this to happen successfully, the location of the buckets must be the same as the dataset location.

Note: If your dataset’s location is set to a value other than US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset. You can read more about this here.

In order to load data in BigQuery, Hevo needs permissions to read and write to BigQuery as well as GCS. The files written to GCS are deleted after seven days to free up your storage.


Prerequisites

  • The authenticating user has BigQuery Admin and Storage Admin access on the existing project. These roles are irrespective of having Owner or Admin roles within the project.

  • An active Billing account associated with your Project.


Perform the following steps to configure your BigQuery Destination:

Assign BigQuery Admin and Storage Admin Roles to the User

  1. Login to your Google Cloud Platform with an Owner Role.

    Click here and check your Role against your name. If you are not an Owner, contact your Owner to perform the following steps.

  2. In the left navigation menu, click IAM & Admin.

  3. Click IAM.

    Select IAM

  4. Select the Name of the project you want to grant permissions for.

    Select the project

  5. Click on Add in the IAM & Admin page.
    OR
    In the PERMISSIONS tab, click the Edit icon for the user whose permissions have to be modified.

  6. If you are adding a new user, add your email in New members field. Skip if you are editing permissions for a user.

    Add a new user to assign permissions to

  7. In the Select a role drop-down, select BigQuery in the left pane, and then, select BigQueryAdmin.

    Assign BigQuery Admin role

  8. Click SAVE and then, + ADD ANOTHER ROLE.

  9. In the Role drop-down, select Cloud Storage in the left pane and then, select Storage Admin.

    Storage Admin permission

  10. Click SAVE and exit.

    Save and exit


Retrieve the Project ID

  1. Go to the Google API Console, and in the top left corner, besides the hamburger menu icon, click the drop-down to view all the projects.

    The names and IDs for all the projects you are a member of are displayed.

    List of projects

  2. Copy the ID of the required project.


Get your Dataset ID and Location

  1. Go to your BigQuery instance.

  2. Select the Project ID. From the list of datasets for the project, choose the one you want the Dataset ID and location for.

    Select the dataset

  3. Copy the Dataset ID and Data Location.


Get your GCS Bucket and Bucket Location

  1. Go to Storage in Google Cloud Platform. In the Storage browser, check for the bucket Name and Location.

    GCS bucket name and location

Note: The bucket name has to be unique across all global namespaces within Google Cloud Storage.


Add Billing Details To your Project

Read this document to set up the billing details of your project.


Configure Google BigQuery as a Destination

  1. Click DESTINATIONS in the Asset Palette.

  2. Click + CREATE in the Destinations List View.

  3. In Add Destination page select Google BigQuery as the Destination type.

  4. In the Configure your Google BigQuery Account page, click + ADD GOOGLE BIGQUERY ACCOUNT.

  5. Sign in as the same user you granted BigQuery and Storage Admin permissions to in Step 1 above.

  6. Click Allow to authorize Hevo to access your data.

    Authorize Hevo

  7. In the Configure your Google BigQuery Warehouse page, specify the following details:

    BigQuery Settings

    • Destination Name: A unique name for your Destination.

    • Project ID: The Project ID of your BigQuery instance, as retrieved in Step 2 above, for which you have granted permissions in the above steps.

    • Dataset ID: Name of the dataset, as retrieved in Step 3 above, in which you want to sync your data.

    • GCS Bucket: Cloud storage bucket, as retrieved in Step 4 above, where files must be staged before being uploaded to BigQuery.

    • Enable Streaming Inserts: Enable this option to stream data to your BigQuery Destination as it gets received from the Source instead of loading it via a job as per a defined Pipeline schedule. Read Near Real-time Data Loading using Streaming.

      Note: You cannot modify this setting later.

    • Sanitize Table/Column Names: Enable this option to remove all non-alphanumeric characters and spaces in between the table and column names and replace them with an underscore (_). Read Name Sanitization.

      Note: You cannot modify this setting later.

    • Advanced Settings:

      • Populate Loaded Timestamp: Enable this option to append the ___hevo_loaded_at_ column to the Destination table to indicate the time when the Event was loaded to the Destination.

        Read Loading Data to a Data Warehouse for more information.

  8. Click TEST CONNECTION to test and SAVE DESTINATION to complete the setup.


Handling Source Data with Different Data Types

For teams created in or after Hevo Release 1.56, Hevo automatically modifies the data type of a Destination table column to accommodate Source data with different data type, where table size is less than 50 GB. Read more.


Handling of JSON Fields

Refer to Parsing Nested JSON Fields in Events to know how Hevo parses the JSON data and makes it available at the Destination. Because BigQuery has built-in support for nested and repeated columns, JSON data is neither split nor compressed but passed as-is to the Destination for all the Pipelines with BigQuery as a Destination.

The fields of a JSON object are shown as a struct (for RECORD) or an array (mode: REPEATED) in the Schema Mapper. The nested fields are not editable.


Limitations

None.


See Also


Revision History

Refer to the following table for the list of key updates made to the page:

Date Release No. Description of Change
22-Feb-2021 1.57 Added the Enable Streaming Writes setting in Step 7 of section Configure Google BigQuery as a Destination.
11-Feb-2021 NA Updated the page overview to inform customers that the files written to GCS are deleted after seven days to free up their Google Cloud storage.
Last updated on 27 Feb 2021