On This Page
Google BigQuery is a fully-managed, serverless data warehouse that supports Structured Query Language (SQL) to derive meaningful insights from the data. With Hevo Pipelines, you can ingest data from any Source system and load it to a Google BigQuery Destination.
Note: For a select range of Sources, Hevo provides a pre-configured and fully managed Google BigQuery data warehouse Destination. Read Hevo Managed Google BigQuery.
Organization of data in BigQuery
Google BigQuery is hosted on the Google Cloud Platform (GCP). GCP uses projects to organize resources. An organization can create multiple projects. Each project has a set of users, a set of APIs, and a billing account. Creating a project provides access to the BigQuery APIs, which allow you to store data in units called datasets, within the project. Datasets organize and control access to your BigQuery tables and views. Each dataset has a location associated with it that affects how BigQuery loads, queries, or exports data.
The Hevo Pipeline stages the Source data in Google Cloud Storage (GCS) buckets before loading it to your BigQuery Destination. For this, the location of the dataset and the GCS bucket must be the same, except for the datasets in the US multi-region. Read Location considerations for further details about the requirements.
Roles and permissions
Hevo requires that the following predefined roles are assigned to the connecting Google account for the configured GCP project, to enable Hevo to read and write data from and to BigQuery and the GCS buckets.
|BigQuery Data Editor||Grants Hevo permission to:
- Create new datasets in the project.
- Read a dataset’s metadata and list the tables in the dataset.
- Create, update, get, and delete tables and views from a dataset.
- Read and update data and metadata for the tables and views in a dataset.
|BigQuery Job User||Grants Hevo permission to perform actions such as loading data, exporting data, copying data, and querying data in a project.|
|Google Cloud Storage|
|Storage Admin||Grants Hevo permission to list, create, view, and delete:
- Objects in GCS buckets.
- GCS buckets.
If you do not want to assign the Storage Admin role, you can create a custom role at the project or organization level, add the following permissions, and assign this role to the connecting Google account, to grant Hevo access to GCS resources:
Note: Hevo uses its GCS bucket if the connecting Google account is not assigned the Storage Admin role or a custom role.
Google account authentication
You can connect to your BigQuery Destination with a user account (OAuth) or a service account. One service account can be mapped to your entire team. You must provide the required roles to both types of accounts for accessing GCP-hosted services such as BigQuery and the GCS buckets. Read Authentication for GCP-hosted Services to know how to assign these roles to your user account and service account.
Data partitioning and clustering
The Hevo Pipeline loads the Source data using BigQuery’s Batch Load feature. To optimize the SQL queries that are run on the BigQuery Destination for deduplication and loading of data, you can create partitioned or clustered tables. Partitions make it easier to manage and query your data. Similarly, clustering allows you to organize data based on the contents of one or more columns of the table, and helps you colocate related data. Read this to know when to use clustering over partitioning and vice versa.
The following image illustrates the key steps that you need to complete to configure Google BigQuery as a Destination in Hevo:
Handling Source Data with Different Data Types
For teams created in or after Hevo Release 1.56, Hevo automatically modifies the data type of Google BigQuery table columns to accommodate Source data with different data types. Data type promotion is performed on tables whose size is less than 50 GB. Read Handling Different Data Types in Source Data.
Note: Your Hevo release version is mentioned at the bottom of the Navigation Bar.
Handling JSON Fields
Read Parsing Nested JSON Fields in Events to know how Hevo parses the JSON data and makes it available at the Destination. Because BigQuery has built-in support for nested and repeated columns, JSON data is neither split nor compressed, but based on the Source type, arrays may be collapsed into JSON strings or passed as-is to the Destination. Read JSON functions in Standard SQL to know about parsing JSON strings back to arrays.
The fields of a JSON object are shown as a struct (for RECORD) or an array (mode: REPEATED) in the Schema Mapper. The nested fields are not editable.
Modifying BigQuery Destination Configuration
You can modify some of the settings post creation of the BigQuery Destination. However, changing any configuration would affect all the Pipelines that are using this Destination. Refer to section, Potential Impacts of Modifying the Configuration Settings to know the impacts.
To modify the BigQuery Destination configuration:
Navigate to the Destination Detailed View.
Do one of the following:
Click the Settings icon next to the Destination name and then, click the Edit icon.
Click the kebab menu on the right and click Edit.
Update the required field values in the Edit Google BigQuery Destination connection settings page. You can modify the following settings:
Populate Loaded Timestamp Read the following section to know the impact from making these changes.
Click SAVE DESTINATION.
Potential impacts of modifying the configuration settings
Changing any configuration setting impacts all Pipelines using the Destination.
Changing the Project ID or Dataset ID may lead to data getting split between the old and the new Dataset. This is because the Events in the old Dataset are not copied to the new one. Further, new tables for all the existing Pipelines that are using this Destination are created in the new Dataset.
Changing the Bucket ID may lead to Events getting lost. The Events not yet copied from the old bucket are not copied to the Dataset.
- Hevo supports loading Events in Geometry formats (such as Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection). However, if you are loading any latitude values to your BigQuery Destination, these values must lie within -90 and +90, as that is the range that BigQuery supports. If the data lies outside this range, you can either change the data type in the Source or create a Transformation to change the data type to varchar.
Hevo currently does not map arrays of type struct in the Source data to arrays of type struct in the BigQuery Destination. Instead, any struct arrays in the Source are mapped as varchar arrays. If needed, you can parse these varchar arrays in the Destination to change them back to struct. Read Strategy: Replicate JSON structure as is while collapsing arrays into strings.
Hevo replicates a maximum of 4096 columns to each BigQuery table, of which six are Hevo-reserved metadata columns used during data replication. Therefore, your Pipeline can replicate up to 4090 (4096-6) columns for each table. Read Limits on the Number of Destination Columns.
To circumvent this limitation, you can use Transformations to remove the columns you do not need.
Refer to the following table for the list of key updates made to this page:
|Date||Release||Description of Change|
|Sep-21-2023||NA||Added section, Destination Considerations.|
|Aug-11-2023||NA||Added limitation about the number of columns supported by Hevo in BigQuery tables.|
|Jul-25-2023||NA||Added the process flow diagram in the page overview section.|
|Jun-19-2023||NA||Updated sections, Create a Google Cloud project and Configure Google BigQuery as a Destination to add information about enabling billing on the project to use streaming inserts.|
|Mar-09-2023||NA||Updated section, Configuring Google BigQuery as a Destination to add a note about switching your authentication method post-Pipeline creation.|
|Dec-20-2022||NA||- Added content in the Roles and Permissions section to explain the permissions needed by Hevo.
- Updated the Prerequisites section to remove BigQuery Admin and Storage Admin roles requirement.
|Nov-08-2022||2.01||Updated screenshot in the Configure Google BigQuery as a Destination section to reflect the latest UI.|
|Oct-17-2022||NA||- Added the Set up the Google Cloud Platform section to help set up GCP.
- Modified the Prerequisites to add the requirement of an existing GCP project.
|Aug-24-2022||NA||Updated section, Limitations to add the limitation regarding Hevo not mapping arrays of type struct in the Source to arrays of type struct in the BigQuery Destination.|
|Apr-27-2022||NA||- Removed sections, Retrieve the Project ID, Get your Dataset ID and Location, and Get your GCS Bucket and Bucket Location.
- Updated section, Configure Google BigQuery as a Destination.
|Apr-11-2022||NA||Updated screenshot in the Configure Google BigQuery as a Destination section to reflect the latest UI.|
|Aug-09-2021||1.69||Updated the section, Handling JSON Fields.|
|May-19-2021||NA||Added the section Modifying BigQuery Destination Configuration.|
|Apr-20-2021||1.61||- Updated the page overview section and added sub-section User Authentication.
- Updated section Configure Google BigQuery as a Destination to include the option to connect to BigQuery using service accounts.
- Moved the section Assign BigQuery Admin and Storage Admin Roles to the User to Google Account Authentication Methods.
|Apr-06-2021||1.60||Updated the overview to inform customers when to use clustering over partitioning and vice versa.|
|Feb-22-2021||1.57||Added the Enable Streaming Writes setting in Step 7 of section Configure Google BigQuery as a Destination.|
|Feb-11-2021||NA||Updated the page overview to inform customers that the files written to GCS are deleted after seven days to free up their Google Cloud storage.|