Structure of Data in the Snowflake Data Warehouse

Last updated on May 30, 2023

Snowflake is a cloud data warehouse that you can use for storage of your enterprise data or your data repository. It is a truly SaaS offering, which means that you do not need to install any hardware or software to use it, and the management and maintenance of the warehouse is handled by Snowflake itself. Its micro-partitioning and data clustering features also take care of the query performance and table optimization for data storage.

Snowflake Architecture

Snowflake separates data storage from the computing functions, which allows you to scale each of these independently. Its architecture consists of the three distinct layers of:

  • Database Storage: All the data in Snowflake is maintained in databases in an encrypted and compressed form. Each database consists of one or more schemas, which define the logical groupings of the tables and views within the database. Snowflake supports both unstructured data such as CSV and semi-structured data, such as, JSON and AVRO.

  • Virtual Warehouses or Compute Clusters: Snowflake handles queries using massively parallel processing (MPP) compute clusters. MPP enables faster performance and allows for quick scaling up. Each virtual warehouse is an independent MPP compute cluster, where each node of the cluster contains a part of the data repository. One virtual warehouse does not share compute resources (such as CPU, memory, and temporary storage) with other virtual warehouses and, therefore, does not impact the performance of any other virtual warehouse. A virtual warehouses can be used across databases and auto-suspended when not in use.

  • Managed Services: The managed cloud services coordinate all the operations in Snowflake, such as, data encryption, SQL querying, client sessions, and authentication. The queries are first routed to this layer’s optimizer before being forwarded to the virtual warehouses. This layer also stores the metadata needed to optimize the queries or filter data.

Snowflake Architecture

Snowflake Data Structure

Data may come into Snowflake from a variety of Sources. To support this, the Snowflake warehouse can store both structured data, such as data similar to CSV format, and semi-structured data, such as data in JSON format.

When you create a Pipeline, Hevo ingests the data from your Source, saves the data in JSON format on its server, and loads it to an S3 location, from where, the data is then loaded to Snowflake in columnar fashion.

Once the data is loaded safely to your Snowflake Destination, it is deleted from these locations.

Data flow

Example

Consider the following Event ingested from a webhook Source, where the metadata field is of type, string.

{
    "event_name": "agents",
    "properties": {
        "agent_name": "James",
        "agent_id": "bond007",
        "metadata": {"message":"this is test message","rand_id":3523}
    }
}

When Hevo writes the data to S3, this Event appears as follows:

{"agent_name": "James", "agent_id": "bond007", "metadata": {"message":"this is test message","rand_id":3523}, "__hevo__ingested_at": ...}

Using SQL queries, Hevo loads this data to your Snowflake warehouse. Read Loading Data to a Snowflake Data Warehouse for more information.

You can verify the data type of the final data through the Destination Workbench in Hevo. For example, in the following image, the metadata field from the above query is loaded as Variant data type that supports JSON data.

Data type in Destination

Snowflake Schema

The data warehouse schema can be seen in terms of a dimensional model in a star formation, composed of a central fact table and a set of surrounding dimension tables that hold second-level or child data for each respective dimension or data in the fact table. The snowflake schema is a variation of the star schema.

Example: Star schema

Star schema

In the star schema above, sales is the fact table, holding key information about the sale. Details of each information item become a separate Dimension table, such as, time, product, and customer.

The snowflake schema is a variant of the star schema, where each dimension of the star can further have a star schema of its own. This formation normalizes, or further expands, the dimension tables, thereby further splitting the data to remove redundancy.

Example: Snowflake schema

Snowflake schema

In the image above, each dimension table, such as, customer and store, is further expanded to break down the information into its simplest representation.

A snowflake schema provides the advantage of space saving by removing redundancy, while also leading to the disadvantage of needing more joins to execute an SQL query while fetching or loading data.

Micro-partitioning

The data in Snowflake is stored in database tables in compressed form, logically structured as collections of columns and rows. Snowflake automatically groups the rows into micro-partitions of 50–500 MB of data. Snowflake is column-based and horizontally partitioned, meaning a row of data is stored in the same micro-partition. A very large table may end up comprising millions, or even hundreds of millions, of micro-partitions. The micro-partitions use the ordering of the data as it is inserted/loaded. The queries can use the micro-partition metadata to determine which partitions are relevant for a query and scan only those partitions. In addition to this, Snowflake further limits the scanning of a partition to only the columns filtered in a query.

Snowflake identifies the rows related to a query by using the metadata it maintains about the rows stored in the micro-partitions.

Clustering

Snowflake automatically uses clustering to order and sort the data within a micro-partition. With clustering, the data stored in the tables is ordered on a subset of some of the columns that can be used to co-locate data. Snowflake collects clustering metadata for each micro-partition created during data load. This metadata is then leveraged to avoid unnecessary scanning of micro-partitions.

You can also define your own cluster keys to keep the frequently accessed data in the same micro-partition for faster retrieval, or, if your queries are running slower than normal. However, manually assigned cluster keys override Snowflake’s natural clustering for that table and require Snowflake to use additional compute resources to reclassify the data. Therefore, this incurs a cost for you in the form of Snowflake credits. Snowflake credits are used to pay for the consumption of resources on Snowflake. Read Understanding Virtual Warehouse, Storage, and Cloud Services Usage for more information.

Caching of Query Results

The Snowflake architecture includes caching at various levels to help speed up the queries you run on the data in your Snowflake Destination and minimize the costs. For example, when a query is executed, Snowflake holds the results of the query for 24 hours. So if the same query is executed again, by you or another user within your account, the results are already available to be returned, provided that the underlying data has not changed. This is especially beneficial for analysis work where you do not need to rerun complex queries to access previous data or when you want to compare the results of complex query before and after a change.


See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Jul-04-2022 NA New document.

Tell us what went wrong