Iceberg Data Lake

Last updated on Feb 19, 2025

Edge Pipeline is currently available under Early Access. You can request access to evaluate and test its features.

Apache Iceberg is an open-source modern table format specification designed for managing large analytic datasets. It defines a metadata model that:

  • Decouples Metadata from Data Files, allowing compute engines to query smartly without scanning all files.

  • Supports Robust Schema and Partition Evolution by tracking changes over time and using stable identifiers, such as partition field IDs and column IDs.

  • Enables ACID Transactions and Time Travel with immutable snapshots that record every commit.

  • Supports Versioning, allowing tables to be rolled back to a previously stable state.

  • Scales to Very Large Datasets through the use of manifest lists and manifest files for efficient file management.

Structure of an Iceberg Table

The three main components of an Iceberg table are the:

  • Catalog: The Iceberg catalog is a centralized system responsible for managing and organizing table metadata. Query engines, such as Spark, Trino, and Hive, do not directly reference the metadata files. Instead, they interact with a catalog to discover, create, update, and drop tables. Currently, Hevo maintains your Iceberg catalog in an AWS Glue database.

  • Metadata: This layer stores structured information (metadata) about an Iceberg table in files in your S3 bucket. It contains the metadata files, manifest lists, and manifest files.

    • A metadata file, typically stored in JSON format, contains details such as the table’s schema, partitions, and snapshot history. A snapshot tracks all the files in an Iceberg table at a specific point in time.

    • A manifest is an immutable Avro file containing detailed metadata about a subset of data files in your Iceberg table. Each file includes information about file paths, partition values, and statistics, such as row counts and the min/max values. A manifest list acts as an index and points to a collection of manifest files.

  • Data: This layer comprises immutable files containing your actual data records. The data files can be written using columnar format, such as Parquet, or row-based format, such as Avro. At this time, Hevo loads the ingested data using the Append mode and writes them as compressed Parquet files into your S3 bucket.

You can configure Iceberg Data Lake as a Destination in your Hevo Edge Pipeline to leverage its reliable and efficient framework for handling extensive analytic datasets.


Modifying Iceberg Data Lake Destination Configuration in Edge

You can modify some settings of your Iceberg Data Lake Destination after its creation. However, any configuration changes will affect all the Pipelines using that Destination.

To modify the configuration of your Iceberg Destination in Edge:

  1. In the detailed view of your Destination, do one of the following:

    • Click the More ( ) icon to access the Destination Actions menu, and then click Edit Destination.

      Access Edit Destination

    • In the Destination Configuration section, click EDIT.

      Access Edit Destination

  2. On the <Your Destination Name> editing page:

    Edit Destination Configuration

    Note: The settings that cannot be changed are grayed out.

    • You can specify a new name for your Destination, not exceeding 255 characters.
  3. Click TEST & SAVE to check the connection to your Iceberg Data Lake Destination and then save the modified configuration.


Data Type Mapping

Hevo internally maps the Source data type to a unified data type, which is referred to as the Hevo Data Type in the table below. This data type represents the Source data from all supported data types in a lossless manner. The Hevo data types are then mapped to the corresponding data types that are supported in each Destination.

Hevo Data Type Iceberg Data Type
JSON STRING
ARRAY LIST
BOOLEAN BOOLEAN
-  BYTE
-  BYTE_ARRAY
BINARY
DATE DATE
DECIMAL DECIMAL
DOUBLE DOUBLE
FLOAT FLOAT
INTEGER INT
LONG LONG
SHORT INT
-  TIME
-  TIME_TZ
TIME
-  TIMESTAMP
-  TIMESTAMP_TZ
TIMESTAMP
VARCHAR STRING

At this time, the following Iceberg data types are not supported by Hevo:

  • MAP

  • STRUCT

  • Any other data type not listed in the table above.


Destination Considerations

  • Iceberg does not support the JSON data type. Hence, JSON data is written to the Iceberg table using the STRING data type.

  • When creating an AWS Glue database for your Iceberg Data Lake Destination, ensure that the database name does not contain hyphens, as they are not allowed in Iceberg namespaces.


Limitations

  • The STRUCT and MAP data types are not supported.

  • Hevo supports loading data only in the Append mode.

  • Currently, AWS Glue is the only supported data catalog.

Tell us what went wrong