Amazon S3

Last updated on Sep 03, 2024

This Destination is currently available for Early Access. Please contact your Hevo account executive or the Support team to enable it for your team. Alternatively, request for early access to try out one or more such features.

Amazon Simple Storage Service (S3) is a durable, efficient, secure, and scalable cloud storage service provided by Amazon Web Services (AWS) that can be accessed from anywhere. S3 uses the concept of buckets to store data in multiple formats, such as images, videos, and documents, organize that data, and retrieve it at any time from the cloud. It also provides you access control, versioning, and integration with other AWS services.

Hevo can ingest data from any of your Pipelines and load it in near real-time to your S3 bucket using the Append Rows on Update mode. The ingested data is loaded as Parquet or JSONL files to the S3 buckets.

Note: As the data is stored in file format in the S3 bucket, you cannot view the Destination schema through the Schema Mapper or query the loaded data using the Workbench.

Hevo allows storing data in a compressed or uncompressed form in the S3 bucket. Refer to the table below for the supported compression algorithms:

File Format Compression Support
Parquet -   Uncompressed
-   Snappy
JSONL -   Uncompressed
-   Gzip

If you are new to AWS or do not have an AWS account, follow the steps listed in the Create an AWS account section, and after that, Set up an Amazon S3 bucket. You can then configure the S3 bucket as a Destination in Hevo.

The following image illustrates the key steps required for configuring Amazon S3 as a Destination in Hevo:

S3 Destination Process Flow


Configuring the Pipeline Settings

When you create a Pipeline with an S3 Destination, you must specify the directory path, or folder structure. Hevo loads the data files into your S3 bucket at the specified location.

This is the default directory path:

${PIPELINE_NAME}/${OBJECT_NAME}/${DATE}/${JOB_ID}

Hevo creates the data files in this path by replacing these parameters as follows:

  • ${PIPELINE_NAME}: The name of your Pipeline that uses the configured S3 bucket as a Destination.

  • ${OBJECT_NAME}: The name of the Source object from which data was ingested.

  • ${DATE}: The date when the data was loaded to your S3 bucket.

  • ${JOB_ID}: The ID of the job in which the data ingestion task ran.

If you specify a prefix while configuring your S3 Destination, it is appended at the beginning of the directory path and your data files are created in that location.

Note: ${PIPELINE_NAME} and ${OBJECT_NAME} are mandatory parameters and your directory path must contain these two.

You can also specify a directory path to organize your data files into folders created using time-based parameters. For this, append one or more of the following parameters after ${PIPELINE_NAME}/${OBJECT_NAME}:

  • ${YEAR}: The year when the data load task ran.

  • ${MONTH}: The month when the data load task ran.

  • ${DAY}: The day when the data load task ran.

  • ${HOUR}: The hour of the day when the data load task ran.

For example, if you want to organize your Source data in the S3 bucket based on the day and hour, you should specify the path as ${PIPELINE_NAME}/${OBJECT_NAME}/${DAY}/${HOUR}.


Limitations

  • Your S3 bucket must be created in one of the AWS regions supported by Hevo.

  • At this time, Hevo supports loading data only in the Append Rows on Update mode.



Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Jul-08-2024 NA Updated section Configuring the Pipeline Settings to revise the definitions of the directory path parameters.
Feb-05-2024 2.20 New document.

Tell us what went wrong