Amazon S3

Last updated on Mar 05, 2024

Amazon Simple Storage Service (S3) is a durable, efficient, secure, and scalable cloud storage service provided by Amazon Web Services (AWS) that can be accessed from anywhere. S3 uses the concept of buckets to store data such as images, videos, and documents, in multiple formats, organize that data, and retrieve it at any time from the cloud. It also provides you access control, versioning, and integration with other AWS services.

Hevo supports the replication of S3 data in the AVRO, CSV, JSON, TSV, and XML file formats. While ingesting data, Hevo automatically unzips any Gzipped files. Further, if any file is updated in the Source, Hevo re-ingests its entire contents as it is not possible to identify individual changes.

For all Pipelines created from Release 1.66 onwards, Hevo uploads the __hevo_source_modified_at column to the Destination as a metadata field in order to ascertain the currency of data during replication. As a result, this field is not visible or available for mapping via the Schema Mapper. However, for older Pipelines:

  • If this field is displayed in the Source Event Type, you must ignore it and not try to map it to a Destination table column, else the Pipeline displays an error.

  • If this field is already present in the Destination table, Hevo automatically loads the metadata to it.

You can continue to use the __hevo_source_modified_at field to create Transformations using the function event.getSourceModifiedAt(). Read Metadata Column __hevo_source_modified_at.

Accessing data in S3 buckets

In S3, access is defined through IAM policies and an IAM role or user. You can create an IAM user or an IAM role, and assign the IAM policy to either of these to define what data Hevo can access.

The following diagram illustrates the steps to do this and configure Amazon S3 as a Source for your Hevo Pipeline. These steps are explained in detail further in this document.

Steps to create an Amazon S3 Pipeline


Prerequisites


Create an IAM Policy

Create an IAM policy with the ListBucket and GetObject permissions. These permissions are required for Hevo to access data from your S3 bucket.

To do this:

  1. Log in to the AWS IAM Console.

  2. In the left navigation pane, under Access management, click Policies.

    AWS nav bar

  3. In the Policies page, click Create policy.

    Create policy

  4. In the Specify permissions page, click JSON, and in the Policy editor section, paste the following JSON statements:

    JSON tab

    Note: Replace the placeholder values in the commands below with your own. For example, <your_bucket_name> with Hevo-S3-bucket.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::<your_bucket_name>",
                    "arn:aws:s3:::<your_bucket_name>/*"
                ]
            }
        ]
    }
    

    The JSON statements allow Hevo to access and ingest data from the bucket name you specify.

    JSON statements

  5. At the bottom of the page, click Next.

    Next page

  6. In the Review and create page, specify the Policy name, and at the bottom of the page, click Create policy.

    Create Policy


Obtain Amazon S3 Credentials

You must generate either access credentials or IAM role-based credentials and assign them the IAM policy to access and ingest your S3 data.

Generate IAM role-based credentials

To generate your IAM role-based credentials, you need to create an IAM role for Hevo and assign the policy that you created in Step 1 above, to the role. Use the Amazon Resource Name (ARN) and external ID from this role while creating your Pipeline.

1. Create an IAM role and assign the IAM policy

  1. Log in to the AWS IAM Console.

  2. In the left navigation pane, under Access Management, click Roles.

    Role Nav bar

  3. In the Roles page, click Create role.

    Create role

  4. In the Trusted entity type section, select AWS account.

    AWS account

  5. In the An AWS account section, select Another AWS account, and in the Account ID field, specify Hevo’s Account ID, 393309748692.

    Account ID

    This account ID enables you to assign a role to Hevo and ingest data from your S3 bucket for replicating it to your desired Destination.

  6. In the Options section, select the Require external ID check box, specify an External ID of your choice, and click Next.

    External ID

  7. In the Add Permissions page, select the policy that you created in Step 1 above, and at the bottom of the page, click Next.

    Select Policy

  8. In the Name, review, and create page, specify the Role name and Description of your choice, and at the bottom of the page, click Create role.

    Role Description

You are redirected to the Roles page.

2. Obtain the ARN and external ID

  1. In the Roles page of your IAM console, click the role that you created above.

    Click Created Role

  2. In the <Role name> page, Summary section, click the copy icon below the ARN field and save it securely like any other password.

    Copy ARN

  3. In the Trust relationships tab, copy the external ID corresponding to the sts:ExternalID field. For example, hevo-role-external-id in the image below.

    Copy External ID

You can use this ARN and external ID while configuring your Pipeline.

Generate access credentials

Your access credentials include the access key and the secret access key. To generate these, you need to create an IAM user for Hevo and assign the policy you created in Step 1 above, to it.

Note: The secret key is associated with an access key and is visible only once. Therefore, you must make sure to save the details or download the key file for later use.

1. Create an IAM user and assign the IAM policy

  1. Log in to the AWS IAM Console.

  2. In the left navigation pane, under Access management, click Users.

    Users nav bar

  3. In the Users page, click Add users.

    Add User

  4. In the Specify user details page, specify the User name, and click Next.

    User Details

  5. In the Set permissions page, Permissions options section, click Attach policies directly.

    Attach Policy

  6. In the Permissions policies section, search and select the check box corresponding to the policy that you created in Step 1 above, and at the bottom of the page, click Next.

    Select Policy

  7. At the bottom of the Review and create page, click Create user.

    Create User

2. Generate the access keys

  1. In the Users page of your IAM console, click the user that you created above.

    Select User

  2. In the <User name> page, select the Security credentials tab.

    Security Credentials

  3. In the Access keys section, click Create access key.

    Create Access Key

  4. In the Access key best practices & alternatives page, select Command Line Interface (CLI).

    Access key practices

  5. At the bottom of the page, select the I understand the above…. check box and click Next.

    Terms and Conditions

  6. (Optional) Specify a description for the access key.

    Access Key description

  7. Click Create access key.

  8. In the Retrieve access keys page, Access key section, click the copy icon in the Access key and Secret access key fields and save the keys securely like any other password.
    Optionally, click Download .csv file to save the keys on your local machine.

    Note: Once you leave this page, you cannot view these keys again.

    Save Keys

You can use these keys while configuring your Pipeline.


Configuring Amazon S3 as a Source

Perform the following steps to configure S3 as the Source in your Pipeline:

  1. Click PIPELINES in the Navigation Bar.

  2. Click + CREATE in the Pipeline List View.

  3. In the Select Source Type page, select S3.

  4. In the Configure your S3 Source page, specify the following:

    S3 settings

    • Pipeline Name: A unique name for the Pipeline.

    • Source Setup: The credentials needed to allow Hevo to access data from your S3 account. Select one of the following setup methods:

  5. Click TEST & CONTINUE.

  6. In the Data Root section, specify the following. The data root signifies the directories or files that contain your data. By default, the files are listed from the root directory.

    Select folders to be ingested

    • Select the folders from which you want to ingest data.

      Note: If Hevo cannot retrieve the list of files from your S3 bucket, it displays the Path Prefix field. In this situation, you must specify the prefix of the path for the directory that contains your data. To specify the path prefixes for multiple files, you can click the Plus ( Plus icon ) icon.

      Path Prefix

    • File Format: The format of the data file in the selected folders. Hevo supports AVRO, CSV, JSON, TSV, and XML formats.

      Note: You can select only one file format at a time. If your Source data is in a different format, you can export the data to either of the supported formats and then ingest the files.

      Based on the format you select, you must specify some additional settings:

      • Field Delimiter: The character on which the fields in each line are separated. For example, \t or ,.

        This field is visible only for CSV data.

      • Create Events from child nodes: If enabled, Hevo loads each node present under the root node in the XML file as a separate Event. If disabled, Hevo combines and loads all nodes present in the XML file as a single Event.

        This field is visible only for XML data.

      • Treat First Row as Column Headers: If enabled, Hevo identifies the first row in your CSV file and uses it as a column header rather than an Event. If disabled, Hevo automatically creates the column headers during ingestion. Default setting: Enabled. Refer to section, Example.

        This field is visible only for CSV data.

      • Include compressed files: If enabled, Hevo also ingests the compressed files of the selected file format from the folders. Hevo supports the tar.gz and zip compression types only. If disabled, Hevo does not ingest any compressed file present in the selected folders.

        This field is visible for all supported data formats.

      • Create Event Types from folders: If enabled, Hevo ingests each subfolder as a separate Event Type. If disabled, Hevo merges subfolders into their parent folders and ingests them as one Event Type.

        This field is visible for all supported data formats.

      • Convert date/time format fields to timestamp: If enabled, Hevo converts the date/time format within the files of selected folders to timestamp. For example, the date/time format 07/11/2022, 12:39:23 converts to timestamp 1667804963. If disabled, Hevo ingests the datetime fields in the same format.

        This field is visible for all supported data formats.

    • Click CONFIGURE SOURCE.

  7. Proceed to configuring the data ingestion and setting up the Destination.


Data Replication

For Teams Created Default Ingestion Frequency Minimum Ingestion Frequency Maximum Ingestion Frequency Custom Frequency Range (in Hrs)
Before Release 2.21 1 Hr 5 Mins 24 Hrs 1-24
After Release 2.21 6 Hrs 30 Mins 24 Hrs 1-24

Note: The custom frequency must be set in hours as an integer value. For example, 1, 2, or 3 but not 1.5 or 1.75.


Example: Automatic Column Header Creation for CSV Tables

If you disable the Treat First Row as Column Headers option while creating a Pipeline, Hevo automatically generates the column headers while ingesting data from the Source.

For example, consider the following data in CSV format, which has no column headers.

  CLAY COUNTY,32003,11973623
  CLAY COUNTY,32003,46448094
  CLAY COUNTY,32003,55206893
  CLAY COUNTY,32003,15333743
  SUWANNEE COUNTY,32060,85751490
  SUWANNEE COUNTY,32062,50972562
  ST JOHNS COUNTY,846636,32033,
  NASSAU COUNTY,32025,88310177
  NASSAU COUNTY,32041,34865452

When Hevo ingests this data, it auto-generates the column headers, as displayed below:

Column headers generated by Hevo for CSV data

The record in the Destination appears as follows:

Destination record with auto-generated column headers



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Mar-05-2024 2.21 Added the Data Replication section.
Jul-17-2023 NA Updated section, Configuring Amazon S3 as a Source to add information about path prefix.
Jun-26-2023 NA Updated the page to provide better clarity.
Apr-14-2023 NA Updated the overview section to add information about the file formats supported by Hevo.
Mar-09-2023 NA Updated section, Generate the IAM role-based credentials for consistent information structure.
Nov-08-2022 NA Updated section, Configuring Amazon S3 as a Source to add information about the Convert date/time format fields to timestamp option.
Oct-17-2022 1.99 Updated section, Configuring Amazon S3 as a Source to add information about ingesting compressed files from selected folders.
Sep-21-2022 1.98 - Added sections, Obtaining Amazon S3 Credentials and Generate the IAM role based credentials.
- Renamed section, (Optional) Obtain your Access Key ID and Secret Access Key to Generate access credentials.
- Updated section, Configuring Amazon S3 as a Source to add information about connecting to Amazon S3 using IAM role.
Sep-07-2022 1.97 Updated section, Configuring Amazon S3 as a Source to reflect the latest UI.
Apr-18-2022 NA Added section, (Optional) Obtain your Access Key ID and Secret Access Key.
Apr-11-2022 1.86 Updated section, Configuring Amazon S3 as a Source to reflect support for TSV file format.
Mar-21-2022 1.85 Removed section, Limitations as Hevo now supports UTF-16 encoding format for CSV files.
Jun-28-2021 1.66 Updated the page overview with information about __hevo_source_modified_at being uploaded as a metadata field from Release 1.66 onwards.
Feb-22-2021 NA Added the limitation about Hevo not supporting UTF-16 encoding format for CSV data.

Tell us what went wrong