Amazon S3

Last updated on May 30, 2023

You can load data from files in an S3 bucket into your Destination database or data warehouse using Hevo Pipelines. Hevo supports replication of data from your Amazon S3 bucket in the following file formats: AVRO, CSV, TSV, JSON, and XML.

Hevo automatically unzips any Gzipped files on ingestion. Further, files are re-ingested if updated, as it is not possible to identify individual changes.

As of Release 1.66, __hevo_source_modified_at is uploaded to the Destination as a metadata field. For existing Pipelines that have this field:

  • If this field is displayed in the Schema Mapper, you must ignore it and not try to map it to a Destination table column, else the Pipeline displays an error.

  • Hevo automatically loads this information in the __hevo_source_modified_at column, which is already present in the Destination table.

You can, however, continue to use __hevo_source_modified_at to create transformations using the function event.getSourceModifiedAt(). Read Metadata Column __hevo_source_modified_at.

Existing Pipelines that do not have this field are not impacted.


Prerequisites


Obtaining Amazon S3 Credentials

You must either obtain the access credentials or generate the IAM role based credentials to allow Hevo to connect to your Amazon S3 account and ingest data from it. These methods allow Hevo to establish authentication and replicate your Amazon S3 data into your desired Destination.

Obtain the access credentials

You need the Access Key ID and Secret Access Key from your Amazon S3 account to allow Hevo to access the data from it. A secret key is associated with the access key and is visible only once. Therefore, you must make sure to copy the details or download the key file for later use.

Perform the following steps to obtain your AWS Access Key ID and Secret Access Key:

  1. Log in to the AWS Console.

  2. Click the drop-down next to your profile name in the top right corner of the AWS user interface, and click Security Credentials.

    Security Credentials on console

  3. In the Security Credentials page, expand Access Keys (Access Key ID and Secret Access Key).

  4. Click Create New Access Key.

    Access Key tab

  5. Click Show Access Key to display the generated Access Key ID and Secret Access Key.

    Show Access Key

  6. Copy the key file and save it in a secure location. Alternatively, click Download Key File to download the key file for later use.

    Download Access Key

Generate the IAM role-based credentials

To generate your IAM role-based credentials, you need to:

  1. Create an IAM policy with the required permissions to access data from your S3 bucket.

  2. Create an IAM role for Hevo. The Amazon Resource Name (ARN) and the external ID from this role is required to configure the Amazon S3 Source in Hevo.

These steps are explained in detail below:

Step 1. Create an IAM policy

An IAM policy is required for Hevo to access and ingest the data present in your specified Amazon S3 bucket.

Perform the following steps to create an IAM policy:

  1. Log in to the AWS Console, and select the IAM service.

  2. In the left navigation pane, under Access Management, click Policies.

    AWS nav bar

  3. In the Policies page, click Create policy.

    Create policy

  4. In the Create Policy page, Visual editor section, click Choose a service corresponding to the Service drop-down, and search and select S3.

    Select service

  5. In the Actions drop-down, under the Access level section, expand the List and Read drop-downs, and select the ListBucket and GetObject check boxes, respectively.

    Permissions

  6. In the Resources drop-down, do the following to add ARNs for the bucket resource:

    1. Click Add ARN corresponding to the bucket resource.

      Add ARN bucket

    2. In the Add ARN(s) pop-up window, do the following:

      Bucket ARN details

      1. Specify the Bucket name which you want Hevo to access. Alternatively, select the Any check box to grant access to all the buckets in your Amazon S3 account.

        The Specify ARN for bucket field is updated as per the bucket name you specify.

        You can obtain the bucket name from the Buckets page in your Amazon S3 console.

        S3 console

      2. Click Add.

  7. In the Resources drop-down, click Add ARN corresponding to the object resource, and do one of the following in the Add ARN(s) pop-up window:

    Add ARN object

    • Generate the ARN(s) using the object name:

      Object ARN details

      1. Specify the Bucket name which you want Hevo to access. Alternatively, select the Any check box to grant access to all the buckets in your Amazon S3 account.

      2. Specify the Object name whose data you want to ingest using Hevo. Alternatively, select the Any check box to ingest all the objects in your Amazon S3 bucket.

        The Specify ARN for object field is updated as per the details you specify.

      3. Click Add.

      4. Optionally, repeat the above steps to include more objects.

    • Specify the object ARN(s):

      1. Click List ARNs manually.

        List ARNs manually

      2. Obtain the ARN for each object from the object Properties page in your Amazon S3 console.

        Obtaining ARN

      3. Paste the ARN into the Type or paste a list of ARNs field. To add multiple ARNs, specify one ARN per line.

        Paste ARN

      4. Click Add.

  8. (Optional) In the Request conditions drop-down, select the Source IP check box, and in the IP range field, specify Hevo’s IP address for your region.

    Whitelist IP

  9. At the bottom of the page, click Next:Tags.

  10. At the bottom of the Create policy page, click Next: Review.

  11. In the Review policy page, specify a Name and Description for your policy, and click Create policy.

    Policy Description

You are redirected to the Policies page, where you can find the policy that you created.

Step 2. Create an IAM role and obtain the IAM role ARN and external ID

After you define the IAM policy, you need to create a role for Hevo and assign that policy to it. From this role, you must obtain the ARN and external ID required for configuring Amazon S3 as a Source in Hevo.

Note: You can see the external ID only once, while creating the role. At that time, you must copy and save it in a secure location for later use.

Perform the following steps to create an IAM role:

  1. Log in to the AWS Console, and select the IAM service.

  2. In the left navigation pane, under Access Management, click Roles.

    Role Nav bar

  3. In the Roles page, click Create role.

    Create role

  4. In the Trusted entity type section, select AWS account.

    AWS account

  5. In the An AWS account section, do the following:

    1. Select the Another AWS account option, and specify Hevo’s Account ID (393309748692). This allows you to create a role for Hevo to access and ingest data from your S3 bucket and replicate it to your desired Destination.

      Account ID

    2. In the Options section, select the Require external ID check box, and specify an External ID of your choice. For example, hevo-role-external-id.

      Note: You must save this external ID in a secure location like any other password. This is required while setting up a Pipeline in Hevo.

  6. Click Next.

  7. In the Permissions policies section, select the policy that you created in Step 1 above, and click Next at the bottom of the page.

    Select Policy

  8. In the Name, review, and create page, specify the Role name and Description, and at the bottom of the page, click Create role.

    Role Description

  9. In the Roles page, select the role that you created above.

    Select Role

  10. In the Summary section of your role, copy the ARN. Use this ARN while configuring your Hevo Pipeline.

    Copy ARN


Configuring Amazon S3 as a Source

Perform the following steps to configure S3 as the Source in your Pipeline:

  1. Click PIPELINES in the Navigation Bar.

  2. Click + CREATE in the Pipeline List View.

  3. In the Select Source Type page, select S3.

  4. In the Configure your S3 Source page, specify the following:

    S3 settings

    • Pipeline Name: A unique name for the Pipeline.

    • Source Setup: The credentials needed to allow Hevo to access your data. Perform the following steps to connect to your Amazon S3 account:

      1. Do one of the following:

        • Connect using Access Credentials:

          • Access Key ID: The AWS access key ID that you retrieved in obtain the access credentials section above.

          • Secret Access Key: The AWS Secret Access Key for the Access Key ID that you retrieved in obtain the access credentials section above.

          • Bucket Name: The name of the bucket from which you want to ingest data.

          • Bucket Region: The AWS region where the bucket is located.

        • Connect using IAM Role:

          • IAM Role ARN: The Amazon Resource Name (ARN) for your Amazon S3 bucket that you copied in Step 2 above.

          • External ID: The external ID that you specified in Step 2 above.

          • Bucket Name: The name of the bucket from which you want to ingest data.

          • Bucket Region: The AWS region where the bucket is located.

      2. Click TEST & CONTINUE.

    • Data Root: The path for the directory which contains your data. By default, the files are listed from the root of the directory.

      Perform the following steps to select the folder(s) and the data format which you want to ingest using Hevo:

      Select folders to be ingested

      1. Select Folders: The folders which contain the data to be ingested.

      2. Select Type of File: The format of the data file in the Source. Hevo currently supports AVRO, CSV, JSON, TSV, and XML formats.

        Note: You can select only one file format at a time. If your Source data is in a different format, you can export the data to either of the supported formats, and then ingest the files.

        Based on the format you select, you must specify some additional settings:

        • AVRO, JSON, TSV:

          • Enable the Include compressed files option if you want to ingest the compressed files of the selected file format from the folders. Hevo currently supports the tar.gz and zip compression types only.

          • Enable the Create Event Types from folders option if the selected folder(s) have subfolders containing files in different formats. Hevo reads each subfolder as a separate Event Type and creates a separate table in the Destination for each of your selected folders.

          • Enable the Convert date/time format fields to timestamp option if you want to convert the date/time format within the files of selected folders to timestamp. For example, the date/time format 07/11/2022, 12:39:23 converts to timestamp 1667804963.

        • CSV:

          • Specify the Field Delimiter. This is the character on which fields in each line are separated. For example, \t, or ,).

          • Enable the Include compressed files option if you want to ingest the compressed files of the selected file format from the folders. Hevo currently supports the tar.gz and zip compression types only.

          • Disable the Treat First Row As Column Headers option if the Source data file does not contain column headers. Hevo, then automatically creates the headers during ingestion. Default setting: Enabled. Refer to section, Example.

          • Enable the Create Event Types from folders option if the selected folder(s) have subfolders containing files in different formats. Hevo reads each subfolder as a separate Event Type and creates a separate table in the Destination for each of your selected folders.

          • Enable the Convert date/time format fields to timestamp option if you want to convert the date/time format within the files of selected folders to timestamp. For example, the date/time format 07/11/2022, 12:39:23 converts to timestamp 1667804963.

        • XML:

          • Enable the Include compressed files option if you want to ingest the compressed files of the selected file format from the folders. Hevo currently supports the tar.gz and zip compression types only.

          • Enable the Create Events from child nodes option to load each node under the root node in the XML file as a separate Event.

          • Enable the Create Event Types from folders option if the selected folder(s) have subfolders containing files in different formats. Hevo reads each subfolder as a separate Event Type and creates a separate table in the Destination for each of your selected folders.

      3. Click CONFIGURE SOURCE.

  5. Proceed to configuring the data ingestion and setting up the Destination.


Example: Automatic Column Header Creation for CSV Tables

Consider the following data in CSV format, which has no column headers.

  CLAY COUNTY,32003,11973623
  CLAY COUNTY,32003,46448094
  CLAY COUNTY,32003,55206893
  CLAY COUNTY,32003,15333743
  SUWANNEE COUNTY,32060,85751490
  SUWANNEE COUNTY,32062,50972562
  ST JOHNS COUNTY,846636,32033,
  NASSAU COUNTY,32025,88310177
  NASSAU COUNTY,32041,34865452

If you disable the Treat first row as column headers option, Hevo auto-generates the column headers, as seen in the schema map here:

Column headers generated by Hevo for CSV data

The record in the Destination appears as follows:

Destination record with auto-generated column headers



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Apr-14-2023 NA Updated the overview section to add information about the file formats supported by Hevo.
Mar-09-2023 NA Updated section, Generate the IAM role-based credentials for consistent information structure.
Nov-08-2022 NA Updated section, Configuring Amazon S3 as a Source to add information about the Convert date/time format fields to timestamp option.
Oct-17-2022 1.99 Updated section, Configuring Amazon S3 as a Source to add information about ingesting compressed files from selected folders.
Sep-21-2022 1.98 - Added sections, Obtaining Amazon S3 Credentials and Generate the IAM role based credentials.
- Renamed section, (Optional) Obtain your Access Key ID and Secret Access Key to Obtain the access credentials.
- Updated section, Configuring Amazon S3 as a Source to add information about connecting to Amazon S3 using IAM role.
Sep-07-2022 1.97 Updated section, Configuring Amazon S3 as a Source to reflect the latest UI.
Apr-18-2022 NA Added section, (Optional) Obtain your Access Key ID and Secret Access Key.
Apr-11-2022 1.86 Updated section, Configuring Amazon S3 as a Source to reflect support for TSV file format.
Mar-21-2022 1.85 Removed section, Limitations as Hevo now supports UTF-16 encoding format for CSV files.
Jun-28-2021 1.66 Updated the page overview with information about __hevo_source_modified_at being uploaded as a metadata field from Release 1.66 onwards.
Feb-22-2021 NA Added the limitation about Hevo not supporting UTF-16 encoding format for CSV data.

Tell us what went wrong