Amazon S3

You can load data from files in an S3 bucket into your Destination database or data warehouse using Hevo Pipelines.

Hevo automatically unzips any Gzipped files on ingestion. Further, files are re-ingested if updated, as it is not possible to identify individual changes.

As of Release 1.66, __hevo_source_modified_at is uploaded to the Destination as a metadata field. For existing Pipelines that have this field:

  • If this field is displayed in the Schema Mapper, you must ignore it and not try to map it to a Destination table column, else the Pipeline displays an error.

  • Hevo automatically loads this information in the __hevo_source_modified_at column, which is already present in the Destination table.

You can, however, continue to use __hevo_source_modified_at to create transformations using the function event.getSourceModifiedAt(). Read Metadata Column __hevo_source_modified_at.

Existing Pipelines that do not have this field are not impacted.


Prerequisites

  • An active AWS account with root user access or an IAM user with the required permissions.

  • The user has ListObjects and GetObject permissions in the S3 account:


(Optional) Obtain your Access Key ID and Secret Access Key

The AWS Access Key ID and Secret Access Key allow Hevo to establish authentication, and replicate your Amazon S3 data into your desired Destination. You need to specify these while configuring Amazon S3 as a Source in Hevo.

To get a secret key, you must create a new Access Key. You can see the Secret Access Key only once, immediately after creating an Access Key. At that time, you must either copy the details or download the key file for later use.

Perform the following steps to obtain your AWS Access Key ID and Secret Access Key:

  1. Log in to the AWS Console.

  2. Click the drop-down next to your profile name in the top right corner of the AWS user interface, and click Security Credentials.

    Security Credentials on console

  3. In the Security Credentials page, expand Access Keys (Access Key ID and Secret Access Key).

    Access Key tab

  4. Click Create New Access Key.

  5. Click Show Access Key to display the generated Access Key ID and Secret Access Key. Copy the details or download the key file for later use.

    Download Access Key


Configuring Amazon S3 as a Source

To configure Amazon S3 as a Source in Hevo:

  1. Click PIPELINES in the Asset Palette.

  2. Click + CREATE in the Pipeline List View.

  3. In the Select Source Type page, select S3.

  4. In the Configure your S3 Source page, specify the following:

    S3 settings

    • Pipeline Name: A unique name for the Pipeline.

    • Access Key ID: The AWS access key ID that you retrieved in Step 1 above.

    • Secret Access Key: The AWS Secret Access Key for the Access Key ID that you retrieved in Step 1 above.

    • Bucket: The name of the bucket from which you want to ingest data.

    • Bucket Region: Choose the AWS region where the bucket is located.

    • Path Prefix: The prefix of the path for the directory which contains the data. By default, the files are listed from the root of the directory.

    • File Format: The format of the data file in the Source. Hevo currently supports AVRO, CSV, JSON, and XML formats. Contact Hevo Support if your Source data is in another format.

      Note: Files lying at the prefix path (and not in a subdirectory) are ignored.

      Based on the format you select, you must specify some additional settings:

      • CSV:

        1. Specify the Field Delimiter. This is the character on which fields in each line are separated. For example, `\t`, or `,`).

        2. Disable the Treat First Row As Column Headers option if the Source data file does not contain column headers. Hevo, then automatically creates the headers during ingestion. Default setting: Enabled. Refer to section, Example.

        3. Enable the Create Event Types from folders option if the path prefix has subdirectories containing files in different formats. Hevo reads each subdirectory as a separate Event Type.

      • TSV:

        1. Disable the Treat First Row As Column Headers option if the Source data file does not contain column headers. Hevo automatically creates the headers during ingestion. Default setting: Enabled.

        2. Enable the Create Event Types from folders option if the path prefix has subdirectories containing files in different formats. Hevo reads each subdirectory as a separate Event Type.

      • JSON: Enable the Create Event Types from folders option if the path prefix has subdirectories containing files in different formats. Hevo reads each of the subdirectories as a separate Event Type.

      • XML: Enable the Create Events from child nodes option to load each node under the root node in the XML file as a separate Event.

    • Advanced Settings

      • Delay in minutes: The time (in minutes) that Hevo must wait post-authentication for the files to be available for ingestion.

        For the S3 Source, the file you upload may be available for ingestion with some delay. Therefore, if Hevo ingests the objects from the last modified timestamp, the latest uploaded objects might fail to get ingested. Similarly, since the timestamp would move ahead in the next ingestion cycle, these objects would not get ingested even in a subsequent run and would eventually get missed. To circumvent this issue, Hevo processes only the files for which the 'last modified' timestamp < (current timestamp - delay in minutes).

        However, with this, your data ingestion always remains behind the current time by the value specified for this field. If you need to modify the Delay in Minutes value later, you must create a new Pipeline.

        Read the recent update about Amazon S3 Strong Consistency.

  5. Click TEST & CONTINUE to proceed with setting up the Destination.


Example: Automatic Column Header Creation for CSV Tables

Consider the following data in CSV format, which has no column headers.

  CLAY COUNTY,32003,11973623
  CLAY COUNTY,32003,46448094
  CLAY COUNTY,32003,55206893
  CLAY COUNTY,32003,15333743
  SUWANNEE COUNTY,32060,85751490
  SUWANNEE COUNTY,32062,50972562
  ST JOHNS COUNTY,846636,32033,
  NASSAU COUNTY,32025,88310177
  NASSAU COUNTY,32041,34865452

If you disable the Treat first row as column headers option, Hevo auto-generates the column headers, as seen in the schema map here:

Column headers generated by Hevo for CSV data

The record in the Destination appears as follows:

Destination record with auto-generated column headers



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Apr-18-2022 NA Added section, (Optional) Obtain your Access Key ID and Secret Access Key.
Apr-11-2022 1.86 Updated section, Configuring Amazon S3 as a Source to reflect support for TSV file format.
Mar-21-2022 1.85 Removed section, Limitations as Hevo now supports UTF-16 encoding format for CSV files.
Jun-28-2021 1.66 Updated the page overview with information about __hevo_source_modified_at being uploaded as a metadata field from Release 1.66 onwards.
Feb-22-2021 NA Added the limitation about Hevo not supporting UTF-16 encoding format for CSV data.
Last updated on 28 Apr 2022