Amazon S3
On This Page
You can load data from files in an S3 bucket into your Destination database or data warehouse using Hevo Pipelines. Hevo supports replication of data from your Amazon S3 bucket in the following file formats: AVRO, CSV, TSV, JSON, and XML.
Hevo automatically unzips any Gzipped files on ingestion. Further, files are re-ingested if updated, as it is not possible to identify individual changes.
As of Release 1.66, __hevo_source_modified_at
is uploaded to the Destination as a metadata field. For existing Pipelines that have this field:
-
If this field is displayed in the Schema Mapper, you must ignore it and not try to map it to a Destination table column, else the Pipeline displays an error.
-
Hevo automatically loads this information in the
__hevo_source_modified_at
column, which is already present in the Destination table.
You can, however, continue to use __hevo_source_modified_at
to create transformations using the function event.getSourceModifiedAt()
. Read Metadata Column __hevo_source_modified_at
.
Existing Pipelines that do not have this field are not impacted.
Prerequisites
-
An active AWS account and an S3 bucket from which data is to be ingested exist.
-
You are logged in as a root user, or as an IAM user with the permissions to obtain the access or IAM role-based credentials.
-
The access credentials or IAM role-based credentials are available to authenticate Hevo on your AWS account.
-
The ListBucket and GetObject permissions are granted in your IAM policy, if you are configuring your S3 Source using IAM role-based credentials.
-
You are assigned the Team Administrator, Team Collaborator, or Pipeline Administrator role in Hevo to create the Pipeline.
Obtaining Amazon S3 Credentials
You must either obtain the access credentials or generate the IAM role based credentials to allow Hevo to connect to your Amazon S3 account and ingest data from it. These methods allow Hevo to establish authentication and replicate your Amazon S3 data into your desired Destination.
Obtain the access credentials
You need the Access Key ID and Secret Access Key from your Amazon S3 account to allow Hevo to access the data from it. A secret key is associated with the access key and is visible only once. Therefore, you must make sure to copy the details or download the key file for later use.
Perform the following steps to obtain your AWS Access Key ID and Secret Access Key:
-
Log in to the AWS Console.
-
Click the drop-down next to your profile name in the top right corner of the AWS user interface, and click Security Credentials.
-
In the Security Credentials page, expand Access Keys (Access Key ID and Secret Access Key).
-
Click Create New Access Key.
-
Click Show Access Key to display the generated Access Key ID and Secret Access Key.
-
Copy the key file and save it in a secure location. Alternatively, click Download Key File to download the key file for later use.
Generate the IAM role-based credentials
To generate your IAM role-based credentials, you need to:
-
Create an IAM policy with the required permissions to access data from your S3 bucket.
-
Create an IAM role for Hevo. The Amazon Resource Name (ARN) and the external ID from this role is required to configure the Amazon S3 Source in Hevo.
These steps are explained in detail below:
Step 1. Create an IAM policy
An IAM policy is required for Hevo to access and ingest the data present in your specified Amazon S3 bucket.
Perform the following steps to create an IAM policy:
-
Log in to the AWS Console, and select the IAM service.
-
In the left navigation pane, under Access Management, click Policies.
-
In the Policies page, click Create policy.
-
In the Create Policy page, Visual editor section, click Choose a service corresponding to the Service drop-down, and search and select S3.
-
In the Actions drop-down, under the Access level section, expand the List and Read drop-downs, and select the ListBucket and GetObject check boxes, respectively.
-
In the Resources drop-down, do the following to add ARNs for the bucket resource:
-
Click Add ARN corresponding to the bucket resource.
-
In the Add ARN(s) pop-up window, do the following:
-
Specify the Bucket name which you want Hevo to access. Alternatively, select the Any check box to grant access to all the buckets in your Amazon S3 account.
The Specify ARN for bucket field is updated as per the bucket name you specify.
You can obtain the bucket name from the Buckets page in your Amazon S3 console.
-
Click Add.
-
-
-
In the Resources drop-down, click Add ARN corresponding to the object resource, and do one of the following in the Add ARN(s) pop-up window:
-
Generate the ARN(s) using the object name:
-
Specify the Bucket name which you want Hevo to access. Alternatively, select the Any check box to grant access to all the buckets in your Amazon S3 account.
-
Specify the Object name whose data you want to ingest using Hevo. Alternatively, select the Any check box to ingest all the objects in your Amazon S3 bucket.
The Specify ARN for object field is updated as per the details you specify.
-
Click Add.
-
Optionally, repeat the above steps to include more objects.
-
-
Specify the object ARN(s):
-
Click List ARNs manually.
-
Obtain the ARN for each object from the object Properties page in your Amazon S3 console.
-
Paste the ARN into the Type or paste a list of ARNs field. To add multiple ARNs, specify one ARN per line.
-
Click Add.
-
-
-
(Optional) In the Request conditions drop-down, select the Source IP check box, and in the IP range field, specify Hevo’s IP address for your region.
-
At the bottom of the page, click Next:Tags.
-
At the bottom of the Create policy page, click Next: Review.
-
In the Review policy page, specify a Name and Description for your policy, and click Create policy.
You are redirected to the Policies page, where you can find the policy that you created.
Step 2. Create an IAM role and obtain the IAM role ARN and external ID
After you define the IAM policy, you need to create a role for Hevo and assign that policy to it. From this role, you must obtain the ARN and external ID required for configuring Amazon S3 as a Source in Hevo.
Note: You can see the external ID only once, while creating the role. At that time, you must copy and save it in a secure location for later use.
Perform the following steps to create an IAM role:
-
Log in to the AWS Console, and select the IAM service.
-
In the left navigation pane, under Access Management, click Roles.
-
In the Roles page, click Create role.
-
In the Trusted entity type section, select AWS account.
-
In the An AWS account section, do the following:
-
Select the Another AWS account option, and specify Hevo’s Account ID (393309748692). This allows you to create a role for Hevo to access and ingest data from your S3 bucket and replicate it to your desired Destination.
-
In the Options section, select the Require external ID check box, and specify an External ID of your choice. For example, hevo-role-external-id.
Note: You must save this external ID in a secure location like any other password. This is required while setting up a Pipeline in Hevo.
-
-
Click Next.
-
In the Permissions policies section, select the policy that you created in Step 1 above, and click Next at the bottom of the page.
-
In the Name, review, and create page, specify the Role name and Description, and at the bottom of the page, click Create role.
-
In the Roles page, select the role that you created above.
-
In the Summary section of your role, copy the ARN. Use this ARN while configuring your Hevo Pipeline.
Configuring Amazon S3 as a Source
Perform the following steps to configure S3 as the Source in your Pipeline:
-
Click PIPELINES in the Navigation Bar.
-
Click + CREATE in the Pipeline List View.
-
In the Select Source Type page, select S3.
-
In the Configure your S3 Source page, specify the following:
-
Pipeline Name: A unique name for the Pipeline.
-
Source Setup: The credentials needed to allow Hevo to access your data. Perform the following steps to connect to your Amazon S3 account:
-
Do one of the following:
-
Connect using Access Credentials:
-
Access Key ID: The AWS access key ID that you retrieved in obtain the access credentials section above.
-
Secret Access Key: The AWS Secret Access Key for the Access Key ID that you retrieved in obtain the access credentials section above.
-
Bucket Name: The name of the bucket from which you want to ingest data.
-
Bucket Region: The AWS region where the bucket is located.
-
-
Connect using IAM Role:
-
-
Click TEST & CONTINUE.
-
-
Data Root: The path for the directory which contains your data. By default, the files are listed from the root of the directory.
Perform the following steps to select the folder(s) and the data format which you want to ingest using Hevo:
-
Select Folders: The folders which contain the data to be ingested.
-
Select Type of File: The format of the data file in the Source. Hevo currently supports AVRO, CSV, JSON, TSV, and XML formats.
Note: You can select only one file format at a time. If your Source data is in a different format, you can export the data to either of the supported formats, and then ingest the files.
Based on the format you select, you must specify some additional settings:
-
AVRO, JSON, TSV:
-
Enable the Include compressed files option if you want to ingest the compressed files of the selected file format from the folders. Hevo currently supports the tar.gz and zip compression types only.
-
Enable the Create Event Types from folders option if the selected folder(s) have subfolders containing files in different formats. Hevo reads each subfolder as a separate Event Type and creates a separate table in the Destination for each of your selected folders.
-
Enable the Convert date/time format fields to timestamp option if you want to convert the date/time format within the files of selected folders to timestamp. For example, the date/time format 07/11/2022, 12:39:23 converts to timestamp 1667804963.
-
-
CSV:
-
Specify the Field Delimiter. This is the character on which fields in each line are separated. For example,
\t
, or,
). -
Enable the Include compressed files option if you want to ingest the compressed files of the selected file format from the folders. Hevo currently supports the tar.gz and zip compression types only.
-
Disable the Treat First Row As Column Headers option if the Source data file does not contain column headers. Hevo, then automatically creates the headers during ingestion. Default setting: Enabled. Refer to section, Example.
-
Enable the Create Event Types from folders option if the selected folder(s) have subfolders containing files in different formats. Hevo reads each subfolder as a separate Event Type and creates a separate table in the Destination for each of your selected folders.
-
Enable the Convert date/time format fields to timestamp option if you want to convert the date/time format within the files of selected folders to timestamp. For example, the date/time format 07/11/2022, 12:39:23 converts to timestamp 1667804963.
-
-
XML:
-
Enable the Include compressed files option if you want to ingest the compressed files of the selected file format from the folders. Hevo currently supports the tar.gz and zip compression types only.
-
Enable the Create Events from child nodes option to load each node under the root node in the XML file as a separate Event.
-
Enable the Create Event Types from folders option if the selected folder(s) have subfolders containing files in different formats. Hevo reads each subfolder as a separate Event Type and creates a separate table in the Destination for each of your selected folders.
-
-
-
Click CONFIGURE SOURCE.
-
-
-
Proceed to configuring the data ingestion and setting up the Destination.
Example: Automatic Column Header Creation for CSV Tables
Consider the following data in CSV format, which has no column headers.
CLAY COUNTY,32003,11973623
CLAY COUNTY,32003,46448094
CLAY COUNTY,32003,55206893
CLAY COUNTY,32003,15333743
SUWANNEE COUNTY,32060,85751490
SUWANNEE COUNTY,32062,50972562
ST JOHNS COUNTY,846636,32033,
NASSAU COUNTY,32025,88310177
NASSAU COUNTY,32041,34865452
If you disable the Treat first row as column headers option, Hevo auto-generates the column headers, as seen in the schema map here:
The record in the Destination appears as follows:
See Also
Revision History
Refer to the following table for the list of key updates made to this page:
Date | Release | Description of Change |
---|---|---|
Apr-14-2023 | NA | Updated the overview section to add information about the file formats supported by Hevo. |
Mar-09-2023 | NA | Updated section, Generate the IAM role-based credentials for consistent information structure. |
Nov-08-2022 | NA | Updated section, Configuring Amazon S3 as a Source to add information about the Convert date/time format fields to timestamp option. |
Oct-17-2022 | 1.99 | Updated section, Configuring Amazon S3 as a Source to add information about ingesting compressed files from selected folders. |
Sep-21-2022 | 1.98 | - Added sections, Obtaining Amazon S3 Credentials and Generate the IAM role based credentials. - Renamed section, (Optional) Obtain your Access Key ID and Secret Access Key to Obtain the access credentials. - Updated section, Configuring Amazon S3 as a Source to add information about connecting to Amazon S3 using IAM role. |
Sep-07-2022 | 1.97 | Updated section, Configuring Amazon S3 as a Source to reflect the latest UI. |
Apr-18-2022 | NA | Added section, (Optional) Obtain your Access Key ID and Secret Access Key. |
Apr-11-2022 | 1.86 | Updated section, Configuring Amazon S3 as a Source to reflect support for TSV file format. |
Mar-21-2022 | 1.85 | Removed section, Limitations as Hevo now supports UTF-16 encoding format for CSV files. |
Jun-28-2021 | 1.66 | Updated the page overview with information about __hevo_source_modified_at being uploaded as a metadata field from Release 1.66 onwards. |
Feb-22-2021 | NA | Added the limitation about Hevo not supporting UTF-16 encoding format for CSV data. |