Amazon S3
ON THIS PAGE
Hevo lets you load data from files in an S3 bucket into your data warehouse.
Prerequisites
- The user has the following permissions on the S3 account:
ListObjects
GetObject
Configuring Amazon S3 as a Source
To configure Amazon S3 as a Source in Hevo:
-
Click PIPELINES in the Asset Palette.
-
Click + CREATE in the Pipeline List View.
-
In the Select Source Type page, select S3.
-
In the Configure your S3 Source page, specify the following:
- Pipeline Name: A unique name for the Pipeline.
- Access Key ID: AWS access key ID which has permissions to read from the given bucket
- Secret Access Key: AWS Secret Access Key for the above Access Key ID
- Bucket: The name of the bucket from which you want to ingest data.
- Bucket Region: Choose the AWS region where the bucket is located.
- Path Prefix: Path Prefix for the data directory. By default, the files are listed from the root of the directory.
- File Format: The format of the data file in the Source. Hevo currently supports AVRO, CSV, JSON, and XML formats. Contact Hevo Support if your Source data is in another format.
Based on the format you select, you must specify some additional settings:
- CSV:
- Specify the Field Delimiter. This is the character on which fields in each line are separated. For example, `\t`, or `,`).
-
Disable the Treat First Row As Column Headers option if the Source data file does not contain column headers. Hevo, then, automatically creates these during ingestion. Default setting: Enabled.
See Example below. - Enable the Create Event Types from folders option if the path prefix has subdirectories containing files in different formats. Hevo reads each subdirectory as a separate Event Type.
Note: Files lying at the prefix path (and not in a subdirectory) are ignored.
- JSON:
-
Enable the Create Event Types from folders option if the path prefix has subdirectories containing files in different formats. Hevo reads each of the subdirectories as a separate Event Type.
Note: Files lying at the prefix path (and not in a subdirectory) are ignored.
-
- XML:
- Enable the Create Events from child nodes option to load each node under the root node in the XML file as a separate Event.
- CSV:
- Advanced Settings
-
Delay in minutes: The time (in minutes) that Hevo must wait post-authentication for the files to be available for ingestion.
For the S3 Source, the file you upload may be available for ingestion with some delay. Therefore, if Hevo ingests the objects from the
last modified
timestamp, the objects which you may have uploaded but are not present in thelist (ls)
operation may fail to get ingested. Similarly, since the timestamp would move ahead in the next ingestion cycle, these objects would not get ingested even in a subsequent run and would eventually get missed. To circumvent this issue, Hevo only processes the files where the'last modified' timestamp < (current timestamp - delay in minutes)
.However, with this, your data ingestion will always remain behind the current time by the value specified for this field. If you need to modify the Delay in Minutes value later, you must create a new Pipeline.
Also read the recent update about Amazon S3 Strong Consistency.
-
Things to Note
- Gzipped files are automatically unzipped on ingestion by Hevo.
- Files are re-ingested on update.
Example: Automatic Column Header Creation for CSV Tables
Consider the following data in CSV format, which has no column headers.
CLAY COUNTY,32003,11973623
CLAY COUNTY,32003,46448094
CLAY COUNTY,32003,55206893
CLAY COUNTY,32003,15333743
SUWANNEE COUNTY,32060,85751490
SUWANNEE COUNTY,32062,50972562
ST JOHNS COUNTY,846636,32033,
NASSAU COUNTY,32025,88310177
NASSAU COUNTY,32041,34865452
If you disable the Treat first row as column headers option, Hevo auto-generates the column headers, as seen in the schema map here:
The record in the Destination appears as follows: