Hevo Data Lake

ON THIS PAGE
Note: The Hevo Data Lake Destination is not available generally. Please reach out to Hevo Support to enable this for your team.

Elementarily, a data lake is a storage for the monstrous amount of data, both structured and unstructured in their native formats which handles the three Vs of big data (Volume, Velocity, and Variety). Data lake eliminates all the restrictions of a typical data warehouse system by providing unlimited storage, unrestricted file size, schema-on-read, and various ways to access data ( including SQL-like queries and ad hoc queries using Presto, Apache Impala etc.)

This article will focus on how to connect to a Hevo Data Lake as a destination.

Prerequisites

Hevo Data Lake needs to access the S3 bucket. Copy the following Bucket Policy to the bucket.

{
    "Version": "2012-10-17",
     "Id": "access-to-hevo-data-lake",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::**<AWS Account ID>**:role/<EMR Role for EC2>"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::<S3 Bucket Name>/*",
                "arn:aws:s3:::<S3 Bucket Name>"
            ]
        }
    ]
}

AWS Account ID: You can find your Account ID Number on the AWS Management Console, choose Support on the navigation bar on the upper-right, and then choose Support Center. Your currently signed-in account number (ID) appears in the Support Center title bar.

EMR Role for EC2: An IAM role is an IAM identity that you can create in your account that has specific permissions. You can find more about the Role of the EMR here.

S3 Bucket Name: The name of the S3 bucket involved here.

Setup Guide

  1. A destination can either be added while creating a pipeline or by directly heading to Destinations option under the Admin tab on the left app bar and clicking Add Destination button.
  2. Select Destination type as Data Lake from the Select Destination Type drop-down
  3. Configure the Tenant Settings for the execution layer of the Data Lake.
    1. Create a new Tenant by clicking on Add New Tenant or select an existing one by selecting the radio button. It is highly recommended to not make a new tenant pointing to the existing tenant’s cluster.
    2. Tenant Name: A unique name for the tenant
    3. Executor Host: Host IP of the master node of your EMR Cluster
    4. Executor Port: Port of the Livy Server running on your EMR Cluster, it is 8998 by default
    5. Metastore Host: Host IP of the Hive Metastore
    6. Metastore Port: Port of the Hive Metastore
    7. JDBC Host: Host IP of the JDBC Server
    8. JDBC Port: Port of the JDBC Server
    9. Click on Save Tenant to continue with setting up the storage layer.
  4. Configure the storage layer of the Data Lake.

    1. Destination Name: A unique name for the destination.

    2. Database Name: The database where all the tables will be, if it doesn’t exist, it will be created for you.

    3. Bucket Name: Since we’re using S3 as the data store, this bucket denotes S3 Bucket name you want to dump the data to.

    4. Prefix: A location prefix where you want your data to be in.

    5. File Format: Select one of the file formats appropriate to your use case.

  5. Click on Continue to save the destination. You can test the connection. Hevo tries to create and delete a dummy table with name ‘dummy_table’. You’ll know if it encounters a failure.

Note: You’ll get a thrift URI which you can use to plug any external query engine like Presto or Apache Impala.

Last updated on 16 Oct 2020