Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine that centrally stores your data so you can search, index, and analyze data of all shapes and sizes.

Hevo connects to your Elasticsearch cluster using the Elasticsearch Transport Client and synchronizes the data available in the cluster to your preferred data warehouse using indices. Currently, Hevo supports the following variants:

  • Generic Elasticsearch
  • AWS Elasticsearch

Prerequisites

  • Elasticsearch version greater than 7.0. View versions.

  • There is at least one sortable field in each document. To be sortable, the fields can be of any of these types: unsigned_long, long_, _ integer, short, byte, float,_ double half_float scaled_float date and date_nanos.

  • The database username and password are available if your Elasticsearch host uses Native Realm authentication.


Perform the following steps to configure your Elasticsearch Source:

Retrieve the Hostname

  • For self-hosted or cloud-based Elasticsearch databases, contact your system admin to know the database hostname and port.

  • For AWS ElasticSearch services, contact your service provider.


(Optional) Obtain Username and Password

The Elastic Stack security features authenticate users by using realms and one or more token-based authentication services. Currently Hevo’s Elasticsearch integration supports only Native Realm authentication.

Contact your system administrator for obtaining the username and password, if you do not have these details.


(Optional) Connect to Elasticsearch hosted inside a Virtual Private Cloud (VPC)

Hevo connects to your Elasticsearch instance hosted inside a VPC using a reverse proxy server set up on Amazon EC2. The server routes all requests that Hevo makes to ingest data to your Elasticsearch instance inside the VPC.

To enable Hevo to connect to your Elasticsearch instance configured inside a VPC, you need to:

  1. Set up an EC2 instance.

  2. Whitelist Hevo’s IP addresses, and launch the EC2 instance.

  3. Retrieve the public Endpoint and connect to the EC2 instance.

  4. Configure a reverse proxy server in the EC2 instance.

These steps are to set up using NGINX Open Source as the reverse proxy server. You can also use another web server, such as Apache or Caddy.

1. Set up the EC2 instance

  1. Open the EC2 Management Console in your AWS account, and launch an EC2 instance. For example, NGINX_Elasticsearch.

  2. Configure the network settings for the instance, such that it is in the same VPC as your Elasticsearch database. Also, retain the default setting for Auto-assign Public IP, to assign a public IP and DNS to the instance. Read Configure Instance Details.

    Configure Instance Details

2. Whitelist Hevo’s IP addresses and launch the instance

  1. Configure the security group settings to whitelist Hevo’s IP addresses of your region for the HTTP and HTTPS protocol types. Read Configure Security Group.

  2. Review the instance settings, and in the pop-up dialog box that is displayed, create a key pair or use an existing one. A key pair, which consists of a public key and a private key, allows you to connect to your instance securely. Read Review Instance and Launch.

    Generate Key Pair and Launch Instance

  3. Click Download Key Pair to download the created key pair and save it in a secure location.

  4. Click Launch Instances.

3. Retrieve the public Endpoint and connect to the EC2 instance

  1. In the Launch Status page, click View Instances to retrieve the public endpoint of the instance you created. This could be the public IPv4 address or DNS.

    Public Endpoint of the EC2

  2. Connect to the EC2 instance using one of the available methods, such as SSH or EC2 Instance Connect. Read Connect to your Linux instance.

4. Configure your reverse proxy server

  1. Install NGINX Open Source in the EC2 instance. Read Installing NGINX.

  2. Perform the following steps to edit the NGINX configuration, and add your Elasticsearch instance public endpoint and port number:

    1. Navigate to the configuration file directory. For example, /etc/nginx.

    2. Edit the configuration file, /etc/nginx/conf.d, and add the following information:

      server {
          listen 443;
      
      location / {
          proxy_pass http://<elasticsearch-services-endpoint>:443;
          }
      }
      
    3. Save the file and restart the NGINX service. For example,

      $ sudo service nginx restart
      

Configure Elasticsearch Connection Settings

Perform the following steps to configure Elasticsearch as the Source in Hevo:

  1. Click PIPELINES in the Asset Palette.

  2. Click + CREATE in the Pipelines List View.

  3. In the Select Source Type page, select Elasticsearch.

  4. In the Configure your Elasticsearch Source page, specify the following:

    Elasticsearch settings

    • Pipeline Name: A unique name for your Pipeline, not exceeding 255 characters.

    • Database Host: The Elasticsearch database host’s IP address or DNS. Provide the public IP address or DNS of the EC2 instance as retrieved in Step 3 if your Elasticsearch database is hosted inside a VPC.

      Note: For URL-based hostnames, exclude the protocol part (http:// or https://).

    • Database Port: The port on which your Elasticsearch server listens for connections. Default value: 9200.

      Note: For an Elasticsearch database hosted inside a VPC, this port number is 443.

    • Database User: The authenticated user that can read the tables in your database.

    • Database Password: The password for the database user.

    • Connection Options: Select one of the following options to specify how Hevo must access your database instance:

      • Connect through SSH: Enable this option to connect to Hevo using an SSH tunnel, instead of directly connecting your Elasticsearch database host to Hevo. This provides an additional level of security to your database by not exposing your Elasticsearch setup to the public. Read Connecting Through SSH.

        If this option is disabled, you must whitelist Hevo’s IP addresses to allow Hevo to connect to your Elasticsearch host.

        Note: This option does not apply to an AWS Elasticsearch Source. To connect to that Source, you must set up a reverse proxy server.

      • Connect through HTTPS: Enable this option if your cluster is configured to use HTTPS. Contact your administrator if you do not have this information. Keep this option disabled to connect using HTTP.

    • Advanced Settings:

      • Load Historical Data: If this option is enabled, the entire table data is fetched during the first run of the Pipeline. If disabled, Hevo loads only the data that was written in your database after the time of creation of the Pipeline.

      • Include New Tables in the Pipeline: Applicable for all Pipeline modes except Custom SQL.

        If enabled, Hevo automatically ingests data from tables created in the Source after the Pipeline has been built. These may include completely new tables or previously deleted tables that have been re-created in the Source.

        If disabled, new and re-created tables are not ingested automatically. They are added in SKIPPED state in the objects list, in the Pipeline Overview page. You can update their status to INCLUDED to ingest data.

        You can change this setting later.

  5. Click TEST & CONTINUE.

  6. Proceed to configuring the data ingestion and setting up the Destination.


Data Replication

  • Historical Load: When you create the Pipeline, Hevo fetches all the data available in the Source database. However, you can limit the number of Events ingested in each run of the Pipeline to maintain the processing load on your cluster.

  • Incremental Data: New and changed data is fetched every 15 minutes, by default. You can configure this frequency using the Change Schedule option in the Pipeline Summary Bar.


Source Considerations

  • Elasticsearch does not have the capability to expose each document modification. Therefore, to have at least one incrementing column of sortable type, the identity column is used as the tiebreaker if the sortable field is the same for more than one document.

    The _id field created by default is used if none is specified.


Limitations

  • Only Native Realm authentication is supported.



Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Apr-11-2022 1.86 Added a note in the Connection Settings about setting up a reverse proxy server for connecting to an AWS Elasticsearch Source.
Feb-21-2022 1.82 Added section, (Optional) Connect to Elasticsearch hosted inside a Virtual Private Cloud (VPC)
Jan-03-2022 1.79 Updated the description of the Include New Tables in the Pipeline advance setting in the Configure Elasticsearch Connection Settings section.
Jul-26-2021 1.68 Added a note for the Database Host field.
Jul-12-2021 1.67 Added the field Include New Tables in the Pipeline under Source configuration settings.
Jun-01-2021 1.64 Updated the Configure Elasticsearch Connection Settings section to include the Connect Through HTTPS setting.
Last updated on 28 Apr 2022