Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine that centrally stores your data so you can search, index, and analyze data of all shapes and sizes.

Hevo connects to your Elasticsearch cluster using the Elasticsearch Transport Client and synchronizes the data available in the cluster to your preferred data warehouse using indices. Currently, Hevo supports the following variants:

  • Generic Elasticsearch
  • AWS Elasticsearch

Prerequisites

  • Elasticsearch version greater than 7.0. View versions.

  • There is at least one sortable field in each document. To be sortable, the fields can be of any of these types: unsigned_long, long_, _ integer, short, byte, float,_ double half_float scaled_float date and date_nanos.

  • The database username and password are available, if your Elasticsearch host uses Native Realm authentication.


Perform the following steps to configure your Elasticsearch Source:

Retrieve the Hostname

  • For self-hosted or cloud-based Elasticsearch databases, contact your system admin to know the database hostname and port.

  • For AWS ElasticSearch services, contact your service provider.


Obtain Username and Password (optional)

The Elastic Stack security features authenticate users by using realms and one or more token-based authentication services. Currently Hevo’s Elasticsearch integration supports only Native Realm authentication.

Contact your system administrator for obtaining the username and password, if you do not have these details.


Configure Elasticsearch Connection Settings

Perform the following steps to configure Elasticsearch as the Source in Hevo:

  1. Click PIPELINES in the Asset Palette.

  2. Click + CREATE in the Pipelines List View.

  3. In the Select Source Type page, select Elasticsearch.

  4. In the Configure your Elasticsearch Source page, specify the following:

    Elasticsearch settings

    • Pipeline Name: A unique name for your Pipeline, not exceeding 255 characters.

    • Database Host: The Elasticsearch database host’s IP address or DNS.

      Note: For URL-based hostnames, exclude the protocol part (http:// or https://).

    • Database Port: The port on which your Elasticsearch server is listening for connections. Default value: 9200.

    • Database User: The authenticated user that can read the tables in your database.

    • Database Password: The password for the database user.

    • Connection Options: Select one of the following options to specify how Hevo must access your database instance:

      • Connect through SSH: Enable this option to connect to Hevo using an SSH tunnel, instead of directly connecting your Elasticsearch database host to Hevo. This provides an additional level of security to your database by not exposing your Elasticsearch setup to the public. Read Connecting Through SSH.

      If this option is disabled, you must whitelist [Hevo’s IP addresses](/about/regions to allow Hevo to connect to your Elasticsearch host.

      • Connect through HTTPS: Enable this option if your cluster is configured to use HTTPS. Contact your administrator if you do not have this information. Keep this option disabled to connect using HTTP.
    • Advanced Settings:

      • Load Historical Data: If this option is enabled, the entire table data is fetched during the first run of the Pipeline. If disabled, Hevo loads only the data that was written in your database after the time of creation of the Pipeline.

      • Include New Tables in the Pipeline: Applicable for all Pipeline modes except Custom SQL. If this option is enabled, Hevo automatically ingests data from tables created after the Pipeline has been built. If disabled, the new tables are listed in the Pipeline Detailed View in Skipped state, and you can manually include the ones you want and load their historical data.

        You can change this setting later.

  5. Click TEST & CONTINUE.

  6. Proceed to configuring the data ingestion and setting up the Destination.


Data Replication

  • Historical Load: When you create the Pipeline, Hevo fetches all the data available in the Source database. However, you can limit the number of Events ingested in each run of the Pipeline to maintain the processing load on your cluster.

  • Incremental Data - New and changed data is fetched every 15 minutes by default. You can configure this frequency using the Change Schedule option in the Pipeline Summary Bar.


Source Considerations

  • Elasticsearch does not have the capability to expose each document modification.Therefore, in order to have at least one incrementing column of sortable type is needed mandatorily, the identity column is used as the tietbreaker if the sortable field is the same for more than one document.

    The _id field created by default is used if none is specified.


Limitations

  • Only Native Realm authentication is supported.



See Also


Revision History

Refer to the following table for the list of key updates made to this page:

Date Release Description of Change
Jul-26-2021 1.68 Added a note for the Database Host field.
Jul-12-2021 1.67 Added the field Include New Tables in the Pipeline under Source configuration settings.
Jun-01-2021 1.64 Updated the Configure Elasticsearch Connection Settings section to include the Connect Through HTTPS setting.
Last updated on 18 Oct 2021