Quantcast
Channel: Hivelight
Viewing all articles
Browse latest Browse all 10

How we sync DynamoDB with OpenSearch

$
0
0

At Hivelight we are using both DynamoDb and OpenSearch as data stores, they may hold the same data but their usage serves different purposes.

We mostly use DynamoDB for data resiliency and Amazon OpenSearch Service (formerly known as Amazon Elasticsearch Service) for querying. Here’s how each component contributes to the overall benefits:

DynamoDB for Data Resiliency:

High Availability: DynamoDB offers built-in multi-AZ replication, which ensures that our data is replicated across multiple availability zones

No ops: DynamoDB is a fully managed service, AWS takes care of the underlying infrastructure, including hardware provisioning, setup, and maintenance. This managed aspect reduces operational overhead and minimizes the risk of infrastructure-related failures.

Backup and Restore: DynamoDB offers backup and restore features, allowing us to create full backups of our tables. These backups can be retained for extended periods, providing an additional layer of data protection against deletions or corruptions.

Amazon OpenSearch Service for Querying:

Rich Querying Capabilities: OpenSearch provides powerful full-text search and analytics capabilities. This enables sophisticated querying of large volumes of data with high performance.

Schema-less: OpenSearch is schema-less, meaning that it can index and search unstructured or semi-structured data without the need to define a rigid schema upfront. However, this sometimes does not hold when dealing with properties that can hold multiple types.

Data Analysis: OpenSearch supports real-time indexing and querying, allowing us to analyze streaming data as it arrives. This capability is useful for monitoring, logging, and real-time analytics applications where timely insights are critical.

We get the best of both worlds

Decoupled Architecture: Using DynamoDB for data storage and OpenSearch for querying allows us to decouple our data storage layer from the query layer. This separation of concerns improves scalability, fault isolation, and flexibility in choosing the right tools for each task.

Performance Optimization: By offloading complex querying tasks to OpenSearch, we can optimize the performance of our primary data store (DynamoDB). This separation of concerns ensures that DynamoDB can focus on serving high-throughput, low-latency read and write operations, while OpenSearch handles complex search queries.

Resilience to Querying Load: Since OpenSearch is optimized for querying and analytics, it can efficiently handle heavy querying workloads without impacting the performance of DynamoDB. This resilience to querying load ensures that our applications remain responsive even during peak query times.

Enhanced Analytics: By leveraging OpenSearch’s querying capabilities, we can gain deeper insights into our data, uncover patterns, trends, and anomalies, and make informed business decisions. This analytical power complements the operational benefits provided by DynamoDB, creating a comprehensive data management solution.

How do we keep DynamoDb and OpenSearch in sync

Long story short, we use DynamoDB Streams that are ingested by a Lambda function that massages the data before indexing it into OpenSearch

The DynamoDB Stream, as mentioned in one of my previous posts will let the Lambda function know of all changes by providing the old record and the new record.

So we can use that to make decisions on how to let OpenSearch know about it

Here is a basic Lambda function that will be triggered by the Stream (code simplified for the example sake):

import { unmarshall } from "@aws-sdk/util-dynamodb";
import sync from "../lib/sync.js";

export const handler = async (event) => {
  const toCreate = [];
  const toUpdate = [];
  const toDelete = [];

  event.Records.forEach((record) => {
    // Getting the stream data as a plain JSON object
    const newImage =
      record.dynamodb.NewImage && unmarshall(record.dynamodb.NewImage);
    const oldImage =
      record.dynamodb.OldImage && unmarshall(record.dynamodb.OldImage);

    if (record.eventName === "INSERT") {
      toCreate.push(newImage);
    } else if (record.eventName === "REMOVE") {
      // we need the old image as there is no "new image" when deleting
      toDelete.push(oldImage);
    } else if (record.eventName === "MODIFY") {
      toUpdate.push(newImage);
    }
  });

  // Sync that to OpenSearch
  sync(toCreate, toDelete, toUpdate);
};

This is pretty simple, right? The function takes the records in and pushes the changes to either a create, delete, or update arrays.

Now, let’s see how to sync that into OpenSearch (code simplified for the example sake):

import { defaultProvider } from "@aws-sdk/credential-provider-node";
import { AwsSigv4Signer } from "@opensearch-project/opensearch/aws";
import { Client } from "@opensearch-project/opensearch";

const client = new Client({
  ...AwsSigv4Signer({
    service: "es",
    region: process.env.AWS_REGION,
    getCredentials: () => {
      const credentialsProvider = defaultProvider();
      return credentialsProvider();
    }
  }),
  node: process.env.YOUR_OPENSEARCH_ENDPOINT_URL
});

const indexName = "users"

export default async (toCreate, toDelete, toUpdate) => {
  let createOperations = toCreate.map((doc) => [
    { index: { _index: indexName, _id: doc.id } },
    doc
  ]);
  let updateOperations = toUpdate.map((doc) => [
    { index: { _index: indexName, _id: doc.id } },
    doc
  ]);
  let deleteOperations = toDelete.map((doc) => [
    { delete: { _index: indexName, _id: doc.id } }
  ]);

  const operations = [
    ...createOperations,
    ...updateOperations,
    ...deleteOperations
  ].filter((o) => o);

  if (!operations.length) {
    return;
  }
  const body = operations.flat().map(JSON.stringify).join("\n") + "\n";

  return client.bulk({ body });
};

and bang! The updated data is now indexed into OpenSearch.

Why we are not using the DynamoDB to OpenSearch ETL service

While the DynamoDB zero-ETL integration with Amazon OpenSearch Service largely replicates (in fact, it offers additional functionalities) what we’ve implemented, there are certain limitations inherent in this integration.

One significant constraint for us revolves around the capability to manipulate the data prior to its transmission to OpenSearch. For example, as OpenSearch indexes primarily serve our end users conducting queries, it’s imperative to omit sensitive data. Furthermore, we aim to refine and format certain data for improved end-user presentation. Additionally, we aspire to enhance the records destined for OpenSearch by supplementing them with data from alternate sources. These tasks, such as removing, appending, or updating properties of the records to be indexed, are seamlessly achievable within our Lambda function.

What is the cost?

Another long story short, the costs are minimal in our scenario. Typically, the function operates in less than 50 milliseconds. However, it’s important to consider the synchronization latency. While changes made to DynamoDB are swiftly reflected, there might be a delay of up to 3 seconds before end-users observe these updates. For example, when a user saves an item and navigates to the list of items, the recent changes may not immediately appear without some front-end work.

At Hivelight, our data hinges on both DynamoDB and OpenSearch. While they may contain similar datasets, they serve distinct purposes. DynamoDB stands as our robust data fortress, while OpenSearch takes the lead in complex querying tasks.

Our DynamoDB Stream-Lambda architecture ensures seamless synchronization between the two, maintaining data integrity and timeliness. While very cost-effective, it’s essential to note a slight delay in UI updates. Nonetheless, with this strategic pairing, we’re delivering exceptional user experiences through insightful data management.

Questions, comments, feedback? Keen to hear about them!

By Xabi Errotabehere

The post How we sync DynamoDB with OpenSearch appeared first on Hivelight.


Viewing all articles
Browse latest Browse all 10

Trending Articles