Full-text search for DynamoDB using Lambda, EFS, Tantivy and Rust.

TL;DR; The code for this blog post is available on Github.

Background

In serverless-land DynamoDB reigns supreme for data persistence. It is ridiculously fast, easy to use and scales to pretty much anything you throw at it. But the trick to using it successfully is that you have to carefully plan your access patterns up front. In situations where you want to do more complicated queries single-table design and filters can sometimes help, but for full-text search your out of luck, it doesn't support it.

For complicated search queries AWS recommends streaming data from DynamoDB to another database such as Elasticsearch. But setting up a whole Elasticsearch cluster can be complicated, plus it's expensive running servers 24-7. Another solution is to use Kinesis Firehose to send data to S3 which you can query with Athena. The downside of Athena is that it tends to be pretty slow especially for user-facing search requests. Depending on your use case there's obviously other solutions available outside of AWS such as Rockset, Algolia, Quickwit, ChaosSearch and Materialize.

While these are all great solutions, what's the fun in using some off the shelf product when we can build something ourselves right? In this article I want to show you a proof of concept that uses DynamoDB, Lambda and Rust to provide a blazingly fast serverless search experience. Depending on interest I could try and turn it into a more generic solution, e.g. a CDK construct that you point at a table and it takes care of the rest.

Proof of concept

For this proof of concept I'm going to model a service used to send emails. Emails will be sent using a service like Mailgun and stored in DynamoDB. We need need to be able to handle sending 100K emails per hour at peak. Messages can have multiple recipients. The service will also receive bounce notifications via webhooks. Lastly we only need to store messages for 90 days. We will rely on DynamoDB TTL to automatically delete items as they expire. Therefore our index will need to be able to support inserts, updates and deletes. For search we need to be able to:

Filter by email address(es).
Filter by keywords, or exact match phrases in the subject or body.
Filter by date range (last hour, last 24 hours, last 7 days).

Solution discussion

Given we're using DynamoDB as our primary datastore we can take advantage of DynamoDB streams to bulk index the data as it changes. DynamoDB will buffer data until it reaches a configurable batch size / batch window before invoking lambda to do the indexing. To index the data we will be using Tantivy. Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust. It's an open source and maintained by the engineers at Quickwit.

We need somewhere to store the index so the main candidates would be S3 or EFS. EFS is a serverless elastic file system that you can attach to Lambda. It scales automatically and you only pay for the storage you use. EFS is also really fast, with sub-millisecond read latencies. Tantivy works out of the box with a filesystem so that's what I decided to run with.

One downside of EFS is that it runs within a VPC. We definitely don't want to create any NAT gateway instances as they are expensive to run. Without public internet access from our lambda functions we will need to use a VPC endpoint to access DynamoDB.

It's important to note that we don't need to store the entire document in the index -- just the terms and the ids of the emails. This helps reduce storage costs and allows Tantivy to do what it does best: indexing. Tantivy will return the ids from a query, then we will fetch the actual document from DynamoDB using a BatchGetItem request.

With Tantivy you can only have one process writing to the index at a time -- It locks the index while writing. So to ensure we only have a single index writer we will use provisioned concurrency set to 1 on the lambda and parallelization factor set to 1 on the stream. There's another potential problem here: scalability. If too much data arrives from the stream we might not be able to keep up. To help mitigate this problem it's important to use a large batch size / and batch window. If the indexing lambda cannot keep up you can get throttling, so for testing I've setup a SQS DLQ which we can check for failures. As always use load testing to ensure the solution will meet your requirements.

Architecture

Indexing flow

The email service will perform CRUD operations on DynamoDB. Changes are streamed in batches to the the "index writer" lambda function. It uses Tantivy to index the email and store it on EFS. It also acts as a "singleton" function, with only one instance of it running at a time.

DynamoDB@2x.png

Search flow

We are going to use a Lambda Function URL so that we can invoke the "index reader" lambda over HTTP using IAM authentication. The first thing the function will do is search the Tantivy index stored on EFS. Then using the ids returned, it will batch get the original documents from DynamoDB. This function can scale out because Tantivy supports having concurrent index readers.

DynamoDB@2x.png

Index Schema

Tantivy indexes have very strict schemas. The schema declares which fields are in the index and for each field its type and "how it should be indexed".

let mut builder = Schema::builder();

let id = builder.add_text_field("id", STRING | STORED);
let timestamp = builder.add_i64_field("timestamp", INDEXED); 
let subject = builder.add_text_field("subject", TEXT);
let body = builder.add_text_field("body", TEXT);
let to = builder.add_text_field("to", TEXT);

let schema = builder.build();

For the id field we use STRING | STORED. STRING fields will be indexed and not tokenized. We also need to STORE the id in the index. This is important so that we can "update" a document. Tantivy does not support updates directly. Instead you have to perform a delete then insert before committing. There's an example of how perform an update operation in the examples.

For the timestamp field we use an i64 to store the Unix timestamp of the message. This field is INDEXED so that we can filter based on date e.g. last hour, last 24 hours, last 7 days.

For the rest of the fields we use TEXT. These fields will be tokenized and indexed.

When creating a document you can add the same field more than once. Below is a snippet from the "index writer" lambda function which parses events from DynamoDB and turns them into a Tantivy document. Notice we add a to field for each recipient.

fn parse_document(
    config: &Config,
    attributes: HashMap<String, AttributeValue>,
) -> anyhow::Result<Document> {
    let id = parse_string(&attributes, "id")?;
    let timestamp: i64 = parse_string(&attributes, "timestamp")?.parse()?;
    let subject = parse_string(&attributes, "subject")?;
    let body = parse_string(&attributes, "body")?;
    let to = parse_string_array(&attributes, "to")?;

    let mut doc = doc!(
        config.email_index_schema.fields.id => id,
        config.email_index_schema.fields.timestamp => timestamp,
        config.email_index_schema.fields.subject => subject,
        config.email_index_schema.fields.body => body,
    );

    for email in to {
        doc.add_text(config.email_index_schema.fields.to, email);
    }

    Ok(doc)
}

Deployment

Setup a profile called "dynamodb-email-indexer". Don't skip this step! We use this within our package.json for deployments and testing.

aws configure --profile dynamodb-email-indexer
AWS Access Key ID [None]: ENTER YOUR ACCESS_KEY
AWS Secret Access Key [None]: ENTER YOUR SECRET_KEY
Default region name [None]: ENTER YOUR AWS_REGION
Default output format [None]:

Clone the code from Github

git clone https://github.com/jakejscott/dynamodb-email-indexer
cd dynamodb-email-indexer

Make sure the .env file is created

cp .env.sample .env

Fill out the fields with your AWS account id and region.

CDK_DEFAULT_ACCOUNT=ENTER YOUR AWS ACCOUNT_ID
CDK_DEFAULT_REGION=ENTER YOUR AWS_REGION

Install the latest version of the AWS CDK

npm install -g aws-cdk

Install the CDK dependencies

npm install

Make sure you have the latest version of Rust. I wrote a blog post if you need help.

rustup update

We need to cross compile our code to run on lambda. To do this we need to install an additional toolchain.

rustup target add x86_64-unknown-linux-musl

Build the Rust lambda functions. I created a utility program called package.rs to make this easy.

npm run package

Bootstrap CDK into the account your deploying into

npm run bootstrap

Finally, all going well, we should be able to deploy the stack!

npm run deploy

We have two stack outputs; the name of the DynamoDB table that is created, and a lambda function url for the index reader. Check that these have been written out to a file called outputs.json. Mine looks like this:

{
  "dynamodb-email-indexer": {
    "EmailIndexReaderFunctionUrl": "https://g4pvyrwfr65mpzjfb7f3wvlncy0vtpzf.lambda-url.ap-southeast-2.on.aws/",
    "EmailTableName": "dynamodb-email-indexer-EmailTableEFC3BCB9-22OESOSH92FY"
  }
}

Load testing writes

The first thing we should do is a smoke test to make sure everything is working. I created a benchmark.rs utility to make this easy.

cargo run --example benchmark -- --how-many 1000 --profile dynamodb-email-indexer

You should see some output similar to this

> dynamodb-email-indexer@0.1.0 benchmark
> cargo run --example benchmark -- --how-many 100 --profile dynamodb-email-indexer
[2022-04-14T00:36:42Z INFO  benchmark] index num docs before starting: 34888
[2022-04-14T00:36:42Z INFO  benchmark] sent 25 of 1000
[2022-04-14T00:36:42Z INFO  benchmark] sent 50 of 1000
[2022-04-14T00:36:43Z INFO  benchmark] sent 75 of 1000
[2022-04-14T00:36:43Z INFO  benchmark] sent 100 of 1000
[2022-04-14T00:36:43Z INFO  benchmark] sent 125 of 1000
...
[2022-04-14T00:36:44Z INFO  benchmark] sent 925 of 1000
[2022-04-14T00:36:44Z INFO  benchmark] sent 950 of 1000
[2022-04-14T00:36:45Z INFO  benchmark] sent 975 of 1000
[2022-04-14T00:36:45Z INFO  benchmark] sent 1000 of 1000
[2022-04-14T00:36:45Z INFO  benchmark] checking total docs count
[2022-04-14T00:36:48Z INFO  benchmark] index num docs: 34888 total indexed: 0 elapsed: 6.455459188s
[2022-04-14T00:36:51Z INFO  benchmark] index num docs: 34888 total indexed: 0 elapsed: 9.665401669s
[2022-04-14T00:36:54Z INFO  benchmark] index num docs: 34888 total indexed: 0 elapsed: 12.887532513s
[2022-04-14T00:36:57Z INFO  benchmark] index num docs: 34888 total indexed: 0 elapsed: 16.106789721s
[2022-04-14T00:37:01Z INFO  benchmark] index num docs: 34888 total indexed: 0 elapsed: 19.333715467s
[2022-04-14T00:37:04Z INFO  benchmark] index num docs: 34888 total indexed: 0 elapsed: 22.573496403s
[2022-04-14T00:37:07Z INFO  benchmark] index num docs: 35387 total indexed: 499 elapsed: 25.896673675s
[2022-04-14T00:37:11Z INFO  benchmark] index num docs: 35632 total indexed: 744 elapsed: 29.187348488s
[2022-04-14T00:37:14Z INFO  benchmark] index num docs: 35888 total indexed: 1000 elapsed: 32.713401804s
[2022-04-14T00:37:14Z INFO  benchmark] done: 32.713490464s

Notice that the program queries how many documents are in the index before starting. It then starts sending documents to DynamoDB. Once it's finished sending the documents it keeps querying the index count until all documents have been indexed. In this case it took about 30 seconds to index 1000 documents.

Let's try indexing 10,000 records...

cargo run --example benchmark -- --how-many 10000 --profile dynamodb-email-indexer

Check the output, looks like it took 53 seconds to index 10,000.

[2022-04-14T00:46:08Z INFO  benchmark] index num docs before starting: 35888
[2022-04-14T00:46:09Z INFO  benchmark] sent 25 of 10000
...
[2022-04-14T00:46:30Z INFO  benchmark] sent 10000 of 10000
[2022-04-14T00:46:30Z INFO  benchmark] checking total docs count
[2022-04-14T00:46:34Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 26.661290598s
[2022-04-14T00:46:37Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 29.866499914s
[2022-04-14T00:46:40Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 33.511384528s
[2022-04-14T00:46:44Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 36.712733001s
[2022-04-14T00:46:47Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 39.926994794s
[2022-04-14T00:46:50Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 43.127705086s
[2022-04-14T00:46:53Z INFO  benchmark] index num docs: 43888 total indexed: 8000 elapsed: 46.342801106s
[2022-04-14T00:46:57Z INFO  benchmark] index num docs: 44466 total indexed: 8578 elapsed: 49.637049724s
[2022-04-14T00:47:01Z INFO  benchmark] index num docs: 45888 total indexed: 10000 elapsed: 53.943496132s
[2022-04-14T00:47:01Z INFO  benchmark] done: 53.94358911s

Lets try 100K records

cargo run --example benchmark -- --how-many 100000 --profile dynamodb-email-indexer
[2022-04-14T00:49:07Z INFO  benchmark] index num docs before starting: 45888
[2022-04-14T00:49:10Z INFO  benchmark] sent 25 of 100000
...
[2022-04-14T00:52:41Z INFO  benchmark] sent 99975 of 100000
[2022-04-14T00:52:41Z INFO  benchmark] sent 100000 of 100000
[2022-04-14T00:52:41Z INFO  benchmark] checking total docs count
[2022-04-14T00:52:45Z INFO  benchmark] index num docs: 143888 total indexed: 98000 elapsed: 218.283679171s
[2022-04-14T00:52:48Z INFO  benchmark] index num docs: 143888 total indexed: 98000 elapsed: 221.598423283s
[2022-04-14T00:52:52Z INFO  benchmark] index num docs: 143888 total indexed: 98000 elapsed: 224.826200708s
[2022-04-14T00:52:55Z INFO  benchmark] index num docs: 143888 total indexed: 98000 elapsed: 228.106123255s
[2022-04-14T00:52:58Z INFO  benchmark] index num docs: 143888 total indexed: 98000 elapsed: 231.340136718s
[2022-04-14T00:53:02Z INFO  benchmark] index num docs: 143888 total indexed: 98000 elapsed: 234.6622887s
[2022-04-14T00:53:05Z INFO  benchmark] index num docs: 144820 total indexed: 98932 elapsed: 237.907739259s
[2022-04-14T00:53:08Z INFO  benchmark] index num docs: 145467 total indexed: 99579 elapsed: 241.268484872s
[2022-04-14T00:53:12Z INFO  benchmark] index num docs: 145888 total indexed: 100000 elapsed: 244.759438031s
[2022-04-14T00:53:12Z INFO  benchmark] done: 244.759524446s

About 4 minutes to index 100K emails. Most of the time was spent sending the documents to DynamoDB too. You can see that by the time it started to check how many documents have been indexed, it was already at 98000.

How about 1 million?

[2022-04-16T00:33:58Z INFO  benchmark] sent 999950 of 1000000
[2022-04-16T00:33:58Z INFO  benchmark] sent 999975 of 1000000
[2022-04-16T00:33:58Z INFO  benchmark] sent 1000000 of 1000000
[2022-04-16T00:33:58Z INFO  benchmark] checking total docs count
[2022-04-16T00:34:03Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2060.0101917s
[2022-04-16T00:34:06Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2063.231698s
[2022-04-16T00:34:09Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2066.4664465s
[2022-04-16T00:34:12Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2069.6761106s
[2022-04-16T00:34:16Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2073.0062907s
[2022-04-16T00:34:19Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2076.2273748s
[2022-04-16T00:34:22Z INFO  benchmark] index num docs: 999000 total indexed: 999000 elapsed: 2079.4475651s
[2022-04-16T00:34:26Z INFO  benchmark] index num docs: 999634 total indexed: 999634 elapsed: 2082.9070893s
[2022-04-16T00:34:29Z INFO  benchmark] index num docs: 1000000 total indexed: 1000000 elapsed: 2086.2907929s
[2022-04-16T00:34:29Z INFO  benchmark] done: 2086.2912984s

About 34 minutes to index 1M documents, again with most of the time spent sending the docs to DynamoDB and the index only trailing by about 30 seconds! I think we can say that it meets our goal of being able to index 100K emails per hour ✅.

Loading testing reads

We also need to load test how many queries we can do per second. To make this easier I disabled the IAM Auth on the lambda function URL. I'm using k6 to run the load test. This is the test script:

import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  stages: [
    { duration: "15s", target: 10 },
    { duration: "30s", target: 25 },
    { duration: "60s", target: 50 },
    { duration: "20s", target: 0 },
  ],
  thresholds: {
    // 95% of requests must finish within 100ms.
    http_req_duration: ["p(95) < 100"],
  },
  ext: {
    loadimpact: {
      distribution: {
        "amazon:au:sydney": { loadZone: "amazon:au:sydney", percent: 100 },
      },
    },
  },
};

export default function () {
  const url = __ENV.URL;
  const query = __ENV.QUERY;
  const limit = parseInt(__ENV.LIMIT);

  const payload = JSON.stringify({
    query: query,
    limit: limit,
  });

  const params = {
    headers: {
      "Content-Type": "application/json",
    },
  };

  const res = http.post(url, payload, params);
  check(res, { "status was 200": (r) => r.status == 200 });

  const json = res.json();
  check(json, {
    "no errors": (x) => x.error == null,
  });

  sleep(1);
}

From my machine we are able to get about 21 requests per second, with 95% of all requests finishing in under 100ms. The stack is deployed in Sydney (ap-southeast-2), while I live in New Zealand so there's a bit of latency.

k6 run -e QUERY="to:jaeden.shields@hotmail.com" -e LIMIT=10 -e URL=https://mnitgr7azia56rmhypjltqitlu0ggqmd.lambda-url.ap-southeast-2.on.aws .\loadtest\index.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: .\loadtest\index.js
     output: -

  scenarios: (100.00%) 1 scenario, 50 max VUs, 2m5s max duration (incl. graceful stop):
           * default: Up to 50 looping VUs for 1m35s over 4 stages (gracefulRampDown: 30s, gracefulStop: 30s)


running (1m35.3s), 00/50 VUs, 2085 complete and 0 interrupted iterations
default ✓ [======================================] 00/50 VUs  1m35s

     ✓ status was 200
     ✓ no errors

     checks.........................: 100.00% ✓ 4170      ✗ 0
     data_received..................: 1.6 MB  17 kB/s
     data_sent......................: 576 kB  6.0 kB/s
     http_req_blocked...............: avg=3.34ms   min=0s      med=0s      max=237.28ms p(90)=0s      p(95)=0s
     http_req_connecting............: avg=1.05ms   min=0s      med=0s      max=102.87ms p(90)=0s      p(95)=0s
   ✓ http_req_duration..............: avg=60.71ms  min=43.92ms med=57.27ms max=368.81ms p(90)=70.92ms p(95)=79.93ms
       { expected_response:true }...: avg=60.71ms  min=43.92ms med=57.27ms max=368.81ms p(90)=70.92ms p(95)=79.93ms
     http_req_failed................: 0.00%   ✓ 0         ✗ 2085
     http_req_receiving.............: avg=180.73µs min=0s      med=0s      max=4.76ms   p(90)=808.3µs p(95)=1ms
     http_req_sending...............: avg=5.64µs   min=0s      med=0s      max=925.8µs  p(90)=0s      p(95)=0s
     http_req_tls_handshaking.......: avg=2.24ms   min=0s      med=0s      max=193.67ms p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=60.53ms  min=42.96ms med=57.09ms max=368.81ms p(90)=70.66ms p(95)=79.93ms
     http_reqs......................: 2085    21.870785/s
     iteration_duration.............: avg=1.07s    min=1.04s   med=1.06s   max=1.37s    p(90)=1.08s   p(95)=1.1s
     iterations.....................: 2085    21.870785/s
     vus............................: 2       min=1       max=50
     vus_max........................: 50      min=50      max=50

Let's run the test using k6.io cloud. First login to k6

k6 login cloud -t YOUR_K6_CLOUD_TOKEN

Next run the script in the cloud using k6 cloud rather than k6 run.

k6 cloud -e QUERY="to:jaeden.shields@hotmail.com" -e LIMIT=10 -e URL=https://mnitgr7azia56rmhypjltqitlu0ggqmd.lambda-url.ap-southeast-2.on.aws .\loadtest\index.js

With the k6 trial account I'm only able to create 50 virtual users and it looks like we peak at about 50 req/s. The average response time stays flat during the test which is good, but we'd probably need to add more users to see how well it scales. I think we can give this a pass for now 😊.

Search testing

Filter by email address:

cargo run --example search -- --query 'to:reina.medhurst@yahoo.com' --limit 10 --profile dynamodb-email-indexer

Response:

[2022-04-14T02:05:12Z INFO  search] search response:
    {
      "emails": [
        {
          "body": "ratione beatae ratione officia et.\nquos eveniet alias fuga unde doloremque aperiam.",
          "id": "01G06C7J8T0AKNJVZT8H5X083S",
          "subject": "inventore maxime quia ea.",
          "timestamp": 1649481599,
          "to": [
            "\"Reina Medhurst\" <reina.medhurst@yahoo.com>"
          ]
        }
      ],
      "error": null,
      "index_num_docs": 145888,
      "query_num_docs": 1
    }
[2022-04-14T02:05:12Z INFO  search] done: 260.592ms

Filter by email addresses:

cargo run --example search -- --query 'to:reina.medhurst@yahoo.com OR to:lolita.kulas@hotmail.com' --limit 10 --profile dynamodb-email-indexer

Response:

[2022-04-14T02:04:24Z INFO  search] search response:
    {
      "emails": [
        {
          "body": "dignissimos consectetur doloremque enim nam ut quo.",
          "id": "01G06C7J8Z46A9FVWQABWMD8K6",
          "subject": "illum ut.",
          "timestamp": 1649481599,
          "to": [
            "\"Lolita Kulas\" <lolita.kulas@hotmail.com>"
          ]
        },
        {
          "body": "ratione beatae ratione officia et.\nquos eveniet alias fuga unde doloremque aperiam.",
          "id": "01G06C7J8T0AKNJVZT8H5X083S",
          "subject": "inventore maxime quia ea.",
          "timestamp": 1649481599,
          "to": [
            "\"Reina Medhurst\" <reina.medhurst@yahoo.com>"
          ]
        }
      ],
      "error": null,
      "index_num_docs": 145888,
      "query_num_docs": 2
    }
[2022-04-14T02:04:24Z INFO  search] done: 227.0634ms

Filter by keywords, or exact match phrases in the subject or body.

cargo run --example search -- --query 'subject:magni+asperiores AND body:quasi+debitis+rerum+tempore+doloribus' --limit 10 --profile dynamodb-email-indexer

Response:

[2022-04-14T02:02:34Z INFO  search] search response:
    {
      "emails": [
        {
          "body": "quasi debitis rerum tempore doloribus.",
          "id": "01G0JRQ6ZY1CGNM60YSX41A314",
          "subject": "magni asperiores.",
          "timestamp": 1649897348,
          "to": [
            "\"Nya Jenkins\" <nya.jenkins@yahoo.com>"
          ]
        }
      ],
      "error": null,
      "index_num_docs": 145888,
      "query_num_docs": 1
    }
[2022-04-14T02:02:34Z INFO  search] done: 267.226ms

Filter by date range

cargo run --example search -- --query 'timestamp:>=1649481599 AND timestamp:<=1649902135' --limit 5 --profile dynamodb-email-indexer

Response

[2022-04-14T02:10:33Z INFO  search] search response:
    {
      "emails": [
        {
          "body": "esse quas amet voluptatibus.",
          "id": "01G06C7J92JMCT94A5SA9JH31N",
          "subject": "deserunt.",
          "timestamp": 1649481599,
          "to": [
            "\"Torrey Wyman\" <torrey.wyman@gmail.com>"
          ]
        },
        {
          "body": "ratione beatae ratione officia et.\nquos eveniet alias fuga unde doloremque aperiam.",
          "id": "01G06C7J8T0AKNJVZT8H5X083S",
          "subject": "inventore maxime quia ea.",
          "timestamp": 1649481599,
          "to": [
            "\"Reina Medhurst\" <reina.medhurst@yahoo.com>"
          ]
        },
        {
          "body": "aut rem sit commodi ipsam labore nam ut minima.\nnatus est vero sit assumenda maxime.",
          "id": "01G06C7J98DQP0R7EG7BZQQRH9",
          "subject": "veritatis reprehenderit suscipit.",
          "timestamp": 1649481599,
          "to": [
            "\"Colten Jerde\" <colten.jerde@yahoo.com>"
          ]
        },
        {
          "body": "ut alias aperiam porro excepturi eum beatae.",
          "id": "01G06C7J91F35Z77ANADB6D2R0",
          "subject": "est.",
          "timestamp": 1649481599,
          "to": [
            "\"Alisha Gutmann\" <alisha.gutmann@yahoo.com>"
          ]
        },
        {
          "body": "dignissimos consectetur doloremque enim nam ut quo.",
          "id": "01G06C7J8Z46A9FVWQABWMD8K6",
          "subject": "illum ut.",
          "timestamp": 1649481599,
          "to": [
            "\"Lolita Kulas\" <lolita.kulas@hotmail.com>"
          ]
        }
      ],
      "error": null,
      "index_num_docs": 145888,
      "query_num_docs": 145885
    }
[2022-04-14T02:10:33Z INFO  search] done: 207.5986ms

Next steps

I'd love to spend some time thinking about how to turn this into a more generic solution or open source project that could be packaged up in the form of a CDK construct. These are some of the features I've thought about:

Indexing an existing table (could start from a DynamoDB table backup).
Ability to manage indexes, modify, re-index etc.
Rolling index per day / month
Admin UI using Remix and Tailwind UI.
Other use cases like being able to search EventBridge events.

If this is something that you're interested in please reach out to me on Twitter or leave a comment or star the Github repo. If you want to learn more about Tantivy check out the examples and join the Quickwit discord channel, they are very helpful and have answered a bunch of my questions.

Thanks for reading!

Jake Scott

Jake Scott

Full-text search for DynamoDB using Lambda, EFS, Tantivy and Rust.

Table of contents

Background

Proof of concept

Solution discussion

Architecture

Indexing flow

Search flow

Index Schema

Deployment

Load testing writes

Loading testing reads

Search testing

Next steps

References