Building Email Response Generation System
We built AI smart enough to write our emails but not smart enough to convince people to stop sending so many.
Introduction
Think of an organization, which has to deal with tons of emails regularly. Wouldn’t it be nice if we could generate emails at scale?
It’s a nice thought but difficult to execute at scale. The biggest challenge is to ensure the quality of email responses. One can easily make an API call to large language models (LLMs). However, what truly matters is ensuring the quality of responses. Providing LLMs with the stored data and email history might not be enough. An email might be an extension of a communication that might have happened over a call. If an organization doesn’t store call or video communications, then even the most powerful LLM won’t be able to generate optimal responses to all the emails.
We create this architecture under certain assumptions, trying to keep them as realistic as possible. We will assume that emails don’t have attachments or even if they have attachments, the attachment data is already extracted and stored in a database for a LLM to access. Data storage and extraction is a crucial design absence; however, we did it on purpose to limit the article length.
Architecture
Here is the detailed architecture diagram. We will discuss the various components of this architecture in detail.
A typical journey of sending a response to an email goes like this: an email hits user’s inbox, user clicks on the email, reads it, types out a response if needed, and then sends it. This journey implies that there is ample time to process an email response prior to a user seeing it. We can exploit this lag to create an asynchronous flow.
As soon as an email hits an inbox, it is sent to Kafka asynchronously for the response to be pre-populated before a user requests it. However, it's possible that a user might request a response prior to its generation. As such an incident increases a user's wait time, we would like to avoid such scenarios. To minimize the wait time, instead of waiting for a user to press the response generation button, we would send a synchronous request when a user opens an email to read. This gives us a head-start and reduces the probability of a user pressing the button prior to response generation.
When a user opens an email, a synchronous request is sent and the pre-processed response is fetched from the Redis storage and is sent back. The response could be held in memory within the client application (like an email client). This is ephemeral storage that exists only during the current session. In applications built with frameworks such as React, Angular, or Vue, the response could be stored in the component state or a state management store (like Redux) associated with the email view.
Redis
We use Redis for it’s distributed capabilities, in-memory store, and Pub/Sub messaging capabilities. Apart from storage, Redis tracks email status via correlation ID. A correlation ID is a unique identifier assigned to each email request that allows the system to track it through the entire processing pipeline. After every processing step, the respective services update the current status of this correlation ID in Redis.
When a client request comes in to fetch a response (during an email opening event) and the response is not present in Redis, a short-lived websocket (with a timeout) connection is established between the API Gateway and the client. The correlation ID status is marked as having an “active connection’’ in Redis.
When the email response is generated and the correlation ID status changes to “completed,’’ a message is published to Redis Pub/Sub, triggering the read operation to deliver the response to the waiting client.
Redis Cache Miss
We've implemented a Redis-managed database lookup approach for cache misses. When data isn't found in Redis, the DB Read-Through Manager in Redis detects the cache miss and directly queries the database. The database then returns results to Redis, which forwards these results to the Query Service without caching them.
We chose to have Redis handle database lookups for cache misses rather than having the Query Service do this work. This design creates a centralized caching strategy where all data access decisions are handled in one place. The Query Service remains focused on serving clients rather than managing data access logic, creating a cleaner separation of responsibilities. Redis can also maintain predictive algorithms to pre-load emails from database and optimized database connection pools, providing better resource utilization across the system.
We ignored the more traditional approach, where the Query Service would handle database fallbacks. Redis would remain purely a cache without needing database credentials, while the Query Service would have complete autonomy and control over read operations. Having the Query Service handle database fallbacks would also make the data access paths more explicit in the application code rather than being partially handled in infrastructure.
Connection Monitor and Active Connection Tracking
The Connection Monitor component lives within our Query Service, functioning as our system's notification hub. It maintains a direct channel to the API Gateway, allowing it to push real-time updates to “active connections’’ without requiring them to poll for changes.
For its intelligence, the Connection Monitor relies on the Active Connection Tracking module in our Redis HA Cluster. This tracking system maintains a living registry of the connected clients and what updates they're authorized to receive.
When the Redis Pub/Sub channels broadcast status changes or operation completions, the Connection Monitor consults the Active Connection Tracking registry to determine precisely which connected clients should receive these updates, then routes them accordingly through the API Gateway.
This partnership ensures our system delivers timely notifications only to relevant active connections, optimizing both bandwidth usage and processing resources within our distributed architecture.
Priority Manager
When a user opens an email (email open event) and no pre-generated response is found in Redis, this absence can occur for one of two reasons:
Scenario 1: A request against a given correlation ID is being sent for the first time.
Scenario 2: A request against a given correlation ID was sent asynchronously to Kafka and is in process.
Let's consider scenario 2. If the generation is already in progress, then we shall wait for the response. However, if the generation has not yet started, then the priority has to be set to high (live requests). One cannot upgrade a topic in Kafka for a given request, and this is handled by the priority manager. The priority manager essentially deletes the correlation ID from the low-priority topic and publishes it again in the high-priority topic. This step reduces the latency of our system.
Email Complexity
We use another trick to reduce overall system latency. When an asynchronous request comes in, we first check the complexity of the email thread. We would prefer complex email threads to be processed first, so that their responses are more likely to be present when an email open event happens. The email complexity analysis happens only for asynchronous requests and not for live requests because we want to process all the live requests with high priority.
Response Generation
There are five services involved in an email response generation.
Data Retrieval: The job of this service is to retrieve requisite data from databases, which is used in prompts to generate relevant responses. One can use a large language model (LLM) to generate SQL queries via a prompt, if needed.
Thread Retrieval: It's a RAG (Retrieval Augmented Generation) based service to get top k similar email threads (historical data). The similar threads are fed to the prompt during response generation.
Response Generation: One can either fine-tune (if historical data is present) or just write a prompt for a large language model (LLM) for email generation. As mentioned in the diagram, this service can either use closed source models such as Open AI or open source models such as Mixtral to craft appropriate responses based on the retrieved data and context.
Response Evaluation: An LLM is used as a judge, which evaluates the response and attaches feedback if the response quality is low and can trigger response regeneration. One can fix the maximum number of retries (for instance, 3). Evaluation is not done for live requests to ensure lower latency. This quality control step helps maintain consistent communication standards. Because we don't use evaluation during live requests, it is important that we do complexity-based prioritization during asynchronous requests.
Notification: Notification service that notifies the other architectural components with the final response. The notification ensures that the generated response is properly delivered back through the system to the waiting client application, completing the request-response cycle.
A workflow orchestrator (Temporal/Airflow) manages the entire flow and each service updates Redis to ensure that the state of each correlation ID is tracked in the system. The orchestrator is essential for maintaining reliable email processing by automatically handling service failures, managing retries, and providing end-to-end visibility into each email's journey through our distributed pipeline. Each service in this pipeline is designed to be modular, allowing for independent scaling and updates as requirements change or as LLM technology evolves.
Command Query Responsibility Segregation
For easy tracking, we use CQRS (Command Query Responsibility Segregation) architectural design, where all write operations to Redis happen via the Command Service and all read operations happen via the Query Service. The Command Service also subscribes to Redis Pub/Sub to write responses to a database for long-term storage.
The read/write separation offers several key advantages:
It allows us to optimize read and write operations independently based on their different workload patterns
It reduces system contention by preventing read operations from blocking writes and vice versa
The design simplifies debugging, monitoring, and maintenance as responsibilities are clearly separated
It improves security by allowing more granular access control for different operation types
Conclusion
The Agentic GenAI Email Service architecture provides an efficient solution to process high email volumes by combining asynchronous pre-processing with smart prioritization. The system's distributed nature ensures scalability, while CQRS principles enable independent optimization of read and write operations. As LLM capabilities continue to evolve, this architecture offers a flexible foundation that can adapt to changing requirements while maintaining enterprise-grade performance and reliability.