Designing Common Systems: Design a Notification System

📌 Question

"Design a system that sends notifications to users across channels such as email, SMS, and push notifications."

This is a highly practical system design interview question. It tests your ability to build modular, asynchronous, and fault-tolerant architectures that scale and adapt to diverse communication channels and user needs.

✅ Solution

1. Clarify the Requirements

Functional Requirements:

Deliver notifications based on events like friend requests, promotions, messages, or alerts.
Support multiple channels: email, push notifications, and SMS.
Allow users to set preferences: opt in/out by channel and notification type.
Support real-time and scheduled notifications.
Ensure delivery with retry logic and fallback mechanisms.

Non-Functional Requirements:

Scale to millions of notifications daily.
Ensure low-latency delivery for time-sensitive messages.
Maintain high availability and fault tolerance.
Include observability: logs, metrics, and traceability.

2. High-Level Architecture Overview

Notification Trigger Service: Accepts events from upstream services.
Queueing Layer: Buffers messages using a queueing system like Kafka, RabbitMQ, or AWS SQS.
Notification Worker Pool: Processes jobs from the queue, checks preferences, applies logic, and routes to the correct channel dispatcher.
Channel Dispatchers: Sends the final message via email, SMS, or push (e.g., SendGrid, Twilio, FCM).
User Preferences Service: Stores notification preferences and quiet-hour settings.
Delivery Tracker & Logger: Logs message status (sent, failed, retried, delivered).
Monitoring Dashboard: Alerts on queue size, failure rates, latency spikes.

3. Event Flow Example

A message is sent to a user → a notification trigger event is emitted.
The event is added to a queue.
A worker dequeues it and checks the user’s notification preferences.
If the user has opted in, the message is formatted and routed to the right dispatcher.
The dispatcher sends it using an external provider (e.g., SendGrid).
Status is logged, and retry logic kicks in if the send fails.

4. Key Design Components

Notification Types

Transactional (e.g., receipts, security alerts)
Informational (e.g., friend activity)
Marketing (e.g., promotions)

Each type may have different user controls and urgency levels.

Channel Selection Logic

Check user preferences
Determine fallback strategy if one channel fails (e.g., SMS if push fails)

Message Formatting

Use templates based on channel, language, and content
Support personalization (e.g., user's name, offer codes)

Retry Handling

Use exponential backoff for transient failures
Store messages in a Dead-Letter Queue if they permanently fail

5. User Preferences and Quiet Hours

Allow users to customize:
- What types of notifications they want
- Which channels to use
- When they want to receive messages (quiet hours)
Use a fast-access data store (e.g., Redis, DynamoDB) to check preferences at send time
Batch updates to avoid spamming users (e.g., daily digest for promotions)

6. Scheduling and Delays

Support:

Delayed delivery (e.g., reminders)
Scheduled sends (e.g., 9am local time)
Time zone awareness

Use a scheduling service or delayed queue with a job execution engine (e.g., Celery, AWS EventBridge, cron-like scheduler).

7. Scaling Considerations

Horizontally scale worker pools to meet throughput demands.
Use message deduplication techniques (e.g., UUIDs or idempotency keys).
Rate-limit API calls to third-party providers to avoid throttling.
Implement bulk-sending strategies for high-volume campaigns.

8. Monitoring & Observability

Track:

Number of notifications sent per channel
Delivery success/failure rates
Queue depth and processing latency
Alert on spikes, drop-offs, and retries

Visualize via dashboards (Grafana, Datadog) and monitor with alerts on service degradation.

9. Trade-Offs to Discuss

Decision Area	Options / Trade-offs
Delivery Guarantees	At-least-once (with retries) vs. at-most-once (no retry)
Channel Priority	Parallel dispatch vs. fallback-only approach
Queueing Model	One global queue vs. per-channel queues
User Preferences	Real-time check vs. precomputed config
Message Status Tracking	Lightweight logs vs. full delivery pipeline and callbacks
Failure Recovery	Retry queues, DLQ, circuit breakers

10. What Interviewers Look For

Ability to break down a multi-component asynchronous system
Thoughtfulness in user experience (preferences, quiet hours)
Understanding of trade-offs in delivery reliability
Scalability and real-world feasibility
Observability, monitoring, and logging awareness
Graceful degradation under partial failures

✅ Summary

A notification system must balance performance, personalization, and reliability. A well-architected design includes:

Queue-based asynchronous processing
Pluggable, channel-specific dispatchers
Preference-aware delivery logic
Scalable workers and retry mechanisms
Observability and monitoring for safety