📌 Question
"Design a system that sends notifications to users across channels such as email, SMS, and push notifications."
This is a highly practical system design interview question. It tests your ability to build modular, asynchronous, and fault-tolerant architectures that scale and adapt to diverse communication channels and user needs.
✅ Solution
1. Clarify the Requirements
Functional Requirements:
- Deliver notifications based on events like friend requests, promotions, messages, or alerts.
- Support multiple channels: email, push notifications, and SMS.
- Allow users to set preferences: opt in/out by channel and notification type.
- Support real-time and scheduled notifications.
- Ensure delivery with retry logic and fallback mechanisms.
Non-Functional Requirements:
- Scale to millions of notifications daily.
- Ensure low-latency delivery for time-sensitive messages.
- Maintain high availability and fault tolerance.
- Include observability: logs, metrics, and traceability.
2. High-Level Architecture Overview
- Notification Trigger Service: Accepts events from upstream services.
- Queueing Layer: Buffers messages using a queueing system like Kafka, RabbitMQ, or AWS SQS.
- Notification Worker Pool: Processes jobs from the queue, checks preferences, applies logic, and routes to the correct channel dispatcher.
- Channel Dispatchers: Sends the final message via email, SMS, or push (e.g., SendGrid, Twilio, FCM).
- User Preferences Service: Stores notification preferences and quiet-hour settings.
- Delivery Tracker & Logger: Logs message status (sent, failed, retried, delivered).
- Monitoring Dashboard: Alerts on queue size, failure rates, latency spikes.
3. Event Flow Example
- A message is sent to a user → a notification trigger event is emitted.
- The event is added to a queue.
- A worker dequeues it and checks the user’s notification preferences.
- If the user has opted in, the message is formatted and routed to the right dispatcher.
- The dispatcher sends it using an external provider (e.g., SendGrid).
- Status is logged, and retry logic kicks in if the send fails.
4. Key Design Components
Notification Types
- Transactional (e.g., receipts, security alerts)
- Informational (e.g., friend activity)
- Marketing (e.g., promotions)
Each type may have different user controls and urgency levels.
Channel Selection Logic
- Check user preferences
- Determine fallback strategy if one channel fails (e.g., SMS if push fails)
Message Formatting
- Use templates based on channel, language, and content
- Support personalization (e.g., user's name, offer codes)
Retry Handling
- Use exponential backoff for transient failures
- Store messages in a Dead-Letter Queue if they permanently fail
5. User Preferences and Quiet Hours
- Allow users to customize:
- What types of notifications they want
- Which channels to use
- When they want to receive messages (quiet hours)
- Use a fast-access data store (e.g., Redis, DynamoDB) to check preferences at send time
- Batch updates to avoid spamming users (e.g., daily digest for promotions)
6. Scheduling and Delays
Support:
- Delayed delivery (e.g., reminders)
- Scheduled sends (e.g., 9am local time)
- Time zone awareness
Use a scheduling service or delayed queue with a job execution engine (e.g., Celery, AWS EventBridge, cron-like scheduler).
7. Scaling Considerations
- Horizontally scale worker pools to meet throughput demands.
- Use message deduplication techniques (e.g., UUIDs or idempotency keys).
- Rate-limit API calls to third-party providers to avoid throttling.
- Implement bulk-sending strategies for high-volume campaigns.
8. Monitoring & Observability
Track:
- Number of notifications sent per channel
- Delivery success/failure rates
- Queue depth and processing latency
- Alert on spikes, drop-offs, and retries
Visualize via dashboards (Grafana, Datadog) and monitor with alerts on service degradation.
9. Trade-Offs to Discuss
Decision Area | Options / Trade-offs |
---|---|
Delivery Guarantees | At-least-once (with retries) vs. at-most-once (no retry) |
Channel Priority | Parallel dispatch vs. fallback-only approach |
Queueing Model | One global queue vs. per-channel queues |
User Preferences | Real-time check vs. precomputed config |
Message Status Tracking | Lightweight logs vs. full delivery pipeline and callbacks |
Failure Recovery | Retry queues, DLQ, circuit breakers |
10. What Interviewers Look For
- Ability to break down a multi-component asynchronous system
- Thoughtfulness in user experience (preferences, quiet hours)
- Understanding of trade-offs in delivery reliability
- Scalability and real-world feasibility
- Observability, monitoring, and logging awareness
- Graceful degradation under partial failures
✅ Summary
A notification system must balance performance, personalization, and reliability. A well-architected design includes:
- Queue-based asynchronous processing
- Pluggable, channel-specific dispatchers
- Preference-aware delivery logic
- Scalable workers and retry mechanisms
- Observability and monitoring for safety