Designing Common Systems: Design a Notification System

📌 Question

"Design a system that sends notifications to users across channels such as email, SMS, and push notifications."

This is a highly practical system design interview question. It tests your ability to build modular, asynchronous, and fault-tolerant architectures that scale and adapt to diverse communication channels and user needs.


✅ Solution

1. Clarify the Requirements

Functional Requirements:

  • Deliver notifications based on events like friend requests, promotions, messages, or alerts.
  • Support multiple channels: email, push notifications, and SMS.
  • Allow users to set preferences: opt in/out by channel and notification type.
  • Support real-time and scheduled notifications.
  • Ensure delivery with retry logic and fallback mechanisms.

Non-Functional Requirements:

  • Scale to millions of notifications daily.
  • Ensure low-latency delivery for time-sensitive messages.
  • Maintain high availability and fault tolerance.
  • Include observability: logs, metrics, and traceability.

2. High-Level Architecture Overview

  • Notification Trigger Service: Accepts events from upstream services.
  • Queueing Layer: Buffers messages using a queueing system like Kafka, RabbitMQ, or AWS SQS.
  • Notification Worker Pool: Processes jobs from the queue, checks preferences, applies logic, and routes to the correct channel dispatcher.
  • Channel Dispatchers: Sends the final message via email, SMS, or push (e.g., SendGrid, Twilio, FCM).
  • User Preferences Service: Stores notification preferences and quiet-hour settings.
  • Delivery Tracker & Logger: Logs message status (sent, failed, retried, delivered).
  • Monitoring Dashboard: Alerts on queue size, failure rates, latency spikes.

3. Event Flow Example

  1. A message is sent to a user → a notification trigger event is emitted.
  2. The event is added to a queue.
  3. A worker dequeues it and checks the user’s notification preferences.
  4. If the user has opted in, the message is formatted and routed to the right dispatcher.
  5. The dispatcher sends it using an external provider (e.g., SendGrid).
  6. Status is logged, and retry logic kicks in if the send fails.

4. Key Design Components

Notification Types

  • Transactional (e.g., receipts, security alerts)
  • Informational (e.g., friend activity)
  • Marketing (e.g., promotions)

Each type may have different user controls and urgency levels.

Channel Selection Logic

  • Check user preferences
  • Determine fallback strategy if one channel fails (e.g., SMS if push fails)

Message Formatting

  • Use templates based on channel, language, and content
  • Support personalization (e.g., user's name, offer codes)

Retry Handling

  • Use exponential backoff for transient failures
  • Store messages in a Dead-Letter Queue if they permanently fail

5. User Preferences and Quiet Hours

  • Allow users to customize:
    • What types of notifications they want
    • Which channels to use
    • When they want to receive messages (quiet hours)
  • Use a fast-access data store (e.g., Redis, DynamoDB) to check preferences at send time
  • Batch updates to avoid spamming users (e.g., daily digest for promotions)

6. Scheduling and Delays

Support:

  • Delayed delivery (e.g., reminders)
  • Scheduled sends (e.g., 9am local time)
  • Time zone awareness

Use a scheduling service or delayed queue with a job execution engine (e.g., Celery, AWS EventBridge, cron-like scheduler).


7. Scaling Considerations

  • Horizontally scale worker pools to meet throughput demands.
  • Use message deduplication techniques (e.g., UUIDs or idempotency keys).
  • Rate-limit API calls to third-party providers to avoid throttling.
  • Implement bulk-sending strategies for high-volume campaigns.

8. Monitoring & Observability

Track:

  • Number of notifications sent per channel
  • Delivery success/failure rates
  • Queue depth and processing latency
  • Alert on spikes, drop-offs, and retries

Visualize via dashboards (Grafana, Datadog) and monitor with alerts on service degradation.


9. Trade-Offs to Discuss

Decision AreaOptions / Trade-offs
Delivery GuaranteesAt-least-once (with retries) vs. at-most-once (no retry)
Channel PriorityParallel dispatch vs. fallback-only approach
Queueing ModelOne global queue vs. per-channel queues
User PreferencesReal-time check vs. precomputed config
Message Status TrackingLightweight logs vs. full delivery pipeline and callbacks
Failure RecoveryRetry queues, DLQ, circuit breakers

10. What Interviewers Look For

  • Ability to break down a multi-component asynchronous system
  • Thoughtfulness in user experience (preferences, quiet hours)
  • Understanding of trade-offs in delivery reliability
  • Scalability and real-world feasibility
  • Observability, monitoring, and logging awareness
  • Graceful degradation under partial failures

✅ Summary

A notification system must balance performance, personalization, and reliability. A well-architected design includes:

  • Queue-based asynchronous processing
  • Pluggable, channel-specific dispatchers
  • Preference-aware delivery logic
  • Scalable workers and retry mechanisms
  • Observability and monitoring for safety