The Complete Guide to Transactional Email Troubleshooting: A DevOps Engineer’s Handbook
Introduction
Transactional emails are the backbone of modern application communication. Whether it’s password resets, order confirmations, or critical system alerts, these messages must reach their destination reliably and promptly. Yet for DevOps engineers, troubleshooting email delivery issues remains one of the most frustrating debugging experiences—a black box where messages disappear into the void without clear visibility into what went wrong.
After two decades of managing enterprise infrastructure and dealing with countless email delivery incidents across AWS, on-premise systems, and hybrid environments, I’ve developed a systematic approach to diagnosing and resolving transactional email issues. This guide distills that experience into actionable troubleshooting strategies you can apply immediately.
What you’ll learn:
- How to diagnose email delivery failures using logs, headers, and DNS records
- Common SMTP, SPF, DKIM, and DMARC misconfigurations and how to fix them
- Practical troubleshooting workflows for AWS SES, SendGrid, and other major providers
- Infrastructure-as-code patterns for reliable email configuration
- Monitoring and alerting strategies to catch issues before users report them
Understanding Transactional Email Architecture
Before diving into troubleshooting, let’s establish a mental model of how transactional emails traverse the internet. Understanding this journey is crucial for effective debugging.
The Email Delivery Pipeline
When your application sends a transactional email, it passes through multiple layers:
Application Layer: Your application generates the email content and metadata, then hands it to an SMTP client library or API.
SMTP Relay/MTA: The message reaches your Mail Transfer Agent—either a self-hosted MTA like Postfix, a cloud service like AWS SES, or a third-party provider like SendGrid.
DNS Authentication Layer: Before sending, the receiving server queries DNS for SPF, DKIM, and DMARC records to verify your legitimacy.
Recipient MTA: The destination mail server receives the message, applies spam filters, and makes the final delivery decision.
Inbox Placement: The email either lands in the inbox, spam folder, or gets rejected entirely.
Key Components That Can Fail
Each layer introduces potential failure points:
- Application issues: Invalid email formats, missing headers, encoding problems
- SMTP problems: Authentication failures, rate limits, connection timeouts
- DNS misconfigurations: Missing or incorrect SPF/DKIM/DMARC records
- Reputation issues: Blacklisted IPs, poor sender score, spam complaints
- Recipient problems: Invalid addresses, full mailboxes, aggressive filters
The troubleshooting challenge lies in identifying which layer failed and why.
Essential Tools for Email Troubleshooting
Effective troubleshooting requires the right tools. Here’s my essential toolkit:
Command-Line Tools
dig and nslookup: Query DNS records for SPF, DKIM, and DMARC configuration.
# Check SPF record
dig TXT example.com +short | grep "v=spf1"
# Check DKIM record (replace 'selector' with your actual DKIM selector)
dig TXT selector._domainkey.example.com +short
# Check DMARC record
dig TXT _dmarc.example.com +short
openssl s_client: Test SMTP connectivity and TLS encryption.
# Test SMTP connection with STARTTLS
openssl s_client -connect smtp.example.com:587 -starttls smtp
# Test implicit TLS (port 465)
openssl s_client -connect smtp.example.com:465
swaks: The Swiss Army knife of SMTP testing, allowing you to craft and send test emails with complete control.
# Basic test email
swaks --to user@example.com \
--from sender@yourdomain.com \
--server smtp.yourdomain.com \
--auth-user apikey \
--auth-password your-api-key
# Test with specific headers
swaks --to user@example.com \
--from sender@yourdomain.com \
--header "X-Custom-Header: test" \
--body "Test message" \
--server smtp.yourdomain.com
Online Testing Services
MXToolbox: Comprehensive email testing including blacklist checks, SPF validation, and DMARC analysis. Essential for reputation monitoring.
Mail-tester.com: Send a test email to their address and receive a detailed deliverability score with specific recommendations.
DMARC Analyzer: Tools like dmarcian or Postmark’s DMARC analyzer help interpret DMARC reports and identify authentication failures.
Log Analysis Tools
CloudWatch Logs (AWS): If using AWS SES, CloudWatch Logs Insights becomes indispensable for querying email events.
# Find all bounces in the last hour
fields @timestamp, mail.destination, bounce.bounceType
| filter eventType = "Bounce"
| sort @timestamp desc
| limit 100
ELK Stack or Splunk: For self-hosted MTAs, centralized logging helps correlate application logs with SMTP server logs.
Systematic Troubleshooting Methodology
When an email doesn’t arrive, follow this systematic approach to identify the root cause quickly.
Step 1: Confirm the Email Was Sent
This sounds obvious, but verify the email actually left your application.
Check application logs: Look for successful API calls or SMTP connections.
# Python example with proper logging
import logging
logger = logging.getLogger(__name__)
try:
response = ses_client.send_email(
Source='sender@example.com',
Destination={'ToAddresses': [recipient]},
Message={'Subject': {'Data': subject}, 'Body': {'Text': {'Data': body}}}
)
logger.info(f"Email sent successfully. MessageId: {response['MessageId']}")
except Exception as e:
logger.error(f"Failed to send email: {str(e)}", exc_info=True)
Check email provider dashboard: AWS SES, SendGrid, Mailgun all provide dashboards showing sends, deliveries, bounces, and complaints.
Verify API responses: If using an email API, ensure you’re receiving successful response codes (usually 200 or 202).
Step 2: Check Email Provider Logs
Once confirmed sent, examine your email service provider’s logs.
AWS SES CloudWatch Logs: Enable Configuration Set with CloudWatch destination.
# Query SES events for a specific recipient
aws logs filter-log-events \
--log-group-name /aws/ses/events \
--filter-pattern "user@example.com" \
--start-time $(date -d '1 hour ago' +%s)000
SendGrid Event Webhook: Configure event webhooks to capture all email events in your own logs.
// Express.js webhook handler
app.post('/sendgrid-webhook', (req, res) => {
const events = req.body;
events.forEach(event => {
console.log(`Event: ${event.event}, Email: ${event.email}, Timestamp: ${event.timestamp}`);
// Store in your logging system
});
res.sendStatus(200);
});
Step 3: Analyze Bounce Messages
Bounces come in two types: hard bounces and soft bounces.
Hard bounces indicate permanent delivery failures:
- Invalid email address
- Domain doesn’t exist
- Recipient address rejected
Soft bounces indicate temporary issues:
- Mailbox full
- Temporary server issues
- Message size too large
AWS SES bounce example:
{
"eventType": "Bounce",
"bounce": {
"bounceType": "Permanent",
"bounceSubType": "General",
"bouncedRecipients": [
{
"emailAddress": "user@example.com",
"action": "failed",
"status": "5.1.1",
"diagnosticCode": "smtp; 550 5.1.1 user unknown"
}
]
}
}
The diagnostic code tells the story. Status codes starting with 5.x.x indicate permanent failures, while 4.x.x codes indicate temporary issues.
Step 4: Verify DNS Authentication Records
Authentication failures are among the most common causes of email delivery problems.
Check SPF record:
dig TXT yourdomain.com +short
You should see something like:
"v=spf1 include:_spf.google.com include:amazonses.com ~all"
Common SPF mistakes:
- Missing
includefor your email provider - Too many DNS lookups (limit: 10)
- Using
+allinstead of~allor-all - Multiple SPF records (only one allowed)
Check DKIM signature:
First, find the DKIM selector in an email header, then query DNS:
# Get selector from email header (usually in DKIM-Signature header)
# Then query DNS
dig TXT selector._domainkey.yourdomain.com +short
Check DMARC policy:
dig TXT _dmarc.yourdomain.com +short
A basic DMARC record looks like:
"v=DMARC1; p=quarantine; rua=mailto:dmarc@yourdomain.com"
DMARC policies:
p=none: Monitor only, no action takenp=quarantine: Send to spam if authentication failsp=reject: Reject email if authentication fails
Step 5: Check Sender Reputation and Blacklists
Even with perfect configuration, poor sender reputation causes delivery issues.
Check blacklist status:
# Use MXToolbox or check manually
host 2.0.0.127.zen.spamhaus.org
# If listed, you'll get an IP response
# If not listed, you'll get "not found"
Major blacklists to monitor:
- Spamhaus ZEN
- Spamcop
- Barracuda
- Invaluement
Check sender score: Use tools like Sender Score or Postmaster Tools (Gmail) to monitor your reputation.
Step 6: Examine Email Headers
The email headers contain a complete delivery trace. If you have access to a successfully delivered test email, analyze its headers.
Key headers to examine:
Authentication-Results: Shows SPF, DKIM, DMARC pass/fail
Received: Shows the path the email took
X-Spam-Status: Spam filter score and rules triggered
Return-Path: Bounce address configuration
Reading Authentication-Results:
Authentication-Results: mx.google.com;
dkim=pass header.i=@yourdomain.com header.s=selector header.b=abc123;
spf=pass (google.com: domain of sender@yourdomain.com designates 1.2.3.4 as permitted sender);
dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE)
All three should show pass for optimal deliverability.
Common Problems and Solutions
Let’s walk through the most frequent issues I encounter and their solutions.
Problem 1: SPF Authentication Failures
Symptoms: Emails marked as spam or rejected, SPF shows fail or softfail in headers.
Diagnosis:
dig TXT yourdomain.com +short | grep spf1
Common causes:
Missing email provider in SPF record:
# Before (missing SendGrid)
"v=spf1 include:_spf.google.com ~all"
# After (including SendGrid)
"v=spf1 include:_spf.google.com include:sendgrid.net ~all"
Too many DNS lookups (SPF limit is 10):
# Bad - too many includes
"v=spf1 include:provider1.com include:provider2.com include:provider3.com include:provider4.com include:provider5.com include:provider6.com include:provider7.com include:provider8.com include:provider9.com include:provider10.com include:provider11.com ~all"
# Better - consolidate or use ip4/ip6 mechanisms
"v=spf1 include:provider1.com ip4:1.2.3.4 ip4:5.6.7.8 ~all"
Solution: Update your SPF record to include all legitimate sending sources. Use ~all (softfail) for testing, then switch to -all (hardfail) for production.
Problem 2: DKIM Signature Failures
Symptoms: DKIM shows fail or none in authentication results.
Diagnosis:
- Get the DKIM selector from a sent email’s headers
- Query DNS for the DKIM public key
dig TXT selector._domainkey.yourdomain.com +short
Common causes:
DNS record not published or expired:
# No response or NXDOMAIN
dig TXT 20230101._domainkey.yourdomain.com +short
# (no output)
Solution for AWS SES:
# Get DKIM tokens
aws ses verify-domain-dkim --domain yourdomain.com
# Add three CNAME records to DNS:
# token1._domainkey.yourdomain.com -> token1.dkim.amazonses.com
# token2._domainkey.yourdomain.com -> token2.dkim.amazonses.com
# token3._domainkey.yourdomain.com -> token3.dkim.amazonses.com
Solution for self-hosted (using OpenDKIM):
# Generate DKIM keys
opendkim-genkey -s selector -d yourdomain.com
# Add public key to DNS
cat selector.txt
# Copy the TXT record contents to your DNS
Clock skew causing signature validation failures:
# Check system time synchronization
timedatectl status
# Ensure NTP is enabled
sudo timedatectl set-ntp true
Problem 3: DMARC Alignment Issues
Symptoms: DMARC shows fail even when SPF and DKIM pass individually.
Diagnosis: DMARC requires alignment between the From domain and either SPF or DKIM.
Understanding alignment:
From: sender@yourdomain.com
Return-Path: bounce@mail.yourdomain.com
DKIM signature: d=yourdomain.com
# Strict alignment: Domains must match exactly
# Relaxed alignment: Organizational domains must match
Common cause: Using a third-party email service with mismatched domains.
From: sender@yourdomain.com
Return-Path: bounce@sendgrid.net
DKIM: d=sendgrid.net
# This fails DMARC alignment
Solution: Configure custom return path (also called bounce domain).
AWS SES example:
# Set up custom MAIL FROM domain
aws ses set-identity-mail-from-domain \
--identity yourdomain.com \
--mail-from-domain bounce.yourdomain.com
# Add MX record in DNS:
# bounce.yourdomain.com MX 10 feedback-smtp.us-east-1.amazonses.com
SendGrid example in Terraform:
resource "sendgrid_authenticated_domain" "domain" {
domain = "yourdomain.com"
subdomain = "mail"
automatic_security = true
custom_spf = true
default = true
}
Problem 4: Rate Limiting and Throttling
Symptoms: Some emails send successfully, others fail with rate limit errors.
Diagnosis: Check your email provider’s sending rate limits.
AWS SES rate limits:
# Check your sending limits
aws ses get-send-quota
# Output shows:
# Max24HourSend: 50000
# MaxSendRate: 14 (emails per second)
# SentLast24Hours: 12543
Solution: Implement rate limiting in your application.
Python example with token bucket algorithm:
import time
from threading import Lock
class RateLimiter:
def __init__(self, rate_per_second):
self.rate = rate_per_second
self.allowance = rate_per_second
self.last_check = time.time()
self.lock = Lock()
def try_consume(self, tokens=1):
with self.lock:
current = time.time()
time_passed = current - self.last_check
self.last_check = current
self.allowance += time_passed * self.rate
if self.allowance > self.rate:
self.allowance = self.rate
if self.allowance < tokens:
return False
self.allowance -= tokens
return True
def wait_and_consume(self, tokens=1):
while not self.try_consume(tokens):
time.sleep(0.1)
# Usage
limiter = RateLimiter(rate_per_second=10)
for email in email_queue:
limiter.wait_and_consume()
send_email(email)
Node.js example with bottleneck:
const Bottleneck = require('bottleneck');
// AWS SES default: 14 emails per second
const limiter = new Bottleneck({
reservoir: 14,
reservoirRefreshAmount: 14,
reservoirRefreshInterval: 1000,
maxConcurrent: 5
});
// Wrap send function
const sendEmail = limiter.wrap(async (emailParams) => {
return await ses.sendEmail(emailParams).promise();
});
Problem 5: Content-Based Spam Filtering
Symptoms: Emails deliver but consistently land in spam folders.
Diagnosis: Send a test email to mail-tester.com and review the spam score report.
Common triggers:
Spammy subject lines:
Bad: "FREE MONEY!!! Click here NOW!!!"
Good: "Your order confirmation #12345"
Poor HTML formatting:
<!-- Bad: No text version, excessive styling -->
<html>
<body style="background: red; font-size: 72px;">
<center>BUY NOW!!!</center>
</body>
</html>
<!-- Good: Clean HTML with text alternative -->
<html>
<body>
<p>Thank you for your order.</p>
<p>Order details...</p>
</body>
</html>
Missing or broken unsubscribe links (for marketing emails):
<!-- Always include for bulk emails -->
<a href="{{unsubscribe_url}}">Unsubscribe</a>
Solutions:
Test with Litmus or Email on Acid before deploying new templates.
Always include both HTML and plain text versions:
# Python example with both versions
message = MIMEMultipart('alternative')
text_part = MIMEText(plain_text_body, 'plain')
html_part = MIMEText(html_body, 'html')
message.attach(text_part)
message.attach(html_part)
Maintain a healthy text-to-image ratio (aim for at least 60% text).
Use proper email headers:
List-Unsubscribe: <mailto:unsubscribe@example.com>
List-Unsubscribe-Post: List-Unsubscribe=One-Click
Precedence: bulk
Problem 6: TLS/SSL Connection Failures
Symptoms: SMTP connection errors, timeout errors, or certificate verification failures.
Diagnosis:
# Test TLS connection
openssl s_client -connect smtp.example.com:587 -starttls smtp
# Check certificate validity
echo | openssl s_client -connect smtp.example.com:587 -starttls smtp 2>/dev/null | openssl x509 -noout -dates
Common causes:
Expired or invalid certificates.
Incorrect SMTP port configuration:
- Port 25: Unencrypted (often blocked by cloud providers)
- Port 587: STARTTLS (encrypted after connection)
- Port 465: Implicit TLS (encrypted from start)
Missing or outdated CA certificates:
# Update CA certificates
sudo apt-get update
sudo apt-get install ca-certificates
# Python: Ensure requests uses system certificates
import certifi
print(certifi.where())
Solution: Always use encrypted connections (587 or 465) with valid certificates.
Python example with proper TLS:
import smtplib
from email.mime.text import MIMEText
def send_email_secure(recipient, subject, body):
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = 'sender@example.com'
msg['To'] = recipient
# Use STARTTLS (port 587)
with smtplib.SMTP('smtp.example.com', 587) as server:
server.starttls() # Upgrade to TLS
server.login('username', 'password')
server.send_message(msg)
Infrastructure as Code for Email Configuration
Managing email configuration manually leads to drift and inconsistencies. Here’s how to codify your email infrastructure.
Terraform: AWS SES Configuration
# Domain verification
resource "aws_ses_domain_identity" "main" {
domain = var.domain_name
}
resource "aws_ses_domain_identity_verification" "main" {
domain = aws_ses_domain_identity.main.id
depends_on = [aws_route53_record.ses_verification]
}
# DKIM configuration
resource "aws_ses_domain_dkim" "main" {
domain = aws_ses_domain_identity.main.domain
}
resource "aws_route53_record" "dkim" {
count = 3
zone_id = var.route53_zone_id
name = "${element(aws_ses_domain_dkim.main.dkim_tokens, count.index)}._domainkey"
type = "CNAME"
ttl = 600
records = ["${element(aws_ses_domain_dkim.main.dkim_tokens, count.index)}.dkim.amazonses.com"]
}
# Custom MAIL FROM domain
resource "aws_ses_domain_mail_from" "main" {
domain = aws_ses_domain_identity.main.domain
mail_from_domain = "bounce.${aws_ses_domain_identity.main.domain}"
}
resource "aws_route53_record" "mail_from_mx" {
zone_id = var.route53_zone_id
name = aws_ses_domain_mail_from.main.mail_from_domain
type = "MX"
ttl = 600
records = ["10 feedback-smtp.${var.aws_region}.amazonses.com"]
}
resource "aws_route53_record" "mail_from_spf" {
zone_id = var.route53_zone_id
name = aws_ses_domain_mail_from.main.mail_from_domain
type = "TXT"
ttl = 600
records = ["v=spf1 include:amazonses.com ~all"]
}
# Configuration set with CloudWatch logging
resource "aws_ses_configuration_set" "main" {
name = "${var.environment}-email-tracking"
}
resource "aws_ses_event_destination" "cloudwatch" {
name = "cloudwatch-destination"
configuration_set_name = aws_ses_configuration_set.main.name
enabled = true
matching_types = ["send", "reject", "bounce", "complaint", "delivery"]
cloudwatch_destination {
default_value = "default"
dimension_name = "EmailType"
value_source = "messageTag"
}
}
# SNS topic for bounce and complaint notifications
resource "aws_sns_topic" "email_notifications" {
name = "${var.environment}-email-notifications"
}
resource "aws_ses_identity_notification_topic" "bounce" {
topic_arn = aws_sns_topic.email_notifications.arn
notification_type = "Bounce"
identity = aws_ses_domain_identity.main.domain
}
resource "aws_ses_identity_notification_topic" "complaint" {
topic_arn = aws_sns_topic.email_notifications.arn
notification_type = "Complaint"
identity = aws_ses_domain_identity.main.domain
}
Terraform: DNS Records for Email Authentication
# SPF record
resource "aws_route53_record" "spf" {
zone_id = var.route53_zone_id
name = var.domain_name
type = "TXT"
ttl = 300
records = ["v=spf1 include:amazonses.com include:_spf.google.com ~all"]
}
# DMARC record
resource "aws_route53_record" "dmarc" {
zone_id = var.route53_zone_id
name = "_dmarc.${var.domain_name}"
type = "TXT"
ttl = 300
records = [
"v=DMARC1; p=quarantine; rua=mailto:dmarc-reports@${var.domain_name}; ruf=mailto:dmarc-forensics@${var.domain_name}; fo=1; adkim=r; aspf=r; pct=100"
]
}
# MX record (if receiving email)
resource "aws_route53_record" "mx" {
zone_id = var.route53_zone_id
name = var.domain_name
type = "MX"
ttl = 300
records = [
"1 ASPMX.L.GOOGLE.COM",
"5 ALT1.ASPMX.L.GOOGLE.COM",
"5 ALT2.ASPMX.L.GOOGLE.COM",
"10 ALT3.ASPMX.L.GOOGLE.COM",
"10 ALT4.ASPMX.L.GOOGLE.COM"
]
}
Ansible: Self-Hosted Postfix Configuration
---
- name: Configure Postfix for transactional email
hosts: mail_servers
become: yes
vars:
postfix_domain: example.com
smtp_relay_host: smtp.sendgrid.net
smtp_relay_port: 587
smtp_relay_user: apikey
tasks:
- name: Install Postfix and required packages
apt:
name:
- postfix
- opendkim
- opendkim-tools
- libsasl2-modules
state: present
update_cache: yes
- name: Configure Postfix main.cf
template:
src: main.cf.j2
dest: /etc/postfix/main.cf
owner: root
group: root
mode: '0644'
notify: restart postfix
- name: Set up SMTP relay credentials
template:
src: sasl_passwd.j2
dest: /etc/postfix/sasl_passwd
owner: root
group: root
mode: '0600'
notify:
- hash sasl_passwd
- restart postfix
- name: Generate DKIM keys
command: opendkim-genkey -s {{ ansible_date_time.year }}{{ ansible_date_time.month }} -d {{ postfix_domain }}
args:
chdir: /etc/opendkim/keys
creates: /etc/opendkim/keys/{{ ansible_date_time.year }}{{ ansible_date_time.month }}.private
- name: Configure OpenDKIM
template:
src: opendkim.conf.j2
dest: /etc/opendkim.conf
notify: restart opendkim
- name: Set up OpenDKIM signing table
template:
src: signing.table.j2
dest: /etc/opendkim/signing.table
notify: restart opendkim
handlers:
- name: restart postfix
service:
name: postfix
state: restarted
- name: restart opendkim
service:
name: opendkim
state: restarted
- name: hash sasl_passwd
command: postmap /etc/postfix/sasl_passwd
Monitoring and Alerting
Proactive monitoring catches email issues before they impact users.
Key Metrics to Monitor
Delivery rate: Percentage of emails successfully delivered vs. sent.
delivery_rate = (delivered / sent) * 100
Target: >98%
Bounce rate: Percentage of emails that bounce.
bounce_rate = (bounced / sent) * 100
Target: <5% (lower is better)
Complaint rate: Percentage of recipients marking as spam.
complaint_rate = (complaints / delivered) * 100
Target: <0.1% (critical threshold)
Open rate (for applicable transactional emails):
open_rate = (opens / delivered) * 100
Varies by email type
CloudWatch Alarms for AWS SES
# Bounce rate alarm
resource "aws_cloudwatch_metric_alarm" "high_bounce_rate" {
alarm_name = "ses-high-bounce-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "Reputation.BounceRate"
namespace = "AWS/SES"
period = "900"
statistic = "Average"
threshold = "0.05"
alarm_description = "Alert when bounce rate exceeds 5%"
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Complaint rate alarm (critical)
resource "aws_cloudwatch_metric_alarm" "high_complaint_rate" {
alarm_name = "ses-high-complaint-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "Reputation.ComplaintRate"
namespace = "AWS/SES"
period = "900"
statistic = "Average"
threshold = "0.001"
alarm_description = "CRITICAL: Complaint rate exceeds 0.1%"
alarm_actions = [aws_sns_topic.critical_alerts.arn]
treat_missing_data = "notBreaching"
}
# Send quota utilization
resource "aws_cloudwatch_metric_alarm" "send_quota_utilization" {
alarm_name = "ses-quota-near-limit"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "SendQuotaUtilization"
namespace = "AWS/SES"
period = "300"
statistic = "Maximum"
threshold = "80"
alarm_description = "Alert when send quota utilization exceeds 80%"
alarm_actions = [aws_sns_topic.alerts.arn]
}
Prometheus Metrics for Custom Monitoring
from prometheus_client import Counter, Histogram, Gauge
import time
# Define metrics
emails_sent_total = Counter(
'emails_sent_total',
'Total emails sent',
['template', 'status']
)
email_send_duration = Histogram(
'email_send_duration_seconds',
'Time to send email',
['template']
)
email_queue_size = Gauge(
'email_queue_size',
'Current email queue size'
)
def send_email_with_metrics(template_name, recipient, content):
start_time = time.time()
try:
# Actual email sending logic
result = send_email_api(recipient, content)
emails_sent_total.labels(
template=template_name,
status='success'
).inc()
return result
except Exception as e:
emails_sent_total.labels(
template=template_name,
status='failed'
).inc()
raise
finally:
duration = time.time() - start_time
email_send_duration.labels(
template=template_name
).observe(duration)
Grafana Dashboard Query Examples
# Email delivery rate (last hour)
rate(emails_sent_total{status="success"}[1h])
/
rate(emails_sent_total[1h])
# 95th percentile send latency
histogram_quantile(0.95,
rate(email_send_duration_seconds_bucket[5m])
)
# Failed sends by template
sum by (template) (
rate(emails_sent_total{status="failed"}[5m])
)
Advanced Troubleshooting Techniques
Debugging with SMTP Session Logs
Enable verbose SMTP logging to capture the complete conversation:
import smtplib
import logging
# Enable debug output
logging.basicConfig(level=logging.DEBUG)
smtplib.SMTP.debuglevel = 2
server = smtplib.SMTP('smtp.example.com', 587)
server.starttls()
server.login('username', 'password')
# Debug output shows complete SMTP session
Using tcpdump to Capture Email Traffic
When application logs aren’t sufficient, capture network traffic:
# Capture SMTP traffic
sudo tcpdump -i any -s 0 -w smtp-capture.pcap 'port 25 or port 587 or port 465'
# Analyze with Wireshark or tshark
tshark -r smtp-capture.pcap -Y smtp -T fields -e smtp.req.command -e smtp.response.code
Email Header Analysis for Deliverability
Extract and analyze headers from delivered emails:
import email
from email import policy
def analyze_email_headers(raw_email):
msg = email.message_from_string(raw_email, policy=policy.default)
# Extract authentication results
auth_results = msg.get('Authentication-Results', '')
print(f"Authentication: {auth_results}")
# Extract spam score
spam_status = msg.get('X-Spam-Status', '')
print(f"Spam Status: {spam_status}")
# Trace email path
received_headers = msg.get_all('Received', [])
print(f"\nEmail path ({len(received_headers)} hops):")
for i, received in enumerate(received_headers, 1):
print(f"{i}. {received}")
# Check DKIM signature
dkim_signature = msg.get('DKIM-Signature', '')
if dkim_signature:
print(f"\nDKIM Signature present: {dkim_signature[:100]}...")
Testing with Different Email Providers
Send test emails to various providers to identify provider-specific issues:
#!/bin/bash
# Test email delivery to major providers
PROVIDERS=(
"gmail-test@gmail.com"
"outlook-test@outlook.com"
"yahoo-test@yahoo.com"
"icloud-test@icloud.com"
"protonmail-test@protonmail.com"
)
for email in "${PROVIDERS[@]}"; do
echo "Testing delivery to ${email}..."
swaks --to "${email}" \
--from "test@yourdomain.com" \
--server smtp.yourdomain.com \
--auth-user "apikey" \
--auth-password "${API_KEY}" \
--header "Subject: Deliverability Test $(date)" \
--body "This is a test email sent at $(date)"
sleep 5
done
Email Security Best Practices
Preventing Email Spoofing
Implement strict DMARC policies:
# Start with monitoring
v=DMARC1; p=none; rua=mailto:dmarc@yourdomain.com; pct=100
# Move to quarantine after monitoring shows compliance
v=DMARC1; p=quarantine; rua=mailto:dmarc@yourdomain.com; pct=100
# Enforce strict rejection
v=DMARC1; p=reject; rua=mailto:dmarc@yourdomain.com; pct=100
Securing SMTP Credentials
Never hardcode credentials. Use secrets management:
import boto3
from botocore.exceptions import ClientError
def get_smtp_credentials():
secret_name = "prod/smtp/credentials"
region_name = "us-east-1"
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name
)
try:
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response['SecretString'])
except ClientError as e:
raise Exception(f"Failed to retrieve credentials: {e}")
Rate Limiting to Prevent Abuse
Implement application-level rate limiting:
from functools import wraps
from datetime import datetime, timedelta
import redis
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def rate_limit(max_requests, window_seconds):
def decorator(func):
@wraps(func)
def wrapper(user_id, *args, **kwargs):
key = f"email_rate_limit:{user_id}"
current_time = datetime.now()
# Get request timestamps from Redis
timestamps = redis_client.lrange(key, 0, -1)
timestamps = [
datetime.fromisoformat(ts.decode())
for ts in timestamps
]
# Remove old timestamps outside the window
window_start = current_time - timedelta(seconds=window_seconds)
recent_timestamps = [
ts for ts in timestamps
if ts > window_start
]
if len(recent_timestamps) >= max_requests:
raise Exception(
f"Rate limit exceeded: {max_requests} emails per {window_seconds}s"
)
# Add current timestamp
redis_client.rpush(key, current_time.isoformat())
redis_client.expire(key, window_seconds)
return func(user_id, *args, **kwargs)
return wrapper
return decorator
@rate_limit(max_requests=10, window_seconds=3600)
def send_email(user_id, recipient, subject, body):
# Email sending logic
pass
Performance Optimization
Batch Email Sending
For bulk transactional emails, use batch APIs:
import boto3
ses = boto3.client('ses', region_name='us-east-1')
def send_bulk_emails(recipients, subject, body_text, body_html):
"""Send to up to 50 recipients per API call"""
batch_size = 50
for i in range(0, len(recipients), batch_size):
batch = recipients[i:i + batch_size]
destinations = [
{
'Destination': {'ToAddresses': [email]},
'ReplacementTemplateData': json.dumps({
'email': email
})
}
for email in batch
]
try:
response = ses.send_bulk_templated_email(
Source='noreply@yourdomain.com',
Template='TransactionalTemplate',
Destinations=destinations,
DefaultTemplateData=json.dumps({
'subject': subject,
'body_text': body_text,
'body_html': body_html
})
)
print(f"Sent batch of {len(batch)} emails")
except Exception as e:
print(f"Error sending batch: {e}")
Async Email Sending
Avoid blocking application threads:
import asyncio
import aiosmtplib
from email.mime.text import MIMEText
async def send_email_async(recipient, subject, body):
message = MIMEText(body)
message['From'] = 'sender@yourdomain.com'
message['To'] = recipient
message['Subject'] = subject
async with aiosmtplib.SMTP(
hostname='smtp.yourdomain.com',
port=587,
use_tls=False
) as smtp:
await smtp.starttls()
await smtp.login('username', 'password')
await smtp.send_message(message)
# Send multiple emails concurrently
async def send_multiple_emails(email_list):
tasks = [
send_email_async(
email['recipient'],
email['subject'],
email['body']
)
for email in email_list
]
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, result in enumerate(results):
if isinstance(result, Exception):
print(f"Failed to send email {i}: {result}")
else:
print(f"Successfully sent email {i}")
# Usage
email_queue = [
{'recipient': 'user1@example.com', 'subject': 'Test 1', 'body': 'Body 1'},
{'recipient': 'user2@example.com', 'subject': 'Test 2', 'body': 'Body 2'},
# ... more emails
]
asyncio.run(send_multiple_emails(email_queue))
Queue-Based Email Processing
Use message queues for reliable delivery:
import boto3
import json
sqs = boto3.client('sqs', region_name='us-east-1')
ses = boto3.client('ses', region_name='us-east-1')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/email-queue'
def enqueue_email(recipient, subject, body):
"""Add email to SQS queue"""
message = {
'recipient': recipient,
'subject': subject,
'body': body,
'timestamp': datetime.now().isoformat()
}
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(message)
)
def process_email_queue():
"""Worker process to send emails from queue"""
while True:
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=10,
WaitTimeSeconds=20
)
if 'Messages' not in response:
continue
for message in response['Messages']:
try:
email_data = json.loads(message['Body'])
ses.send_email(
Source='noreply@yourdomain.com',
Destination={'ToAddresses': [email_data['recipient']]},
Message={
'Subject': {'Data': email_data['subject']},
'Body': {'Text': {'Data': email_data['body']}}
}
)
# Delete message from queue on success
sqs.delete_message(
QueueUrl=QUEUE_URL,
ReceiptHandle=message['ReceiptHandle']
)
except Exception as e:
print(f"Error processing message: {e}")
# Message will be retried based on queue visibility timeout
Troubleshooting Checklist
When facing email delivery issues, work through this systematic checklist:
Initial Diagnosis
- [ ] Confirm email was sent (check application logs)
- [ ] Verify API response codes (200/202 for success)
- [ ] Check email service provider dashboard for send status
- [ ] Look for error messages in application logs
DNS Configuration
- [ ] Verify SPF record exists and includes all sending sources
- [ ] Confirm DKIM records are published and accessible
- [ ] Check DMARC policy is configured correctly
- [ ] Ensure DNS propagation is complete (may take up to 48 hours)
- [ ] Verify MX records if receiving replies
Authentication
- [ ] Confirm SPF passes for your sending IP
- [ ] Verify DKIM signatures validate correctly
- [ ] Check DMARC alignment (SPF or DKIM domain matches From domain)
- [ ] Ensure custom return path (MAIL FROM) is configured
Reputation & Deliverability
- [ ] Check if sending IP is blacklisted
- [ ] Monitor sender reputation score
- [ ] Review bounce rate (should be <5%)
- [ ] Check complaint rate (must be <0.1%)
- [ ] Verify you’re not hitting rate limits
Content & Format
- [ ] Test email with mail-tester.com for spam score
- [ ] Ensure both HTML and plain text versions exist
- [ ] Check for spam trigger words in subject/body
- [ ] Verify images have alt text and proper hosting
- [ ] Confirm unsubscribe link works (for bulk email)
Infrastructure
- [ ] Verify SMTP credentials are correct
- [ ] Check firewall rules allow outbound SMTP traffic
- [ ] Ensure TLS/SSL certificates are valid
- [ ] Confirm correct SMTP port (587 or 465)
- [ ] Check system time synchronization for DKIM
Recipient Issues
- [ ] Verify email address format is valid
- [ ] Check if domain exists (MX record query)
- [ ] Look for “user unknown” or “mailbox full” errors
- [ ] Test sending to different email providers
Conclusion
Troubleshooting transactional email delivery requires a systematic approach, understanding of email infrastructure, and the right tools. By following the methodologies outlined in this guide, you can diagnose and resolve most email issues efficiently.
Key takeaways:
Start with the basics: confirm the email was sent before investigating complex issues.
Authentication is critical: Properly configured SPF, DKIM, and DMARC records are non-negotiable for deliverability.
Monitor proactively: Set up alerts for bounce rates, complaint rates, and quota utilization before issues impact users.
Use Infrastructure as Code: Terraform, Ansible, or CloudFormation ensures consistency and prevents configuration drift.
Test thoroughly: Send test emails to multiple providers and use tools like mail-tester.com before deploying to production.
Remember that email delivery is a reputation game. Maintain good sending practices, respond quickly to bounces and complaints, and your transactional emails will reliably reach the inbox.
Additional Resources
Email Authentication:
- RFC 7208: SPF specification
- RFC 6376: DKIM specification
- RFC 7489: DMARC specification
Testing Tools:
- MXToolbox: https://mxtoolbox.com
- Mail-tester: https://www.mail-tester.com
- Google Postmaster Tools: https://postmaster.google.com
- Microsoft SNDS: https://sendersupport.olc.protection.outlook.com/snds/
Provider Documentation:
- AWS SES: https://docs.aws.amazon.com/ses/
- SendGrid: https://docs.sendgrid.com
- Mailgun: https://documentation.mailgun.com
- Postmark: https://postmarkapp.com/developer
Monitoring & Analytics:
- DMARC Analyzer: https://www.dmarcanalyzer.com
- Postmark DMARC Monitor: https://dmarc.postmarkapp.com
Have questions about email deliverability or want to share your troubleshooting experiences? Leave a comment below or reach out on LinkedIn.