When you need to process tasks asynchronously at scale, a robust queue system becomes essential. Here's what I learned building one that handles 100K+ jobs daily.
Why Not Just Use SQS?
Amazon SQS is great, but for our use case we needed: - Sub-second job pickup latency - Complex retry strategies per job type - Real-time job progress tracking - Priority queues
Redis with Bull gave us all of this with simpler operational overhead.
Architecture Overview
The system has three main components:
- . **Producers** - API servers that enqueue jobs
- . **Redis** - The queue storage and pub/sub backbone
- . **Workers** - Horizontally scalable job processors
Key Design Decisions
1. Job Persistence
While Redis is fast, we needed durability. Every job is: - Written to Redis for processing - Logged to PostgreSQL for audit trail - Results persisted after completion
2. Retry Strategy
Different jobs need different retry behaviors:
const emailQueue = new Bull('email', {
defaultJobOptions: {
attempts: 5,
backoff: {
type: 'exponential',
delay: 2000, // 2s, 4s, 8s, 16s, 32s
},
},
});3. Worker Scaling
Workers scale based on queue depth using Kubernetes HPA:
metrics:
- type: External
external:
metric:
name: redis_queue_depth
target:
type: AverageValue
averageValue: "100"Lessons Learned
- . **Always set job timeouts** - Stuck jobs will block workers
- . **Use separate queues for different priorities** - Don't let bulk jobs block critical ones
- . **Monitor everything** - Queue depth, processing time, failure rates
The system has been running in production for 2 years with 99.9% job completion rate.