Back to blog
·12 min read

Building Scalable Task Queues with Redis and Node.js

Lessons learned from building a distributed task processing system handling 100K+ daily jobs.

Node.jsRedisArchitecture

When you need to process tasks asynchronously at scale, a robust queue system becomes essential. Here's what I learned building one that handles 100K+ jobs daily.

Why Not Just Use SQS?

Amazon SQS is great, but for our use case we needed: - Sub-second job pickup latency - Complex retry strategies per job type - Real-time job progress tracking - Priority queues

Redis with Bull gave us all of this with simpler operational overhead.

Architecture Overview

The system has three main components:

  1. . **Producers** - API servers that enqueue jobs
  2. . **Redis** - The queue storage and pub/sub backbone
  3. . **Workers** - Horizontally scalable job processors

Key Design Decisions

1. Job Persistence

While Redis is fast, we needed durability. Every job is: - Written to Redis for processing - Logged to PostgreSQL for audit trail - Results persisted after completion

2. Retry Strategy

Different jobs need different retry behaviors:

const emailQueue = new Bull('email', {
  defaultJobOptions: {
    attempts: 5,
    backoff: {
      type: 'exponential',
      delay: 2000, // 2s, 4s, 8s, 16s, 32s
    },
  },
});

3. Worker Scaling

Workers scale based on queue depth using Kubernetes HPA:

metrics:
  - type: External
    external:
      metric:
        name: redis_queue_depth
      target:
        type: AverageValue
        averageValue: "100"

Lessons Learned

  1. . **Always set job timeouts** - Stuck jobs will block workers
  2. . **Use separate queues for different priorities** - Don't let bulk jobs block critical ones
  3. . **Monitor everything** - Queue depth, processing time, failure rates

The system has been running in production for 2 years with 99.9% job completion rate.