How KEDA Works

KEDA (Kubernetes Event Driven Autoscaling) monitors external metrics (queue depth, custom metrics) and scales Kubernetes Deployments.

Components:

  • KEDA Operator: Watches ScaledObjects
  • Metrics Server: Exposes metrics to HPA
  • ScaledObject: Defines scaling triggers and targets

Asya Integration

Asya operator creates KEDA ScaledObject for each AsyncActor:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: text-processor
spec:
  scaleTargetRef:
    name: text-processor   # Deployment to scale
  minReplicaCount: 0       # Scale to zero when idle
  maxReplicaCount: 50      # Max replicas
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-text-processor
      queueLength: "5"      # Target: 5 messages per replica
      awsRegion: us-east-1

Formula: desiredReplicas = ceil(queueDepth / queueLength)

Example: 100 messages, queueLength=5 → 20 replicas

Benefits

Scale to zero:

  • 0 messages → 0 pods → $0 cost
  • Queue fills → Spin up to maxReplicas in seconds

Independent scaling:

  • Each actor scales based on its own queue depth
  • Data-loader scales differently than LLM inference

Cost optimization:

  • Only run GPU pods when needed
  • No warm pools, no idle resources

Handle bursts:

  • Automatic response to traffic spikes
  • Gradual scale-down when load decreases

Configuration

Scaling configured in AsyncActor spec:

spec:
  scaling:
    enabled: true            # Enable KEDA autoscaling
    minReplicas: 0           # Minimum pods (0 for scale-to-zero)
    maxReplicas: 100         # Maximum pods
    queueLength: 5           # Target messages per replica
    cooldownPeriod: 60       # Seconds before scaling down (default: 60s)
    pollingInterval: 10      # How often KEDA checks queue depth (default: 10s)

Parameters:

  • enabled: Enable/disable KEDA autoscaling (default: false)
  • minReplicas: Minimum pods (default: 0 for scale-to-zero)
  • maxReplicas: Maximum pods (default: 50)
  • queueLength: Target messages per replica (default: 5)
  • cooldownPeriod: Delay before scaling down in seconds (default: 60)
  • pollingInterval: Queue check frequency in seconds (default: 10)

Scaling Scenarios

Idle Workload

  • Queue: 0 messages
  • Replicas: 0 (minReplicas=0)
  • Cost: $0

Low Load

  • Queue: 10 messages, queueLength=5
  • Replicas: 2
  • Processing: ~5 messages per replica

High Load

  • Queue: 250 messages, queueLength=5
  • Replicas: 50 (capped at maxReplicas)
  • Processing: ~5 messages per replica

Burst

  • Queue suddenly: 500 messages
  • KEDA scales up: 0 → 50 in ~30-60 seconds
  • After processing: Queue drains → Scale down to 0

Transport-Specific Triggers

SQS

triggers:

- type: aws-sqs-queue
  metadata:
    queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-actor
    queueLength: "5"
    awsRegion: us-east-1

RabbitMQ

triggers:

- type: rabbitmq
  metadata:
    host: amqp://rabbitmq:5672
    queueName: asya-actor
    queueLength: "5"

Monitoring Autoscaling

# Watch HPA status
kubectl get hpa -w

# View ScaledObject
kubectl get scaledobject text-processor -o yaml

# View KEDA metrics
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

See: observability.md for autoscaling metrics.