For Platform Engineers ⚙️

Deploy and manage Asya🎭 infrastructure.

Overview¶

As platform engineer, you:

Deploy Asya operator and gateway
Configure transports (SQS, RabbitMQ)
Manage IAM roles and permissions
Monitor system health
Support data science teams

Prerequisites¶

Kubernetes cluster (EKS, GKE, Kind)
kubectl and Helm configured
Transport backend (SQS + S3 or RabbitMQ + MinIO)
KEDA installed

Quick Start¶

1. Install KEDA¶

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

2. Install CRDs¶

kubectl apply -f src/asya-operator/config/crd/

3. Configure Transports¶

For AWS (SQS):

# operator-values.yaml
transports:
  sqs:
    enabled: true
    type: sqs
    config:
      region: us-east-1

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/asya-operator-role

For self-hosted (RabbitMQ):

# operator-values.yaml
transports:
  rabbitmq:
    enabled: true
    type: rabbitmq
    config:
      host: rabbitmq.default.svc.cluster.local
      port: 5672
      username: guest
      passwordSecretRef:
        name: rabbitmq-secret
        key: password

4. Install Operator¶

helm install asya-operator deploy/helm-charts/asya-operator/ \
  -n asya-system --create-namespace \
  -f operator-values.yaml

5. Install Gateway (Optional)¶

# gateway-values.yaml
config:
  sqsRegion: us-east-1  # or skip for RabbitMQ
  postgresHost: postgres.default.svc.cluster.local
  postgresDatabase: asya_gateway

routes:
  tools:
  - name: example
    description: Example tool
    parameters:
      text:
        type: string
        required: true
    route: [example-actor]

helm install asya-gateway deploy/helm-charts/asya-gateway/ -f gateway-values.yaml

6. Install Crew Actors¶

# crew-values.yaml
happy-end:
  enabled: true
  transport: sqs  # or rabbitmq
  workload:
    template:
      spec:
        containers:
        - name: asya-runtime
          env:
          - name: ASYA_HANDLER
            value: handlers.end_handlers.happy_end_handler
          - name: ASYA_S3_BUCKET
            value: asya-results
          # For MinIO:
          # - name: ASYA_S3_ENDPOINT
          #   value: http://minio:9000
          # - name: ASYA_S3_ACCESS_KEY
          #   value: minioadmin
          # - name: ASYA_S3_SECRET_KEY
          #   value: minioadmin

error-end:
  enabled: true
  transport: sqs  # or rabbitmq
  workload:
    template:
      spec:
        containers:
        - name: asya-runtime
          env:
          - name: ASYA_HANDLER
            value: handlers.end_handlers.error_end_handler
          - name: ASYA_S3_BUCKET
            value: asya-results

helm install asya-crew deploy/helm-charts/asya-crew/ -f crew-values.yaml

7. Verify Installation¶

# Check operator
kubectl get pods -n asya-system

# Check KEDA
kubectl get pods -n keda

# Check CRDs
kubectl get crd | grep asya

# Check crew actors
kubectl get asya

Supporting Data Science Teams¶

Provide Template¶

Share AsyncActor template with DS teams:

apiVersion: asya.sh/v1alpha1
kind: AsyncActor
metadata:
  name: my-actor
spec:
  transport: sqs  # or rabbitmq
  scaling:
    enabled: true
    minReplicas: 0
    maxReplicas: 50
    queueLength: 5
  workload:
    kind: Deployment
    template:
      spec:
        containers:
        - name: asya-runtime
          image: YOUR_IMAGE:TAG
          env:
          - name: ASYA_HANDLER
            value: "module.function"
            # For class handlers: "module.Class.method"
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "2Gi"

Key fields to explain: - spec.transport: Which transport to use (ask platform team) - spec.scaling.enabled: Enable KEDA autoscaling (default: false) - spec.scaling.minReplicas: Minimum pods (0 for scale-to-zero) - spec.scaling.maxReplicas: Maximum pods - spec.scaling.queueLength: Messages per replica target - spec.workload.kind: Deployment or StatefulSet - env.ASYA_HANDLER: Handler path (module.function or module.Class.method)

Configure Gateway Tools¶

Add tools for DS teams to call:

# gateway-values.yaml
routes:
  tools:
  - name: text-processor
    description: Process text with ML model
    parameters:
      text:
        type: string
        required: true
      model:
        type: string
        default: "default"
    route: [text-preprocess, text-infer, text-postprocess]

Grant Access¶

AWS (IRSA): Configure IAM role annotation in AsyncActor

metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/asya-actor-role

AWS (Pod Identity): Create pod identity association

aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace default \
  --service-account asya-my-actor \
  --role-arn arn:aws:iam::ACCOUNT:role/asya-actor-role

RabbitMQ: Provide credentials

kubectl create secret generic rabbitmq-secret \
  --from-literal=password=YOUR_PASSWORD

Monitoring¶

Prometheus Metrics¶

Important: Operator does NOT automatically create ServiceMonitors. You must configure Prometheus scraping.

Key sidecar metrics (namespace: asya_actor):

asya_actor_processing_duration_seconds{queue} - Processing time
asya_actor_messages_processed_total{queue, status} - Messages processed
asya_actor_messages_failed_total{queue, reason} - Failed messages
asya_actor_runtime_errors_total{queue, error_type} - Runtime errors

Key operator metrics:

controller_runtime_reconcile_total{controller="asyncactor"} - Reconciliations
controller_runtime_reconcile_errors_total{controller="asyncactor"} - Errors

KEDA metrics:

keda_scaler_active{scaledObject} - Active scalers
keda_scaler_metrics_value{scaledObject} - Queue depth

Prometheus Configuration¶

ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: asya-actors
spec:
  selector:
    matchLabels:
      asya.sh/actor: "*"
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

Scrape config (standard Prometheus):

scrape_configs:
- job_name: asya-actors
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_asya_sh_actor]
    action: keep
    regex: .+
  - source_labels: [__meta_kubernetes_pod_container_name]
    action: keep
    regex: asya-sidecar
  - source_labels: [__address__]
    action: replace
    regex: ([^:]+)(?::\d+)?
    replacement: $1:8080
    target_label: __address__

Grafana Dashboards¶

Example queries:

Actor throughput:

rate(asya_actor_messages_processed_total{queue="asya-my-actor"}[5m])

Queue depth:

keda_scaler_metrics_value{scaledObject="my-actor"}

Error rate:

rate(asya_actor_messages_failed_total{queue="asya-my-actor"}[5m])

P95 latency:

histogram_quantile(0.95, rate(asya_actor_processing_duration_seconds_bucket{queue="asya-my-actor"}[5m]))

See: ../operate/monitoring.md for complete metrics and alerts.

Logging¶

View operator logs:

kubectl logs -n asya-system deploy/asya-operator -f

View actor logs:

# Runtime logs (handler output)
kubectl logs -l asya.sh/actor=my-actor -c asya-runtime -f

# Sidecar logs (routing, transport)
kubectl logs -l asya.sh/actor=my-actor -c asya-sidecar -f

Troubleshooting¶

Queue Not Created¶

# Check operator logs
kubectl logs -n asya-system deploy/asya-operator

# Check AsyncActor status
kubectl describe asya my-actor

# Check conditions
kubectl get asya my-actor -o jsonpath='{.status.conditions}'

Common causes: - Transport not enabled in operator config - Missing IAM permissions (SQS) - RabbitMQ connection failure - AWS SQS 60-second cooldown after queue deletion

Actor Not Scaling¶

# Check ScaledObject
kubectl get scaledobject my-actor -o yaml
kubectl describe scaledobject my-actor

# Check HPA created by KEDA
kubectl get hpa

# Check KEDA operator logs
kubectl logs -n keda deploy/keda-operator

Common causes: - spec.scaling.enabled not set to true - KEDA not installed - Queue doesn't exist - Missing IAM permissions for KEDA to read queue metrics

Sidecar Connection Errors¶

# Check sidecar logs
kubectl logs deploy/my-actor -c asya-sidecar

# Check transport config
kubectl get asya my-actor -o jsonpath='{.spec.transport}'

Common issues: - Wrong transport configured (sqs vs rabbitmq) - Missing IAM permissions (SQS) - RabbitMQ credentials incorrect - Queue doesn't exist - Network connectivity issues

Runtime Errors¶

# Check runtime logs
kubectl logs deploy/my-actor -c asya-runtime

# Check handler config
kubectl get asya my-actor -o jsonpath='{.spec.workload.template.spec.containers[?(@.name=="asya-runtime")].env}'

Common issues: - Wrong ASYA_HANDLER value (handler not found) - Missing Python dependencies in image - Class handler __init__ parameters missing defaults - OOM errors (insufficient memory) - CUDA OOM (GPU memory exhausted)

Pod Status Issues¶

Check AsyncActor status for detailed error information:

kubectl get asya my-actor

Status values: - Running - Healthy - Napping - Scaled to zero (normal with minReplicas=0) - Creating - Initial deployment - RuntimeError - Runtime container crashing - SidecarError - Sidecar container crashing - ImagePullError - Cannot pull image - TransportError - Queue/transport issues

Scaling Configuration¶

Queue-based Autoscaling (KEDA)¶

spec:
  scaling:
    enabled: true
    minReplicas: 0          # Scale to zero when idle
    maxReplicas: 50         # Max replicas
    queueLength: 5          # Target: 5 messages per replica
    pollingInterval: 10     # Check queue every 10s
    cooldownPeriod: 60      # Wait 60s before scaling down

Formula: desiredReplicas = ceil(queueDepth / queueLength)

Example: 100 messages, queueLength=5 → 20 replicas

GPU Workloads¶

spec:
  workload:
    template:
      spec:
        containers:
        - name: asya-runtime
          resources:
            limits:
              nvidia.com/gpu: 1
        nodeSelector:
          nvidia.com/gpu: "true"
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

Note: Ensure GPU node group exists and NVIDIA device plugin is installed.

StatefulSet (for stateful workloads)¶

spec:
  workload:
    kind: StatefulSet
    template:
      spec:
        containers:
        - name: asya-runtime
          volumeMounts:
          - name: data
            mountPath: /data
    volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

Cost Optimization¶

Enable scale-to-zero:

spec:
  scaling:
    enabled: true
    minReplicas: 0  # $0 when idle

Set appropriate queueLength: - Higher = fewer pods, slower processing, lower cost - Lower = more pods, faster processing, higher cost

Examples: - queueLength: 5 → 100 messages = 20 pods - queueLength: 10 → 100 messages = 10 pods - queueLength: 20 → 100 messages = 5 pods

Use Spot Instances (AWS):

eksctl create nodegroup \
  --cluster my-cluster \
  --spot \
  --instance-types g4dn.xlarge \
  --nodes-min 0 \
  --nodes-max 10

SQS cost optimization: - First 1M requests/month free - $0.40 per million requests after - No idle costs (pay per use) - Scale to zero = $0

Upgrades¶

# Upgrade operator
helm upgrade asya-operator deploy/helm-charts/asya-operator/ \
  -n asya-system \
  -f operator-values.yaml

# Upgrade gateway
helm upgrade asya-gateway deploy/helm-charts/asya-gateway/ \
  -f gateway-values.yaml

# Upgrade crew
helm upgrade asya-crew deploy/helm-charts/asya-crew/ \
  -f crew-values.yaml

Important: Always upgrade operator before upgrading actors. AsyncActors may need to be reconciled after operator upgrade.