The Problem

AI/ML teams building production pipelines face fundamental architectural challenges:

Challenge Why It Matters
Logic Entanglement Pipeline orchestration (if/else, retries, error handling) mixed with business logic (AI/ML inference, data processing, API calls, decision making). Frameworks require code instrumentation (@flow, @step decorators). Impossible to test components independently.
Scaling Limitations All components scale together, wasting resources. GPU pods sit idle 80% of the time but still cost $1000s/month. Cannot independently deploy or scale different pipeline stages.
Infrastructure Complexity Data scientists manage infrastructure code alongside ML code. Platform engineers struggle to operate heterogeneous deployment patterns at scale. Batch vs streaming requires completely different frameworks.
Vendor Lock-in API rate limits throttle throughput. Model changes break pipelines. Costs scale unpredictably.

Core issue: Traditional approaches treat AI workloads as HTTP services when they're actually batch processing jobs that need queue-based semantics.

This coupling is obvious for backend engineers to avoid, but unnatural for data science workflows.

Traditional async request-response pattern

❌ Traditional request-response pattern: Clients orchestrate workflows, hold state in memory, get stuck on failures. Servers scale independently but clients waste resources waiting.

Asya🎭 async actor pattern

✅ Asya🎭 pattern: Actors scale independently based on queue depth. Messages flow through pipeline. Errors route to DLQ. No client orchestration - framework handles everything.

What is Asya🎭?

Asya🎭 is a Kubernetes-native async actor framework for orchestrating complex near-realtime AI pipelines at scale.

Core principle: Decouple pipeline logic, infrastructure logic, and component logic.

  • Each component is an independent actor
  • Each actor has a sidecar (routing logic) + runtime (user code)
  • Zero pip dependencies - radically simple interface for data scientists
  • Actors communicate via async message passing (pluggable transports: SQS, RabbitMQ)
  • Pipeline structure is data, not code - indirectly defined by each message
  • Built-in observability, reliability, extensibility, scalability
  • Optional MCP HTTP gateway for easy integration

When to Use Asya🎭

Good Fit

Kubernetes-native deployments - Already running on K8s or planning to migrate

Near-realtime data processing - Latency requirements: seconds to minutes per component - Total pipeline: tens or hundreds of components with very different latencies (ms to minutes each)

Mixed workload types - Self-hosted AI components (LLMs, vision models) - Data processing components - Backend engineering components

Bursty workloads - Unpredictable traffic patterns - Need cost optimization through scale-to-zero - GPU-intensive tasks requiring independent scaling

Resilient processing - Automatic retries, dead-letter queues - Built-in error handling

Easy integration - Configurable MCP Gateway for HTTP API access out of the box

Not Good Fit

Synchronous request-response APIs - Use HTTP services (KServe, Seldon, BentoML) instead

Sub-second latency requirements - Queue overhead adds ~10-500ms - Scale-to-zero inevitably brings delays due to pod startup time (we're working on minimizing it)

Simple single-step processing - Operational complexity overhead may not be worth it

Stateful workflows requiring session affinity - Actors shine when they are stateless

Problems Asya🎭 Solves

No Single Point of Failure

  • Fully distributed architecture
  • No central orchestrator/DAG/flow
  • Messages carry their own routes

Separation of Concerns

  • Pipeline structure: Pipeline is data, not code (no @flow decorators)
  • Infrastructure layer: K8s-native, zero infra management for DS
  • Component logic: Fully controlled by DS

Scalability

  • Each component independently scalable based on queue depth or custom metrics
  • Scale to zero prevents wasted GPU costs
  • KEDA-based autoscaling

Extensibility

  • Pluggable transports (SQS, RabbitMQ, Kafka planned)
  • Easy integration with open-source tools

Observability

  • Built-in metrics for actors, sidecars, runtimes, operator, gateway
  • OpenTelemetry integration

Reliability

  • Built-in retries, DLQs, error handling
  • At-least-once delivery guarantees

Usability

  • Zero infrastructure management for DS
  • Easy for platform engineers to operate at scale (K8s-native)

Problems Asya🎭 Does NOT Solve

  • Pre-defined AI components: Asya doesn't provide inference runtimes - integrate with existing ones (KAITO, LLM-d)
  • CI/CD: Bring your own deployment pipeline
  • Data storage: Bring your own databases, object stores
  • Data processing frameworks: DS build their own runtimes
  • Synchronous HTTP APIs: Cannot compete with ms-latency LLM deployments due to queue overhead
  • Managed service: Bring your own K8s cluster

Existing Solutions Comparison

Workflow Orchestrators (Airflow, Prefect, Dagster, Kubeflow Pipelines, Temporal)

  • Monolithic orchestrators with central DAG/flow definition
  • Not truly async - state held in orchestrator
  • Hard to scale different components independently
  • Hard to deploy components independently

Actor Frameworks (Dapr)

  • K8s-native async actor framework
  • Not designed for data science workloads
  • Lacks built-in AI orchestration features (observability, reliability, autoscaling)

Custom K8s Solutions

  • Require significant engineering effort to build and maintain
  • Lack standardized patterns for AI orchestration

LLM Deployment Tools (KAITO, LLM-d)

  • Perfect for deploying LLMs as REST APIs
  • Asya integrates with these via HTTP calls from actors

Key Insight

Traditional architectures treat AI workloads as HTTP services. AI workloads are actually batch processing jobs with unique requirements:

  • Expensive GPU compute sitting idle
  • Mixed latencies (10ms preprocessing + 30s inference)
  • Bursty traffic patterns (10x spikes during business hours)
  • Multi-step dependencies needing orchestration

Therefore, the best way to operate such systems is asynchronously, which might be tricky to implement.

Async actors invert control: Messages carry routes (not clients orchestrating). Actors scale based on available work (not always-on servers).