Step‑by‑step guide to building, testing, and deploying production‑grade AI agents with LangChain, LangGraph, and Haystack on Kubernetes, covering observability.
TL;DR – The Go‑to Stack for Production‑Grade AI Agents
| Goal | Core Framework | Orchestration / Ops | Why It Works in Production |
|---|---|---|---|
| Prototype → Production | LangChain + LangGraph (Python/JS) | Kubernetes + Docker + CI/CD (GitHub/GitLab) + Observability (Prometheus + Grafana, OpenTelemetry) | Rich component library, native LLM & tool abstractions, graph‑based “agent‑flow” DSL that compiles to deterministic pipelines, strong community support. |
| Enterprise‑scale RAG / Multi‑modal | Haystack (or custom DeepSpeed stack) | K8s + Airflow + Service Mesh (Istio) + Vector DB (Weaviate/PGVector) + Vault | End‑to‑end retrieval‑augmented generation, scalable indexing, built‑in pipelines, easy plug‑in of fine‑tuned models. |
| Fully Managed, Low‑Ops | OpenAI (or Azure) Assistants API | Serverless (Azure Functions/Cloudflare Workers) + Managed logging (Azure Monitor) | No model‑hosting, built‑in tool calling, versioned assistants, per‑assistant pricing. |
| Open‑Source, Self‑Hosted | AutoGPT / BabyAGI clones (LangChain core) | Docker‑Compose → K8s + Argo Workflows | Good for research/experimentation; needs extra hardening for production. |
What “Production‑Grade” Really Means
| Dimension | Must‑Haves | Common Pitfalls |
|---|---|---|
| Reliability | Idempotent steps, retries, circuit‑breakers, graceful degradation | LLM latency spikes → timeouts; unhandled rate limits → API bans |
| Scalability | Horizontal scaling of inference (vLLM, DeepSpeed) and vector stores | Single‑node inference choking under load |
| Observability | Structured logs, distributed tracing, metrics (latency, token usage, cost), alerts | Black‑box LLM calls hide failures |
| Security & Compliance | Secret management, encryption at rest, audit trails, PII redaction | Hard‑coded keys, prompt‑injection attacks |
| Versioning & CI/CD | Container images per version, schema migrations, canary releases | monolithic script changes cause downtime |
| Governance | Prompt templating, policy enforcement, content moderation hooks | Uncontrolled output leads to policy violations |
| Cost Control | Token‑usage monitoring, budget alerts, fallback to cheaper models | Surprise bills from runaway loops |
A framework that addresses most of these out‑of‑the‑box is the sweet spot for production.
Leading Frameworks (2024‑2025)
| Framework | Language(s) | Core Idea | Production‑Ready Features |
|---|---|---|---|
| LangChain (v0.3+) | Python, TypeScript/JS | Composable “chains” of LLM calls, tools, memory, retrieval | Prompt templating, tool abstraction, retries/timeout wrappers, tracing (LangSmith), 30+ vector‑DB adapters, multi‑provider LLM adapters |
| LangGraph (extension) | Python | Declarative state‑machine graphs that compile to deterministic workflows | Persistent state (Redis/Postgres), graph versioning, parallel branches, human‑in‑the‑loop nodes |
| Haystack (v2.x) | Python | Scalable RAG pipelines + document store | Indexing with Elasticsearch/Weaviate/Milvus/PGVector, pipeline API, integration with vLLM/TGI, REST & GraphQL endpoints |
| DeepSpeed‑based custom stack | Python | Zero‑Optimized large‑model serving | 8‑/4‑bit quantization, ZeRO‑III sharding, Ray‑based autoscaling on K8s |
| OpenAI / Azure Assistants API | Any (REST) | First‑class “assistant” objects that manage threads, tool calling, and functions | Provider‑managed scaling, versioned assistants, automatic transcript storage |
| AutoGPT / BabyAGI clones | Python | Self‑prompting loops that generate new tasks | Simple CLI, basic retry handling – great for experiments but needs hardening |
| LangSmith / Haystack‑MLflow | – | Experiment tracking & model registry | Metric logging, comparative runs, model versioning |
Bottom line
- Most flexible, community‑driven stack: LangChain + LangGraph.
- Heavy RAG / knowledge‑base focus: Haystack (can be combined with LangChain for tool calling).
- Zero‑ops, managed LLMs: OpenAI Assistants API (or Azure OpenAI).
All three can run on Kubernetes and be wired into the same ops stack, enabling hybrid solutions.
Reference Production Architecture
+-------------------+ HTTPS +--------------------+ gRPC +------------------------+
| Front‑end (Web/ | <------> | API Gateway (Envoy | <------> | Ingress (Istio/Traefik)|
| Mobile) | | / Kong / APIGW) | | |
+-------------------+ +--------------------+ +------------------------+
|
v
+------------------------+
| Agent Service (K8s) |
| ├─ LangGraph Engine |
| ├─ LangChain Library |
| ├─ Vector Store (Pinecone/Weaviate) |
| ├─ Memory DB (Redis) |
| └─ Secrets (Vault) |
+------------------------+
|
+-----+-----+-----+
| | |
v v v
LLM Provider Tool Service Monitoring
(OpenAI, (DB calls, (Prometheus,
Anthropic, code‑ Grafana,
vLLM self‑host) interpreter) OpenTelemetry)
Key Production Practices baked in
| Area | Practices |
|---|---|
| Retries & Circuit‑breakers | tenacity or custom safe_call wrappers around every external request. |
| Rate‑limit handling | Central token‑bucket middleware, auto‑backoff, per‑LLM quota enforcement. |
| Prompt Management | Store prompts in version‑controlled files; load via PromptTemplate. |
| Tool Security | Input validation with pydantic, sandboxed Docker containers for code execution, whitelist of libraries. |
| Observability | LangChain/LangGraph OpenTelemetry instrumentation, custom metrics (tokens, cost, latency). |
| State Persistence | LangGraph state stored in Redis/Postgres → crash‑recovery without losing conversation. |
| Testing | Unit‑test each tool, integration‑test full graph with LangGraph’s simulate(). |
| Deployment | Multi‑arch Docker image, Helm chart with autoscaling, health checks, secret injection. |
| Cost Control | Middleware logs token usage; periodic aggregation triggers alerts on quota breach; fallback to cheaper models via graph branch. |
Decision Tree – Which Framework to Choose?
Do you need to self‑host LLMs?
├─ Yes → LangChain + LangGraph on vLLM (or DeepSpeed) + K8s
│ • Need heavy RAG? Add Haystack for indexing.
│ • Want managed LLM? Swap vLLM for OpenAI/Anthropic.
└─ No → OpenAI (or Azure) Assistants API
• Need custom tools? Layer LangChain on top of Assistants.
• Need fine‑grained workflow? Use LangGraph with Assistants as the LLM node.
| Requirement | Best Fit |
|---|---|
| Real‑time sub‑second latency | Self‑hosted vLLM + GPU pool; avoid external API hops |
| Multi‑modal (image/audio) | LangChain adapters + gpt‑4o or CLIP embeddings in vector store |
| Regulated industry (HIPAA/GDPR) | Self‑hosted stack + strict encryption, no PHI to SaaS |
| Rapid PoC → Prod | Start with LangChain + HuggingFace model on Docker → migrate to LangGraph + K8s |
| Existing Java/Scala team | Use LangChain‑JS/TS (Node) or wrap Python micro‑service behind HTTP |
Quick‑Start Production‑Ready LangGraph Agent (Skeleton)
# agent.py
import os, redis, httpx
from opentelemetry import trace
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from langchain_community.vectorstores import Pinecone
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, END
from fastapi import FastAPI, Request
# ---------- Observability ----------
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(ConsoleSpanExporter())
)
HTTPXClientInstrumentor().instrument()
RedisInstrumentor().instrument()
# ---------- Core Components ----------
LLM = ChatOpenAI(
model=os.getenv("LLM_MODEL", "gpt-4o-mini"),
temperature=0.0,
api_key=os.getenv("OPENAI_API_KEY"),
)
EMBED = OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY"))
VECTOR = Pinecone.from_existing_index(
index_name=os.getenv("PINECONE_INDEX"),
embedding=EMBED,
namespace=os.getenv("PINECONE_NS", "default"),
)
REDIS = redis.StrictRedis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379"))
# ---------- Helper: retry + circuit breaker ----------
def safe_call(fn):
async def wrapper(*args, **kwargs):
retries = 3
for attempt in range(1, retries + 1):
try:
return await fn(*args, **kwargs)
except (httpx.HTTPError, TimeoutError):
if attempt == retries:
raise
# simple back‑off / alert hooks could go here
return wrapper
# ---------- Graph Nodes ----------
def retrieve(state):
docs = VECTOR.similarity_search(state["user_message"], k=4)
return {"retrieved_docs": [d.page_content for d in docs]}
@safe_call
async def generate(state):
prompt = f"""You are an assistant. Use the context below to answer the question.
Context:
{state['retrieved_docs']}
Question: {state['user_message']}
Answer (concise):"""
rsp = await LLM.ainvoke(prompt)
return {"assistant_reply": rsp.content.strip()}
# ---------- Build the State Graph ----------
workflow = StateGraph("AgentState")
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)
workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)
Agent = workflow.compile()
# ---------- FastAPI entry point ----------
app = FastAPI()
@app.post("/chat")
async def chat_endpoint(req: Request):
payload = await req.json()
user_msg = payload["message"]
session_id = payload.get("session_id", "anonymous")
# Load persisted state if any
stored = REDIS.get(session_id)
state = {"user_message": user_msg}
if stored:
state.update(eval(stored))
result = await Agent.ainvoke(state)
# Persist the latest reply for crash‑recovery
REDIS.set(session_id, str({"assistant_reply": result["assistant_reply"]}), ex=3600)
return {"reply": result["assistant_reply"]}
Deploy in a few steps
# Build Docker image
docker build -t my-agent:latest -f Dockerfile .
# Helm install (example)
helm upgrade --install my-agent ./helm/my-agent \
--set image.repository=my-agent \
--set env.OPENAI_API_KEY=$OPENAI_KEY \
--set env.PINECONE_INDEX=my-index \
--set env.REDIS_URL=redis://redis:6379 \
--set replicaCount=3 \
--set resources.limits.cpu=2,resources.limits.memory=4Gi
The skeleton already includes:
- Retry / circuit‑breaker (plug in
pybreakerfor production). - Distributed tracing (OpenTelemetry).
- State persistence (Redis).
- K8s‑ready container with health checks and autoscaling.
From here you can add more nodes (tool calls, human‑in‑the‑loop, validation steps) without touching the surrounding infrastructure.
Frequently Asked Follow‑Ups (Quick Answers)
| Question | Answer |
|---|---|
| Can LangChain run on Java/Scala? | Not natively. Use LangChain‑JS/TS (Node) or expose a Python micro‑service via HTTP. |
| How to achieve multi‑tenant isolation? | Separate vector‑store namespaces, Redis key prefixes, and (if needed) per‑tenant LLM API keys. Enforce via NetworkPolicies and OPA. |
| Do I need Airflow/Prefect for agents? | No for real‑time chat – LangGraph is the in‑process workflow engine. Use Airflow/Prefect only for batch jobs (e.g., nightly index refresh). |
| How to guard against hallucinations? | Add a validation node that runs the generated answer through a second LLM with a “Is this statement supported by the retrieved docs?” prompt before returning it. |
| Cost comparison: Assistants API vs. self‑hosted vLLM? | Assistants: $0.002–$0.012 per 1 k tokens + per‑assistant request fee. Self‑hosted: GPU cost (≈ $3–$4/hr for 8 × A100) + infra. Low QPS → Assistants cheaper; high QPS or large context windows → self‑hosted often wins. |
| Can I version prompts? | Yes. Store prompts in a Git repo, load them via importlib.resources, tag with semantic versions (PROMPT_V1_2), and switch via feature flags. |
| Do I need a separate workflow engine? | For synchronous chat, LangGraph suffices. For scheduled batch pipelines, combine with Airflow or Prefect. |
Final Takeaway
- If you need flexibility, testability, and observability while keeping the door open to self‑hosting large models → go with LangChain + LangGraph on Kubernetes.
- If your primary challenge is massive retrieval and you’re building a knowledge‑base‑centric product → layer Haystack under the same orchestration.
- **If you prefer a zero‑ops approach and are comfortable sending data to a cloud provider → adopt OpenAI (or Azure) Assistants API, optionally wrapped by LangGraph for richer tool orchestration.
Pick the stack that aligns with your deployment model, privacy requirements, and engineering bandwidth. The ecosystem is mature enough that you can start with a small prototype, iterate quickly, and scale to a production‑grade AI agent system without having to rip‑and‑replace core components later. Happy building! 🚀