Building a Stateful IT Service Desk Agent with LangGraph on Amazon EKS

IT support teams face persistent challenge: employees expect instant answers to common questions (VPN setup, single sign-on troubleshooting, new-hire onboarding), but novel or complex issues still require human expertise. An AI agent that confidently answers, “How do I reset my VPN?” but hallucinates a response to “My IAM Identity Center session keeps expiring after last night’s IdP migration” creates more tickets than it resolves.

In this post, we present an IT Service Desk agent that handles routine Level 1 (L1) support requests autonomously and escalates complex issues to Level 2/Level 3 (L2/L3) engineers when confidence is low. The agent is built with LangGraph, an open source (MIT-licensed) framework for stateful AI workflows maintained by the LangChain community. LangGraph’s interrupt() and checkpointing primitives map directly to tiered support escalation. While we demonstrate deployment on Amazon Elastic Kubernetes Service (Amazon EKS) with Amazon DynamoDB for state persistence, this LangGraph pattern runs on any Kubernetes platform; the orchestration layer is portable and the state backend is pluggable.

In addition to LangGraph, this solution builds on FastAPI for the HTTP layer, OpenTelemetry for tracing, and Karpenter for node autoscaling.

Why LangGraph for stateful support workflows

LangGraph models AI workflows as directed graphs, which makes tiered support escalation a natural fit. This implementation pairs LangGraph with Kubernetes for orchestration; for this post, we will use Amazon EKS. Key properties:

Automatic L1 resolution: Well-documented issues (VPN, password resets, software installation) are resolved without human intervention.

Context-preserving escalation: When the knowledge base lacks coverage or the issue is novel, LangGraph escalates with full context: the employee’s question, the AI’s attempted answer, and the retrieved documentation.

Durable state persistence: LangGraph’s checkpointing (implemented here with Amazon DynamoDB, but also supporting PostgreSQL, Redis, or custom backends) keeps escalated tickets intact across pod restarts, scaling events, and multi-replica routing.

Complete execution tracing: OpenTelemetry records every escalation decision: why the agent escalated, what documentation it searched, and who resolved the ticket.

Horizontal scaling: Kubernetes scales the agent based on demand; on Amazon EKS we use Karpenter configured to provision Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for resource optimization.

Architecture overview

The system consists of a FastAPI application running a LangGraph agent on Amazon EKS. The agent retrieves context from Amazon OpenSearch Serverless (indexed IT runbooks, troubleshooting guides, and onboarding documentation), generates answers using Anthropic’s Claude model on Amazon Bedrock, and conditionally escalates to human L2/L3 engineers when confidence is low. Amazon DynamoDB persists conversation state across pod restarts.

Architecture diagram showing a LangGraph AI agent on Amazon EKS with DynamoDB checkpointing, OpenSearch Serverless retrieval, and Bedrock LLM integration for an IT service desk escalation workflow.

Architecture diagram: overall workflow with Amazon EKS, LangGraph, and Amazon Bedrock.

The knowledge base in Amazon Simple Storage Service (Amazon S3) contains IT runbooks, troubleshooting guides, VPN configuration docs, SSO setup procedures, onboarding checklists, and known-issue bulletins. An indexing pipeline (for example, an AWS Lambda function triggered by S3 PutObject events) chunks each document, generates embeddings using Amazon Titan Text Embeddings v2 on Amazon Bedrock, and writes the resulting vectors to an Amazon OpenSearch Serverless vector index for semantic kNN retrieval.

The architecture uses IAM Roles for Service Accounts (IRSA) so that pods assume an IAM role to access Amazon Bedrock, Amazon DynamoDB, and Amazon OpenSearch Serverless without static credentials. The Horizontal Pod Autoscaler (HPA) scales from 2 to 10 replicas based on CPU utilization, and Karpenter is configured to provision additional nodes using Spot Instances when capacity is needed.

LangGraph agent design

LangGraph models AI workflows as directed graphs where nodes are Python functions and edges define control flow. Three properties make it suitable for IT support escalation:

Checkpointing: Automatic state persistence at every node transition. When an issue is escalated to a human engineer, the full context is preserved in Amazon DynamoDB regardless of pod lifecycle.

interrupt(): Pauses graph execution, persists state, and resumes when the L2/L3 engineer provides a resolution. The escalated ticket can sit for minutes or hours and the state is safe.

Conditional edges: Route to different nodes based on confidence, enabling the L1 (AI) to L2/L3 (human) escalation pattern.

The agent graph has three nodes: Retrieve, Generate, and Escalate.

Confidence routing uses a hybrid approach: if the best retrieval score from Amazon OpenSearch Serverless (cosine similarity over Amazon Titan embeddings) is below 0.7, or the LLM’s self-assessed confidence (extracted via structured output prompting) is below 7/10, the request escalates to a human engineer. These thresholds are starting points, not fixed values; the right settings depend on your runbook coverage and how much escalation volume your L2/L3 team can absorb. Tune them empirically: raise the bar (for example, 0.8 / 8) to escalate more aggressively and protect answer quality while the knowledge base is still thin, then relax it as coverage improves and you see which escalations the AI could have handled confidently. Reviewing a sample of escalated and auto-resolved tickets each week is a practical way to calibrate. This hybrid check catches two failure modes:

Knowledge gap: The issue isn’t documented (for example, a new infrastructure change broke something).

Ambiguous context: The runbook exists but the employee’s situation doesn’t clearly match.

RAG pipeline with confidence-based human escalation.

RAG pipeline with confidence-based human escalation.

Building the agent

In this section, we walk through building the LangGraph agent step by step.

State schema

The state flows through every node in the graph. It captures the IT support context needed for both AI resolution and human escalation.

# Shared state passed through every node: question, retrieved docs,
# confidence, escalation flag, and routing metadata.

from dataclasses import dataclass, field
from typing import Annotated
from langgraph.graph.message import add_messages


@dataclass
class Document:
    content: str
    source: str
    score: float = 0.0


@dataclass
class SupportState:
    question: str = ""
    documents: list[Document] = field(default_factory=list)
    generation: str = ""
    confidence_score: float = 0.0
    needs_escalation: bool = False
    engineer_response: str | None = None
    category: str = ""  # vpn, sso, onboarding, hardware, access, other

Retrieve node

This node embeds the employee’s question with Amazon Titan Text Embeddings v2 on Amazon Bedrock, then runs a kNN search against the Amazon OpenSearch Serverless vector index built from the runbook corpus. It returns the top 5 documents along with their similarity scores.

# Query OpenSearch Serverless

import json


def embed_question(question: str) -> list[float]:
    response = bedrock_client.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": question}),
    )
    return json.loads(response["body"].read())["embedding"]


def retrieve(state: SupportState) -> dict:
    client = get_opensearch_client()
    query_vector = embed_question(state.question)
    response = client.search(
        index="it-runbooks",
        body={
            "size": 5,
            "query": {
                "knn": {
                    "embedding": {
                        "vector": query_vector,
                        "k": 5,
                    }
                }
            },
        },
    )
    documents = [
        Document(
            content=hit["_source"]["content"],
            source=hit["_source"].get("source", ""),
            score=hit["_score"],
        )
        for hit in response["hits"]["hits"]
    ]
    return {"documents": documents}

Generate node

This node sends the retrieved runbook content and the employee’s question to Anthropic’s Claude model on Amazon Bedrock. The prompt instructs the model to answer as an IT support agent and self-assess its confidence. Note that this self-assessed score is just one input: a model’s own confidence rating may not always reflect whether its answer is actually correct, so it helps to pair it with other signals. The hybrid approach does this by also considering the retrieval score from Amazon OpenSearch Serverless, and you may want to validate the thresholds against real escalation outcomes over time.

# Generate node: answers the question from retrieved docs via Claude on Bedrock,
# self-assesses confidence, and flags escalation when confidence or retrieval
# score falls below threshold.

RETRIEVAL_THRESHOLD = 0.7
CONFIDENCE_THRESHOLD = 7
MODEL_ID = "anthropic.claude-sonnet-4-6"  # Choose your Bedrock Claude Sonnet model


def parse_model_response(text: str) -> tuple[str, float, str]:
    confidence_match = re.search(r"^CONFIDENCE:s*(d+(?:.d+)?)", text, re.MULTILINE)
    category_match = re.search(r"^CATEGORY:s*(w+)", text, re.MULTILINE)
    confidence = float(confidence_match.group(1)) if confidence_match else 0.0
    category = category_match.group(1).lower() if category_match else "other"
    answer = re.split(r"^CONFIDENCE:", text, maxsplit=1, flags=re.MULTILINE)[0].strip()
    return answer, confidence, category


def generate(state: SupportState) -> dict:
    best_retrieval_score = max((d.score for d in state.documents), default=0.0)
    context = "nn".join(doc.content for doc in state.documents)

    prompt = f"""You are an IT Service Desk agent. Answer the employee's question using
ONLY the provided IT documentation.

Rules:
- Only provide steps you can verify from the documentation.
- If the issue appears to be caused by a recent infrastructure change not covered in docs, say so.
- Never guess at credentials, endpoints, or configuration values.

After your answer, rate your confidence from 1-10 on a new line:
CONFIDENCE: <number>

Also categorize the issue on a new line:
CATEGORY: <vpn|sso|onboarding|hardware|access|other>

IT Documentation:
{context}

Employee Question: {state.question}

Answer:"""

    response = bedrock_client.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )
    body = json.loads(response["body"].read())
    raw_text = body["content"][0]["text"]
    answer, confidence, category = parse_model_response(raw_text)

    needs_escalation = (
        confidence < CONFIDENCE_THRESHOLD
        or best_retrieval_score < RETRIEVAL_THRESHOLD
    )

    return {
        "generation": answer,
        "confidence_score": confidence,
        "needs_escalation": needs_escalation,
        "category": category,
    }

Escalation node

The interrupt() function pauses graph execution and persists the full support context to Amazon DynamoDB. The graph remains paused until an L2/L3 engineer provides a resolution, which could be minutes or hours later, on any pod in the cluster. In production, the escalation node also publishes a notification (for example, to an Amazon SQS queue feeding an engineer console) so the on-call engineer is alerted to the new ticket; that notification is omitted here for brevity.

# Escalate node: pauses the graph via interrupt(), persisting full context to
# DynamoDB, and resumes with the L2/L3 engineer's resolution when they respond.

from langgraph.types import interrupt


def escalate_to_engineer(state: SupportState) -> dict:
    engineer_response = interrupt(
        {
            "question": state.question,
            "ai_attempted_answer": state.generation,
            "confidence": state.confidence_score,
            "category": state.category,
            "reason": "Low confidence - escalating to L2/L3 engineer.",
            "retrieved_sources": [d.source for d in state.documents],
        }
    )
    return {
        "engineer_response": engineer_response,
        "generation": engineer_response,
    }

Graph assembly

The graph assembly wires the three nodes into the support flow. Execution starts at retrieve, passes to generate, and then branches: route_after_generate inspects the needs_escalation flag set during generation and routes low-confidence requests to escalate_to_engineer while sending resolved requests straight to END.

# Wires retrieve -> generate, then branches on needs_escalation to either
# escalate to a human engineer or end the flow.

from langgraph.graph import StateGraph, START, END


def route_after_generate(state: SupportState) -> str:
    if state.needs_escalation:
        return "escalate_to_engineer"
    return END


def build_graph():
    graph = StateGraph(SupportState)
    graph.add_node("retrieve", retrieve)
    graph.add_node("generate", generate)
    graph.add_node("escalate_to_engineer", escalate_to_engineer)

    graph.add_edge(START, "retrieve")
    graph.add_edge("retrieve", "generate")
    graph.add_conditional_edges(
        "generate",
        route_after_generate,
        ["escalate_to_engineer", END],
    )
    graph.add_edge("escalate_to_engineer", END)
    return graph

FastAPI wrapper

The application exposes two endpoints: one for employees to submit IT support requests, and one for engineers to resolve escalated tickets.

# FastAPI wrapper: /support runs the graph (returning either an answer or an
# escalation), and /resolve resumes a paused graph with the engineer's response
# via DynamoDB-backed checkpointing.

import uuid
from fastapi import FastAPI
from langgraph.types import Command
from langgraph.checkpoint.aws import DynamoDBSaver

app = FastAPI()
checkpointer = DynamoDBSaver(
    table_name="it-support-checkpoints",
    region_name="us-east-1",
)
graph = build_graph().compile(checkpointer=checkpointer)


@app.post("/support")
async def support(request: SupportRequest):
    thread_id = request.thread_id or str(uuid.uuid4())
    config = {"configurable": {"thread_id": thread_id}}
    result = graph.invoke({"question": request.question}, config)
    state = graph.get_state(config)

    if state.next:  # Graph is paused - escalated to engineer
        interrupt_payload = state.tasks[0].interrupts[0].value
        return SupportResponse(
            ticket_id=thread_id,
            needs_escalation=True,
            interrupt_context=interrupt_payload,
        )
    return SupportResponse(
        ticket_id=thread_id,
        answer=result["generation"],
        category=result.get("category", ""),
    )


@app.post("/support/{ticket_id}/resolve")
async def resolve(ticket_id: str, request: ResolveRequest):
    config = {"configurable": {"thread_id": ticket_id}}
    result = graph.invoke(
        Command(resume=request.engineer_response), config
    )
    return SupportResponse(
        ticket_id=ticket_id,
        answer=result["generation"],
        category=result.get("category", ""),
    )

Amazon DynamoDB checkpointing

Amazon DynamoDB checkpointing is what makes the escalation pattern viable on Kubernetes. Without it, an escalated ticket is lost if the handling pod restarts or if the engineer’s resolution routes to a different replica.

The langgraph-checkpoint-aws library persists the full graph state at every node transition. When a request is escalated, the state, including the employee’s question, the AI’s attempted answer, retrieved runbook content, and confidence scores, is already in Amazon DynamoDB. When the engineer resolves the ticket (potentially hours later, on any pod in the cluster), the graph loads the checkpoint and delivers the resolution.

This also provides resilience against infrastructure failures. If a pod crashes mid-execution, the support request can be retried from the last completed node rather than asking the employee to repeat themselves.

Containerize and deploy

Prerequisites

To deploy the IT support agent on Amazon EKS, you need the following:

  • An AWS account. If you don’t have one, you can sign up for one.
  • An AWS Identity and Access Management (IAM) user with permissions to work with Amazon EKS, Amazon Bedrock, Amazon DynamoDB, and Amazon OpenSearch Serverless.
  • The AWS CLI, kubectl, eksctl, and helm installed in your terminal.
  • Access to Amazon Bedrock foundation models (example: Anthropic Claude Sonnet) enabled in your account.

Step 1: Containerize the application

Create a Dockerfile that packages the agent and FastAPI app into a slim Python image and serves it with uvicorn on port 8000.

# Dockerfile: packages the agent and FastAPI app into a slim Python image and
# serves it with uvicorn on port 8000.

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY agent/ agent/
COPY app/ app/
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Build the container image and push it to Amazon Elastic Container Registry (Amazon ECR). Replace 111122223333 with your AWS account ID.

# Build the container image, authenticate Docker to Amazon ECR, then tag and
# push the image to your ECR repository.

docker build -t it-support-agent .
aws ecr get-login-password --region us-east-1 | 
  docker login --username AWS --password-stdin 
  111122223333.dkr.ecr.us-east-1.amazonaws.com
docker tag it-support-agent:latest 
  111122223333.dkr.ecr.us-east-1.amazonaws.com/it-support-agent:latest
docker push 
  111122223333.dkr.ecr.us-east-1.amazonaws.com/it-support-agent:latest

Note: Replace 111122223333 with your AWS account ID throughout this post.

Step 2: Set up the Amazon EKS cluster

Create an Amazon EKS cluster using the following cluster-config.yaml. This provisions a private-networking EKS 1.31 cluster with OIDC enabled (for IRSA), a managed node group, Karpenter discovery tags, and core add-ons.

# eksctl cluster config: provisions a private-networking EKS 1.31 cluster with
# OIDC enabled (for IRSA), a managed node group, Karpenter discovery tags, and
# core add-ons.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: it-support-agent
  region: us-east-1
  version: "1.31"
  tags:
    karpenter.sh/discovery: it-support-agent
iam:
  withOIDC: true
managedNodeGroups:
  - name: default
    desiredCapacity: 3
    minSize: 2
    maxSize: 5
    instanceType: m7i.large
    privateNetworking: true
addons:
  - name: vpc-cni
    attachPolicyARNs:
      - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
  - name: coredns
    version: latest
  - name: kube-proxy
    version: latest

Run this command to create your Amazon EKS cluster:

eksctl create cluster -f cluster-config.yaml

Verify that nodes are running:

kubectl get nodes

Step 3: Create the Amazon DynamoDB table

Create the DynamoDB table for LangGraph checkpoints, using a composite PK/SK key schema and on-demand (pay-per-request) billing.

# Create the DynamoDB table for LangGraph checkpoints, using a composite PK/SK
# key schema and on-demand (pay-per-request) billing.

aws dynamodb create-table 
  --table-name it-support-checkpoints 
  --attribute-definitions 
      AttributeName=PK,AttributeType=S 
      AttributeName=SK,AttributeType=S 
  --key-schema 
      AttributeName=PK,KeyType=HASH 
      AttributeName=SK,KeyType=RANGE 
  --billing-mode PAY_PER_REQUEST 
  --region us-east-1

Step 4: Deploy with Helm

The Helm chart defines the Kubernetes resources: a Deployment with an IRSA-annotated service account, a ClusterIP Service, an ALB Ingress, and a Horizontal Pod Autoscaler. The values file sets replica count, binds the pod’s service account to the IRSA role, injects environment variables, and configures the HPA to scale 2 to 10 replicas at 70% CPU.

# Helm values: sets replica count, binds the pod's service account to the IRSA
# role, injects AWS/OpenSearch/checkpoint/OTel env vars, and configures HPA to
# scale 2 to 10 replicas at 70% CPU.

replicaCount: 2
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/it-support-agent-irsa
env:
  AWS_REGION: us-east-1
  OPENSEARCH_ENDPOINT: <your-opensearch-endpoint>
  CHECKPOINT_TABLE: it-support-checkpoints
  OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector.observability:4317
hpa:
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilization: 70

Deploy:

# Deploy the agent to the cluster with Helm, overriding the OpenSearch endpoint
# at install time.

helm install it-support-agent ./helm/it-support-agent 
  --set env.OPENSEARCH_ENDPOINT=<your-opensearch-endpoint>

The IAM role attached through IRSA requires the following permissions, defined as an inline policy on the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-6",
        "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
      ]
    },
    {
      "Sid": "DynamoDBCheckpoints",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:Query",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:111122223333:table/it-support-checkpoints"
    },
    {
      "Sid": "OpenSearchServerless",
      "Effect": "Allow",
      "Action": ["aoss:APIAccessAll"],
      "Resource": "*"
    }
  ]
}

Observability with OpenTelemetry

For an IT Service Desk, observability serves dual purposes: operational monitoring and audit compliance. Every escalation decision is traceable. You can demonstrate why the AI escalated, what documentation it searched, and who resolved the ticket.

# Initializes OpenTelemetry: creates a tracer provider for the service, exports
# spans to the OTLP collector, and auto-instruments the FastAPI app for request
# tracing.

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor


def init_telemetry(app):
    provider = TracerProvider(
        resource=Resource.create({"service.name": "it-support-agent"})
    )
    provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint="http://otel-collector:4317",
                insecure=True,
            )
        )
    )
    trace.set_tracer_provider(provider)
    FastAPIInstrumentor.instrument_app(app)

Each node records attributes that explain its decisions:

SpanAttributes
support.retrievequestion, documents_retrieved, top_score, runbooks_searched
support.generateconfidence_score, best_retrieval_score, needs_escalation, category
support.escalateconfidence_score, category, engineer_resolved, resolution_time_ms

This enables SLA tracking and helps answer questions like: How long from escalation to resolution? Which categories escalate most frequently? Which runbooks have coverage gaps?

Testing the complete flow

Let’s verify the application is working correctly by testing both paths: L1 AI resolution and L2/L3 escalation.

L1 resolution: Routine VPN question

curl -X POST http://<ENDPOINT>/support 
  -H "Content-Type: application/json" 
  -d '{"question": "How do I connect to the corporate VPN from my Mac?"}'

Response:

{
  "ticket_id": "52964e68-a2d2-4a4f-9914-57fea1a2d0f9",
  "answer": "To connect to the corporate VPN on macOS:n1. Open System Settings > VPNn2. Select Add VPN Configurationn3. Select IKEv2 as the typen4. Server: vpn.corp.example.comn5. Remote ID: vpn.corp.example.comn6. Authentication: Username (your SSO credentials)n7. Select Connect and authenticate with your MFA prompt.",
  "needs_escalation": false,
  "category": "vpn"
}

Escalation: Novel issue after infrastructure change

curl -X POST http://<ENDPOINT>/support 
  -H "Content-Type: application/json" 
  -d '{"question": "My IAM Identity Center session keeps expiring every 10 minutes since the IdP migration last Thursday. I have to re-authenticate constantly."}'

Response:

{
  "ticket_id": "b7142ee4-c933-491d-ad54-9a701b4c1397",
  "answer": null,
  "needs_escalation": true,
  "category": "sso",
  "interrupt_context": {
    "question": "My IAM Identity Center session keeps expiring every 10 minutes since the IdP migration last Thursday...",
    "ai_attempted_answer": "Based on available documentation, session duration is configured in IAM Identity Center settings. However, I cannot confirm whether the recent IdP migration changed these settings.",
    "confidence": 4.0,
    "reason": "Low confidence - escalating to L2/L3 engineer.",
    "retrieved_sources": ["sso-setup-guide.md", "iam-identity-center-config.md"]
  }
}

Engineer resolves the escalated ticket

curl -X POST http://<ENDPOINT>/support/b7142ee4-c933-491d-ad54-9a701b4c1397/resolve 
  -H "Content-Type: application/json" 
  -d '{"engineer_response": "This is a known issue from the IdP migration on Thursday. The session duration was reset to the 10-minute default during cutover. Fix: Go to IAM Identity Center, then Settings, then Session Duration and set it back to 8 hours. We are pushing a bulk fix tonight that will resolve this for all users by tomorrow morning."}'

Response:

{
  "ticket_id": "b7142ee4-c933-491d-ad54-9a701b4c1397",
  "answer": "This is a known issue from the IdP migration on Thursday. The session duration was reset to the 10-minute default during cutover. Fix: Go to IAM Identity Center, then Settings, then Session Duration and set it back to 8 hours. We are pushing a bulk fix tonight that will resolve this for all users by tomorrow morning.",
  "needs_escalation": false,
  "category": "sso"
}

Between the escalation and resolution, the pod could have restarted, scaled down, or the request could have routed to a different replica. Amazon DynamoDB checkpointing makes this transparent.

Cleaning up

To avoid ongoing charges, delete the deployed resources:

helm uninstall it-support-agent
aws dynamodb delete-table --table-name it-support-checkpoints
aws ecr delete-repository --repository-name it-support-agent --force
eksctl delete cluster -f cluster-config.yaml

If you created an Amazon OpenSearch Serverless collection and Amazon S3 bucket for the IT runbooks, delete those as well through the AWS Management Console.

Conclusion

In this post, we demonstrated a production pattern for stateful IT support that mirrors how support teams already work: handle routine L1 requests automatically, escalate complex issues to human engineers with full context, and maintain a complete execution trace of every decision.

This implementation demonstrates several key points:

  • LangGraph’s interrupt() maps directly to IT support escalation, with no custom ticketing queue infrastructure needed.
  • LangGraph checkpointing (DynamoDB here, or PostgreSQL/Redis elsewhere) keeps escalated tickets intact across pod restarts, scaling events, and multi-replica routing.
  • Confidence-based routing catches both knowledge gaps (undocumented issues) and ambiguous situations (novel problems after infrastructure changes).
  • OpenTelemetry provides SLA tracking, escalation analytics, and compliance audit trails.
  • The pattern is portable: while shown on Amazon EKS, it runs on any Kubernetes platform.

Next steps

  • Add a feedback loop that stores engineer resolutions back into the knowledge base, so the same question gets resolved at L1 next time.
  • Implement priority routing using the category field to route escalations to specialized engineers (network team for VPN, identity team for SSO).
  • Build SLA dashboards by aggregating OpenTelemetry spans to track mean-time-to-resolution by category.
  • Contribute your own checkpoint backends or deployment recipes to the LangGraph community.

To learn more, visit the LangGraph documentation, Amazon EKS documentation, and Amazon Bedrock documentation.

This post first appeared on Read More