industry-news

infrastructure

AI agents

serverless computing

LLM deployment

tool updates

Cloudflare Workers AI Now Runs Large Models - What Builders Need to Know

Cloudflare expanded Workers AI to support large language models like Kimi K2.5, enabling serverless LLM inference at scale. Here's what this means for your AI agent infrastructure.

Lead AI EditorialMarch 21, 20265 min read

Listen to article0:00 / –:––

Cover image for Cloudflare Workers AI Now Runs Large Models - What Builders Need to Know

Why it matters

Run production AI agents on distributed infrastructure without managing inference endpoints - serverless LLM inference now viable at scale with predictable latency and costs.

Signal analysis

Market signals

The Expansion

What Changed: Serverless LLMs at Scale

Lead AI Dot Dev tracked Cloudflare's announcement of large language model support on Workers AI, and this represents a fundamental shift in how developers can deploy inference workloads. Previously, Workers AI focused on smaller, optimized models suitable for lightweight inference tasks. The platform now runs production-grade LLMs like Kimi K2.5 - a capable Chinese model - directly on Cloudflare's global edge network. This means you can build full AI agent stacks without provisioning separate inference infrastructure.

The technical architecture matters here. Cloudflare optimized their inference stack specifically for agent use cases, which typically involve multiple sequential model calls, branching logic, and stateful interactions. By embedding LLM inference directly into the Workers runtime, latency drops significantly compared to calling external APIs. You're eliminating the network hop to a separate inference service.

Cost structure changes too. Cloudflare reduced inference pricing for agent workloads, making the economics more viable for applications that need frequent model calls. Per-token pricing scales predictably, and you avoid the overhead of managing dedicated GPU instances.

Large models now run natively on Cloudflare Workers infrastructure
Optimizations specifically target agent use cases with multiple sequential calls
Inference latency reduced by eliminating external API calls
Pricing restructured for agent workloads with per-token economics

Operator Impact

Why This Matters for Agent Development

AI agents require a different infrastructure pattern than traditional LLM applications. Agents need to think step-by-step, call external tools, evaluate results, and iterate. Each step involves an inference call. Running these agents on external APIs compounds latency and costs - a 10-step reasoning chain hitting OpenAI's API 10 times becomes slow and expensive.

Cloudflare's approach eliminates this constraint. Since inference runs at the edge near your users, response times stay acceptable even for multi-step agentic workflows. More importantly, you maintain execution context within a single compute environment. No serializing state between services, no managing connection pools to multiple inference endpoints.

The Kimi K2.5 model selection signals something else: Cloudflare is building partnerships with diverse model providers, not just relying on OpenAI-class models. This matters because agent workloads often benefit from specialized models - reasoning models for planning, code models for tool use, domain-specific models for vertical applications. You'll likely see more model options added to Workers AI.

For teams building agents today, this removes a major operational headache. You're no longer choosing between 'use expensive API calls for good models' or 'self-host infrastructure and manage scaling.' Cloudflare abstracts the scaling problem while keeping costs reasonable.

Multi-step agent reasoning no longer bottlenecked by external API latency
Execution context stays unified within Workers runtime
Pricing scales with actual usage rather than fixed infrastructure costs
Partnership model suggests more specialized models coming to the platform

Market Trend

Market Signal: Edge Inference Becoming Standard

This announcement reflects a broader market shift. We're moving past the era where all inference happened in centralized cloud regions. Edge providers - Cloudflare, Fastly, others - are racing to embed ML capabilities closer to users and applications. It's the same infrastructure evolution that made serverless compute viable: pushing compute to where it's needed rather than forcing developers to manage central capacity.

The competitive pressure is real. Vercel added AI inference via their partnerships. AWS is positioning bedrock across regions. Cloudflare's move signals that companies building on Workers can't be stuck in a world where 'real AI' lives somewhere else. This is table stakes for modern platform vendors now.

What builders should recognize: this is a reversal of the cloud consolidation trend of the 2010s. Back then, centralized mega-clouds won because they had better economics at scale. Now, economics and latency both favor distributed inference. Your infrastructure decisions need to account for this shift. If you're evaluating platforms, proximity to inference capability should rank alongside compute performance and pricing.

Edge providers now offering inference alongside compute - competitive threshold shifting
Distributed inference becoming economical, not just theoretically appealing
Centralized inference models face pressure on both latency and cost fronts for agent workloads

Next Steps

What Builders Should Do Now

If you're building agents, audit your current inference approach. Are you hitting external APIs for each reasoning step? Calculate the latency and cost for typical agent workflows. Then compare against what Workers AI pricing might look like. The breakeven point is probably lower than you expect, especially for multi-step operations.

Evaluate Kimi K2.5 specifically if your application supports its capabilities. It's a strong reasoning model, and testing it on Workers AI costs nothing compared to provisioning infrastructure elsewhere. You get real production numbers - actual latency, actual cost - before making infrastructure commitments.

Longer term, treat Workers AI as part of your evaluation set when planning AI agent infrastructure. It won't be the right choice for every workload - some agents need very specialized models or extreme throughput - but for typical applications, serverless inference at the edge is now genuinely competitive. You're no longer choosing between 'convenience' and 'cost-effectiveness.' You can have both, from the Cloudflare blog announcement linked in this analysis.

Thank you for listening, Lead AI Dot Dev

Calculate current latency and cost impact of multi-step agent inference against Workers AI pricing
Test Kimi K2.5 model with your agent workflow to get production data before committing
Add Workers AI to your standard evaluation process for new agent projects

Best use cases

How to benefit from this update

Open the scenarios below to see where this shift creates the clearest practical advantage.

Fast read

Key takeaways

Takeaway 1

Cloudflare Workers AI now supports large language models, enabling serverless LLM inference at scale - agents can run entirely on Cloudflare infrastructure without external API calls

Takeaway 2

Agents benefit specifically from edge inference: multi-step reasoning stays latency-efficient and cost-effective when all compute happens in one environment

Takeaway 3

This signals a market shift where edge providers compete on inference capability - distributed models now win on both latency and economics, not just convenience

Action plan

Operator moves

Step 1

Audit your current agent inference costs and latency - model your actual multi-step workflows against Workers AI pricing to identify immediate savings or performance gains

Step 2

Test Kimi K2.5 on a sample agent workflow within Workers AI - you'll get real production data (latency, cost, model quality) that beats theoretical comparisons

Step 3

Add Workers AI to your platform evaluation criteria for new agent projects - serverless LLM inference is now genuinely competitive, not just convenient

Next move

Build around this shift

Use AI Chat to turn this market signal into a concrete stack, workflow, or implementation plan.

Custom Build Browse Builds

Get the weekly operator brief

One concise email with the releases, workflow changes, and AI dev moves worth paying attention to.

Cloudflare Workers AI Now Runs Large Models - What Builders Need to Know

Market signals

What Changed: Serverless LLMs at Scale

Why This Matters for Agent Development

Market Signal: Edge Inference Becoming Standard

What Builders Should Do Now

How to benefit from this update

Get the weekly operator brief

Related reads

Cloudflare Workers AI Now Runs Large Models - What Builders Need to Know

Market signals

What Changed: Serverless LLMs at Scale

Why This Matters for Agent Development

Market Signal: Edge Inference Becoming Standard

What Builders Should Do Now

How to benefit from this update

Get the weekly operator brief

Related reads

Cloudflare Workers AI Now Runs Large Models - What Builders Need to Know

Market signals

Edge Inference Becoming Platform Baseline

Agent Economics Reshaping Infrastructure Decisions

Model Diversity Accelerating

What Changed: Serverless LLMs at Scale

Why This Matters for Agent Development

Market Signal: Edge Inference Becoming Standard

What Builders Should Do Now

How to benefit from this update

Use case 1Multi-Step Agent Reasoning

Use case 2Tool-Use Agents at Scale

Use case 3Cost-Conscious AI Deployments

Get the weekly operator brief

Related reads

Cloudflare Workers AI Now Runs Large Models - What Builders Need to Know

Market signals

Edge Inference Becoming Platform Baseline

Agent Economics Reshaping Infrastructure Decisions

Model Diversity Accelerating

What Changed: Serverless LLMs at Scale

Why This Matters for Agent Development

Market Signal: Edge Inference Becoming Standard

What Builders Should Do Now

How to benefit from this update

Use case 1Multi-Step Agent Reasoning

Use case 2Tool-Use Agents at Scale

Use case 3Cost-Conscious AI Deployments

Get the weekly operator brief

Related reads