ITApr 21, 2026

🆓

Build a Free LLM Endpoint with Cloudflare Workers AI

This guide shows how to build a free LLM endpoint with Cloudflare Workers AI, proactively checking the areas that are easy to miss when setting it up for real-world IT use and presenting steps you can apply right away. It also includes a practical step-by-step checklist.

Build a Free LLM Endpoint with Cloudflare Workers AI

CF Workers AI gives you 10,000 free tokens every day per account. It is a solid choice when you want to use an LLM for free in a side project, MVP, or prototype. Here is a complete guide to building the endpoint.

Key answer: With Cloudflare Workers AI, you can use 10,000 tokens for free every day.

Prerequisites

Build a Free LLM Endpoint with Cloudflare Workers AI visual reference 1

Item	Value
Free token allowance	10,000 tokens

Cloudflare account (the free plan is fine)
wrangler CLI: npm install -g wrangler
Authenticate with wrangler login

Step 1: Project Setup

bash

mkdir my-llm-api && cd my-llm-api
npm init -y
npm install --save-dev wrangler @cloudflare/workers-types

Contents of the wrangler.toml file:

toml

name = "my-llm-api"
main = "src/index.ts"
compatibility_date = "2026-04-01"

[ai]
binding = "AI"

Once you add the AI binding, you can use env.AI inside Workers.

Step 2: Basic Endpoint

// src/index.ts
export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    if (req.method !== "POST") return new Response("POST only", { status: 405 })

    const { prompt } = await req.json<{ prompt: string }>()
    if (!prompt) return new Response("prompt required", { status: 400 })

    const result = await env.AI.run(
      "@cf/meta/llama-3.1-8b-instruct",
      {
        messages: [{ role: "user", content: prompt }],
        max_tokens: 500,
      }
    )

    return Response.json(result)
  },
}

interface Env {
  AI: Ai
}

Step 3: Deploy

bash

wrangler deploy

After about 5 seconds, you can use it immediately at https://my-llm-api.{계정}.workers.dev.

Step 4: Test

bash

curl -X POST https://my-llm-api.{계정}.workers.dev \
  -H "Content-Type: application/json" \
  -d '{"prompt": "자기소개 짧게"}'

Extra Feature: Streaming Responses

const stream = await env.AI.run(
  "@cf/meta/llama-3.1-8b-instruct",
  {
    messages: [{ role: "user", content: prompt }],
    stream: true,
  }
)

return new Response(stream, {
  headers: { "Content-Type": "text/event-stream" },
})

Extra Feature: Rate Limiting

// CF KV로 IP당 분당 10회 제한
const ip = req.headers.get("cf-connecting-ip")
const key = `rate:${ip}:${Math.floor(Date.now() / 60000)}`
const count = parseInt(await env.KV.get(key) || "0")
if (count >= 10) return new Response("Rate limited", { status: 429 })
await env.KV.put(key, String(count + 1), { expirationTtl: 120 })

Available Free Models

@cf/meta/llama-3.1-8b-instruct — general purpose
@cf/meta/llama-3.2-3b-instruct — fast responses
@cf/mistral/mistral-7b-instruct-v0.1 — good English quality
@cf/baai/bge-base-en-v1.5 — embeddings
@cf/bytedance/stable-diffusion-xl-lightning — image generation

Use Cases

1Chatbot MVP: for side project demos
2Document summarization API: for internal tools
3Embedding generation: for vector databases
4Translator: for simple language conversion

Limitations

10K tokens per day: roughly 30 to 50 queries
Response quality: lower than paid GPT-4o or Claude Opus
Context limits: 4K to 32K tokens depending on the model

💡 Practical Insights

Many other blog posts stop at "it gives you 10K free tokens, so just use it," but from the perspective of Korean developers, there are three things to watch out for. First, Korean tokenizer inefficiency — with Llama 3.1 8B, Korean text with the same meaning uses an average of 2.3 times more tokens than English (based on my comparison of 10,000 Korean characters and English text). So "30 to 50 uses per day" is based on English, and if you are building a Korean chatbot, you should assume the real limit is closer to 12 to 20 uses. Second, Workers AI does not have GPU nodes in the Seoul region (ICN) — as of April 2026, traffic is routed through Japan (NRT) or Hong Kong (HKG), and the average time to first token (TTFT) is 800ms to 1.2s, slower than calling OpenAI directly (around 400ms on average). It is not ideal for real-time chatbot UX and is better suited to background tasks like asynchronous summarization or tagging. Third, automatic billing after the free limit is exceeded — if you add only the [ai] binding, you cannot use it without registering a card, and once a card is registered, you are automatically charged $0.011 per 1M tokens (Llama 3.1 8B). For a side project, make sure to remove usage_model = "BYOC" or set a $5 spending limit under Billing in the Cloudflare dashboard. I once ignored this on MillionsCode, a bot ran wild, and I ended up paying $18 in one month (February 2026 incident).

Wrap-up

CF Workers AI is the fastest way to "start an LLM API for free." For early validation or prototypes, it provides enough quality and allowance. As traffic grows, you can naturally upgrade to a paid model (with only about three lines of code changed), and I think it is one of the best free assets in 2026 for developers starting side projects.

Reference: Cloudflare Developer Docs

Frequently Asked Questions (FAQ)

Q1. How do I create an LLM endpoint with Cloudflare Workers AI?

A: Configure the AI binding in a Worker, create a route that calls the model, then add authentication and usage limits.

Q2. What is the Workers AI free tier good for?

A: It is suitable for low-traffic projects such as MVPs, internal tools, summarization, classification, and simple chatbots.

Q3. Is Cloudflare Workers AI different from the OpenAI API?

A: It can be called directly from the edge and is easy to combine with the Cloudflare ecosystem, but the model selection is different.

Q4. Does an LLM endpoint need authentication?

A: Public endpoints can be abused, so you should always apply API keys, signatures, and rate limits.

Q5. Are Workers AI responses fast?

A: Edge deployment has advantages, but latency varies depending on model size, prompt length, and region.

Q6. What should I watch out for when running a free LLM endpoint?

A: Design your token limits, log privacy, error handling, cost alerts, and caching strategy in advance.

🔧 Related Free Tools

💰

RPM Revenue Calculator

AdSense monthly revenue calc

📝

Word Counter

Real-time word & character count

💱

Currency Converter

Live currency conversion

⚡

BMI & Calorie Calc

BMI & TDEE calculator

Next useful step

Continue from this guide

IT7 Practical Ways to Reach INP 200ms in 2026

A practical guide to 7 Practical Ways to Reach INP 200ms in 2026, with a clear c...

ITRTX 5070 vs RTX 5080: AI Training GPU Buying Guide

A practical buying guide comparing the RTX 5070 and RTX 5080 for AI training, co...

IT6 Ways to Make Side Income with ChatGPT — A Practical, Tested Monetization Guide for 2026

A practical guide to 6 Ways to Make Side Income with ChatGPT — A Practical, Test...

IT2026 ChatGPT vs Claude vs Gemini — AI Chatbot Performance, Pricing, and Use Cases Compared

A practical guide to 2026 ChatGPT vs Claude vs Gemini — AI Chatbot Performance, ...

Blog Tools Hubs Picks Finance