This post is making a point - generative AI is relatively cheap - that might seem so obvious it doesn't need making. I'm mostly writing it because I've repeatedly had the same discussion in the past six months where people claim the opposite. Not only is the misconception still around, but it's not even getting less frequent. This is mainly written to have a document I can point people at, the next time it repeats.
(Note: all the data below is from the start of May, which when I originally wrote this blog post. I just ended up not publishing it, since having written it I was convinced that something this obvious didn't actually deserve 2000 words.)
It seems to be a common, if not a majority, belief that Large Language Models (in the colloquial sense of "things that are like ChatGPT") are very expensive to operate. This then leads to a ton of innumerate analyses about how AI companies must be obviously doomed, as well as a myopic view on how consumer AI businesses can/will be monetized.
It's an understandable mistake, since inference was indeed very expensive at the start of the AI boom, and those costs were talked about a lot. But inference has gotten cheaper even faster than models have gotten better, and nobody has an intuition for something becoming 1000x cheaper in two years. It just doesn't happen. It doesn't help that the common pricing model ("$ per million tokens") is very hard to visualize.
So let's compare LLMs to web search. I'm choosing search as the comparison since it's in the same vicinity and since it's something everyone uses and nobody pays for, not because I'm suggesting that ungrounded generative AI is a good substitute for search.
(It should also go without saying that these are just my personal opinions.)
What is the price of a web search?
Here's the public API pricing for some companies operating their own web search infrastructure, retrieved on 2025-05-02:
- The Gemini API pricing
lists a "Grounding with Google Search" feature at $35/1k queries. I
believe that's the best number we can get for Google, they don't publish
prices for a "raw" search result API.
- The Bing Search API
is priced at $15/1k queries at the cheapest tier.
- Brave has a price of $5/1k searches at the cheapest tier. Though there's something very strange about their pricing structure, with the unit pricing increasing as the quota increases, which is the opposite of what you'd expect. The tier with real quota is priced at $9/1k searches.
So there's a range of prices, but not a horribly wide one, and with the engines you'd expect to be of higher quality also having higher prices.
What is the price of LLMs in a similar domain?
To make a reasonable comparison between those search prices and LLM prices, we need two numbers:
- How many tokens are output per query?
- What's the price per token?
I picked a few arbitrary queries from my search history, and phrased them as questions, and ran them on Gemini 2.5 Flash (thinking mode off) in AI Studio:
- [When was the term LLM first used?] -> 361 tokens, 2.5
seconds
- [What are the top javascript game engines?] -> 1145 tokens, 7.6
seconds
- [What are the typical carry-on bag size limits in europe?] -> 506
tokens, 3.4 seconds
- [List the 10 largest power outages in history] -> 583 tokens, 3.7 seconds
Note that I'm not judging the quality of the answers here. The purpose is just to get rough numbers for how large typical responses are. The average is probably somewhere in the 500-1000 token range.
What's the price of a token? The pricing is sometimes different for input and output tokens. Input tokens tend to be cheaper, and our inputs are very short compared to the outputs, so for simplicity let's consider all the tokens to be outputs. Here's the pricing of some relevant models, retrieved on 2025-05-02:
Model | Price / 1M tokens |
---|---|
Gemma 3 27B | $0.20 (source) |
Qwen3 30B A3B | $0.30 (source) |
Gemini 2.0 Flash | $0.40 (source) |
GPT-4.1 nano | $0.40 (source) |
Gemini 2.5 Flash Preview | $0.60 (source) |
Deepseek V3 | $1.10 (source) |
GPT-4.1 mini | $1.60 (source) |
Deepseek R1 | $2.19 (source) |
Claude 3.5 Haiku | $4.00 (source) |
GPT-4.1 | $8.00 (source) |
Gemini 2.5 Pro Preview | $10.00 (source) |
Claude 3.7 Sonnet | $15.00 (source) |
o3 | $40.00 (source) |
If we assume the average query uses 1k tokens, these prices would be directly comparable to the prices per 1k search queries. That's convenient.
The low end of that spectrum is at least an order of magnitude cheaper than even the cheapest search API, and even the models at the low end are pretty capable. The high end is about on par with the highest end of search pricing. To compare a midrange pair on quality, the Bing Search vs. a Gemini 2.5 Flash comparison shows the LLM being 1/25th the price.
Note that many of the above models have cheaper pricing in exchange for more flexible scheduling (Anthropic, Google and OpenAI give a 50% discount for batch requests, Deepseek is 50%-75% cheaper during off-peak hours). I've not included those cheaper options in the table to keep things comparable, but the presence of those cheaper tiers is worth keeping in mind when thinking about the next section...
Objection!
I know some people are going to have objections to this back-of-the-envelope calculation, and a lot of them will be totally legit concerns. I'll try to address some of them preemptively. Slightly different assumptions can easily lead to clawing back 10% here and 50% there. But I don't see how to bridge a 25x gap just for breaking even, let alone making the AI significantly more expensive. If you want to play around with different assumptions, there's a little calculator widget below.
Surely the typical LLM response is longer than that - I already picked the upper end of what the (very light) testing suggested as a reasonable range for the type of question that I'd use web search for. There's a lot of use cases where the inputs and outputs are going to be much longer (e.g. coding), but then you'd need to also switch the comparison to something in that same domain as well.
The LLM API prices must be subsidized to grab market share -- i.e. the prices might be low, but the costs are high - I don't think they are, for a few reasons. I'd instead assume APIs are typically profitable on a unit basis.
First, there's actually not that much motive to gain API market share with unsustainably cheap prices. Any gains would be temporary, since there's no long-term lock-in, and better models are released weekly. Data from paid API queries will also typically not be used for training or tuning the models, so getting access to more data wouldn't explain it.
Second, some of those models have been released with open weights and API access is also available from third-party providers who would have no motive to subsidize inference. (Or the number in the table isn't even first party hosting -- I sure can't figure out what the Vertex AI pricing for Gemma 3 is). The pricing of those third-party hosted APIs appears competitive with first-party hosted APIs.
Third, Deepseek released actual numbers on their inference efficiency in February. Those numbers suggest that their normal R1 API pricing has about 80% margins when considering the GPU costs, though not any other serving costs.
The search API prices amortize building and updating the search index, LLM inference is based on just the cost of inference - This seems pretty likely to be true, actually? But the effect can't really be that large for a popular model (e.g. the allegedly leaked OpenAI financials claimed $2B/year spent on inference vs. $3B/year on training).
The search API prices must have higher margins than LLM inference - It's possible. I certainly don't know what the margins of any Search API providers are, though it seems fair to assume they're pretty robust. But, well, see the point above about Deepseek's own numbers on R1 profit margins.
Also, it seems quite plausible that some Search providers would accept lower margins
Web search returns results 20x-100x faster than an LLM finishes the query, how could it be more expensive? - Search latency can be improved by parallelizing the problem, while LLM inference is serial in nature. The task of predicting a single token can be parallelized, but the you can't predict all the output tokens at once.
But OpenAI made a loss, and they don't expect to make profit for years! - That's because a huge proportion of their usage is not monetized at all, despite the usage pattern being ideal for it. OpenAI reportedly made a loss of $5B in 2024. They also reportedly have 500M MAUs. To reach break-even, they'd just need to monetize those free users for an average of $10/year, or $1/month. A $1 ARPU for a service like this would be pitifully low.
If the reported numbers are true, OpenAI doesn't actually have high costs for a consumer service that popular, which is what you'd expect to see if the high cost of inference was the problem. They just have a very low per-user revenue, by choice.
Sensitivity analysis
If you want to play around with different assumptions, here's a calculator:
Open in new tabWhy does this matter?
I mean, you're right to ask that. Nothing really matters and eventually we'll all be dead.
But it is interesting how many people have built their mental model for the near future on a premise that was true for only a brief moment. Some things that will come as a surprise to them even assuming all progress stops right now:
There's an argument advanced by some people about how low prices mean it'll be impossible for AI companies to ever recoup model training costs. The thinking seems to be that it's just the prices that have been going down, but not the costs, and the low prices must be an unprofitable race to the bottom for what little demand there is. What's happening and will continue to happen instead is that as costs go down, the prices go down too, and demand increases as new uses become viable. For an example, look at the OpenRouter API traffic volumes, both in aggregate and in the relative share of cheaper models.
This post was mainly about APIs, but consumer usage will have exactly the same cost structure, just a different monetization structure. And given how low the unit costs must be, advertising isn't merely viable but lucrative.
From this it follows that the financials of frontier AI labs are a lot better than some innumerate pundits would have you believe. They're making a loss because they're not under pressure to be profitable, and aren't actively trying to monetize consumer traffic yet. This could well be a land grab unlike APIs, since unpaid consumer queries may be used for training while paid API queries typically are not. Even the subscription pricing might be there mainly for demand management rather than trying to run a profit.
The real cost problem isn't going to be with the LLMs themselves, it's with all the backend services that AI agents will want to access if even a rudimentary form of the agentic vision actually materializes. Running the AI is already cheap, will keep getting cheaper, and will always have a monetization model of some sort since it's what the end user is interacting with. Neither of those is true for the end-user services that have been turned into AI backends without their consent. An AI trying to, I don't know, book concert tickets whenever a band I like plays in my town will probably be phenomenally expensive to its third-party backends (e.g. scraping ticket sites). Those sites will be uncompensated for the expense while also removing their actual revenue streams.
I don't really know how that plays out.
Obviously many service owners will try to make unauthorized scraping harder, but that's a very hard problem to solve on the web. Maybe some of them give up on the web entirely, and move to mobile where they can at least get device attestations. Some might just give up on the open web, and require all usage to be signed in, with account creation being gated on something scarce. Some might become unviable and close up shop entirely.
If/when that happens, what's the play on the AI agent side? Will they choose an escalating adversarial arms race with increasingly dodgy tactics, or will they eventually decide that it's better to pay for the services they use? The former seems unsustainable. If the latter, then it feels like the core engineering challenge becomes one of building data provider backends optimized specifically for AI use, with the goal of scaling to massive volumes and cheaper unit prices, with the trade-off being higher latency, lower reliability and lower quality. That could be quite interesting from a systems perspective. (Yes, I'm aware of MCP, but it's a solution to an orthogonal issue.)
But one thing I'm confident won't be happening is that AIs turn out to be too expensive to run.