‘Reasoning’ will increase the infrastructure footprint of AI

“Reasoning” models — and their offshoots, such as “deep research” — have emerged as a major trend in generative AI, helping improve the accuracy and reliability of responses to complex problems.

However, this approach significantly increases the computational costs of generating responses. The imitation of reasoning means a higher consumption of server capacity, requiring additional data center space, power and cooling — ultimately driving up operating costs for the model owners.

The key question facing businesses is whether the potential benefits of reasoning justify as much as a sixfold increase in the cost of running or buying large language model (LLM)-based services. Even before reasoning, the inference economics of standard LLMs were difficult, having to balance model performance, response latency and overall costs. Reasoning further complicates this equation.

Inference for reasoning capabilities tends to require at least an order of magnitude more computational steps, as it involves the iterative pursuit of the best answer by running several parallel, trial-and-error attempts (often called trajectories). The model creates its own internal tests to benchmark these attempts — this type of inference is often referred to as “test-time.”

While this approach is costly, it is becoming more effective for improving the performance of the largest and most complex LLMs. For these models, scaling up training data and compute now delivers diminishing returns. Reasoning represents the first wave of “test-time compute” models that attempt to continue improving LLM performance through elaborate inference techniques.

The key question for data center operators is what infrastructure will be required to deliver services built on even more computationally intense AI models.

Token intelligence

Reasoning builds on established academic ideas about “self-verification” and “reflection” that enable an LLM (or a companion model) to evaluate its outputs in addition to responding to user requests (commonly called prompts). The objective is to trade some extra compute and time for potentially better-quality outputs.

The first mainstream reasoning model, OpenAI’s o1, was launched to the public in December 2024. It was soon followed by DeepSeek R1, Alibaba’s Qwen, Google’s Gemini 2.0 Flash Thinking, xAI’s Grok 3, Anthropic’s Claude 3.7 and others.

While these models only imitate reason, they have been trained to generate an internal “thinking process” based on the prompt and, in some cases, can ask clarifying questions.

A model based on the transformer architecture (i.e., almost any modern LLM) can be turned into a reasoning model by incorporating additional, carefully curated reinforcement learning and supervised fine-tuning stages. The former automatically rewards or penalizes the model for specific behavior; the latter involves training on a specialized dataset where each input is paired with the correct output. Although computationally intensive, these steps require only a fraction of the resources used to train the base model. The main difference between typical LLMs and reasoning models emerges during inference — when the model is deployed in production.

When a user enters a prompt into an LLM, it is broken up into tokens — collections of characters that have semantic meaning for the model. The model analyzes the input tokens and generates appropriate output tokens in return, according to the patterns it “learned” during training. Customers are often billed for the number of tokens generated, as this serves as a rough proxy for compute required to respond to a query.

A typical LLM is only capable of analyzing the input tokens created by the user. In contrast, a reasoning model can also analyze the tokens it has generated internally, allowing it to make sure that the results are consistent. In essence, the machine is allowed to talk to itself.

There are several approaches to account for tokens generated internally during the model’s test-time run. OpenAI refers to these as “reasoning tokens” and counts (and charges for) them separately from input and output tokens. Anthropic calls them “thinking tokens” but otherwise treats them as output tokens.

In all cases, reasoning increases the number of tokens generated during a response to a user prompt, which means it increases the hardware resources required per user and the cost of services.

While the additional compute burden is hard to quantify, the effect on pricing is clear: OpenAI’s o1 model with reasoning features is roughly six times more expensive to use than GPT-4o. DeepSeek has a similar price increase, with R1 costing about six times as much as the standard V3 model.

And yet, spending more time or tokens does not guarantee a better output. Reasoning models still hallucinate and sometimes struggle with relatively simple problems. They are also not suitable for applications that require low latency.

What is reasonable?

Reasoning is seen — perhaps rightly — as a more economical way to improve performance of the largest LLMs compared with scaling up training infrastructure and data. At the same time, it makes the business viability of inference even more problematic. Reasoning models are difficult to accommodate because they are, for now, relying on expensive hardware, such as high-performance GPUs, for delivery rather than much more affordable inference servers.

Many AI developers, including xAI and Anthropic, treat reasoning as an optional feature that can be enabled for specific prompts but will incur an additional cost. Meanwhile, OpenAI has announced it will no longer release standalone reasoning models. The company plans to integrate “chain of thought” into future products — however, the customer will no longer be able to select when to use them. From ChatGPT-5 onwards, OpenAI will introduce “intelligence levels” that vary across different subscription tiers.

Exactly how machine reasoning will evolve is difficult to predict. Reasoning (and deep research) might become practical for advanced math problems, challenging coding tasks, or producing complex analytical reports.

Some AI industry watchers speculate that future reasoning models will be allowed to run for much longer — days or even months — to answer hard but loosely structured questions. However, such capabilities are likely to remain in a narrow domain of “test-time supercomputing” even as the costs of inference decrease.

For the broader market, the question is whether reasoning models deliver enough value — generating responses that are distinctly better — to justify the added costs. The technical nature of reasoning dictates that inevitably, it will always be several times more expensive than “plain” LLM inference.

Inference, in general, offers a much wider choice of platforms, including older GPUs, modern CPUs, and specialized inference accelerators that are cheaper and more energy-efficient than the high-end devices used for training. A surge in demand for inference compute could largely be absorbed by facilities that are not suitable for training. These could be air-cooled and designed for lower rack power densities.

Unexpected demand could also mean that overall, AI power consumption projections might have to be revised upwards once again. Currently, training is responsible for the bulk of AI power consumption. While the share of power consumed by inference workloads is smaller, it is expected to grow much faster as organizations adopt AI-based products and services. If reasoning becomes popular, it will accelerate the growth rate.

The Uptime Intelligence View

Reasoning significantly increases the amount of compute resources required for model inference, but its hunger for infrastructure capacity will be moderated by the high cost of services.

Even so, the arrival of test-time compute could drive more demand for inference than previously thought, potentially offsetting some of the architectural efficiency gains in inference made in recent months.

Token intelligence

What is reasonable?

The Uptime Intelligence View

You might also like