• Link to X
  • Link to LinkedIn
  • Link to Mail
  • ABOUT UI
    • Business Partners
    • Careers
    • Contact Us
    • News & Press
    • Our Team
    • Press Releases
    • Branding Guidelines
  • CONTACT
Uptime Institute Blog
  • Journal
    • Journal Home
    • Executive
    • Operations
    • Design
  • AI Services
    • AI Infrastructure Advisory
  • Tier Certification
    • Overview
    • Design
    • Construction
    • Operations
    • Tier Gap Analysis
    • Prefabricated/Modular
    • Tier Certifications List
  • Professional Services
    • Overview
    • Infrastructure Services
    • Management and Operations Services
    • Energy and Sustainability Services
    • Consulting Services
  • Education
    • Course Details
    • Course Calendar
    • Competency & Confidence Assessments
    • Private Education
    • Graduate Roster
  • Events
    • Industry Events
    • Leadership Events
    • Network Events
  • Network
    • Overview
    • Network Calendar
    • Network Roster
    • Request Corporate Access
    • Request Guest Access
    • Uptime Network Portal
  • Intelligence
  • Clients
    • Client Stories
  • Resources
    • Data Center Industry Surveys
    • Ebooks
    • Journal Blog
    • Product Datasheets
    • Research & Reports
    • Tier Specification Documents
    • Tools
    • Webinars
  • Click to open the search input field Click to open the search input field Search
  • Menu Menu
Blog - Latest News
AI power fluctuations strain both budgets and hardware

AI power fluctuations strain both budgets and hardware

December 17, 2025/in Design, Executive, Operations/by Douglas Donnellan, Senior Research Associate, Uptime Institute, ddonnellan@uptimeinstitute.com

AI training at scale introduces power consumption patterns that can strain both server hardware and supporting power systems, shortening equipment lifespans and increasing the total cost of ownership (TCO) for operators.

These workloads can cause GPU power draw to spike briefly, even for only a few milliseconds, pushing them past their nominal thermal design power (TDP) or against their absolute power limits. Over time, this thermal stress can degrade GPUs and their onboard power delivery components.

Even when average power draw stays within hardware specifications, thermal stress can affect voltage regulators, solder joints and capacitors. This kind of wear is often difficult to detect and may only become apparent after a failure. As a result, hidden hardware degradation can ultimately affect TCO — especially in data centers that are not purpose-built for AI compute.

Strain on supporting infrastructure

AI training power swings can also push server power supply units (PSUs) and connectors beyond their design limits. PSUs may be forced to absorb rapid current fluctuations, straining their internal capacitors and increasing heat generation. In some cases, power swings can trip overcurrent protection circuits, causing unexpected reboots or shutdowns. Certain power connectors, such as the standard 12VHPWR cables used for GPUs, are also vulnerable. High contact resistance can cause localized heating, further compounding the wear and tear effects.

When AI workloads involve many GPUs operating in synchronization, power swing effects multiply. In some cases, simultaneous power spikes across multiple servers may exceed the rated capacity of row-level UPS modules — especially if they were sized following legacy capacity allocation practices. Under such conditions, AI compute clusters can sometimes reach 150% of their steady-state maximum power levels.

In extreme cases, load fluctuations of large AI clusters can exceed a UPS system’s capability to source and condition power, forcing it to use its stored energy. This happens when the UPS is overloaded and unable to meet demand using only its internal capacitance. Repeated substantial overloads will put stress on internal components as well as the energy storage subsystem. For batteries, particularly lead-acid cells, this can shorten their shelf life. In worst-case scenarios, these fluctuations may cause voltage sags or other power quality issues (see Electrical considerations with large AI compute).

Capacity planning challenges

Accounting for the effects of power swings from AI training workloads during the design phase is challenging. Many circuits and power systems are sized based on the average demand of a large and diverse population of IT loads, rather than their theoretical combined peak. In the case of large AI clusters, this approach can lead to a false sense of security in capacity planning.

When peak amplitudes are underestimated, branch circuits can overheat, breakers may trip, and long-term damage can occur to conductors and insulation — particularly in legacy environments that lack the headroom to adapt. Compounding this challenge, typical monitoring tools track GPU power every 100 milliseconds or more — too slow to detect the microsecond-speed spikes that can accelerate the wear on hardware through current inrush.

Estimating peak power behavior depends on several factors, including the AI model, training dataset, GPU architecture and workload synchronization. Two training runs on identical hardware can produce vastly different power profiles. This uncertainty significantly complicates capacity planning, leading to under-provisioned resources and increased operational risks.

Facility designs for large-scale AI infrastructure need to account for the impact of dynamic power swings. Operators of dedicated training clusters may overprovision UPS capacity, use rapid-response PSUs, or set absolute power and rate-of-change limits on GPU servers using software tools (e.g., Nvidia-SMI). While these approaches can help reduce the risk of power-related failures, they also increase capital and operational costs and can reduce efficiency under typical load conditions.

Many smaller operators — including colocation tenants and enterprises exploring AI — are likely testing or adopting AI training on general-purpose infrastructure. Nearly three in 10 operators already perform AI training, and of those that do not, nearly half expect to begin in the near future, according to results from the Uptime Institute AI Infrastructure Survey 2025 (see Figure 1).

Figure 1 Three in 10 operators currently perform AI training

Diagram: Three in 10 operators currently perform AI training

Many smaller data center environments may lack workload diversity (non-AI loads) to absorb power swings or the specialized engineering to manage dynamic power consumption behavior. As a result, these operators face a greater risk of failure events, hardware damage, shortened component lifespans and reduced UPS reliability — all of which contribute to higher TCO.

Several low-cost strategies can help mitigate risk. These include oversizing branch circuits — ideally dedicating them to GPU servers — distributing GPUs across racks and data halls to prevent localized hotspots, and setting power caps on GPUs to trade some peak performance for longer hardware lifespan.

For operators considering or already experimenting with AI training, TDP alone is an insufficient design benchmark for capacity planning. Infrastructure needs to account for rapid power transients, workload-specific consumption patterns, and the complex interplay between IT hardware and facility power systems. This is particularly crucial when using shared or legacy systems, where the cost of misjudging these dynamics can quickly outweigh the perceived benefits of performing AI training in-house.


The Uptime Intelligence View

For data centers not specifically designed to support AI training workloads, GPU power swings can quietly accelerate hardware degradation and increase costs. Peak power consumption of these workloads is often difficult to predict, and signs of component wear may remain hidden until failures occur. Larger operators with dedicated AI infrastructure are more likely to address these power dynamics during the design phase, while smaller operators — or those using general-purpose infrastructure — may have fewer options.

To mitigate risk, these operators can consider overprovisioning rack-level UPS capacity for GPU servers, oversizing branch circuits (and dedicating them to GPU loads where possible), distributing heat from GPU servers across racks and rooms to avoid localized hotspots, and applying software-based power caps. Data center operators should also factor in more frequent hardware replacements during financial planning to more accurately reflect the actual cost of running AI training workloads.

The following Uptime Institute experts were consulted for this report:
Chris Brown, Chief Technical Officer, Uptime Institute
Daniel Bizo, Senior Research Director, Uptime Institute Intelligence
Max Smolaks, Research Analyst, Uptime Institute Intelligence

Other related reports published by Uptime Institute include:
Electrical considerations with large AI compute

Share this:

  • Share on Facebook (Opens in new window) Facebook
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Reddit (Opens in new window) Reddit
  • Share on WhatsApp (Opens in new window) WhatsApp
  • Email a link to a friend (Opens in new window) Email
Tags: AI, artificial intelligence, Data Center, digital Infrastructure
https://journal.uptimeinstitute.com/wp-content/uploads/2025/12/AI-power-fluctuations-strain-budgets-hardware-featured.jpg 540 1030 Douglas Donnellan, Senior Research Associate, Uptime Institute, ddonnellan@uptimeinstitute.com https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.png Douglas Donnellan, Senior Research Associate, Uptime Institute, ddonnellan@uptimeinstitute.com2025-12-17 15:00:002025-12-16 15:00:30AI power fluctuations strain both budgets and hardware
You might also like
ENTEL Achieves Uptime Institute Tier Certification of Operational Sustainability
Higher data center costs unlikely to cause exodus to public cloud Higher data center costs unlikely to cause exodus to public cloud
Cloud a viable choice amidst uncertain AI returns Cloud a viable choice amidst uncertain AI returns
Mixed resiliency at the edge Mixed resiliency at the edge
Data-center-owners-v-designers Resolving Conflicts between Data Center Owners and Designers
Cloud SLAs punish, not compensate Cloud SLAs punish, not compensate
AI and cooling: toward more automation AI and cooling: toward more automation
Direct liquid cooling (DLC): pressure is rising but constraints remain Direct liquid cooling: pressure is rising but constraints remain

Content Categories

  • Journal Home
  • Executive
  • Operations
  • Design

Subscribe to Journal via Email

Enter your email address to subscribe to Uptime Institute Journal and receive notifications of new articles by email.

  • Recent

Tags

Accredited Tier Designer (9) AI (21) artificial intelligence (16) ATD (10) Carbon Emissions (7) Climate Change (13) Cloud (22) Cloud Computing (17) Cloud Costs (15) Cloud Infrastructure (29) Cloud Migration (8) Colocation (6) cooling (9) Data Center (252) Data Center Availability (40) Data Center Cooling (13) Data Center Design (45) Data Center Disaster Recovery (7) Data Center Energy Efficiency (34) Data Center Facilities Management (43) Data Center Operations (66) data center power (8) Data Center Staffing (18) DCIM (9) digital Infrastructure (117) energy (8) Energy Efficiency (38) Environmental Sustainability (18) IT (7) IT Efficiency (16) IT Outages (10) M&O (6) outages (11) Public Cloud (7) PUE (10) Regulations (24) Resiliency (9) security (7) Sustainability (34) Sustainability Reporting (7) Tier Certification (26) Tier Certification Constructed Facility (16) Uptime Institute FORCSS (6) Uptime Institute Network (13) Uptime Institute Symposium (6)
© 2014-2025 Uptime Institute, LLC All rights reserved.
  • Link to X
  • Link to LinkedIn
  • Link to Mail
Link to: Retail vs wholesale: finding the right colo pricing model Link to: Retail vs wholesale: finding the right colo pricing model Retail vs wholesale: finding the right colo pricing modelRetail vs wholesale: finding the right colo pricing model Link to: AI’s growth calls for useful IT efficiency metrics Link to: AI’s growth calls for useful IT efficiency metrics AI’s growth calls for useful IT efficiency metricsAI’s growth calls for useful IT efficiency metrics
Scroll to top Scroll to top Scroll to top