Musings on LLMs

Why Do LLMs Use So Much Energy?

September 29, 2025

Opinion

Authors:

Article shepherded by:

Rik Farrow

With all the hype about Large Language Models (LLMs), I've been wondering about several things. No, I am not worried about General AI taking over the world, like some people. Instead I've been trying to get a handle on why devices used for training LLMs are so expensive, and why training and operating a LLM requires so much energy.

here are some posts and articles about the energy used when creating GPT4. One article [1], based on leaked data, suggests that GPT-4 was trained using 25 thousand NVIDIA A100 GPUs for 90-100 days. NVIDIA sells HGX servers that can hold eight A100 each and use 6.5 kilowatts to operate under maximum load. As a guestimate, the author used the number of servers (3,125) times 6.5 times 24 times 90 to come up with a whopping 43 gigawatts of electricity. That's 43 thousand megawatts, and as the price of a megawatt varies based on location, time of day, I just went with $50/mw, so training GPT-4 likely cost over two million dollars just in electricity alone. The article's author came up with a range that starts where I do but ends around 60 thousand megawatts. For comparison, it requires 854 megawatts to 'mine' one Bitcoin in July of 2025.

Just for fun, I decided to compare the GPT-4 estimates to ten hour flights, roughly Seattle to London, in the fuel-efficient Boeing 787-9. The 787-9 burns about 5400 liters of fuel per hour, so approximately 54,000 liters of fuel for each trans-Atlantic flight [3]. Each liter of fuel has about the same amount of energy as 10 kilowatts of electricity, so each flight uses the equivalent of 540,000 kilowatts, or 540 megawatts. Another way of visualizing this is that one of the three nuclear reactors in Arizona can produce almost this much electricity in one hour.

Think about that for a minute. With carbon dioxide levels soaring to the point that some parts of the globe will become deadly within decades, flying across the ocean seems a bit irresponsible to me these days. Then compare that 540 megawatts to 43,000 megawatts possibly used to train GPT-4 — that's over 80 trans-Atlantic flights, using my lower energy figure. And GPT-4 is an older model with fewer parameters.

I think that puts training a LLM into a different sort of perspective.

Training LLMs actually occurs in several phases:

Pre-training, the part that is the most energy intense, where the LLM is fed a somewhat curated copy of the Internet, often including pirated copies of books; this produces what's called a foundation or base model.
Training, where LLMs are provided with example questions and answers as a method of improving the way LLMs will respond to queries.
Fine-tuning, the process of adjusting LLMs, similar to training but more focused.

You can read more about this process in an article written by Chinese researchers [3]. The folks at Shanghai AI Laboratory shared their experience with training and deploying an LLM at NSDI'24.

Once an LLM has been prepared, it is ready to be used in a process called inference. The amount of energy used at this point per query is small, but using a popular model, like those of OpenAI, adds up when people are making millions of queries per day. One analysis [4] suggests that inquiries made using ChatGPT use only .3 watt per query, a lot less than others had suggested, at 3 watts per query. But that provides a range of 600,000 watts to 6 megawatts per day for using just one LLM for inference for two million queries.

Burning Energy

That brings me to my second question: why so much energy and why do people want to use NVIDIA GPUs? I first did some digging into why NVIDIA GPUs are so popular and what is special about them.

The NVIDIA A100 could be found online for $30,000 a piece, based on my searches in the fall of 2023. The newer GPU, the H100, costs about as much, and is 1.6 to 6 times faster than the A10 [5]. Each GPU actually is a large chip, made by Taiwan Semiconductor Manufacturing Company (TSMC) for NVIDIA tightly coupled with HBM3 memory. HBM3 memory consists of vertically stacked layers of RAM providing wider transfers of memory than ordinary DRAM.

You might think that other companies would have already copied the manufacturing of these chips, but you'd be wrong because these chips are unusual. Instead of just the usual collection of highly parallel but simple pipelined processing units typically found in a GPU, these chips have up to 80 GBs of their own memory on each A100 board and boast up to 1,555 GBs/second in access speed. The size of memory and speed of access is significant, as LLMs are immensely large, with sizes in the trillions of parameters, and each step in processing a token during pre-training means executing floating point multiplies and additions on all of these parameters.

If an LLM is advertised as having 1 trillion parameters, and uses 20 billion input tokens during training, that's 10**12 * 2*10**10, or 2*10**22 sets of floating point operations during pretraining. During his 2025 FAST'25 keynote [6], Seelam of IBM cloud uses a slightly larger value, showing in his slides that pre-training is six times the total number of tokens times parameters.

Ideally, all of the parallel pipelines provided by each NVIDIA GPU would remain busy constantly, but the reality is quite different. A model's parameters are not going to fit into a single HBM3 on a GPU, but must be distributed among multiple GPUs. This is also done for performance reasons. NVIDIA has its own networking interconnect, allowing the exchange of data between GPUs at 600 GBs/second, about one third the maximum memory rate. But model parameters cannot be neatly distributed in memory, even for one GPU.

The A100 has been specifically designed to work well with tensor processing. Tensors are like multidimensional arrays, and are key to performance during pre-training of LLMs, and the A100 can perform 156 to 312 teraflops using 32 bit floating point values.

When I began writing this article, the A100 and GPT-4 were new; today, it's the H100 and GPT-5. Ignoring GPT-5 for now, we can compare the A100 to the H100, both from NVIDIA. Based on an article that itself is based on performance testing by MosaicML, a company that is a managed LLM training and inference service, the H100 is 1.6 to 6 times faster than the A100. There are two big reasons for that.

First are changes in how memory gets transferred to GPU compute cores. The H100 uses HBM3 that supports faster copying of memory to the cores than DRAM. The Tensor Memory Accelerator is a new part of the NVIDIA Hopper architecture that handles memory management, freeing up GPU cores to do what they are best at: math.

The second are improvements in floating point and integer calculations. One of the things that surprised me while researching this article was the use of eight bit floating point (FP8). Any eight bit value can have no more than 256 values, and given that earlier LLMs were using 32 bit floating point, with 2^^32 possible values, it didn't seem to me that FP8 would provide enough resolution. I asked Karl Koscher of NVIDIA why they didn't just use eight bit integers, and he said it's because FP8 has a greater range and that's what's important.

But the H100 has the same problem as any other GPU when it comes to tensor operations. Tensors are multidimensional arrays. When a single dimensional array gets stored, programmers want to access each value one after the other, in order, for fastest access leading to faster processing. Tensors represent multidimensional arrays as single dimensional arrays with offsets (strides) for accessing each row, so accessing parameters during any use of an LLM means having to jump around in memory all of the time, making memory access slower than reading consecutive values.

The sharding of parameters across GPUs is also an issue. Perhaps the LLM has a *small* number of parameters, just 20 billion, that could theoretically fit into the 80 gigabytes of HBM3 memory. But for performance reasons, people typically use multiple GPUs, so the parameters must be distributed. While eight NVIDIA GPUs will fit in one HGX server, multiple servers, connected using networking, may be in use. All of this slows down processing from peak speeds.

I like to compare the sharding of parameters to the way supercomputers used to predict weather work, as I find it easier to visualize. Weather prediction algorithms split up the earth's atmosphere into 1.5 to 2 kilometer segments with several layers. If supercomputers had a single, shared pool of memory, you might imagine that the parameters for all segments would be stored in order. However, these segments are part of the 3D space representing earth's atmosphere and cannot be stored as if they were a one-dimensional array. Worse yet, this data is sharded across thousands of pools of memory.

While processing parameters for one group of segments, that data might fit into one pool of memory. But what happens at the borders of this group of segments? That data gets stored in other pools and must be communicated across pools.

For LLMs, the problem is even worse, as there are many more dimensions to the tensors, and parameters will be based on parameters in other tensors, meaning data must be copied between GPUs, and even between GPU servers. All this conspires to make using LLMs slower, and more energy expensive, than they might ideally be.

LLMs and Parameters

LLM base models have gotten much larger. The leap in abilities from GPT-3 to GPT-4 suggested to AI experts that models trained on more data, with more parameters, worked better. Companies have now trained models with over a terabyte of parameters. The amount of training data has not grown as fast because once you have scraped the internet and copied libraries full of books, you're faced with the limit on the amount of training data. Having better training data is actually more important, and filtering training data turns out to be a very important task. You may want to watch the FAST'25 keynote by Seetharami Seelam [6] where he describes the training phases for a foundation model using IBM's Vela architecture.

NVIDIA has been supporting parallel processing using the CUDA libraries since 2007. The most popular libraries for working with ML all support CUDA, and I think this alone is a huge advantage currently for NVIDIA, and owners of NVIDIA GPUs. It's been estimated that the markup on an A100 is 100 times the cost of making them, and that is unlikely to change for years, until other chip manufacturers can build similar hardware and come up with equally efficient software libraries for ML.

CUDA libraries aren't just device drivers for communicating with NVIDIA GPUs. They are also very clever compilers that manage the arrangement of parameters in memory. CUDA has been in development for decades, meaning that NVIDIA has those decades of experience in software that best arranges parameters for LLMs and other AI software.

Software

Riley Eller, a friend and fellow security geek, started prodding me to look deeper into ML in 2023. He did this by being excited about ML, using it in his latest startup, and then pointing me in the direction of online tutorials by Andrej Karpathy, who worked for Tesla on their vision recognition systems as well as being a founder of OpenAI.

Karpathy has produced a number of fast-paced videos with examples you can build yourself—for the most part. When I began watching these, I found myself pausing the video often, trying to understand what he just did. Or why he did something. Still, I found it fascinating to watch him build two models in his second tutorial that input a list of first names and use that to predict whether a collection of characters might be a first name. For example, 'ellie' would have a result of 100%, while his own name with an 'x' after it, would be totally unlikely [7].

Riley, being the smart guy that he is, actually first pointed me to a later video where Karpathy uses the collection of all the text in Shakespear's plays as the input tokens, about one million words. He then builds a General Purpose Transformer (GPT) and starts training it, then tuning it to reduce its loss. Eventually, he produces a model that outputs lines of text somewhat like a Shakespear play in style. Less trained versions could only produce random characters, so watching this process, and the tuning required, was enlightening to say the least.

Watching Karpathy only informed me about how little I knew about ML. I found myself annoyed at terms he mentioned, or the uses of variable names like x, y, w, or logits that seemed arbitrary to me. Then I started reading books about ML, and learned that these were not arbitrary at all.

I started with something I thought would be simple and quick: Machine Learning in One Hundred Pages by Andriy Burkov. I had read the first chapter, found his writing to be clear and concise, and bought the book. The second chapter briefly covers the math he would be using throughout the rest of the book. While I have used math over my career, suddenly the calculus course I took fifty years ago took on new importance, as did statistics.

Burkov, as well as other ML researchers and physicists, use math as a language. Math is very concise as well as precise. I could understand enough of the math to barely get the points Burkov makes later in the book. Which, by the way, is 136 pages long. The book's name was a marketing trick, writes Burkov. And his book ends with just a couple of pages relating to the technique that makes LLMs possible: transformers. I also read other online sources about the Python libraries used in ML.

I wondered why in the world people would use Python for compute-heavy tasks found in ML. There are a couple of answers, one being the conciseness of Python and its support for classes. The other is PyTorch, a library written in C++ that includes the tensor class with methods implemented in C++, meaning that compute-heavy operations are done here. So while Karpathy's tutorials use Python, they all are calling C++ routines which may in turn call into CUDA if you have NVIDIA devices. If you want to understand more about this, I suggest reading the first half of the notes about PyTorch in Eric Young's blog entry about getting started with contributing to PyTorch [8]. You might also want to look at llm.c, a project started by Karpathy for recreating GPT-2 using only C.

I kept watching Karpathy's videos, and worked with the Jupyter notebooks he uses in his tutorials. As I kept at it, I began to learn how machine learning works in LLMs. I had been upset when I learned that weights were initialized to random values, but Karpathy explains why this works, and why some other choices like starting weights set to zero wouldn't work.

The parameters in LLMs are the weights and biases for each 'neuron' in a model. Hyperparameters are used when designing and pretuning an LLM, for example, the dimensions used in each layer, the number of hidden layers, or how much the gradient between neurons should be used to adjust the weights during backpropagation.

Models mainly learn through the process of backpropagation, where the current prediction for the next token is matched against the actual next token, representing the 'loss'. The loss, actually the negative log of the probability of having matched correctly, is used to calculate the gradient, and the gradient is used to adjust weights moving backwards, up through the model, to the input layer. At this point, it's even easier for me to understand why the process of training LLMs is so energy intensive: every parameter gets changed with each step.

You probably remember the big leap in performance, when OpenAI released GPT-4: from 175 billion parameters in GPT-3 to 1.8 trillion parameters in GPT-4. At the time, it appeared that increasing the number of parameters increased the ability of the LLM to create useful output. Increasing the number of parameters by a factor of 10 did make a difference, but so did other changes in the LLM, like having a Mixture of Experts architecture for use during inference. But, I believe, the increase in size and complexity is a huge part of the drive to building enormous data centers for enormous LLMs. Gundlach et al [11] have theorized that 'meeker' LLMs are the true future, not large ones.

VIDIA gets to take advantage of this movement, as they will be able to sell more of their costly GPUs. On September 22, 2025, NVIDIA announced it will invest $100 billion in building data centers for OpenAI in exchange for OpenAI equity, an enormous bet in the technology. And many people are worried about how AI will upend the job market, both replacing entry-level jobs as well as some much more lucrative positions.

Perhaps, just perhaps, things are not quite as rosy as OpenAI, Anthropic, and others think they are. I found research, where two groups of senior programmers wrote code, one group using AI while the other didn't, expecting that the AI-using groups would work faster: instead they didn't. It took the group using AI 20% longer to perform similar tasks, probably because they had to first understand and then correct code produced by something other than themselves [9].

LLMs predict the next output token, whether for a marketing blurb or code, based on having been trained on as much data as can be found online. In some instances, using advanced AIs has proven tremendously helpful, for example, when searching for new molecules for drugs or superconductors. Those AIs were trained specifically for those applications. LLMs work by predicting the next output token, including some amount of context (previous tokens) in the query. These LLMs are not going to be making new discoveries in chemistry, as they predict tokens, that is characters, based on their training input.

And let's consider biological models, instead of AIs. Scientists have completely mapped a fruit fly's brain with 139,255 neurons, and around 15.1 million weighted edges [parameters] when represented as a graph. With this, a fruit fly's brain supports complex motor control while walking or in flight, courtship behaviour, involved decision making, flexible associative memory, spatial learning and complex multisensory navigation [10].

Or how about the human brain, with about 12 trillion synapses—parameters. Just like an LLM, training a human brain takes enormous effort, especially with the low bandwidth of reading and the spoken word. Still, at 20 watts TDP, many humans can outperform LLMs requiring megawatts of power, even if a lot more slowly. I think the future is going to be very interesting indeed, but I am not expecting General AI anytime soon, barring advances in fusion technology to power them at the very least.

Appendix

References:

[1] Kasper Groes Albin Ludvigsen, The carbon Footprint of GPT-4: https://towardsdatascience.com/the-carbon-footprint-of-gpt-4-d6c676eb21ae

[2] Fuel Efficiency of Modern Aircraft https://www.smh.com.au/traveller/reviews-and-advice/how-fuel-efficient-a...

[3] Qinghao Hu, Peng Sun, Tianwei Zhang, Understanding the Workload Characteristics of Large Language Model Development: https://www.usenix.org/publications/loginonline/understanding-workload-c...

[4] Josh You, How much energy does ChatGPT use? https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use

[5] Gcore, NVIDIA GPUs: H 100 vs. A100 comparison: https://gcore.com/blog/NVIDIA-h100-a100

[6] Seetharami R. Seelam, Insights Gained from Delivering Two Generations of AI Supercomputers and Storage Solutions in IBM Cloud: https://www.usenix.org/system/files/fast25_slides-seelam-keynote.pdf

[7] Andrej Karpathy, Neural Networks: Zero to Hero: https://github.com/karpathy/nn-zero-to-hero?tab=readme-ov-file

[8] Eric Yang, PyTorch internals blog entry on contributing to PyTorch: https://blog.ezyang.com/2019/05/pytorch-internals

[9] Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

[10] Schlegel, P., Yin, Y., Bates, A.S. et al. Whole-brain annotation and multi-connectome cell typing of Drosophila. Nature 634, 139–152 (2024). https://doi.org/10.1038/s41586-024-07686-5 or https://www.nature.com/articles/s41586-024-07686-5

[11] Hans Gundlach, Jayson Lynch, Neil Thompson, Meek Models Shall Inherit the Earth, TAIG ICML 2025: https://arxiv.org/abs/2507.07931

Article Categories:

AI/ML

Hardware

Last updated October 18, 2025

Authors:

Rik Farrow has been a consultant for 45 years. He has written two books, as well as worked as the technical editor for a Unix magazine and for two editions of a popular operating system book. He also taught Unix system administration and Internet security during the 90s and noughts internationally, and worked as a volunteer for USENIX program and steering committees. Rik was the editor of ;login: from 2005 to 2025.

[email protected]