AI & Robotics News

Meta engineer: Only two nuclear power plants needed to fuel AI inference next year

November 14, 2023

Meta’s director of engineering for Generative AI, Sergey Edunov, has a surprising answer to how much more power will be needed to handle the increasing demand for AI applications for the next year: just two new nuclear power plants.

Edunov leads Meta’s training efforts for its Llama 2 open-source foundation model, which is considered one of the leading models. Speaking during a panel session I moderated at the Digital Workers Forum last week in Silicon Valley, he said two power plants would seem to be enough to power humanity’s AI needs for a year, and that this seemed to be acceptable. Referring to questions about whether the world has enough capacity to handle the growing AI power needs, especially given the rise of power-hungry generative AI applications, he said: “We can definitely solve this problem.”

Edunov made it clear that he was working only from back-of-the-envelope math when preparing his answer. However, he said it provided a good ballpark estimate of how much power will be needed to do what is called AI “inference.” Inference is the process by which AI is deployed in an application to respond to a question or to make a recommendation.

Inference is distinct from AI model “training,” which is when a model is trained on massive amounts of data for it to get ready to do inference.

Training of large language models (LLMs) has gained scrutiny recently, because it requires massive processing, although only initially. Once a model has been trained, it can be used over and over for inference needs, which is where the real application of AI happens.

Edunov gave two separate answers to address inference and training. His first answer addressed inference, where the majority of processing will happen as organizations deploy AI applications. He explained how he did his simple calculation for the inference side: He said Nvidia, the dominant supplier of processors for AI, appears to be ready to release between one million and two million of its H100 GPUs next year. If all of those GPUS were used to generate “tokens” for reasonably sized LLMs, he said it adds up to about 100,000 tokens per person on the planet per day, which he admitted is quite a lot of tokens.

Tokens are the basic units of text that LLMs use to process and generate language. They can be words, parts of words, or even single characters, depending on how the LLM is designed. For example, the word “hello” can be a single token, or it can be split into two tokens: “hel” and “lo”. The more tokens an LLM can handle, the more complex and diverse the language it can produce.

So how much electricity do we need to generate that many tokens? Well, each H100 GPU requires about 700 watts, and given that you need some electricity to support the data center and cooling, Edunov said he rounded up to 1KW per GPU. Add it all up, and that’s just two nuclear reactors needed to power all of those H100s. “At the scale of humanity, it’s not that much,” Edunov said. “I think as humans as a society we can afford to pay up to 100,000 tokens per day per person on this planet. So on the inference side, I feel like it might be okay where we are right now.”

(After the session, Edunov clarified to VentureBeat that his remarks referred to the power needed for the added AI compute from the new influx of Nvidia’s H100s, which are designed especially to handle AI applications and are thus the most notable. In addition to the H100s, there are older Nvidia GPU models, as well as AMD and Intel CPUs, as well as special-purpose AI accelerators that do inference for AI.)

Training LLMs is a different challenge, Edunov said. The main constraint is getting enough data to train them. He said it’s widely speculated that GPT4 was trained on the whole internet. Here he made some more simple assumptions. The entire publicly available internet, if you just download it, is roughly 100 trillion tokens, he said. But if you clean it up and de-duplicate data, you can get that data down to 20 trillion to 10 trillion tokens, he said. And if you focus on high-quality tokens, the amount will be even lower. “The amount of distilled knowledge that humanity created over the ages is not that big,” he said, especially if you need to keep adding more data to models to scale them to better performance.

He estimates that next-generation, higher-performing models will require 10 times more data. So if GPT4 was trained on say, 20 trillion tokens, then the next model will require like 200 trillion tokens. There may not be enough public data to do that, he said. That’s why researchers are working on efficiency techniques to make models more efficient and intelligent on smaller amounts of data. LLM models may also have to tap into alternative sources of data, for example, multimodal data, such as video. “Those are vast amounts of data that can enable future scaling,” he said.

Edunov spoke on a panel titled: “Generating Tokens: The Electricity of the GenAI Era,” and joining him were Nik Spirin, director of GenAI for Nvidia, and Kevin Tsai, Head of Solution Architecture, GenAI, for Google.

Spirin agreed with Edunov that there are other reservoirs of data available outside of the public internet, including behind firewalls and forums, although they are not easily accessible. However, they could be used by organizations with access to that data to easily customize foundational models.

Society has an interest in getting behind the best open-source foundation models, to avoid having to support too many independent efforts, Spirin said. This will save on compute, he said, since they can be pre-trained once, and most of the effort can be spent on making intelligent downstream applications. He said this is an answer to avoid hitting any data limits anytime soon.

Google’s Tsai added that several other technologies can help take the pressure off training. Retrieval augmented generation (RAG) can help organizations fine-tune foundation models with their troves of data. While RAG has its limits, other technologies Google has experimented with, such as sparse semantic vectors, can help. “The community can come together with useful models that can be repurposed in many places. And that’s probably the way to go right, for the earth,” he said.

At the end of the panel, I asked the panelists their predictions for the next two to three years of how LLMs will grow in capability, and where they will hit limitations. In general, they agreed that while it’s unclear just how much LLMs will be able to improve, significant value has already been demonstrated, and enterprises will likely be deploying LLMs en masse within about two years.

Improvements to LLMs could either continue exponentially or start to taper off, said Meta’s Edunov. Either way, we’ll have the answer in three to four years of whether artificial general intelligence (AGI) is possible with current technology, he predicted. Judging from previous waves of technology, including initial AI technologies, enterprise companies will be slow to adopt initially, Nvidia’s Spirin said. But within two years, he expects companies to be getting “massive” value out of it. “At least that was the case with the previous wave of AI technology,” he said.

Google’s Tsai pointed out that supply-chain limitations – caused by Nvidia’s reliance on high bandwidth memory for its GPUS – are slowing down model improvement, and that this bottleneck has to be solved. But he said he remained encouraged by innovations, like Blib-2, a research project from Salesforce, to find a way to build smaller, more efficient models. These may help LLMs get around supply-chain constraints by reducing their processing requirements, he said.

VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Hear from top industry leaders on Nov 15. Reserve your free pass

Meta’s director of engineering for Generative AI, Sergey Edunov, has a surprising answer to how much more power will be needed to handle the increasing demand for AI applications for the next year: just two new nuclear power plants.

Edunov leads Meta’s training efforts for its Llama 2 open-source foundation model, which is considered one of the leading models. Speaking during a panel session I moderated at the Digital Workers Forum last week in Silicon Valley, he said two power plants would seem to be enough to power humanity’s AI needs for a year, and that this seemed to be acceptable. Referring to questions about whether the world has enough capacity to handle the growing AI power needs, especially given the rise of power-hungry generative AI applications, he said: “We can definitely solve this problem.”

Edunov made it clear that he was working only from back-of-the-envelope math when preparing his answer. However, he said it provided a good ballpark estimate of how much power will be needed to do what is called AI “inference.” Inference is the process by which AI is deployed in an application to respond to a question or to make a recommendation.

Inference is distinct from AI model “training,” which is when a model is trained on massive amounts of data for it to get ready to do inference.

VB Event

AI Unleashed

Don’t miss out on AI Unleashed on November 15! This virtual event will showcase exclusive insights and best practices from data leaders including Albertsons, Intuit, and more.

Register for free here

Training of large language models (LLMs) has gained scrutiny recently, because it requires massive processing, although only initially. Once a model has been trained, it can be used over and over for inference needs, which is where the real application of AI happens.

Power needs for inference are under control

Edunov gave two separate answers to address inference and training. His first answer addressed inference, where the majority of processing will happen as organizations deploy AI applications. He explained how he did his simple calculation for the inference side: He said Nvidia, the dominant supplier of processors for AI, appears to be ready to release between one million and two million of its H100 GPUs next year. If all of those GPUS were used to generate “tokens” for reasonably sized LLMs, he said it adds up to about 100,000 tokens per person on the planet per day, which he admitted is quite a lot of tokens.

Tokens are the basic units of text that LLMs use to process and generate language. They can be words, parts of words, or even single characters, depending on how the LLM is designed. For example, the word “hello” can be a single token, or it can be split into two tokens: “hel” and “lo”. The more tokens an LLM can handle, the more complex and diverse the language it can produce.

So how much electricity do we need to generate that many tokens? Well, each H100 GPU requires about 700 watts, and given that you need some electricity to support the data center and cooling, Edunov said he rounded up to 1KW per GPU. Add it all up, and that’s just two nuclear reactors needed to power all of those H100s. “At the scale of humanity, it’s not that much,” Edunov said. “I think as humans as a society we can afford to pay up to 100,000 tokens per day per person on this planet. So on the inference side, I feel like it might be okay where we are right now.”

(After the session, Edunov clarified to VentureBeat that his remarks referred to the power needed for the added AI compute from the new influx of Nvidia’s H100s, which are designed especially to handle AI applications and are thus the most notable. In addition to the H100s, there are older Nvidia GPU models, as well as AMD and Intel CPUs, as well as special-purpose AI accelerators that do inference for AI.)

For training generative AI, getting enough data is the problem

Training LLMs is a different challenge, Edunov said. The main constraint is getting enough data to train them. He said it’s widely speculated that GPT4 was trained on the whole internet. Here he made some more simple assumptions. The entire publicly available internet, if you just download it, is roughly 100 trillion tokens, he said. But if you clean it up and de-duplicate data, you can get that data down to 20 trillion to 10 trillion tokens, he said. And if you focus on high-quality tokens, the amount will be even lower. “The amount of distilled knowledge that humanity created over the ages is not that big,” he said, especially if you need to keep adding more data to models to scale them to better performance.

He estimates that next-generation, higher-performing models will require 10 times more data. So if GPT4 was trained on say, 20 trillion tokens, then the next model will require like 200 trillion tokens. There may not be enough public data to do that, he said. That’s why researchers are working on efficiency techniques to make models more efficient and intelligent on smaller amounts of data. LLM models may also have to tap into alternative sources of data, for example, multimodal data, such as video. “Those are vast amounts of data that can enable future scaling,” he said.

Edunov spoke on a panel titled: “Generating Tokens: The Electricity of the GenAI Era,” and joining him were Nik Spirin, director of GenAI for Nvidia, and Kevin Tsai, Head of Solution Architecture, GenAI, for Google.

Spirin agreed with Edunov that there are other reservoirs of data available outside of the public internet, including behind firewalls and forums, although they are not easily accessible. However, they could be used by organizations with access to that data to easily customize foundational models.

Society has an interest in getting behind the best open-source foundation models, to avoid having to support too many independent efforts, Spirin said. This will save on compute, he said, since they can be pre-trained once, and most of the effort can be spent on making intelligent downstream applications. He said this is an answer to avoid hitting any data limits anytime soon.

Google’s Tsai added that several other technologies can help take the pressure off training. Retrieval augmented generation (RAG) can help organizations fine-tune foundation models with their troves of data. While RAG has its limits, other technologies Google has experimented with, such as sparse semantic vectors, can help. “The community can come together with useful models that can be repurposed in many places. And that’s probably the way to go right, for the earth,” he said.

Predictions: We’ll know if AGI is possible within three or four years, and LLMs will provide enterprises “massive” value

At the end of the panel, I asked the panelists their predictions for the next two to three years of how LLMs will grow in capability, and where they will hit limitations. In general, they agreed that while it’s unclear just how much LLMs will be able to improve, significant value has already been demonstrated, and enterprises will likely be deploying LLMs en masse within about two years.

Improvements to LLMs could either continue exponentially or start to taper off, said Meta’s Edunov. Either way, we’ll have the answer in three to four years of whether artificial general intelligence (AGI) is possible with current technology, he predicted. Judging from previous waves of technology, including initial AI technologies, enterprise companies will be slow to adopt initially, Nvidia’s Spirin said. But within two years, he expects companies to be getting “massive” value out of it. “At least that was the case with the previous wave of AI technology,” he said.

Google’s Tsai pointed out that supply-chain limitations – caused by Nvidia’s reliance on high bandwidth memory for its GPUS – are slowing down model improvement, and that this bottleneck has to be solved. But he said he remained encouraged by innovations, like Blib-2, a research project from Salesforce, to find a way to build smaller, more efficient models. These may help LLMs get around supply-chain constraints by reducing their processing requirements, he said.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Author: Matt Marshall
Source: Venturebeat
Reviewed By: Editorial Team

156

0