The industry shift towards deploying smaller, more specialized — and therefore more efficient — AI models mirrors a transformation we’ve previously witnessed in the hardware world. Namely, the adoption of graphics processing units (GPUs), tensor processing units (TPUs) and other hardware accelerators as means to more efficient computing.
There’s a simple explanation for both cases, and it comes down to physics.
CPUs were built as general computing engines designed to execute arbitrary processing tasks — anything from sorting data, to doing calculations, to controlling external devices. They handle a broad range of memory access patterns, compute operations, and control flow.
However, this generality comes at a cost. As CPU hardware components support a broad range of tasks and decisions about what the processor should be doing at any given time — which demands more silicon for circuity, energy to power it and of course, time to execute those operations.
This trade-off, while offering versatility, inherently reduces efficiency.
This directly explains why specialized computing has increasingly become the norm in the past 10-15 years.
Today you can’t have a conversation about AI without seeing mentions of GPUs, TPUs, NPUs and various forms of AI hardware engines.
These specialized engines are, wait for it, less generalized — meaning they do fewer tasks than a CPU, but because they are less general they are much more efficient. They devote more of their transistors and energy to doing actual computing and data access devoted to the task at hand, with less support devoted to general tasks (and the various decisions associated with what to compute/access at any given time).
Because they are much simpler and economical, a system can afford to have a lot more of those compute engines working in parallel and hence perform more operations per unit of time and unit of energy.
A parallel evolution is unfolding in the realm of large language models (LLMs).
Like CPUs, general models such as GPT-4 are impressive because of their generality and ability to perform surprising complex tasks. But that generality also invariably comes from a cost in number of parameters (rumors have it is in the order of trillions of parameters across the ensemble of models) and the associated compute and memory access cost to evaluate all the operations necessary for inference.
This has given rise to specialized models like CodeLlama that can perform coding tasks with good accuracy (potentially even better accuracy) but at a much lower cost. Another example, Llama-2-7B can perform typical language manipulation tasks like entity extraction well and also at a much lower cost. Mistral, Zephyr and others are all capable smaller models.
This trend echoes the shift from sole reliance on CPUs to a hybrid approach incorporating specialized compute engines like GPUs in modern systems. GPUs excel in tasks requiring parallel processing of simpler operations, such as AI, simulations and graphics rendering, which form the bulk of computing requirements in these domains.
In the world of LLMs, the future lies in deploying a multitude of simpler models for the majority of AI tasks, reserving the larger, more resource-intensive models for tasks that genuinely necessitate their capabilities. And luckily, a lot of enterprise applications such as unstructured data manipulation, text classification, summarization and others can all be done with smaller, more specialized models.
The underlying principle is straightforward: Simpler operations demand fewer electrons, translating to greater energy efficiency. This isn’t just a technological choice; it’s an imperative dictated by the fundamental principles of physics. The future of AI, therefore, hinges not on building ever-larger general models, but on embracing the power of specialization for sustainable, scalable and efficient AI solutions.
Luis Ceze is CEO of OctoML.
Join leaders in San Francisco on January 10 for an exclusive night of networking, insights, and conversation. Request an invite here.
The industry shift towards deploying smaller, more specialized — and therefore more efficient — AI models mirrors a transformation we’ve previously witnessed in the hardware world. Namely, the adoption of graphics processing units (GPUs), tensor processing units (TPUs) and other hardware accelerators as means to more efficient computing.
There’s a simple explanation for both cases, and it comes down to physics.
The CPU tradeoff
CPUs were built as general computing engines designed to execute arbitrary processing tasks — anything from sorting data, to doing calculations, to controlling external devices. They handle a broad range of memory access patterns, compute operations, and control flow.
However, this generality comes at a cost. As CPU hardware components support a broad range of tasks and decisions about what the processor should be doing at any given time — which demands more silicon for circuity, energy to power it and of course, time to execute those operations.
VB Event
The AI Impact Tour
Getting to an AI Governance Blueprint – Request an invite for the Jan 10 event.
This trade-off, while offering versatility, inherently reduces efficiency.
This directly explains why specialized computing has increasingly become the norm in the past 10-15 years.
GPUs, TPUs, NPUs, oh my
Today you can’t have a conversation about AI without seeing mentions of GPUs, TPUs, NPUs and various forms of AI hardware engines.
These specialized engines are, wait for it, less generalized — meaning they do fewer tasks than a CPU, but because they are less general they are much more efficient. They devote more of their transistors and energy to doing actual computing and data access devoted to the task at hand, with less support devoted to general tasks (and the various decisions associated with what to compute/access at any given time).
Because they are much simpler and economical, a system can afford to have a lot more of those compute engines working in parallel and hence perform more operations per unit of time and unit of energy.
The parallel shift in large language models
A parallel evolution is unfolding in the realm of large language models (LLMs).
Like CPUs, general models such as GPT-4 are impressive because of their generality and ability to perform surprising complex tasks. But that generality also invariably comes from a cost in number of parameters (rumors have it is in the order of trillions of parameters across the ensemble of models) and the associated compute and memory access cost to evaluate all the operations necessary for inference.
This has given rise to specialized models like CodeLlama that can perform coding tasks with good accuracy (potentially even better accuracy) but at a much lower cost. Another example, Llama-2-7B can perform typical language manipulation tasks like entity extraction well and also at a much lower cost. Mistral, Zephyr and others are all capable smaller models.
This trend echoes the shift from sole reliance on CPUs to a hybrid approach incorporating specialized compute engines like GPUs in modern systems. GPUs excel in tasks requiring parallel processing of simpler operations, such as AI, simulations and graphics rendering, which form the bulk of computing requirements in these domains.
Simpler operations demand fewer electrons
In the world of LLMs, the future lies in deploying a multitude of simpler models for the majority of AI tasks, reserving the larger, more resource-intensive models for tasks that genuinely necessitate their capabilities. And luckily, a lot of enterprise applications such as unstructured data manipulation, text classification, summarization and others can all be done with smaller, more specialized models.
The underlying principle is straightforward: Simpler operations demand fewer electrons, translating to greater energy efficiency. This isn’t just a technological choice; it’s an imperative dictated by the fundamental principles of physics. The future of AI, therefore, hinges not on building ever-larger general models, but on embracing the power of specialization for sustainable, scalable and efficient AI solutions.
Luis Ceze is CEO of OctoML.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!
Author: Luis Ceze, OctoML
Source: Venturebeat
Reviewed By: Editorial Team