Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Meta, the social media giant formerly known as Facebook, has been a pioneer in artificial intelligence (AI) for more than a decade, using it to power its products and services such as News Feed, Facebook Ads, Messenger and virtual reality. But as the demand for more advanced and scalable AI solutions grows, so does the need for more innovative and efficient AI infrastructure.
At the AI Infra @ Scale event today — a one-day virtual conference hosted by Meta’s engineering and infrastructure teams — the company announced a series of new hardware and software projects that aim to support the next generation of AI applications. The event featured speakers from Meta who shared their insights and experiences on building and deploying AI systems at large scale.
Among the announcements was a new AI data center design that will be optimized for both AI training and inference, the two main phases of developing and running AI models. The new data centers will leverage Meta’s own silicon, the Meta training and inference accelerator (MTIA), a chip that will help to accelerate AI workloads across various domains such as computer vision, natural language procession and recommendation systems
Meta also revealed that it has already built the Research Supercluster (RSC), an AI supercomputer that integrates 16,000 GPUs to help train large language models (LLMs) like the LLaMA project, which Meta announced at the end of February.
Event
Transform 2023
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
“We’ve been building advanced infrastructure for AI for years now, and this work reflects long term efforts that will enable even more advances and better use of this technology across everything we do,” Meta CEO Mark Zuckerberg said in a statement.
Building AI infrastructure is table stakes in 2023
Meta is far from being the only hyperscaler or large IT vendor that is thinking about purpose-built AI infrastructure. In November, Microsoft and Nvidia announced a partnership for an AI supercomputer in the cloud. The system benefits (not surprisingly) from Nvidia GPUs, connected with Nvidia’s Quantum 2 InfiniBand networking technology.
A few months later in February, IBM outlined details of its AI supercomputer, codenamed Vela. IBM’s system is using x86 silicon, alongside Nvidia GPUs and ethernet-based networking. Each node in the Vela system is packed with eight 80GB A100 GPUs. IBM’s goal is to build out new foundation models that can help serve enterprise AI needs.
Not to be outdone, Google has also jumped into the AI supercomputer race with an announcement on May 10. The Google system is using Nvidia GPUs along with custom designed infrastructure processing units (IPUs) to enable rapid data flow.
What Meta’s new AI inference accelerator brings to the table
Meta is now also jumping into the custom silicon space with its MTIA chip. Custom built AI inference chips are also not a new thing either. Google has been building out its tensor processing unit (TPU) for several years and Amazon has had its own AWS inferentia chips since 2018.
For Meta, the need for AI inference spans multiple aspects of its operations for its social media sites, including news feeds, ranking, content understanding and recommendations. In a video outlining the MTIA silicon, Meta research scientist for infrastructure Amin Firoozshahian commented that traditional CPUs are not designed to handle the inference demands from the applications that Meta runs. That’s why the company decided to build its own custom silicon.
“MTIA is a chip that is optimized for the workloads we care about and tailored specifically for those needs,” Firoozshahian said.
Meta is also a big user of the open source PyTorch machine learning (ML) framework, which it originally created. Since 2022, PyTorch has been under the governance of the Linux Foundation’s PyTorch Foundation effort. Part of the goal with MTIA is to have highly optimized silicon for running PyTorch workloads at Meta’s large scale.
The MTIA silicon is a 7nm (nanometer) process design and can provide up to 102.4 TOPS (Trillion Operations per Second). The MTIA is part of a highly integrated approach within Meta to optimize AI operations, including networking, data center optimization and power utilization.
The data center of the future is built for AI
Meta has been building its own data center for over a decade to meet the needs of its billions of users. So far, it has been doing just fine, but the explosive growth in AI demands means it’s time to do more.
“Our current generation of data center designs is world class, energy and power efficient,” Rachel Peterson, VP for data center strategy at Meta said during a roundtable discussion at the Infra @ scale event. “It’s actually really supported us through multiple generations of servers, storage and network and it’s really able to serve our current AI workloads really well.”
As AI use grows across Meta, more compute capacity will be needed. Peterson noted that Meta sees a future where AI chips are expected to consume more than 5x the power of Meta’s typical CPU servers. That expectation has caused Meta to rethink the cooling of the data center and provide liquid cooling to the chips in order to deliver the right level of power efficiency. Enabling the right cooling and power to enable AI is the driving force behind Meta’s new data center designs.
“As we look towards the future, it’s always been about planning for the future of AI hardware and systems and how we can have the most performance systems in our fleet,” Peterson said.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
Author: Sean Michael Kerner
Source: Venturebeat