Cleantech & EV's News

Tesla unveils Dojo supercomputer: world’s new most powerful AI training machine

August 20, 2021

At its AI Day, Tesla unveiled its Dojo supercomputer technology while flexing its growing in-house chip design talent.

The automaker claims to have developed the fastest AI training machine in the world.

For years now, Tesla has been teasing the development of a new supercomputer in-house optimized for neural net video training.

Tesla is handling an insane amount of video data from its fleet of over 1 million vehicles, which it uses to train its neural nets.

The automaker found itself unsatisfied with current hardware options to train its computer vision neural nets and believed it could do better internally.

Over the last two years, CEO Elon Musk has been teasing the development of Tesla’s own supercomputer called “Dojo.”

Last year, he even teased that Tesla’s Dojo would have a capacity of over an exaflop, which is one quintillion (10¹⁸) floating-point operations per second, or 1,000 petaFLOPS.

It could potentially makes Dojo the new most powerful supercomputer in the world.

Today, at Tesla’s AI Day, the company unveiled Dojo.

Ganesh Venkataramanan, Tesla’s senior director of Autopilot hardware and the leader of the Dojo project, led the presentation.

The engineer started by unveiling Dojo’s D1 chip, which is using 7 nanometer technology and delivers breakthrough bandwidth and compute performance:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

This is the second chip designed by the Tesla team internally after the FSD chip found in the FSD computer hardware 3 in Tesla cars.

Venkataramanan had an actual D1 chip on stage:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

The engineer commented on the new D1 chip:

This was entirely designed by Tesla team internally. All the way from the architecture to the package. This chip is like GPU-level compute with a CPU level flexibility and twice the network chip level IO bandwight.

Tesla claims to have achieved a significant breakthrough in chip bandwidth:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

Tesla designed the chip to “seamlessly connect without any glue to each other,” and the automaker took advantage of that by connecting 500,000 nodes together.

It adds the interface, power, and thermal management, and it results in what it calls a training tile:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

The result is a 9 PFlops training tile with 36TB per second of bandwight in a less than 1 cubic foot format.

Venkataramanan also had an actual Dojo training tile on stage:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

The engineer commented on the piece of computing technology:

It’s unprecedented. This is an amazing piece of engineering.

However, that’s where the unveiling of actual real Dojo hardware stopped for Tesla.

The automaker revealed that it only recently ran a neural network on one of the tiles and Venkataramanan appeared to even surprised Andrej Karpathy, Tesla’s head of AI, on stage by revealing for the first time that Dojo training tile ran one of his neural networks:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

But now it still has to form a compute cluster using those training tiles in order to truly build the first Dojo supercomputer.

Tesla says that it can combine 2 x 3 tiles in a tray and two trays in a computer cabinet for over 100 PFlops per cabinet:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

But with their incredible bandwidth, Tesla claims that they can link those all together to create the ExaPod.

In a 10-cabinet system, Tesla’s Dojo ExaPod will break the barrier of the ExaFlop of compute – something that supercomputer makers have been trying to achieve for a long time:

Tesla unveils Dojo supercomputer: world's new most powerful AI training machine

Tesla hasn’t put that system together yet, but CEO Elon Musk claimed that it will be operational next year.

It would become the fatest AI training computer in the world while being power efficient and in a relatively small format for a supercomputer.

Tesla plans to use the new supercomputer to train its own neural networks to develop self-driving technology, but it also plans to make it available to other AI developers in the future.

Since it was Tesla’s first shot at developing a supercomputer in-house, the company also believes that there are a lot of room for improvements, and it is teasing 10x improvements in some levels of performance in the next version of Dojo.

Subscribe to Electrek on YouTube for exclusive videos and subscribe to the podcast.

Author: Fred Lambert
Source: Electrek

935

1