Broadcom's Thor 2 looks to hammer top spot in AI NICs
Monday, May 20, 2024 at 4:34PM
Roy Rubenstein in 400 Gigabit Ethernet, Broadcom, Hasan Siraj, Jas Tremblay, Jericho3-AI, OCP, RoCE, Thor 2, Universal Ethernet Consortium, artificial intelligence, linear pluggable optics, network interface card

Broadcom has announced the availability of network interface cards (NICs) for large-scale artificial intelligence (AI) computers. 

Jas Tremblay

The NIC cards are using Broadcom's Thor 2 chip which started sampling in 2023 and is now in volume production.

Jas Tremblay, vice president and general manager of the data center solutions group at Broadcom, says the Thor 2 is the industry's first 400 gigabit Ethernet (GbE) NIC device to be implemented in a 5nm CMOS process.  

"It [the design] gives customers choices and freedom when they're building their AI systems such that they can use different NICs with different [Ethernet] switches," says Tremblay.

 

NICs for AI 

The 400GbE Thor 2 supports 16 lanes of PCI Express 5.0, each lane operating at 32 gigabit-per-second (Gbps).

The chip also features eight 112-gigabit serialisers/ deserialisers (serdes). Eight 112-gigabit serdes are supported even though the chip is a 400GbE device since some customers operate the serdes at the lower 56Gbps speed to match their switches' serdes.    

Broadcom is bringing to market a variety of NICs using the Thor 2. Tremblay explains that one board is for standard servers while another is designed for an Open Compute Project (OCP) server. In turn, certain customers have custom designs. 

Broadcom has also qualified 100 optical and copper-based connectors used with the NIC boards. "People want to use different cables to connect these cards, and we have to qualify them all," says Tremblay. These include linear pluggable optics (LPO), for the first time as part of the optical options.

The requirement for so many connectors is a reflection of several factors: AI's needs, the use of 100-gigabit serdes, and 400GbE. "What's happening is that customers are having to optimise the physical cabling to reduce power and thermal cooling requirements," says Tremblay.

When connecting the Broadcom NIC to a Broadcom switch, a reach of 5m is possible using direct attach copper (DAC) cabling. In contrast, if the Broadcom NIC is connected to another vendor's switch, the link distance may only be half that.

"In the past, people would say: 'I'm not going to have different cable lengths for various types of NICs and switch connections'," says Tremblay. "Now, in the AI world, they have to do that given there's so much focus on power and cooling."  

 How the NIC connects to the accelerator chip (in the diagram, a graphics processing unit (GPU)) and also the layers of switches to enable the NIC to talk to other NICs. Source: Broadcom.

NIC categories

Many terms exist to describe NICs. Broadcom, which has been making NICs for over two decades, puts NICs into two categories. One, and Broadcom's focus, is Ethernet NICs. The NICs use a hardware-accelerated data path and are optimised for networking, connectivity, security, and RoCE.

RoCE refers to RDMA over Converged Ethernet, while RDMA is short for remote direct memory access. RDMA allows one processor to read or write to another's memory without involving the processor. This frees the processor to concentrate on computation. RoCE uses Ethernet as a low-latency medium for such transfers. 

The second NIC category refers to a data processing unit (DPU). Here, the chip has CPU cores to execute the offload tasks, implementing functions that would otherwise burden the main processor. 

Tremblay says the key features that make an Ethernet NIC ideal for AI include using at least a 25Gbps serdes, RoCE, and advanced traffic congestion control.  

 

Switch scheduling or end-point scheduling

Customers no longer buy components but complete AI compute clusters, says Tremblay. They want the cluster to be an open design so that when choosing the particular system elements, they have confidence it will work.

Broadcom cites two approaches - switch scheduling and end-point scheduling - to building AI systems. 

Switch scheduling refers to systems where the switch performs the traffic load balancing to ensure that the networking fabric is used to the full. The switch also oversees congestion control. 

Hasan Siraj

"The switch does perfect load balancing with every packet spread across all the outbound lines and reassembled at the other end," says Hasan Siraj, head of software products and ecosystem at Broadcom. Jericho3-AI, which Broadcom announced last year, is an example of a switch scheduler for AI workloads. 

The second approach - end-point scheduling - is for customers that prefer the NIC to do the scheduling. Leading cloud-computing players have their own congestion control algorithms, typically, and favour such flexibility, says Siraj: “But you still need a high-performance fabric that can assist with the load balancing.”

Here, a cloud player will used their NIC designs or other non-Broadcom NICs for the congestion control control but use it with a Broadcom switch such as the Tomahawk 5 (see diagram below).  

Left shows an end-point scheduler set-up while the right diagram is an example of switch scheduler. Source: Broadcom.

Accordingly, the main configuration options are a Broadcom NIC with a non-Broadcom switch, a third-party NIC and the Jericho3-AI, or a full NIC-switch Broadcom solution where the Jericho3-AI does the load balancing and congestion control, while the Thor 2-based NIC takes care of RoCE in a power efficient way. 

“Our strategy is to be the most open solution,” says Tremblay. “Everything we are doing is standards-based.”

And that includes the work of the Ultra Ethernet Consortium that is focussed on transportation and congestion control to tailor Ethernet for AI. The Ultra Ethernet Consortium is close to issuing the first revisions of its work.

The Ultra Ethernet Consortium aspires to achieve AI cluster sizes of 1 million accelerator chips. Such a huge computing cluster will not fit within one data centre sue to size, power, and thermal constraints, says Siraj. Instead, the cluster will be distributed across several data centres tens of kilometres apart. The challenge here will be how to achieve such connectivity while maintaining job completion time and latency.

 

Thor 3

Meanwhile, Broadcom has started work on an 800-gigabit NIC chip, the Thor 3, and a 1.6-terabit version after that.  

The Jericho3-AI switch chip supports up to 32,000 endpoints, each at 800Gbps. Thus, the AI switch chip is ready for the advent Thor 3-based NIC boards.

Article originally appeared on Gazettabyte (https://www.gazettabyte.com/).
See website for complete article licensing information.