Tomahawk 6: The industry’s first 100-terabit switch chip

Part 2: Data Centre Switching
Peter Del Vecchio, product manager for the Tomahawk switch family at Broadcom, outlines the role of the company’s latest Tomahawk 6 Ethernet switch chip in AI data centres.
Broadcom is now shipping samples of its Tomahawk 6, the industry’s first 102.4-terabit-per-second (Tbps) Ethernet switch chip. The chip highlights AI’s impact on Ethernet networking switch chip design since Broadcom launched its current leading device, the 51.2-terabit Tomahawk 5. The Tomahawk 6 is more evolutionary, rather than a complete change, notes Del Vecchio. The design doubles bandwidth and includes enhanced networking features to support AI scale-up and scale-out networks.
Nvidia is the only other company that has announced a 102.4 terabit switch, and it’s scheduled for production in 2026,” says Bob Wheeler, analyst at large at market research firm LightCounting, adding that Nvidia sells switches, not chips.

Multi-die architecture
The Tomahawk 6 marks a shift from the monolithic chip design of the Tomahawk 5 to a multi-die architecture.
The 102.4 terabit Tomahawk 6 comes in two versions. One has 512 input-output lanes – serialisers/ deserialisers (serdes) – operating at 200-gigabit using 4-level pulse amplitude modulation signalling (PAM-4). The other Tomahawk 6 version has 1,024 serdes, each using 100-gigabit PAM-4.
“The core die is identical between the two, the only difference are the chiplets that are either for 100 gig or 200 gig PAM-4,” says Del Vecchio. The core die hosts the packet processing and traffic management logic.
The chip uses a 3nm CMOS process node, which improves power efficiency compared to the 5nm CMOS Tomahawk 5.
Broadcom does not quote exact power figures for the chip. “The Tomahawk 6 is significantly less than one watt per 100 gigabits-per-second, well below 1,000 watts,” says Del Vecchio. In contrast, the Tomahawk 5 consumes less than 512 watts.
AI networking: Endpoint-scheduled fabrics
The Tomahawk 6 chip is designed for AI clusters requiring near-100 per cent network utilisation.
“With previous data centre networks, it was unusual that the networks would be loaded to more than 60 to 70 per cent utilisation,” says Del Vecchio. “For AI, that’s unacceptable.”
The chip supports endpoint-scheduled fabrics, where traffic scheduling and load balancing occur at the endpoints to ensure the traffic is efficiently distributed across the network. An endpoint could be a network interface card (NIC) or an AI accelerator interface.
This contrasts with Broadcom’s other switch chip family, the Jericho 3-AI and the Ramon, which is designed for switch-scheduled fabrics. Here, the switch chip handles the networking and packet spraying, working alongside simpler end-point hardware.
The type of switch chip used – endpoint schedule or switch scheduled – depends on the preferences of service providers and hyperscalers. Broadcom says there is demand for both networking approaches.
The Tomahawk 6 uses Broadcom’s latest cognitive routing suite and enhanced telemetry to address the evolving AI traffic patterns.
The market shifted dramatically in 2022, says Del Vecchio, with demand moving from general data centre networking to one focused on AI’s needs. The trigger was the generative AI surge caused by the emergence of ChatGPT in November 2022, after the Tomahawk 5 was already shipping.
“There was some thought of AI training and for inference [with the Tomahawk 5], but the primary use case at that point was thought to be general data centre networks,” says Del Vecchio.
Wide and flat topologies
Tomahawk 6 supports two-tier networks connecting up to 128,000 AI accelerator chips, such as graphic processor units (GPUs). This assumes 200 gigabits per endpoint, which may be insufficient for the I/O requirements of the latest AI accelerator chips.
To achieve higher bandwidth per end-point – 800 gigabit or 1.6 terabit – multiple network planes are used in parallel, each adding 200 gigabits. This way, Broadcom’s design avoids adding an extra third tier of network switching.

“Rather than having three tiers, you have multiple networking planes, say, eight of those in parallel,” says Del Vecchio.Such a wide-and-flat topology minimises latency and simplifies congestion control, which is critical for AI workloads. “Having a two-tier network versus a three-tier network makes congestion control much easier,” he says.
Tomahawk 6’s enhanced adaptive routing and load balancing features caters to AI’s high-utilisation demands. The aim is to try to keep the port speed low, to maximise the radix, says Del Vecchio, contrasting AI networks with general data centres, where higher 800-gigabit port speeds are typical.
Scale-Up Ethernet
The above discussion refers to the scale-out networking approach. For scale-up networking, the first hop between the AI accelerator chips, the devices are densely interconnected using multiple lanes — four or eight 200-gigabit lanes — to achieve higher bandwidth within a rack.
Broadcom has taken a different approach to scale-up networking than other companies. It has chosen Ethernet rather than developing a proprietary interface like Nvidia’s NVlink or the industry-backed UALink.
Broadcom has released its Scale-Up Ethernet (SUE) framework, which positions Ethernet as a unified solution for scale-up networks and which it has contributed to the Open Compute Project (OCP).

SUE supports large-scale GPU clusters. “You can do 512 XPUs in a scale-up cluster, connected in a single hop,” says Del Vecchio. SUE’s features include link-level retry, credit-based flow control, and optimised headers for low-latency, reliable transport.
“There is no one-size-fits-all for scale-up,” says Wheeler. “For example, Google’s ICI [inter chip interconnect] is a remote direct memory access (RDMA) based interconnect, more like Ethernet than UALink or NVLink,” says Wheeler. “There will likely be multiple camps.”
Broadcom chose Ethernet for several reasons. “One is you can leverage the whole Ethernet ecosystem,” says Del Vecchio, who stresses it results in a unified toolset for front-end, back-end, and scale-up networks.
SUE also aligns with hyperscaler preferences for interchangeable interfaces. “They’d like to have one unified technology for all that,” says Del Vecchio.
Del Vecchio is also a Ultra Ethernet Consortium (UEC) steering committee member. The UEC focuses on scale-out for its 1.0 specification, which is set for public release soon.
Link-level retry (LLR) and credit-based flow control (CBFC) are already being standardised within UEC, says Del Vecchio, and suggests that there will also be scale-up extensions which will benefit Broadcom’s SUE approach.
Interconnects
Tomahawk 6 supports diverse physical interconnects, including 100-gigabit and 200-gigabit PAM-4 serdes and passive copper links up to 2 meters, enabling custom GPU cluster designs.

“There’s a lot of focus on these custom GPU racks,” says Del Vecchio, highlighting the shift from generic pizza-box switches to highly engineered topologies.
The goal is to increase the power to each rack to cram more AI accelerator chips, thereby increasing the degree of scale-up using copper interconnect. Copper links could be used to connect two racks to further double scale-up capacity.
Co-packaged optics: Enhancing reliability?
Co-packaged optics (CPO) has also become a design feature of switch chips. The Tomahawk 6 will be Broadcom’s third-generation switch chip that will also be offered with co-packaged optics.
“People are seeing how much power is going into the optics for these GPU racks,” says Del Vecchio. Co-packaged optics eliminates retimers and DSPs, reducing latency and burst errors
Broadcom and hyperscalers are currently investigating another key potential benefit of co-packaged optics. “There are indications that you wind up with significantly fewer link flaps,” he said. A link flap refers to an link instability.
Unlike pluggable optics, which introduce burst errors via DSPs, co-packaged optics offers random Gaussian noise, which is better suited for forward error correction schemes. “If you have an end-to-end CPO link, you have much more random errors,” he explained.
This suggests that using co-packaged optics could benefit the overall runtime of massive AI clusters, a notable development that, if proven, will favour the technology’s use. “We expect the Tomahawk 6 Davisson co-packaged optics version to follow Tomahawk 6 production closely,” says LightCounting’s Wheeler.
Design challenges
Tomahawk 6’s development required overcoming significant hurdles.
Packaging over 1,000 serdes was one. “There were no packages on the market anywhere near that size,” says Del Vecchio, emphasising innovations in controlling warpage, insertion loss, and signal integrity. Del Vecchio also highlights the complexity of fanning out 1,000 lanes. The multi-die design required low-latency, low-power chip-to-chip interfaces, with Broadcom using its experience developing custom ASICs.
Traffic management structures, like the Memory Management Unit (MMU), have also seen exponential complexity increases. “Some structures are 4x the complexity,” says Del Vecchio.
The multi-die design demanded efficient chip-to-chip interfaces, while packaging 1,000 serdes lanes required signal integrity and manufacturability innovations. “We spent a lot of time on the packaging technology,” he added.
Meanwhile, using architectural optimisations, such as automatic clock gating and efficient serdes design, improved power efficiency. What about the delay in announcing the latest Tomahawk switch chip compared to the clock-like 2-year launch date gaps of previous Tomahawk chips? (See table above.)
Del Vecchio says the delay wasn’t due to a technical issue or getting access to a 3nm CMOS process. Instead, choosing the right market timing drove the release schedule.
Broadcom believes it has a six-month to one-year lead on competing switch chip makers.
Production and market timing
Tomahawk 6 samples are now shipping to hyperscalers and original equipment manufacturers (OEMs). Production is expected within seven months, matching the timeline achieved with the Tomahawk 5. “We feel confident there is no issue with physical IP,” says Del Vecchio, based on the work done with Broadcom’s test chips and verification suites.
The simultaneous availability of 100-gigabit and 200-gigabit SerDes versions of the latest switch chip reflects AI’s bandwidth demands.
“There is such a huge insatiable demand for bandwidth, we could not afford the time delay between the 100-gig and 200-gig versions,” says Del Vecchio.
Broadcom's first Jericho3 takes on AI's networking challenge

Add Content
Broadcom’s Jericho silicon has taken an exciting turn.
The Jericho devices are used for edge and core routers.
But the first chip of Broadcom’s next-generation Jericho is aimed at artificial intelligence (AI); another indicator, if one is needed, of AI’s predominance.
Dubbed the Jericho3-AI, the device networks AI accelerator chips that run massive machine-learning workloads.
AI supercomputers
AI workloads continue to grow at a remarkable rate.
The most common accelerator chip used to tackle such demanding computations is the graphics processor unit (GPU).
GPUs are expensive, so scaling them efficiently is critical, especially when AI workloads can take days to complete.
“For AI, the network is the bottleneck,” says Oozie Parizer, (pictured) senior director of product management, core switching group at Broadcom.
Squeezing more out of the network equates to shorter workload completion times.
“This is everything for the hyperscalers,” says Parizer. “How quickly can they finish the job.”
Broadcom shares a chart from Meta (below) showing how much of the run time for its four AI recommender workloads is spent on networking, moving the data between the GPUs.

In the worse case, networking accounts for three fifths (57 per cent) of the time during which the GPUs are idle, waiting for data.
Scaling
Parizer highlights two trends driving networking for AI supercomputers.
One is the GPU’s growing input-output (I/O), causing a doubling of the interface speed of network interface cards (NICs). The NIC links the GPU to the top-of-rack switch.
The NIC interface speeds have progressed from 100 to 200 to now 400 gigabits and soon 800 gigabits, with 1.6 terabits to follow.
The second trend is the amount of GPUs used for an AI cluster.
The largest cluster sizes have used 64 or 256 GPUs, limiting the networking needs. But now machine-learning tasks require clusters of 1,000 and 2,000 GPUs up to 16,000 and even 32,000.
Meta’s Research SuperCluster (RSC), one of the largest AI supercomputers, uses 16,000 Nvidia A100 GPUs: 2,000 Nvidia DGX A100 systems each with eight A100 GPUs. The RSC also uses 200-gigabit NICs.
“The number of GPUs participating in an all-to-all exchange [of data] is growing super fast,” says Parizer.
The Jericho3-AI is used in the top-of-rack switch that connects a rack’s GPUs to other racks in the cluster.
The Jericho3-AI enables clusters of up to 32,000 GPUs, each served with an 800-gigabit link.
An AI supercomputer can used all its GPUs to tackle one large training job or split the GPUs into pools running AI workloads concurrently.
Either way, the cluster’s network must be ‘flat’, with all the GPU-to-GPU communications having the same latency.
Because the GPUs exchange machine-learning training data in an all-to-all manner, only when the last GPU receives its data can the computation move onto the next stage.
“The primary benefit of Jericho3-AI versus traditional Ethernet is predictable tail latency,” says Bob Wheeler, principal analyst at Wheeler’s Network. “This metric is very important for AI training, as it determines job-completion time.”
Data spraying
“We realised in the last year that the premium traffic capabilities of the Jericho solution are a perfect fit for AI,” says Parizer.
The Jericho3-AI helps maximise GPU processing performance by using the full network capacity while traffic routing mechanisms help nip congestion in the bud.
The Jericho also adapts the network after a faulty link occurs. Such adaptation must avoid heavy packet loss otherwise the workload must be restarted, potentially losing days of work.
AI workloads use large packet streams known as ‘elephant’ flows. Such flows tie up their assigned networking path, causing congestion when another flow also needs that path.
“If traffic follows the concept of assigned paths, there is no way you get close to 100 per cent network efficiency,” says Parizer.
The Jericho3-AI, used in a top-of-rack switch, has a different approach.
Of the device’s 28.8 terabits of capacity, half connects the rack’s GPUs’ NICs and a half to the ‘fabric’ that links the rack’s GPUs to all the other cluster’s GPUs.
Broadcom uses the 14.4-terabit fabric link as one huge logical pipe over which traffic is evenly spread. Each destination Jericho3-AI top-of-rack switch then reassembles the ‘sprayed’ traffic.
“From the GPU’s perspective, it is unaware that we are spraying the data,” says Parizer.
Receiver-based flow control
Spraying may ensure full use of the network’s capacity, but congestion can still occur. The sprayed traffic may be spread across the fabric to all the spine switches, but for short periods, several GPUs may send data to the same GPU, known as incast (see diagram).

The Jericho copes with this many-to-one GPU traffic using receiver-based flow control.
Traffic does not leave the receiving Jericho chip just because it has arrived, says Parizer. Instead, the receiving Jericho tells the GPUs with traffic to send and schedules part of the traffic from each.
“Traffic ends up queueing nearer the sender GPUs, notifying each of them to send a little bit now, and now,” says Parizer, who stresses this many-to-one condition is temporary.
Ethernet flow control is used when Jericho chip senses that too much traffic is being sent.
“There is a temporary stop in data transmission to avoid packet loss in network congestion,” says Parizer. “And it is only that GPU that needs to slow down; it doesn’t impact any adjacent GPUs.”
Fault control
At Optica’s Executive Forum event, held alongside the OFC show in March, Google discussed using a 6,000 tensor processor unit (TPU) accelerator system to run large language models.
One Google concern is scaling such clusters while ensuring overall reliability and availability, given the frailty of large-scale accelerator clusters.
“With a huge network having thousands of GPUs, there is a lot of fibre,” says Parizer. “And because it is not negligible, faults happen.”
New paths must be calculated when an optical link goes down in a network arrangement that using flows and assigned paths with significant traffic loss likely.
“With a job that has been running for days, significant packet loss means you must do a job restart,” says Parizer.
Broadcom’s solution, not based on flows and assigned paths, uses load balancing to send data over one less path overall.
Using the Jericho2C+, Broadcom has shown fault detection and recovery in microseconds such that the packet loss is low and no job restart is needed.
The Jericho portfolio of devices
Broadcom’s existing Jericho2 architecture combines an enhanced packet-processing pipeline with a central modular database and a vast memory holding look-up tables.
Look-up tables are used to determine how the packet is treated: where to send it, wrapping it in another packet (tunnel encapsulation), extracting it (tunnel termination), and access control lists (ACLs).
Different stages in the pipeline can access the central modular database, and the store can be split flexibly without changing the packet-processing code.
Jericho2 was the first family device with a 4.8 terabit capacity and 8 gigabytes of high bandwidth memory (HBM) for deep buffering.
The Jericho 2C followed, targeting the edge and service router market. Here, streams have lower bandwidth – 1 and 10 gigabits typically – but need better support in the form of queues, counters and metering, used for controlling packets and flows.
Pariser says the disaggregated OpenBNG initiative supported by Deutsche Telekom uses the Jericho 2C.
Broadcom followed with a third Jericho2 family device, the Jericho 2C+, which combines the attributes of Jericho2 and Jericho2C.
Jericho2C+ has 14.4 terabits of capacity and 144 100-gigabit interfaces, of which 7.2-terabit is network interfacing bandwidth and 7.2-terabit for the fabric interface.
“The Jericho2C+ is a device that can target everything,” says Pariser.
Applications include data centre interconnect, edge and core network routing, and even tiered switching in the data centre.
Hardware design
The Jericho3-AI, made up of tens of billions of transistors in a 5nm CMOS process, is now sampling.
Broadcom says it designed the chip to be cost-competitive for AI.
For example, the packet processing pipeline is simpler than the one used for core and edge routing Jericho.
“This also translates to lower latency which is something hyperscalers also care about,” says Parizer.
The cost and power savings from optimisations will be relatively minor, says Wheeler.

Broadcom also highlights the electrical performance of the Jericho3-AI’s input-output serialiser-deserialiser (serdes) interfaces.
The serdes allows the Jericho3-AI to be used with 4m-reach copper cables linking the GPUs to the top-of-rack switch.
The serdes performance also enables linear-drive pluggables that dont have no digital signal processor (DSP) for retiming with the serdes driving the pluggable directly. Linear drive saves cost and power.
Broadcom’s Ram Valega, senior vice president and general manager of the core switching group, speaking at the Open Compute Project’s regional event held in Prague in April, said 32,000 GPU AI clusters cost around $1 billion, with 10 per cent being the network cost.

Valega showed Ethernet outperforms Infiniband by 10 per cent for a set of networking benchmarks (see diagram above).
“If I can make a $1 billion system ten per cent more efficient, the network pays for itself,” says Valega.
Wheeler says the comparison predates the recently announced NVLink Network, which will first appear in Nvidia’s DGX GH200 platform.
“It [NVLink Network] should deliver superior performance for training models that won’t fit on a single GPU, like large language models,” says Wheeler.
Enfabrica’s chip tackles AI supercomputing challenges

- Enfabrica’s accelerated compute fabric chip is designed to scale computing clusters comprising CPUs and specialist accelerator chips.
- The chip uses memory disaggregation and high-bandwidth networking for accelerator-based servers tackling artificial intelligence (AI) tasks.
For over a decade, cloud players have packed their data centres with x86-based CPU servers linked using tiers of Ethernet switches.
“The reason why Ethernet networking has been at the core of the infrastructure is that it is incredibly resilient,” says Rochan Sankar, CEO and co-founder of Enfabrica.
But the rise of AI and machine learning is causing the traditional architecture to change.
What is required is a mix of processors: CPUs and accelerators. Accelerators are specialist processors such as graphics processing units (GPUs), programmable logic (FPGAs), and custom ASICs developed by the hyperscalers.
It is the accelerator chips, not the CPUs, that do the bulk of the processing. Accelerators also require vast data, creating challenging input-output (I/O) and memory requirements.
At Optica’s Executive Forum event, held alongside the OFC show in March, Ryohei Urata, director and principal engineer at Google, mentioned how Google uses two computing pods – comprising 6,000 TPU accelerators – to run its large language models.
A key concern for Google is scaling such clusters while ensuring their reliability and availability. It is critical that the system is available when running a large language model, says Urata,
“As an engineer, when you’re putting stuff down, at least when you’re first start to put it together, you think, okay, this is going to work perfectly,” says Urata. “This is a perfect design, you don’t factor in failing gracefully, so that’s a key lesson.”
Google’s concern highlights that accelerator-based clusters lack the reliability of data centre server-Ethernet networks.
Accelerated compute fabric
Start-up Enfabrica has developed a chip, dubbed the accelerated compute fabric, to scale computing clusters.
“The focus of Enfabrica is on how networking and fabric technologies have to evolve in the age of AI-driven computing,” says Sankar.
AI models are growing between 8x to 275x annually, placing enormous demands on a data centre’s computing and memory resources.
“Two hundred and seventy-five times are of the order of what the large language models are increasing by, 8x is more other models including [machine] vision; recommender models are somewhere in between,” says Sankar.
Another AI hardware driver is growing end-user demand; ChatGPT gained 100 million users in the first months after its launch.
Meeting demand involves cascading more accelerators but the I/O bandwidth connected to the compute is lagging. Moreover, that gap is growing.
Sankar includes memory bandwidth as part of the I/O issue and segments I/O scaling into two: connecting CPUs, GPUs, accelerators and memory in the server, and the I/O scaling over the network.
A computing architecture for AI must accommodate greater CPUs and accelerators yet tackle the I/O bottleneck.
“To scale, it requires disaggregation; otherwise, it becomes unsustainable and expensive, or it can’t scale enough to meet processing demands,” says Sankar
“Memory disaggregation represents the last step in server disaggregation, following storage and networking,” says Bob Wheeler, principal analyst at Wheeler’s Network.
Memory expansion through disaggregation has become more urgent as GPUs access larger memories for AI training, particularly for large language modules like ChatGPT, says Wheeler.
Rethinking data connectivity
In the data centre, servers in a rack are linked using a top-of-rack switch. The top-of-rack switch also connects to the higher-capacity leaf-spine Ethernet switching layers to link servers across the data centre.
Enfabrica proposes that the higher capacity Ethernet switch leaf layer talks directly to its accelerated compute fabric chip, removing the top-of-rack switch.

In turn, the accelerated compute fabric uses memory mapping to connect CPUs, accelerators, disaggregated memory pools using CXL, and disaggregated storage (see diagram above).
The memory can be a CPU’s DDR DRAM, a GPU’s high-bandwidth memory (HBM), a disaggregated compute express link (CXL) memory array, or storage.
“It [the accelerated compute fabric] connects to them over standard memory-mapped interfaces such as PCI Express (PCIe) or CXL,” says Sankar.

The chip uses ‘copy engines’ to move data to and from any processing element’s native memory. And by performing memory transfers in parallel, the chip is doing what until now has required PCIe switches, network interface cards (NICs), and top-of-rack switches.
The accelerated compute fabric also has 800-gigabit network interfaces so that, overall, the chip has terabits of bandwidth to move data across the network.
“CXL provides a standard way to decouple memories from CPUs, enabling DRAM disaggregation,” says Wheeler. “Enfabrica’s copy engines connect the GPUs to the pool of CXL memory. The network side, using RDMA (remote direct memory access), enables scaling beyond the limits of CXL.”
Sankar stresses that the accelerated compute fabric is much more than an integration exercise using an advanced 5nm CMOS process.
“If you were to integrate eight NICs, four PCIe switches and a top-of-rack switch, it would not fit into a single die,” says Sankar.
As for software, Enfabrica has designed its solution to fit in with how GPUs, CPUs and memory move data.
Significance
Sankar says the accelerated compute fabric IC will shorten job completion time because the scheduler is finer-grained and the chip can steer I/O to resources as required.
Computing clusters will also become larger using the IC’s high-density networking and CXL.
Wheeler says that CXL 3.x fabrics could provide the same capabilities as the accelerated compute fabric, but such advanced features won’t be available for years.
“History suggests some optional features included in the specifications will never gain adoption,” says Wheeler.
“The CXL/PCIe side of the [accelerated compute fabric] chip enables memory disaggregation without relying on CXL 3.x features that aren’t available, whereas the RNIC (RDMA NIC) side allows scaling to very large systems for workloads that can tolerate additional latency,” says Wheeler.
System benefits
Sankar cites two GPU platforms – one proprietary and one an open system – to highlight its chip benefits. The platforms are Nvidia’s DGX-H100 box and the open-design Grand Teton announced by Meta.
“The DGX has become a sort of fundamental commodity or a unit of AI computing,” says Shankar.
The DGX uses eight H100 GPUs, CPUs (typically two), I/O devices that link the GPUs using NVlink, and Infiniband for networking. The Meta platform has a similar specification but uses Ethernet.
Both systems have eight 400-gigabit interfaces. “That is 3.2 terabits coming out of the appliance, and inside the device, there is 3.2 terabit connected to a bunch of compute resources,” says Sankar.
The Meta platform includes layers of PCIe switches, and Open Compute Project (OCP 3.0) NICs running at 200 gigabits, going to 400 gigabits in the next generation.
The Grand Teton platform also uses eight NICs, four PCIe switches, and likely a top-of-rack switch to connect multiple systems.
Enfabrica’s vision is to enable a similarly composable [GPU] system. However, instead of eight NICs, four PCIe switches and the external top-of-rack switch, only three devices would be needed: two Enfabrica accelerated compute fabric chips and a control processor.
Enfabrica says the design would halve the power compared to the existing NICs, PCIe switches and the top-of-rack switch. “That represents 10 per cent of the rack’s power,” says Sankar.
And low-latency memory could be added to the space saved by using three chips instead of 12. Then, the eight GPUs would have tens of terabytes of memory to share whereas now each GPU has 80 gigabytes of HBM.
What next?
Enfabrica is unveiling the architecture first, and will detail its product later this year.
It is key to unveil the accelerated compute fabric concept now given how AI architectures are still nascent, says Sankar.
But to succeed, the start-up must win a sizeable data-centre customer such as a hyperscaler, says Wheeler: “That means there’s a very short list of customers, and winning one is paramount.”
The supplier must deliver high volumes from the start and guarantee supply continuity, and may also have to provide the source code to ensure that a customer can maintain the product under any circumstances.
“These are high hurdles, but Innovium proved it can be done and was rewarded with an exit at a valuation of greater than $1 billion,” says Wheeler.
Broadcom samples the first 51.2-terabit switch chip

- Broadcom’s Tomahawk 5 marks the era of the 51.2-terabit switch chip
- The 5nm CMOS device consumes less than 500W
- The Tomahawk 5 uses 512, 100-gigabit PAM-4 (4-level pulse amplitude modulation) serdes (serialisers-deserialisers)
- Broadcom will offer a co-packaged version combining the chip with eight 6.4 terabit-per-second (Tbps) optical engines
Part 1: Broadcom’s Tomahawk 5
Broadcom is sampling the world’s first 51.2-terabit switch chip.
With the Tomahawk 5, Broadcom continues to double switch silicon capacity every 24 months; Broadcom launched the first 3.2-terabit Tomahawk was launched in September 2014.
“Broadcom is once again first to market at 51.2Tbps,” says Bob Wheeler, principal analyst at Wheeler’s Network. “It continues to execute, while competitors have struggled to deliver multiple generations in a timely manner.”
Tomahawk family
Hyperscalers use the Tomahawk switch chip family in their data centres.
Broadcom launched the 25.6-terabit Tomahawk 4 in December 2019. The chip uses 512 serdes, but these are 50-gigabit PAM-4. At the time, 50-gigabit PAM-4 matched the optical modules’ 8-channel input-output (I/O).
Certain hyperscalers wanted to wait for 400-gigabit optical modules using four 100-gigabit PAM-4 electrical channels, so, in late 2020, Broadcom launched the Tomahawk4-100G switch chip, which employs 256, 100-gigabit PAM-4 serdes.
Tomahawk 5 doubles the 100-gigabit PAM-4 serdes to 512. However, given that 200-gigabit electrical interfaces are several years off, Broadcom is unlikely to launch a second-generation Tomahawk 5 with 256, 200-gigabit PAM-4 serdes.

Switch ICs
Broadcom has three switch chip families: Trident, Jericho and the Tomahawk.
The three switch chip families are needed since no one switch chip architecture can meet all the markets’ requirements.
With its programable pipeline, Trident targets enterprises, while Jericho targets service providers.
According to Peter Del Vecchio, Broadcom’s product manager for the Tomahawk and Trident lines, there is some crossover. For example, certain hyperscalers favour the Trident’s programmable pipeline for their top-of-rack switches, which interface to the higher-capacity Tomahawk switches chips at the aggregation layer.
Monolithic design
The Tomahawk 5 continues Broadcom’s approach of using a monolithic die design.
“It [the Tomahawk5] is not reticule-limited, and going to [the smaller] 5nm [CMOS process] helps,” says Del Vecchio.
The alternative approach – a die and chiplets – adds overall latency and consumes more power, given the die and chiplets must be interfaced. Power consumption and signal delay also rise whether a high-speed serial or a slower, wider parallel bus is used to interface the two.
Equally, such a disaggregated design requires an interposer on which the two die types sit, adding cost.
Chip features
Broadcom says the capacity of its switch chips has increased 80x in the last 12 years; in 2010, Broadcom launched the 640-gigabit Trident.
Broadcom has also improved energy efficiency by 20x during the same period.
“Delivering less than 1W per 100Gbps is pretty astounding given the diminishing benefits of moving from a 7nm to a 5nm process technology,” says Wheeler.
“In general, we have achieved a 30 per cent plus power savings between Tomahawk generations in terms of Watts-per-gigabit,” says Del Vecchio.

These power savings are not just from advances in CMOS process technology but also architectural improvements, custom physical IP designed for switch silicon and physical design expertise.
“We create six to eight switch chips every year, so we’ve gotten very good at optimising for power,” says Del Vecchio
The latest switch IC also adds features to support artificial intelligence (AI)/ machine learning, an increasingly important hyperscaler workload.
AI/ machine learning traffic flows have a small number of massive ‘elephant’ flows alongside ‘mice’ flows. The switch chip adds elephant flow load balancing to tackle congestion that can arise when the two flow classes mix.
“The problem with AI workloads is that the flows are relatively static so that traditional hash-based load balancing will send them over the same links,” says Wheeler. “Broadcom has added dynamic balancing that accounts for link utilisation to distribute better these elephant flows.”
The Tomahawk 5 also provides more telemetry information so data centre operators can better see and tackle overall traffic congestion.
The chip has added virtualisation support, including improved security of workloads in a massively shared infrastructure.
Del Vecchio says that with emerging 800-gigabit optical modules and 1.6 terabit ones on the horizon, the Tomahawk 5 is designed to handle multiples of 400 Gigabit Ethernet (GbE) and will support 800-gigabit optical modules.
The chip’s 100-gigabit physical layer interfaces are combined to form 800 gigabit (8 by 100 gigabit), which is fed to the MAC, packet processing pipeline and the Memory Management Unit to create a logical 800-gigabit port. “After the MAC, it’s one flow, not at 400 gigabits but now at 800 gigabits,” says Del Vecchio.
Market research firm, Dell’Oro, says that 400GbE accounts for 15 per cent of port revenues and that by 2026 it will rise to 57 per cent.
Broadcom also cites independent lab test data showing that its support for RDMA over Converged Ethernet (RoCE) matches the performance of Infiniband.
“We’re attempting to correct the misconception promoted by competition that Infiniband is needed to provide good performance for AI/ machine learning workloads,” says Del Vecchio. The tests used previous generation silicon, not the Tomahawk 5.
“We’re saying this now since machine learning workloads are becoming increasingly common in hyperscale data centres,” says Del Vecchio.
As for the chip’s serdes, they can drive 4m of direct attached copper cabling, with sufficient reach to connect equipment within a rack or between two adjacent racks.
Software support
Broadcom offers a software development kit (SDK) to create applications. The same SDK is common to all three of its switch chip families.
Broadcom also supports the Switch Abstraction Interface (SAI). This standards-based programming interface sits on top of the SDK, allowing the programming of switches independent of the silicon provider.
Broadcom says some customers prefer to use its custom SDK. It can take time for changes to filter up, and a customer may want something undertaken that Broadcom can develop quickly using its SDK.
System benefits
Doubling the switch chip’s capacity every 24 months delivers system benefits.That is because implementing a 51.2-terabit switch using the current generation Tomahawk 4 requires six such devices.

Now a single 2-rack-unit (2RU) Tomahawk 5 switch chip can support 64 by 800-gigabit, 128 by 400-gigabit and 256 by 200-gigabit modules.
These switch boxes are air-cooled, says Broadcom.
Co-packaged optics
In early 2021 at a J.P Morgan analyst event, Broadcom revealed its co-packaged optics roadmap that highlighted Humboldt, a 25.6-terabit switch chip co-packaged with optics, and Bailly, a 51.2-terabit fully co-packaged optics design.
At OFC 2022, Broadcom demonstrated a 25.6Tbps switch that sent half of the traffic using optical engines.
Also shown was a mock-up of Bailly, a 51.2 terabit switch chip co-packaged with eight optical engines, each at 6.4Tbps.
Broadcom will offer customers a fully co-packaged optics Tomahawk 5 design but has not yet given a date.
Broadcom can also support a customer if they want tailored connectivity with, say, 3/4 of the Tomahawk 5 interfaces using optical engines and the remainder using electrical interfaces to front panel optics.
Xilinx’s Versal Premium ready for the 800-gigabit era
When Xilinx was created in 1984, the founders banked on programmable logic becoming ever more attractive due to Moore’s law.
Making logic programmable requires extra transistors so Xilinx needed them to become cheaper and more plentiful, something Moore’s law has delivered, like clockwork, over decades.
Since then, Xilinx’s field-programmable gate array (FPGA) devices have advanced considerably.
Indeed, Xilinx’s latest programmable logic family, the Versal Premium, is no longer referred to as an FPGA but as an adaptive compute accelerator platform (ACAP).
The Versal Premium series of chips, to be implemented using TSMC’s 7nm CMOS process, was unveiled for the OFC 2020 show. The Premium series will have seven chips with the largest, the VP1802, having 50 billion transistors.
First devices will ship in the second half of 2021.
ACAP series
Xilinx unveiled its adaptive compute acceleration platform in 2018.

Kirk Saban
“It is a complete rearchitecting of our device technology,” says Kirk Saban, vice president product and platform marketing at Xilinx. “It is heterogenous by nature and has multiple types of processing engines.”
“Versal Premium is evolutionary compared with previous FPGAs that have hardened blocks for certain functions,” says Bob Wheeler, principal analyst at The Linley Group. “It is another step along a continuum, not really new.”
Six ACAP families are planned for Versal: three tailored for artificial intelligence (AI) - the AI RF, AI Core and AI Edge - and the others being the Prime, Premium and HBM (high bandwidth memory).
Only Versal AI series will have AI engines: very-long-instructing-word (VLIW) processor cores that can also be used for computational-intensive tasks such as digital signal processing.
Premium is the third Versal family to be unveiled, joining the AI Core and Prime series.
Versal Prime is Xilinx’s broadest series in the portfolio, featuring a range of device sizes and capabilities. The Prime series is suited to such applications as storage acceleration in the data centre; wired networking such as 5G back-, mid- and front-haul, and passive optical networking; and industrial applications such as machine vision.
Networking needs
Versal Premium has been developed with core networking and data centre acceleration applications in mind.
“The top-end SKU handles high-end networking applications such as optical transport and data centre interconnect as well as the most demanding signal-processing applications such as radar systems,” says Wheeler.
Xilinx defines core networking as the infrastructure beyond the radio access network. “All the wireline infrastructure is what we consider to be the core of the network,” says Saban. “Access, metro, and core networks, all together.”
When Xilinx’s designers sat down to consider the networking needs for the coming six years, they anticipated a huge capacity hike in the core network. Device numbers are set to grow tenfold with each device generating ten times more traffic.
“The bandwidth going through the wired network globally needs to grow at 50 per cent on a compound annual basis to keep pace with the number of devices being connected and the data coming through them,” says Saban.
Versal Premium will deliver three times the bandwidth and nearly twice the logic capacity of the 16nm Virtex UltraScale+ VU13P FPGA, the largest device used currently for networking and data centre applications.
“Shifts are happening that the Virtex FPGAs are not going to be able to handle,” says Saban. “The move to 400 gigabit and then 800 gigabit on the mid-term horizon, the Virtex products can’t handle that kind of throughput.”
Versal Premium architecture. Source: Xilinx
Premium architecture
The Premium devices feature ARM-based scalar processors such as the dual-core Cortex-A72 application processor and the dual-core Cortex-R5F real-time processor.
The application processor is used for general-purpose processing and control. The real-time processor is used for applications that require deterministic processing. Such a processor is key for safety-certified applications.
Also included is a platform management controller that oversees the device. A user can configure many of the ACAP settings using a standard tool flow but the controller’s operation is effectively transparent to the user, says Saban.
The Premium features several types of on-chip memory that Saban likens to levels of cache memory used by high-performance processors. ”We have look-up-table RAM, Block RAM and Ultra RAM and we can offload to [external] DDR4 [RAM],” he says. “The memory hierarchy can be configured to match the algorithm you are building.”
The various on-chip functional blocks are linked via a programmable network-on-a-chip. Having the network-on-a-chip frees up programmable logic resources that would otherwise be required to connect the design’s functional blocks.
“Equipment manufacturers need to deliver on this core network growth but they also need to do it securely,” says Saban. “With everything shifting to the cloud, there are huge concerns about data privacy; in many instances, security is just as important as performance for the operators.”
To this aim, the Premium’s on-chip peripherals include 400-gigabit crypto-engines that support the AES-GCM-256 and -128, MACsec, and IPSec encryption standards.
“The crypto blocks are unique and save a lot of look-up tables and power compared with implementing these in programmable logic,” says Linley’s Wheeler.
Other on-chip features include up to 5 terabits of Ethernet throughput supporting rates from 10 to 400 Gigabit Ethernet. The devices have multiple 600-gigabit Ethernet MAC cores and support such protocols as FlexE, Flex-O, Ethernet CPRI (eCPRI), Fibre Channel over Ethernet (FCoE), and OTN.
The Premium family delivers up to 1.8 terabits of Interlaken, from 10-gigabit to 600-gigabit interfaces. Interlaken enables chip-to-chip and chip-to-backplane communications.
There are also 112-gigabit 4-level pulse-amplitude modulation (PAM-4) serialisers/ deserialisers (serdes). The VP1802 will have 28, 32-gigabit serdes and either 140, 58-gigabit or 70, 112-gigabit serdes. The electrical transceivers can drive 10m of copper cable, says Saban.
PCI Express Generation 5.0, enabling direct memory access and cache-coherent interconnect, is also supported on-chip. “We can connect to server CPUs and be an extension of their memory map,” says Saban.
Xilinx claims 22 UltraScale+ FPGAs would be needed to implement all the logic and peripherals of the Versal Premium VP1802.
System design
Wireline vendors want to double the performance with each generation of equipment while keeping platform size and power consumption constant.
Xilinx has a diagram (shown) of a generic telecom line-card design using the Versal Premium. “Vendors have different variants but at a high-level, they all look like this,” says Saban.

Generic telecom line card using the Versal Premium. Source: Xilinx
Line-card data arrives via optical modules. At present 100-gigabit is mainstream with 400-gigabit coming soon, and eventually 800-gigabit interfaces. The data is fed to the Premium’s hardened logic blocks: the Ethernet and encryption blocks.
The adaptive logic (in red) is what companies use to implement their unique designs such as executing virtualised network functions (NFV) or for packet processing.
“We are seeing the need to infuse artificial intelligence and machine learning into these applications in some capacity,” says Saban. Premium devices have no AI VLIW cores but have sufficient resources for some level of artificial intelligence/ machine learning capability.
Interlaken then sends the data to a host chip or across the backplane to another line card.
Software tools
Xilinx stresses the company is no longer a chip provider but a platform provider. This is reflected in the software tools it provides to accompany its silicon.

Vitis software tool. Source: Xilinx
Versal ACAPs come with advanced toolkit libraries so engineers can program the chip with no knowledge of the underlying hardware.
Xilinx is continuing to provide its Vivado toolset that supports register-transfer level (RTL), a design abstraction used by hardware engineers for their circuit designs. “The traditional RTL toolchain is not going away and will continue to evolve,” says Saban.
But coders developing data centre applications with no knowledge of RTL or programmable logic can now use Xilinx’s Vitis toolset that was launched in 2019.
“It is critical to enable software developers and data scientists doing machine learning a way to interface to our [ACAP] products,” says Saban.
Vitis supports programming languages such as C, C++ and Python as well as higher-level machine-learning frameworks such as TensorFlow and Caffe.
Xilinx also has a library of functions for tasks such as data analytics and genomics. Such applications can be switched in and out since they are executed using adaptive hardware.
The Premium software tools will be available in the fourth quarter of the year.
Lifespan
A programmable logic family’s lifespan is five or six years; the Virtex UltraScale family was launched in 2015.
“We added a few kickers [to the Virtex family] such as high bandwidth memory and 58-gigabit serdes,” says Saban. “And we will likely do the same with Versal, add some integrated block in a derivative product.”
Xilinx’s chip designers will likely now be already working on an ACAP architecture for 2026 supporting 1.6-terabit speeds and to be implemented using a 5nm CMOS process.
“If we are to deliver twice the bandwidth at half the power, it is not enough to lean on CMOS process technology,” says Saban. “We will need to look at new chip architectures to solve the problems.”
This is challenging. “It gets harder, it gets more expensive and there are less and fewer companies that can afford it,” says Saban.
EZchip targets multi-core processing with Tilera purchase
Network processor specialist, EZchip Semiconductor, is to acquire Tilera. The deal is valued at $130 million in cash: $50 million when the deal closes, and up to $80 million more depending on performance targets being met.
Bob Wheeler, The Linley Group
Tilera's products include multi-core processors, intelligent network interface cards (NICs) and one rack-unit (1RU) network - 'whitebox' - appliances used for security applications.
Acquiring Tilera will broaden EZchip's market. Tilera's devices are used for network appliances, enterprise routers, cloud computing, video and voice encoders, security, deep-packet inspection, load-balancing, and emerging applications such as software-defined networking (SDN) and network functions virtualisation (NFV).
EZchip's first acquisition will also broaden the company's US presence and customers: Tilera has 100 customers including Brocade, Check Point Software Technologies, Cisco, Fujitsu, Harmonic, MikroTik and ZTE.
EZchip estimates that with the acquisition, its total addressable market will double to $2 billion by 2016.
EZchip's flagship NPS is a high-end network processor family while Tilera's multi-core general processors include the Tile-GX family with 9, 16, 36 and 72, 64-bit cores, programmed using the C-language and which supports the Linux operating system.
"The two companies are highly complementary," says Bob Wheeler, principal analyst for networking at the Linley Group. "Beyond the obvious addition of products, markets, and customers, I see Tilera’s software and systems expertise as important to the success of EZchip’s existing NPS programme."
Eli Fruchter, CEO of EZchip, says that the two companies have been discussing co-development of a next-generation multi-core family that will add specialist networking accelerator hardware from EZchip. The resulting family will have the highest core count at the lowest power, while achieving leading networking and packet-processing performance, says the CEO.
Tilera's designs are noted for their processing performance per watt. Wheeler also highlights the company's iMesh tiled architecture which enables efficient scaling as cores are added to a chip. "Tilera’s proprietary 64-bit VLIW [very long instruction word] CPU design is also important in delivering leading power efficiency," he says.
The next-generation device family will use a standard processing core and move away from Tilera's proprietary technology. EZchip's NPS uses the 32-bit ARC core which EZchip has redesigned. "Network security and monitoring are the primary targets [for the next-gen devices]," says Wheeler. "Tilera currently serves other applications, including videoconferencing, but these won’t benefit from EZchip’s accelerators."
Tilera's revenues were $35 million in 2013, suggesting single-digit percent market share using EZchip's $1 billion TAM estimate. It thus has some way to go to compete with Broadcom and Cavium. Near term, customers may be more willing to work with a profitable public company, notes Wheeler, but for EZchip to achieve major share gains will depend on delivering next-generation processors.
Tilera's revenues declined in the first half of 2014. EZchip would not detail why, except to suggest that the decline in orders is temporary and that growth will return in the second half of 2014. EZchip is confident Tilera's revenues will exceed $35 million in 2015.
EZchip will pay Tilera's shareholders up to $80 million if revenue targets are met: $50 million in cash if revenues reach $45 million between when the deal closes in Q3 2014 and June 2015, and a further $30 million if revenues of $31 million are achieved in the second half of 2015.




