Broadcom’s silicon for the PCI Express 6.0 era

Broadcom has detailed its first silicon for the sixth generation of the PCI Express (PCIe 6.0) bus, developed with AI servers in mind.
The two types of PCIe 6.0 devices are a switch chip and a retimer.
Broadcom, working with Teledyne LeCroy, is also making available an interoperability development platform to aid engineers adopting the PCIe 6.0 standard as part of their systems.
Compute servers for AI are placing new demands on the PCIe bus. The standard no longer about connects CPUs to peripherals but also serving the communication needs of AI accelerator chips.
“AI servers have become a lot more complicated, and connectivity is now very important,” says Sreenivas Bagalkote, Broadcom’s product line manager for the data center solutions group.
Bagalkote describes Broadcom’s PCIe 6.0 switches as a ‘fabric’ rather than silicon to switch between PCIe lanes.
PCI Express
PCIe is an long-standing standard adopted widely, not only for computing and servers but across industries such as medical imaging, automotive, and storage.
The first three generations of PCIe evolved around the CPU. There followed a big wait for the PCIe 4.0, but since then, a new PCI generation has appeared every two years, each time doubling the data transfer rate.
Now, PCIe 6.0 silicon is coming to the market while work continues to progress on the latest PCIe 7.0, with the final draft ready for member review.
The PCIe standard supports various lane configurations from two to 32 lanes. For servers, 8-lane and 16-lane configurations are common.
“Of all the transitions in PCIe technology, generation 6.0 is the most important and most complicated,” says Bagalkote.
PCIe 6.0 introduces several new features. Like previous generations, it doubles the lane rate: PCIe 5.0 supports 32 giga-transfers a second (GT/s) while PCIe 6.0 supports 64GT/s.
The 64GT/s line rate requires the use of 4-level pulse amplitude modulation (PAM-4) for the first time; all previous PCIe generations use non-return-to-zero (NRZ) signalling.
Since PCIe must be backwards compatible, the PCIe 6.0 switch supports PAM-4 and NRZ signalling. More sophisticated circuitry is thus required at each end of the link as well as a forward error correction scheme, also a first for the PCIe 6.0 implementation.
Another new feature is flow control unit (FLIT) encoding, a network packet scheme designed to simplify data transfers.
PCIe 6.0 also adds integrity and data encryption (IDE) to secure the data on the PCIe links.
AI servers
A typical AI server includes CPUs, 8 or 16 interconnect GPUs (AI accelerators), network interface cards (NICs) to connect to GPUs making up the cluster, and to storage elements.
A typical server connectivity tray will likely have four switch chips, one for each pair of GPUs, says Bagalkote. Each GPU has a dedicated NIC, typically with a 400 gigabit per second (Gbps) interface. The PCIe switch chips also connect the CPUs and NVMe storage.
Broadcom’s existing generation PCIe 5.0 switch ICs have been used in over 400 AI server designs, estimated by the company at 80 to 90 per cent of all deployed AI servers.
Switch and retimer chips
PCIe 6.0’s doubling the lane data rate makes sending signals over 15-inch rack servers harder.
Broadcom says its switch chip uses serialiser-deserialiser (serdes) that outperform the PCIe specification by 4 decibels (dB). If an extra link distance is needed, Broadcom also offers its PCIe 6.0 retimer chips that also offer an extra 4dB.
Using Broadcom’s ICs at both ends results in a 40dB link budget, whereas the specification only calls for 32dB. “This [extra link budget] allows designers to either achieve a longer reach or use cheaper PCB materials,” says Bagalkote.
The PCIe switch chip also features added telemetry and diagnostic features. Given the cost of GPUs, such features help data centre operators identify and remedy issues they have, to avoid taking the server offline
“PCIe has become an important tool for diagnosing in real-time, remotely, and with less human intervention, all the issues that happen in AI servers,” says Bagalkote.
Early PCIe switches were used in a tree-like arrangement with one input – the root complex – connected via the switch to multiple end-points. Now, with AI servers, many devices connect to each other. Broadcom’s largest device – the PEX90144 – can switches between its 144 PCIe 6.0 lanes while supporting 2-, 4-, 8- or 16-lane-wide ports.
Broadcom also has announced other switch IC configurations with 104- and 88-lanes. These will be followed by 64 and 32 lane versions. All the switch chips are implemented using a 5nm CMOS process.
Broadcom is shipping “significant numbers” of samples of the chips to certain system developers.
PCIe versus proprietary interconnects
Nvidia and AMD that develop CPUs and AI accelerators have developed their own proprietary scale-up architectures. Nvidia has NVLink, while AMD has developed the Infinity Fabric interconnect technology.
Such proprietary interconnect schemes are used in preference to PCIe to connect GPUs, and CPUs and GPUs. However, the two vendors use PCIe in their systems to connect to storage, for example.
Broadcom says that for the market in general, open systems have a history of supplanting closed, proprietary systems. It points to the the success of its PCIe 4.0 and PCIe 5.0 switch chips and believes PCIe 6.0 will be no different.
Disaggregated system vendor developer, Drut Technologies, is now shipping a PCIe 5.0-based scalable AI cluster that can support different vendors’ AI accelerators. Its system uses Broadcom’s 144-lane PCIe 5.0 switch silicon for its interconnect fabric.
Drut is working on its next-generation PCIe 6.0-generation-based design.
Broadcoms taps AI to improve switch chip traffic analysis

Broadcom’s Trident 5-X12 networking chip is the company’s first to add an artificial intelligence (AI) inferencing engine.
Data centre operators can use their network traffic to train the chip’s neural network. The Trident 5’s inference engine, dubbed the Networking General-purpose Neural-network Traffic-analyzer or NetGNT, is loaded with the resulting trained model to classify traffic and detect security threats.
“It is the first time we have put a neural network focused on traffic analysis into a chip,” says Robin Grindley, principal product line manager with Broadcom’s Core Switching Group.
Adding an inference engine shows how AI can complement traditional computation, in this case, packet processing.
Trident family
Trident is one of Broadcom’s main three lines of networking and switch chips, the Jericho and Tomahawk being the other two.
Service providers favour the Jericho family for high-end IP routing applications. The Ethernet switch router chip’s features include a programmable pipeline and off-chip store for large traffic buffering and look-up tables.
The latest Jericho 3, the 28.8 terabits-per-sec (Tbps) Jericho 3, was announced in September. Broadcom launched the first family device, the Jericho3-AI, earlier this year; a chip tailored for AI networking requirements.
In contrast, Broadcom’s Tomahawk Ethernet network switch family addresses the data centre operators’ needs. The Tomahawk has a relatively simple fixed packet-processing pipeline to deliver the highest switching capacity. The Tomahawk 5 has a capacity of 51.2 terabits and includes 512, 100-gigabit PAM4 serialiser-deserializer (serdes).
“The big hyperscalers want maximum bandwidth and maximum radix [switches],” says Grindley. “The hyperscalers have a pretty simple fabric network and do everything else themselves.”
The third family, the Trident Ethernet switch chips, is popular for enterprise applications. Like the Jericho, the Trident has a programmable pipeline to address enterprise networking tasks such as Virtual Extensible LAN (VXLAN), tunnelling protocols, and segment routing (SRv6).
The speeds and timelines of the various Tomahawk and Trident chips are shown in the chart.

Trident 5-X12
The Trident 5-X12 is implemented using a 5nm CMOS process and has a capacity of 16 terabits. The chip’s input-output includes 160, 100-gigabit PAM4 serdes. These are the serdes that Broadcom introduced with the Tomahawk 5.
The first chip of each new generation of Trident usually has the highest capacity and is followed by lower-capacity devices tailored to particular markets.

Trident 5 is aimed at top-of-rack switch applications. Typically, 24 or 48 ports of the top-of-rack switch are used for downlinks to connect to servers, while 4 or 8 are used for higher-capacity uplinks (see diagram).
The Trident 5 can support 48 ports of 200 gigabits for the downlinks and eight 800 gigabit for the uplinks. To support 800-gigabit interfaces, the chip uses eight 100-gigabit serdes and an one-chip 800-gigabit media access controller (MAC). Other top-of-rack switch configurations are shown in the diagram.
Currently, 400-gigabit network interface cards are used for demanding applications such as machine learning. Trident5 is also ready to transition to 800-gigabit network interface cards.
Another Tomahawk feature the Trident 5 has adopted is cognitive routing, a collection of congestion management techniques for demanding machine-learning workloads.
One of the techniques is global load balancing. Previous Trident devices supported dynamic load balancing, where the hardware could see the congested port and adapt in real-time. However, such a technique gives no insight into what happens further along the flow path. “If I knew that, downstream, somebody else was congested, then I could make a smarter decision,” says Grindley. Global load balancing does just this. It sends notification to the routing chips upstream that there is congestion so they can all work together.
Another cognitive routing feature is drop congestion notification. Here, packets dropped due to congestion are captured such that what is sent is only their header data and where the packet was dropped. This mechanism improves flow completion times compared to normal packet loss, which is costly for machine-learning workloads.
Trident 5, like its predecessor, Trident 4, has a heterogeneous pipeline of tile types. The tiles contain static random-access memory (SRAM), ternary content-addressable memory (TCAM) or arithmetic logic units. The tiles allow multiple look-ups or actions in parallel at each stage in the pipeline.

Broadcom has a compiler that maps high-level packet-processing functions to its pipeline in the NPL programming language. The latency through the device stays constant, however the packet processing is changed, says Grindley.
Trident 5’s NetGNT inference engine is a new pipeline resource for higher-level traffic patterns. “NexGNT looks at things not at a packet-by-packet level, but across time and the overall packet flow through the network,” says Grindley.
The NetGNT
Until now system architects and network operation centre staff have defined a set of static rules written in software to uncover and treat suspicious packet flows. “A pre-coded set of rules is limited in its ability to catch higher-level traffic patterns,” says Grindley.
When Broadcom started the Trident 5 design, its engineers thought a neural network approach could be used. “We knew it would be useful if you had something that looked at a higher level, and we knew neural networks could do this kind of task,” says Grindley.
The neural network sits alongside the existing traffic analysis logic. Information such as packet headers, or data already monitored and generated by the pipeline, can be fed to the neural network to assess the traffic patterns.
“It sits there and looks for high-level patterns such as the start of a denial of service attack” says Grindley.
Training
The neural network is trained using supervised learning. A human expert must create the required training data and train the model using supervised learning. The result is a set of weights loaded onto the Trident 5’s neural network.

When the neural network is triggered, i.e. when it identifies a pattern of interest, the Trident 5 must decide what it should do. The chip can drop the packets or change the quality of service (QoS). The device can also drop a packet while creating a mirror packet containing headers and metadata. This can then be sent to a central analyser at the network operations centre to perform higher-level management algorithms.
Performance
The Trident 5 chip is now sampling. Broadcom says there is no performance data as end customers are still to train and run live traffic through the Trident 5’s inference engine.
“What it can do for them depends on getting good data and then running the training,” says Grindley. “Nobody has done this yet.”
Will the inference engine be used in other Broadcom networking chips?
“It depends on the market,” says Grindley. “We can replicate it, just like taking IP from the Tomahawk where appropriate.”
Vodafone's effort to get silicon for telco

This as an exciting time for semiconductors, says Santiago Tenorio, which is why his company, Vodafone, wants to exploit this period to benefit the radio access network (RAN), the most costly part of the wireless network for telecom operators.
The telecom operators want greater choice when buying RAN equipment.
As Tenorio, a Vodafone Fellow (the company’s first) and its network architecture director, notes, there were more than ten wireless RAN equipment vendors 15 years ago. Now, in some parts of the world, the choice is down to two.
“We were looking for more choice and that is how [the] Open RAN [initiative] started,” says Tenorio. “We are making a lot of progress on that and creating new options.”
But having more equipment suppliers is not all: the choice of silicon inside the equipment is also limited.
“You may have Fujitsu radios or NEC radios, Samsung radios, Mavenir software, whatever; in the end, it’s all down to a couple of big silicon players, which also supply the incumbents,” he says. “So we thought that if Open RAN is to go all the way, we need to create optionality there too to avoid vendor lock-in.”
Vodafone has set up a 50-strong research team at its new R&D centre in Malaga, Spain, that is working with chip and software companies to develop the architecture of choice for Open RAN to expand the chip options.
Open RAN R&D
Vodafone’s R&D centre’s 50-staff are organised into several streams, but their main goal is to answer critical issues regarding the Open RAN silicon architecture.
“Things like whether the acceleration is in-line or look-aside, which is a current controversy in the industry,” says Tenorio. “These are the people who are going to answer that question.”
With Open RAN, the virtualised Distributed Unit (DU) runs on a server. This contrasts with specialised hardware used in traditional baseband units.
Open RAN processes layer 1 data in one of two ways: look-aside or in-line. With look-aside, the server’s CPU performs certain layer 1 tasks, aided by accelerator hardware to perform tasks like forward error correction. This requires frequent communication between the two that limits processing efficiency.
In-line solves this by performing all the layer 1 processing using a single chip. Dell, for example, has an Open RAN accelerator card that performs in-line processing using Marvell’s silicon.
When Vodafone announced its Open RAN silicon initiative in January, it was working with 20 chip and software companies. More companies have since joined.
“You have software players like middleware suppliers, also clever software plug-ins that optimise the silicon itself,” says Tenorio. “It’s not only silicon makers attracted by this initiative.”
Vodafone has no preconceived ideas as to the ideal solution. “All we want is the best technical solution in terms of performance and cost,” he says.
By performance, Vodafone means power consumption and processing. “With a more efficient solution, you need less [processing] cores,” says Tenorio.
Vodafone is talking to the different players to understand their architectures and points of view and is doing its own research that may include simulations.
Tenorio does not expect Vodafone to manufacture silicon: “I mean, that’s not necessarily on the cards.” But Vodafone must understand what is possible and will conduct lab testing and benchmark measurements.
“We will do some head-to-head measurements that, to be fair, no one I know does,” he says. Vodafone’s position will then be published, it will create a specification and will drive vendors to comply with it.
“We’ve done that in the past,” says Tenorio. “We have been specifying radios for the last 20 years, and we never had to manufacture one; we just needed to understand how they’re done to take the good from the bad and then put everybody on the art of the possible.”
Industry interest
The companies joining Vodafone’s Open RAN chip venture are motivated for different reasons.
Some have joined to ensure that they have a voice and influence Vodafone’s views. “Which is super,” says Tenorio.
Others are there because they are challengers to the current ecosystem. “They want to get the specs ahead of anybody to have a better chance of succeeding if they listen to our advice, which is also super,” says Tenorio.
Meanwhile, software companies have joined to see whether they can improve hardware performance.
“That is the beauty of having the whole ecosystem,” he says.

Work scale
The work is starting at layer 1 and not just the RAN’s distributed unit (DU) but also the radio unit (RU), given how the power amplifier technology is the biggest offender in terms of power consumption.
Layers 2 and 3 will also be tackled. “We’re currently running that on Intel, and we’re finding that there is a lot of room for improvement, which is normal,” says Tenorio. “It’s true that running the three layers on general-purpose hardware has room for improvement.”
That room for improvement is almost equivalent to one full generation of silicon, he says.
Vodafone says that it also can’t be the case that Intel is the only provider of silicon for Open RAN.
The operator expects new hardware variants based on ARM, perhaps AMD, and maybe the RISC-V architecture at some point.
“We will be there to make it happen,” says Tenorio.
Other chip accelerators
Does such hardware as Graphics Processing Units (GPUs), Data Processing Units (DPUs) and also programmable logic have roles?
“I think there’s room for that, particularly at the point that we are in,” says Tenorio. “The future is not decided yet.”
The key is to avoid vendor lock-in for layer 1 acceleration, he says.
He highlights the work of such companies like Marvell and Qualcomm to accelerate layer 1 tasks, but he fears this will drive the software suppliers to take sides on one of these accelerators. “This is not what we want,” he says.
What is required is to standardise the interfaces to abstract the accelerator from the software, or steer away from custom hardware and explore the possibilities of general-purpose but specialised processing units.
“I think the future is still open,” says Tenorio. “Right now, I think people tend to go proprietary at layer 1, but we need another plan.”
“As for FPGAs, that is what we’re trying to run away from,” says Tenorio. “If you are an Open RAN vendor and can’t afford to build your ASIC because you don’t have the volume, then, okay, that’s a problem we were trying to solve.”
Improving general-purpose processing avoids having to go to FPGAs which are bulky, power-hungry and expensive, says Tenorio but he also notes how FPGAs are evolving.
“I don’t think we should have religious views about it,” he says. “There are semi-programmable arrays that are starting to look better and better, and there are different architectures.”
This is why he describes the chip industry as ‘boiling’: “This is the best moment for us to take a view because it’s also true that, to my knowledge, there is no other kind of player in the industry that will offer you a neutral, unbiased view as to what is best for the industry.”
Without that, the fear is that by acquisition and competition, the chip players will reduce the IC choices to a minimum.
“You will end up with two to three incumbent architectures, and you run a risk of those being suboptimal, and of not having enough competition,” says Tenorio.
Vodafone’s initiative is open to companies to participate including its telco competitors.
“There are times when it is faster, and you make a bigger impact if you start things on your own, leading the way,” he says.
Vodafone has done this before: In 2014, it started working with Intel on Open RAN.
“We made some progress, we had some field trials, and in 2017, we approached TIP (the Telecom Infra Project), and we offered to contribute our progress for TIP to continue in a project group,” says Tenorio. “At that point, we felt that we would make more progress with others than going alone.”
Vodafone is already deploying Open RAN in the UK and has said that by 2030, 30 per cent of its deployments in Europe will be Open RAN.
“We’ve started deploying open RAN and it works, the performance is on par with the incumbent architecture, and the cost is also on par,” says Tenorio. “So we are creating that optionality without paying any price in terms of performance, or a huge premium cost, regardless of what is inside the boxes.”
Timeline
Vodafone is already looking at in-line versus look-aside.
“We are closing into in-line benefits for the architecture. There is a continuous flow of positions or deliverables to the companies around us,” says Tenorio. “We have tens of meetings per week with interested companies who want to know and contribute to this, and we are exchanging our views in real-time.”
There will also be a white paper published, but for now, there is no deadline.
But there is an urgency to the work given Vodafone is deploying Open RAN, but this research work is for the next generation of Open RAN. “We are deploying the previous generation,” he says.
Vodafone is also talking, for example, to the ONF open-source organisation, which announced an interest in defining interfaces to exploit acceleration hardware.
“I think the good thing is that the industry is getting it, and we [Vodafone] are just one factor,” says Tenorio. “But you start these conversations, and you see how they’re going places. So people are listening.”
The industry agrees that layer 1 interfacing needs to be standardised or abstracted to avoid companies ending in particular supplier camps.
“I think there’ll be a debate whether that needs to happen in the ORAN Alliance or somewhere else,” says Tenorio. “I don’t have strong views. The industry will decide.”
Other developments
The Malaga R&D site will not just focus on Open RAN but other parts of the network, such as transport.
Transport still makes use of proprietary silicon but there is also more vendor competition.
“The dollars spent by operators in that area is smaller,” says Tenorio. “That’s why it is not making the headlines these days, but that doesn’t mean there is no action.”
Two transport areas where disaggregated designs have started are the disaggregated backbone router, and the disaggregated cell site gateway, both being sensible places to start.
“Disaggregating a full MPLS carrier-grade router is a different thing, but its time will come,” says Tenorio, adding that the centre in Malaga is not just for Open RAN, but silicon for telcos.
BT’s Open RAN trial: A mix of excitement and pragmatism

“We in telecoms, we don’t do complexity very well.” So says Neil McRae, BT’s managing director and chief architect.
He was talking about the trend of making network architectures open and in particular the Open Radio Access Network (Open RAN), an approach that BT is trialling.
“In networking, we are naturally sceptical because these networks are very important and every day become more important,” says McRae
Whether it is Open RAN or any other technology, it is key for BT to understand its aims and how it helps. “And most importantly, what it means for customers,” says McRae. “I would argue we don’t do enough of that in our industry.”
Open RAN
Open RAN has become a key focus in the development of 5G. Open RAN is backed by leading operators, it promises greater vendor choice and helps counter the dependency on the handful of key RAN vendors such as Nokia and Ericsson. There are also geopolitical considerations given that Huawei is no longer a network supplier in certain countries.
“Huawei and China, once they were the flavour of the month and now they no longer are,” says McRae. “That has driven a lot of concern – there are only Nokia and Ericsson as scaled players – and I think that is a thing we need to worry about.”
McRae points out that Open RAN is an interface standard rather than a technology.
“Those creating Open RAN solutions, the only part that is open is that interface side,” he says. ”If you think of Nokia, Ericsson, Mavenir, Rakuten and Altiostar – any of the guys building this technology – none of their technology is specifically open but you can talk to it via this open interface.”

Opportunity
McRae is upbeat about Open RAN but much work is needed to realise its potential.
“Open RAN, and I would probably say the same about NFV (network functions virtualisation), has got a lot of momentum and a lot of hype well before I think it deserves it,” he says.
Neil McRaeBT favours open architectures and interoperability. “Why wouldn’t we want to to be part of that, part of Open RAN,” says McRae. “But what we are seeing here is people excited about the potential – we are hugely excited about the potential – but are we there yet? Absolutely not.”
BT views Open RAN as a way to support the small-cell neutral host model whereby a company can offer operators coverage, one way Open RAN can augment macro cell coverage.
Open RAN can also be used to provide indoor coverage such as in stadiums and shopping centres. McRae says Open RAN could also be used for transportation but there are still some challenges there.
“We see Open RAN and the Open RAN interface specifications as a great way for building innovation into the radio network,” he says. “If there is one part that we are hugely excited about, it is that.”
BT’s Open RAN trial
BT is conducting an Open RAN trial with Nokia in Hull, UK.
“We haven’t just been working with Nokia on this piece of work, other similar experiments are going on with others,” says McRae.
McRae equates Open RAN with software-defined networking (SDN). SDN uses several devices that are largely unintelligent while a central controller – ’the big brain’ – controls the devices and in the process makes them more valuable.
“SDN has this notion of a controller and devices and the Open RAN solution is no different: it uses a different interface but it is largely the same concept,” says McRae.
This central controller in Open RAN is the RAN Intelligent Controller (RIC) and it is this component that is at the core of the Nokia trial.
“That controller allows us to deploy solutions and applications into the network in a really simple and manageable way,” says McRae.
The RIC architecture is composed of a near-real-time RIC that is very close to the radio and that makes almost instantaneous changes based on the current situation.
There is also a non-real-time controller – that is used for such tasks as setting policy, the overall run cycle for the network, configuration and troubleshooting.
“You kind of create and deploy it, adjust it or add or remove things, not in real-time,” says McRae. “It is like with a train track, you change the signalling from red to green long before the train arrives.”
BT views the non-real-time aspect of the RIC as a new way for telcos to automate and optimise the core aspects of radio networking.
McRae says the South Bank, London is one of the busiest parts of BT’s network and the operator has had to keep adding spectrum to the macrocells there.
“It is getting to the point where the macro isn’t going to be precise enough to continue to build a great experience in a location like that,” he says.
One solution is to add small cells and BT has looked at that but has concluded that making macros and small cells work together well is not straightforward. This is where the RIC can optimise the macro and small cells in a way that improves the experience for customers even when the macro equipment is from one vendor and the small cells from another.
“The RIC allows us to build solutions that take the demand and the requirements of the network a huge step forward,” he says. “The RIC makes a massive step – one of the biggest steps in the last decade, probably since massive MIMO – in ensuring we can get the most out of our radio network.”
BT is focussed on the non-real-time RIC for the initial use cases it is trialling. It is using Nokia’s equipment for the Hull trial.
BT is also testing applications such as load optimisation between different layers of the network and between different neighbouring sites. Also where there is a failure in the network it is using ‘Xapps’ to reroute traffic or re-optimise the network.
Nokia also has AI and machine learning software which BT is trialling. BT sees AI and machine learning-based solutions as a must as ultimately human operators are too slow.
Trial goals
BT wants to understand how Open RAN works in deployment. For example, how to manage a cell that is part of a RIC cluster.
In a national network, there will likely be multiple RICs used.
“We expect that this will be a distributed architecture,” says McRae. “How do you control that? Well, that’s where the non-real-time RIC has a job to do, effectively to configure the near-real-time RIC, or RICs as we understand more about how many of them we need.”
Another aspect of the trial is to see if, by using Open RAN, the network performance KPIs can be improved. These include time on 4G/ time on 5G, and the number of handovers and dropped calls.
“Our hope and we expect that all of these get better; the early signs in our labs are that they should all get better, the network should perform more effectively,” he says.
BT will also do coverage testing which, with some of the newer radios it is deploying, it expects to improve.
“We’ve done a lot of learning in the lab,” says McRae. “Our experience suggests that translating that into operational knowledge is not perfect. So we’re doing this to learn more about how this will work and how it will benefit customers at the end of the day.”
Openness and diversity
Given that Open RAN aims to open vendor choice, some have questioned whether BT’s trial with Nokia is in the spirit of the initiative.
“We are using the Open RAN architecture and the Open RAN interface specs,” says McRae. “Now, for a lot of people, Open RAN means you have got to have 12 vendors in the network. Let me tell you, good luck to everyone that tries that.”
BT says there are a set of flavours of Open RAN appearing. One is Rakuten and Symphony, another is Mavenir. These are end-to-end solutions being built that can be offered to operators as a solution.
“Service providers are terrible at integrating things; it is not our core competency,” says McRae. “We have got better over the years but we want to buy a solution that is tested, that has a set of KPIs around how it operates, that has all the security features we need.”
This is key for a platform that in BT’s case serves 30 million users. As McRae puts it, if Open RAN becomes too complicated, it is not going to get off the ground: “So we welcome partnerships, or ecosystems that are forming because we think that is going to make Open RAN more accessible.”
McRae says some of the reaction to its working with Nokia is about driving vendor diversity.
BT wants diverse vendors that can provide it with greater choice and benefit from competition. But McRae points out that many of the vendors’ equipment use the same key components from a handful of chip companies; and chips that are made in two key locations.
“What we want to see is those underlying components, we want to see dozens of companies building them all over the world,” he says. “They are so crucial to everything we do in life today, not just in the network, but in your car, smartphone, TV and the microwave.”
And while more of the network is being implemented in software – BT’s 5G core is all software – hardware is still key where there are are packet or traffic flows.
“The challenge in some of these components, particularly in the radio ecosystem, is you need strong parallel processing,” says McRae. “In software that is really difficult to do.”
“Intel, AMD, Broadcom and Qualcomm are all great partners,“ says McRae. “But if any one of these guys, for some reason, doesn’t move the innovation curve in the way we need it to move, then we run into real problems of how to grow and develop the network.”
What BT wants is as much IC choice as possible, but how that will be achieved McRae is less certain. But operators rightly have to be concerned about it, he says.
Nvidia's plans for the data processor unit

When Nvidia’s CEO, Jensen Huang, discussed its latest 400-gigabit BlueField-3 data processing unit (DPU) at the company’s 2021 GTC event, he also detailed its successor.
Companies rarely discuss chip specifications two generations ahead; the BlueField-3 only begins sampling next quarter.
The BlueField-4 will advance Nvidia’s DPU family.
It will double again the traffic throughput to 800 gigabits-per-second (Gbps) and almost quadruple the BlueField-3’s integer processing performance.
But one metric cited stood out. The BlueField-4 will increase by nearly 1000x the number of terabit operators-per-second (TOPS) performed: 1,000 TOPS compared to the BlueField-3’s 1.5 TOPS.
Huang said artificial intelligence (AI) technologies will be added to the BlueField-4, implying that the massively parallel hardware used for Nvidia’s graphics processor units (GPUs) are to be grafted onto its next-but-one DPU.
Why add AI acceleration? And will it change the DPU, a relatively new processor class?
Data processor units
Nvidia defines the DPU as a programmable device for networking.
The chip combines general-purpose processing – multiple RISC cores used for control-plane tasks and programmed in a high-level language – with accelerator units tailored for packet-processing data-plane tasks.
“The accelerators perform functions for software-defined networking, software-defined storage and software-defined security,” says Kevin Deierling, senior vice president of networking at Nvidia.
The DPU can be added to a Smart Network Interface Card (SmartNIC) that complements the server’s CPU, taking over the data-intensive tasks that would otherwise burden the server’s most valuable resource.
Other customers use the DPU as a standalone device. “There is no CPU in their systems,” says Deierling.
Storage platforms is one such example, what Deierling describes as a narrowly-defined workload. “They don’t need a CPU and all its cores, what they need is the acceleration capabilities built into the DPU, and a relatively small amount of compute to perform the control-path operations,” says Deierling.
Since the DPU is the server’s networking gateway, it supports PCI Express (PCIe). The PCIe bus interfaces to the host CPU, to accelerators such as GPUs, and supports NVMe storage. NVMe is a non-volatile memory host controller interface specification.
BlueField 3
When announced in 2021, the 22-billion transistor BlueField-3 chip was scheduled to sample this quarter. “We need to get the silicon back and do some testing and validation before we are sampling,” says Deierling.
The device is a scaled-up version of the BlueField-2: it doubles the throughput to 400Gbps and includes more CPU cores: 16 Cortex-A78 64-bit ARM cores.
Nvidia deliberately chose not to use more powerful ARM cores. “The ARM is important, there is no doubt about it, and there are newer classes of ARM,” says Deierling. “We looked at the power and the performance benefits you’d get by moving to one of the newer classes and it doesn’t buy us what we need.”
The BlueField-3 has the equivalent processing performance of 300 X86 CPU cores, says Nvidia, but this is due mainly to the accelerator units, not the ARM cores.
The BlueField-3 input-output [I/O] includes Nvidia’s ConnectX-7 networking unit that supports 400 Gigabit Ethernet (GbE) which can be split over 1, 2 or 4 ports. The DPU also doubles the InfiniBand interface compared to the BlueField-2, either a single 400Gbps (NDR) port or two 200Gbps (HDR) ports. There are also 32 lanes of PCI Express 5.0, each lane supporting 32 giga-transfers-per-second (GT/s) in each direction.
The memory interface is two DDR5 channels, doubling both the memory performance and the channel count of the BlueField-2.
The data path accelerator (DPA) of the BlueField-3 comprises 16 cores, each supporting 16 instruction threads. Typically, when a packet arrives, it is decrypted and the headers are inspected after which the accelerators are used. The threads are used if the specific function needed is not accelerated. Then, a packet is assigned to a thread and processed.
“The DPA is a specialised part of our acceleration core that is highlighly programmable,” says Deierling.
Other programmable logic blocks include the accelerated switching and packet processing (ASAP2) engine that parses packets. It inspects packet fields looking for a match that tells it what to do, such as dropping the packet or rewriting its header.
In-line acceleration
The BlueField-3 implements the important task of security.
A packet can have many fields and encapsulations. For example, the fields can include a TCP header, quality of service, a destination IP and an IP header. These can be encapsulated into an overlay such as VXLAN and further encapsulated into a UDP packet before being wrapped in an outer IP datagram that is encrypted and sent over the network. Then, only the IPSec header is exposed; the remaining fields are encrypted.
Deierling says the BlueField-3 does the packet encryption and decryption in-line.
For example, the DPU uses the in-line IPsec decode to expose the headers of the various virtual network interfaces – the overlays – of a received packet. Picking the required overlay, the packet is sent to a set of service-function chainings that use all the accelerators available such as tackling distributed denial-of-service and implementing a firewall and load balancing.
“You can do storage, you can do an overlay, receive-side scaling [RSS], checksums,” says Deierling. “All the accelerations built into the DPU become available.”
Without in-line processing, the received packet goes through a NIC and into the memory of the host CPU. There, it is encrypted and hence opaque; the packet’s fields can’t benefit from the various acceleration techniques. “It is already in memory when it is decrypted,” says Deierling.

Often, with the DPU, the received packet is decrypted and passed to the host CPU where the full packet is visible. Then, once the host application has processed the data, the data and packet may be encrypted again before being sent on.
“In a ‘zero-trust’ environment, there may be a requirement to re-encrypt the data before sending it onto the next hop,” says Deierling. “In this case, we just reverse the pipeline.”
An example is confidential healthcare information where data needs to be encrypted before being sent and stored.
DPU evolution
There are many application set to benefit from DPU hardware. These cover the many segments Nvidia is addressing including AI, virtual worlds, robotics, self-driving cars, 5G and healthcare.
All need networking, storage and security. “Those are the three things we do but it is software-defined and hardware-accelerated,” says Deierling.
Nvidia has an ambitious target of launching a new DPU every 18 months. That suggests the BlueField-4 could sample as early as the end of 2023.
The 800-gigabit Bluefield-4 will have 64-billion transistors and nearly quadruple the integer processing performance of the BlueField-3: from 42 to 160 SPECint.
Nvidia says its DPUs, including the BlueField-4, are evolutionary in how they scale the ARM cores, accelerators and throughput. However, the AI acceleration hardware added to the BlueField-4 will change the nature of the DPU.
“What is truly salient is that [1,000] TOPS number,” says Deierling. “And that is an AI acceleration; that is leveraging capabilities Nvidia has on the GPU side.”
Self-driving cars, 5G and robotics
An AI-assisted DPU will support such tasks as video analytics, 5G and robotics.
For self-driving cars, the DPU will reside in the data centre, not in the car. But that too will change.“Frankly, the car is becoming a data centre,” notes Deierling.
Deep learning currently takes place in the data centre but as the automotive industry adopts Ethernet, a car’s sensors – lidar, radar and cameras – will send massive amounts of data which an IC must comprehend.
This is relevant not just for automotive but all applications where data from multiple sensors needs to be understood.
Deierling describes Nvidia as an AI-on-5G company.
“We have a ton of different things that we are doing and for that, you need a ton of parallel-processing capabilities,” he says. This is why the BlueField-4 is massively expanding its TOPS rating.
He describes how a robot on an automated factory floor will eventually understand its human colleagues.
“It is going to recognize you as a human being,“ says Deierling. “You are going to tell it: ‘Hey, stand back, I’m coming in to look at this thing’, and the robot will need to respond in real-time.”
Video analytics, voice processing, and natural language processing are all needed while the device will also be running a 5G interface. Here, the DPU will reside in a small mobile box: the robot.
“Our view of 5G is thus more comprehensive than just a fast pipe that you can use with a virtual RAN [radio access network] and Open RAN,” says Deierling. “We are looking at integrating this [BlueField-4] into higher-level platforms.”
Compute vendors set to drive optical I/O innovation

Part 2: Data centre and high-performance computing trends
Professor Vladimir Stojanovic has an engaging mix of roles.
When he is not a professor of electrical engineering and computer science at the University of California, Berkeley, he is the chief architect at optical interconnect start-up, Ayar Labs.
Until recently Stojanovic spent four days each week at Ayar Labs. But last year, more of his week was spent at Berkeley.
Stojanovic is a co-author of a 2015 Nature paper that detailed a monolithic electronic-photonics technology. The paper described a technological first: how a RISC-V processor communicated with the outside world using optical rather than electronic interfaces.
It is this technology that led to the founding of Ayar Labs.
Research focus
“We [the paper’s co-authors] always thought we would use this technology in a much broader sense than just optical I/O [input-output],” says Stojanovic.
This is now Stojanovic’s focus as he investigates applications such as sensing and quantum computing. “All sorts of areas where you can use the same technology – the same photonic devices, the same circuits – arranged in different configurations to achieve different goals,” says Stojanovic.
Stojanovic is also looking at longer-term optical interconnect architectures beyond point-to-point links.
Ayar Labs’ chiplet technology provides optical I/O when co-packaged with chips such as an Ethernet switch or an “XPU” – an IC such as a CPU or a GPU (graphics processing unit). The optical I/O can be used to link sockets, each containing an XPU, or even racks of sockets, to form ever-larger compute nodes to achieve “scale-out”.
But Stojanovic is looking beyond that, including optical switching, so that tens of thousands or even hundreds of thousands of nodes can be connected while still maintaining low latency to boost certain computational workloads.
This, he says, will require not just different optical link technologies but also figuring out how applications can use the software protocol stack to manage these connections. “That is also part of my research,” he says.
Optical I/O
Optical I/O has now become a core industry focus given the challenge of meeting the data needs of the latest chip designs. “The more compute you put into silicon, the more data it needs,” says Stojanovic.
Within the packaged chip, there is efficient, dense, high-bandwidth and low-energy connectivity. But outside the package, there is a very sharp drop in performance, and outside the chassis, the performance hit is even greater.
Optical I/O promises a way to exploit that silicon bandwidth to the full, without dropping the data rate anywhere in a system, whether across a shelf or between racks.
This has the potential to build more advanced computing systems whose performance is already needed today.
Just five years go, says Stojanovic, artificial intelligence (AI) and machine learning were still in their infancy and so were the associated massively parallel workloads that required all-to-all communications.
Fast forward to today, such requirements are now pervasive in high-performance computing and cloud-based machine-learning systems. “These are workloads that require this strong scaling past the socket,” says Stojanovic.
He cites natural language processing that within 18 months has grown 1000x in terms of the memory required; from hosting a billion to a trillion parameters.
“AI is going through these phases: computer vision was hot, now it’s recommender models and natural language processing,” says Stojanovic. “Each generation of application is two to three orders of magnitude more complex than the previous one.”
Such computational requirements will only be met using massively parallel systems.
“You can’t develop the capability of a single node fast enough, cramming more transistors and using high-bandwidth memory,“ he says. High-bandwidth memory (HBM) refers to stacked memory die that meet the needs of advanced devices such as GPUs.
Co-packaged optics
Yet, if you look at the headlines over the last year, it appears that it is business as usual.
For example, there have been a Multi Source Agreement (MSA) announcement for new 1.6-terabit pluggable optics. And while co-packaged optics for Ethernet switch chips continues to advance, it remains a challenging technology; Microsoft has said it will only be late 2023 when it starts using co-packaged optics in its data centres.
Stojanovic stresses there is no inconsistency here: it comes down to what kind of bandwidth barrier is being solved and for what kind of application.
In the data centre, it is clear where the memory fabric ends and where the networking – implemented using pluggable optics – starts. That said, this boundary is blurring: there is a need for transactions between many sockets and their shared memory. He cites Nvidia’s NVLink and AMD’s Infinity Fabric links as examples.
“These fabrics have very different bandwidth densities and latency needs than the traditional networks of Infiniband and Ethernet,” says Stojanovic. “That is where you look at what physical link hardware answers the bottleneck for each of these areas.”
Co-packaged optics is focussed on continuing the scaling of Ethernet switch chips. It is a more scalable solution than pluggables and even on-board optics because it eliminates long copper traces that need to be electrically driven. That electrical interface has to escape the switch package, and that gives rise to that package-bottleneck problem, he says.
There will be applications where pluggables and on-board optics will continue to be used. But they will still need power-consuming retimer chips and they won’t enable architectures where a chip can talk to any other chip as if they were sharing the same package.
“You can view this as several different generations, each trying to address something but the ultimate answer is optical I/O,” says Stojanovic.
How optical connectivity is used also depends on the application, and it is this diversity of workloads that is challenging the best of the system architects.
Application diversity
Stojanovic cites one machine learning approach for natural language processing that Google uses that scales across many compute nodes, referred to as the ‘multiplicity of experiments’ (MoE) technique.

A processing pipeline is replicated across machines, each performing part of the learning. For the algorithm to work in parallel, each must exchange its data set – its learning – with every other processing pipeline, a stage referred to as all-to-all dispatch and combine.
“As you can imagine, all-to-all communications is very expensive,” says Stojanovic. “There is a lot of data from these complex, very large problems.”
Not surprisingly, as the number of parallel nodes used grows, a greater proportion of the overall time is spent exchanging the data.
Using 1,000 AI processors running 2,000 experiments, a third of the time is required for data exchange. Scaling the hardware to 3,000 to 4,000 AI processors and communications dominate the runtime.
This, says Stojanovic, is a very interesting problem to have: it’s an example where adding more compute simply does not help.
“It is always good to have problems like this,” he says. “You have to look at how you can introduce some new technology that will be able to resolve this to enable further scaling, to 10,000 or 100,000 machines.”
He says such examples highlight how optical engineers must also have an understanding of systems and their workloads and not just focus on ASIC specifications such as bandwidth density, latency and energy.
Because of the diverse workloads, what is needed is a mixture of circuit switching and packet switching interconnect.
Stojanovic says high-radix optical switching can connect up to a thousand nodes and, scaling to two hops, up to a million nodes in sub-microsecond latencies. This suits streamed traffic.

But an abundance of I/O bandwidth is also needed to attach to other types of packet switch fabrics. “So that you can also handle cache-line size messages,” says Stojanovic.
These are 64 bytes long and are found with processing tasks such as Graph AI where data searches are required, not just locally but across the whole memory space. Here, transmissions are shorter and involve more random addressing and this is where point-to-point optical I/O plays a role.
“It is an art to architect a machine,” says Stojanovic.
Disaggregation
Another data centre trend is server disaggregation which promises important advantages.
The only memory that meets the GPU requirements is HBM. But it is becoming difficult to realise taller and taller HBM stacks. Stojanovic cites as an example how Nvidia came out with its A100 GPU with 40GB of HBM that was quickly followed a year later, by an 80GB A100 version.
Some customers had to do a complete overall of their systems to upgrade to the newer A100 yet welcomed the doubling of memory because of the exponential growth in AI workloads.
By disaggregating a design – decoupling the compute and memory into separate pools – memory can be upgraded independently of the computing. In turn, pooling memory means multiple devices can share the memory and it avoids ‘stranded memory’ where a particular CPU is not using all its private memory. Having a lot of idle memory in a data centre is costly.
If the I/O to the pooled memory can be made fast enough, it promises to allow GPUs and CPUs to access common DDR memory.
“This pooling, with the appropriate memory controller design, equalises the playing field of GPUs and CPUs being able to access jointly this resource,” says Stojanovic. “That allows you to provide way more capacity – several orders more capacity of memory – to the GPUs but still be within a DRAM read access time.”
Such access time is 50-60ns overall from the DRAM banks and through an optical I/O. The pooling also means that the CPUs no longer have stranded memory.
“Now something that is physically remote can be logically close to the application,” says Stojanovic.
Challenges
For optical I/O to enable such system advances what is needed is an ecosystem of companies. Adding an optical chiplet alongside an ASIC is not the issue; chiplets are aready used by the chip industry. Instead, the ecosystem is needed to address such practical matters as attaching fibres and producing the lasers needed. This requires collaboration among companies across the optical industry.
“That is why the CW-WDM MSA is so important,” says Stojanovic. The MSA defines the wavelength grids for parallel optical channels and is an example of what is needed to launch an ecosystem and enable what system integrators and ultimately the hyperscalers want to do.
Systems and networking
Stojanovic concludes by highlighting an important distinction.
The XPUs have their own design cycles and, with each generation, new features and interfaces are introduced. “These are the hearts of every platform,” says Stojanovic. Optical I/O needs to be aligned with these devices.
The same applies to switch chips that have their own development cycles. “Synchronising these and working across the ecosystem to be able to find these proper insertion points is key,” he says.
But this also implies that the attention given to the interconnects used within a system (or between several systems i.e. to create a larger node) will be different to that given to the data centre network overall.
“The data centre network has its own bandwidth pace and needs, and co-packaged optics is a solution for that,“ says Stojanovic. “But I think a lot more connections get made, and the rules of the game are different, within the node.”
Companies will start building very different machines to differentiate themselves and meet the huge scaling demands of applications.
“There is a lot of motivation from computing companies and accelerator companies to create node platforms, and they are freer to innovate and more quickly adopt new technology than in the broader data centre network environment,” he says
When will this become evident? In the coming two years, says Stojanovic.
Marvell's 50G PAM-4 DSP for 5G optical fronthaul

- Marvell has announced the first 50-gigabit 4-level pulse-amplitude modulation (PAM-4) physical layer (PHY) for 5G fronthaul.
- The chip completes Marvell’s comprehensive portfolio for 5G radio access network (RAN) and x-haul (fronthaul, midhaul and backhaul).
Marvell has announced what it claims is an industry-first: a 50-gigabit PHY for the 5G fronthaul market.
Dubbed the AtlasOne, the PAM-4 PHY chip also integrates the laser driver. Marvell claims this is another first: implementing the directly modulated laser (DML) driver in CMOS.
“The common thinking in the industry has been that you couldn’t do a DML driver in CMOS due to the current requirements,” says Matt Bolig, director, product marketing, optical connectivity at Marvell. “What we have shown is that we can build that into CMOS.”
Marvell, through its Inphi acquisition, says it has shipped over 100 million ICs for the radio access network (RAN) and estimates that its silicon is in networks supporting 2 billion cellular users.
“We have been in this business for 15 years,” says Peter Carson, senior director, solutions marketing at Marvell. “We consider ourselves the number one merchant RAN silicon provider.”
Inphi started shipping its Polaris PHY for 5G midhaul and backhaul markets in 2019. “We have over a million ships into 5G,” says Bolig. Now Marvell is adding its AtlasOne PHY for 5G fronthaul.
Mobile traffic
Marvell says wireless data has been growing at a compound annual growth rate (CAGR) of over 60 per cent (2015-2021). Such relentless growth is forcing operators to upgrade their radio units and networks.
Stéphane Téral, chief analyst at market research firm, LightCounting, in its latest research note on Marvell’s RAN and x-haul silicon strategy, says that while 5G rollouts are “going gangbusters” around the world, they are traditional RAN implementations.
By that Téral means 5G radio units linked to a baseband unit that hosts both the distributed unit (DU) and centralised unit (CU).
But as 5G RAN architectures evolve, the baseband unit is being disaggregated, separating the distributed unit (DU) and centralised unit (CU). This is happening because the RAN is such an integral and costly part of the network and operators want to move away from vendor lock-in and expand their marketplace options.
For RAN, this means splitting the baseband functions and standardising interfaces that previously were hidden within custom equipment. Splitting the baseband unit also allows the functionality to be virtualised and be located separately, leading to the various x-haul options.
How the RAN is being disaggregated includes virtualised RAN and Open RAN. Marvell says Open RAN is still in its infancy but is a key part of the operators’ desire to virtualise and disaggregate their networks.
“Every Open RAN operator that is doing trials or early-stage deployments is also virtualising and disaggregating,” says Carson.
RAN disaggregation is also occuring in the vertical domain: the baseband functions and how they interface to the higher layers of the network. Such vertical disaggregation is being undertaken by the likes of the ONF and the Open RAN Alliance.
The disaggregated RAN – a mixture of the radio, DU and CU units – can still be located at a common site but more likely will be spread across locations.
Fronthaul is used to link the radio unit and DU when they are at separate locations. In turn, the DU and CU may also be at separate locations with the CU implemented in software running on servers deep within the network. Separating the DU and the CU is leading to the emergence of a new link: midhaul, says Téral.
Fronthaul speeds
Marvell says that the first 5G radio deployments use 8 transmitter/ 8 receiver (8T/8R) multiple-input multiple-output (MIMO) systems.
MIMO is a signal processing technique for beamforming, allowing operators to localise the capacity offered to users. An operator may use tens of megahertz of radio spectrum in such a configuration with the result that the radio unit traffic requires a 10Gbps front-haul link to the DU.
Leading operators are now deploying 100MHz of radio spectrum and massive MIMO – up to 32T/32R. Such a deployment requires 25Gbps fronthaul links.
“What we are seeing now is those leading operators, starting in the Asia Pacific, while the US operators have spectrum footprints at 3GHz and soon 5-6GHz, using 200MHz instantaneous bandwidth on the radio unit,” says Carson.
Here, an even higher-order 64T/64R massive MIMO will be used, driving the need for 50Gbps fronthaul links. Samsung has demonstrated the use of 64T/64R MIMO, enabling up to 16 spatial layers and boosting capacity by 7x.
“Not only do you have wider bandwidth, but you also have this capacity boost from spatial layering which carriers need in the ‘hot zones’ of their networks,” says Carson. “This is driving the need for 50-gigabit fronthaul.”
AtlasOne PHY
Marvell says its AtlasOne PAM-4 PHY chip for fronthaul supports an industrial temperature range and reduces power consumption by a quarter compared to its older PHYs. The power-saving is achieved by optimising the PHY’s digital signal processor and by integrating the DML driver.
Earlier this year Marvell announced its 50G PAM-4 Atlas quad-PHY design for the data centre. The AtlasOne uses the same architecture but differs in that it is integrated into a package for telecom and integrates the DML driver but not the trans-impedance amplifier (TIA).
“In a data centre module, you typically have the TIA and the photo-detector close to the PHY chip; in telecom, the photo-detector has to go into a ROSA (receiver optical sub-assembly),” says Bolig. “And since the photo-detector is in the ROSA, the TIA ends up having to be in the ROSA as well.”
The AtlasOne also supports 10-gigabit and 25-gigabit modes. Not all lines will need 50 gigabits but deploying the PHY future-proofs the link.
The device will start going into modules in early 2022 followed by field trials starting in the summer. Marvell expects the 50G fronthaul market to start in 2023.
RAN and x-haul IC portfolio
One of the challenges of virtualising the RAN is doing the layer one processing and this requires significant computation, more than can be handled in software running on a general-purpose processor.
Marvell supplies two chips for this purpose: the Octeon Fusion and the Octeon 10 data processing unit (DPU) that provides programmability and as well as specialised hardware accelerator blocks needed for 4G and 5G. “You just can’t deploy 4G or 5G on a software-only architecture,” says Carson.
As well as these two ICs and its PHY families for the various x-haul links, Marvell also has a coherent DSP family for backhaul (see diagram). Indeed, LightCounting’s Téral notes how Marvell has all the key components for an all-RAN 5G architecture.
Marvell also offers a 5G virtual RAN (VRAN) DU card that uses the OcteonFusion IC and says it already has five design wins with major cloud and OEM customers.
Nokia's 4.8-terabit FP5 packet-processing chipset

Part 1: IP routing: Nokia’s latest FP5 and router platforms
Nokia has unveiled its latest packet-processing silicon that will be the mainstay of its IP router platforms for years to come.
The FP5 chipset is rated at 4.8 terabits-per-second (Tbps), a twelvefold improvement in Nokia’s packet-processing silicon performance in a decade. (See chart.)

Communications service provider (CSP) BT says Nokia’s 7750 router platforms equipped with the FP5 chipset will deliver every use case it needs for its Multi Service Edge; from core routing, MPLS-VPN, broadband network gateways (BNG), to mobile backhaul and Ethernet.
The FP5 announcement comes four years after Nokia unveiled its existing flagship router chipset, the FP4. The FP4 was announced as a 2.4Tbps chipset but Nokia upgraded its packet-processing rating to 3Tbps.
“We announced what we knew but then, through subsequent development and testing, the performance ended up at 3Tbps,” says Heidi Adams, head of IP and optical networks marketing at Nokia.
The FP5 may also exceed its initial 4.8Tbps rating.
Nokia will use the FP5 to upgrade its existing platforms and power new router products; it will not license the chipset nor will it offer it for use in open router platforms.
Nokia’s chipset evolution
At the heart of Nokia’s router silicon is a 2D array of packet processing cores.
The FP3, announced in 2011 by Alcatel-Lucent (acquired by Nokia in 2016), used 288 packet processing cores arranged in a 32×9 array. Each row of cores acted as a packet-processing pipeline that could be partitioned to perform independent tasks. The array’s columns performed table look-ups and each column could be assigned several tasks.
Nokia didn’t detail how the FP4 upgraded the array of cores. But the performance enhancement was significant; the FP4 delivers a 7.5x improvement in packet processing performance compared to the FP3.
The 16nm CMOS FP4 chipset includes a traffic manager (q-chip), packet processor (p-chip), the t-chip that interfaces to the router fabric, and what was then a new chip, the e-chip.
The e-chip acts as a media access controller (MAC) that parcels data from the router’s client-side pluggable optical modules for the p-chip.
Nokia even designed memory for the FP4 whereby instructions can be implemented during memory access and the memory can be allocated to perform different types of look-up and buffering, depending on requirements.
To maximise the memory’s performance, Nokia used advanced packaging for the FP4’s p-chip and q-chip. The resulting 2.5D-packaged p-chip comprises the packet processor die and stacks of memory. The q-chip is also a multi-chip module containing RISC processors and buffering memory.
The FP4 uses 56Gbps PAM-4 serialiser-deserialiser (serdes) interfaces, technology that Nokia secured from Broadcom.
FP5’s features
The FP5 builds on the major architectural upgrade undertaken with the FP4.
Using a 7nm CMOS process technology, Nokia’s FP5 designers have combined on-chip what were two separate FP4 chips: the packet processor (p-chip) and traffic manager (q-chip).
The FP5 chipset consumes a quarter of the power of the FP4 in terms of watts-per-gigabit (0.1W/Gig for the FP5 compared to the FP4’s 0.4W/Gig).
Consolidating two chips into one accounts for part of the power savings. Using 112Gbps serdes and a more advanced CMOS process are other factors.
Nokia has also added encryption hardware blocks to the chip’s ports. The hardware blocks implement the MACsec algorithm and can also encrypt layer 2.5 and layer 3 traffic.
The chipset can handle packet flows as large as 1.6 terabits. “We don’t have any physical interfaces that support flows at that rate,” says Adams. “It’s an indicator that the chipset is ready for much more.”
The e-chip, which Nokia describes as a tremendously important device, has also been upgraded. As well as the MAC function, it acts as an early-stage packet processor, performing pre-processing and pre-classification tasks on the traffic.
The e-chip also performs pre-buffering for the packer processor. Using multiple such devices allows the line card to expand the forwarding limit of the FP5’s packet processor. This enables Nokia’s routers to perform what it calls intelligent aggregation (IA). “We can bring in more traffic, increase the number of ingress ports even if those ports start to get fully loaded, because of the chipset architecture being fully buffered,” says Adams. “The result is a 30 per cent uplift in the stated capacity numbers.”
The FP5 chipset has been taped out and the silicon is being tested in Nokia’s lab.
Router platforms
IP core routers are tasked with moving large amounts of IP traffic across a network backbone. IP edge routers, in contrast, typically aggregate a variety of services such as mobile transport, residential traffic or act as gateways.

The platforms that will use the FP5 are classified by Nokia as edge routers. “The boundaries have blurred,” says Adams. “It is more important to look at how applications are deployed and what the requirements are.”
The platforms using the FP5 are the existing 7750 SR-14s and 7750-SR7s routers that were announced with the launch of the FP4.
These chassis were designed to accommodate Nokia’s current and next-generation router cards. “This allows operators to retain the same chassis and support a mix of FP4 and FP5 cards, growing into them gradually,” says Adams.
Nokia has announced three other platforms: two mid-range platforms, the 7750 SR2-se and the 7750 SR1-se, and the 7750 SR-1 that will be available in six variants. “They [the SR-1 boxes] are going to be available in a range of configurations and different port speeds,” says Adams.
Platforms using the FP5 chipset will ship in the first half of 2022, starting with the SR-1.
Nokia also announced an FP5 expandable media adaptor (XMA) line card for the non-fixed platforms (the 7750 SR-14s/ SR-7s and SR-2se). The card supports 36 pluggable slots and with 400 Gigabit Ethernet (GbE) has a capacity of 14.4Tbps full-duplex or 19.2Tbps in intelligent aggregation mode.
The card will also support 400ZR and ZR+ coherent modules and is ready for 800GbE pluggables that will double the card’s capacity ratings.
Nokia says the FP5 improves the throughput of the XMA card by a factor of three: Nokia’s 4.8Tbps XMA (12Tbps IA) uses four FP4 chipsets while the latest 14.4Tbps (19.2Tbps IA) XMA uses six FP5 chipsets.
Custom silicon versus disaggregated designs
Nokia says the benefits of having its own chipset justify the intellectual effort and development expense, even when advanced merchant silicon is available and certain CSPs are embracing open disaggregated routers.
“We feel there is a need in the industry for platforms based on this kind of technology,” says Adams.
What is important is the total cost of ownership and that Nokia’s systems are deployed in critical networks where resiliency, reliability, the feature set and network security are all critical, says Adams.
Nokia also points to the progress it has made since the launch of the FP4. “We have secured 350 projects, two-thirds of which were new footprints or competitive displacements,” says Adams. Nokia’s IP revenues in 2020 were $3.2 billion.
That said, Nokia also partners with merchant silicon vendors: the 7250 IXR interconnect router uses merchant silicon, for example.
“If I look at disaggregation, absolutely, it is an interesting area,” says Adams. “But I think it is very early days.”
Neil McRae, managing director and chief architect at BT, says that while some operators are looking at disaggregated software and hardware, BT doesn’t believe this is necessarily the best solution in terms of performance, reliability or cost.

“Increasingly, the ratio of capital investment in core networking is moving towards optical transceivers than router silicon,” says McRae. “But to get the most out of the network and the router, using custom silicon for the most demanding cases still delivers the best outcomes.
“In our live network but also in our testing, the integrated solution is more reliable, easier to operate and a significant improvement from a total cost of ownership point of view,” says McRae.
BT says it will be able to scale interfaces on the 7750 from 1-400Gbs using the FP5 and Nokia’s SR-OS routing networking software.
BT also highlights the importance of reliability under demand, pointing out how the CSP’s traffic has doubled during the pandemic without impacting its customers.
“Nokia’s understanding of how the underlying silicon is going to react in different situations gives them a significant advantage in building the software on top that performs in challenging situations,” says McRae.
Chip Strategy
Nokia says that were it to sell its FP5 silicon as a standalone product, it would enter a very different design environment.
“You are designing to the requirements of multiple customers versus designing for your systems,” says Adams.
Nokia’s belief is that there is strong demand for platforms designed to purpose.
“We are staying true to that strategy,” says Adams.
Microchip’s compact, low-power 1.6-terabit PHY

Microchip Technology’s latest physical layer (PHY) chip has been developed for next-generation line cards.
The PM6200 Meta-DX2L (the ‘L’ is for light) 1.6-terabit chip is implemented using TSMC’s 6nm CMOS process. It is Microchip’s first PHY to use 112-gigabit PAM-4 (4-level pulse-amplitude modulation) serialiser/ deserialisers (serdes) interfaces.
Microchip’s existing 16nm CMOS Meta-DX1 PHY devices are rated at 1.2 terabits and use 56-gigabit PAM-4 serdes.
System vendors developing line cards that double the capacity of their switch, router or transport systems are being challenged by space and power constraints, says Microchip. To this aim, the company has streamlined the Meta-DX2L to create a compact, lower-power chip.
“One of the things we have focussed on is the overall footprint of our [IC] design to ensure that people can realise their cards as they go to the 112-gigabit PAM-4 generation,” says Stephen Docking, manager, product marketing, communications business unit, at Microchip.
The company says the resulting package measures 23x30mm and reduces the power per port by 35 per cent compared to the Meta-DX1.
IC architecture
The Meta-DX1 family of 1.2-terabit physical layer (PHY) Ethernet chips effectively comprise three 400-gigabit cores and support the OIF’s Flexible Ethernet (FlexE) protocol and MACsec encryption.

The Meta-DX1 devices, launched in 2019, support the Precision Time Protocol (PTP) used to synchronise clocks across a network with high accuracy that is a requirement for 5G.
The new Meta-DX2L is a single chip although Microchip hints that other family devices will follow.
The Meta-DX2L can be viewed as comprising two 800-gigabit cores. The chip does away with FlexE and the PTP protocol but includes retiming and gearbox modes. The gearbox is used to translate between 28, 56 and 112-gigabit rates.
“We still see customers working on FlexE designs, so the lack of it [with the Meta-DX2L] is not due to limited market demand but how we chose to optimise the chip,” says Docking.
The same applies to PTP. The Meta-DX1 performs time stamping that meets 5G’s Class C and Class D front-haul clocking requirements. “The difference with the Meta-DX2L is that it is not doing time stamping,” says Docking. But it can work with devices doing the time stamping.
“In a 5G system, if you add a PHY, you need to do it in such a way that it doesn’t add any uncertainty in the overall latency of the system,” says Docking. ”So we have focussed on the device have a constant latency.” This means the Meta-DX2L can be used in systems meeting Class C or Class D clocking requirements.
The chip also features a 16×16 crosspoint switch that allows customers to use different types of optical modules and interface them to a line card’s ASIC or digital signal processor (DSP).
The Meta-DX2L’s two cores are flexible and support rates from 1 to 800 Gigabit Ethernet, says Docking.
As well as Ethernet rates, the device supports proprietary rates common with artificial intelligence (AI) and machine learning.
For AI, an array of graphic processor units (GPUs) talk to each other on the same line card. “But to scale the system, you have to have multiple line cards talk to each other,” says Docking. “Different companies that design GPUs have chosen their own protocols to optimise their communications.”
Such links are not aligned with the Ethernet rates but the Meta-DX2L supports these proprietary rates.
Microchip says the Meta-DX2L will sample this quarter.
1.6 terabits, system resilience and design challenges
The PHY’s 1.6-terabit capacity was chosen based on customers’ requirements.
“If you look at the number of ports people want to support, it is often an even multiple of 800-gigabit ports,” says Docking.
The Meta-DX2L, like its predecessor PHY family, has a hitless 2:1 multiplexer. The multiplexer function is suited for centralised switch platforms where the system intelligence resides on a central card while the connecting line cards are relatively simple, typically comprising PHYs and optical modules.
In such systems, due to the central role of the platform’s switch card, a spare card is included. Should the primary card fail, the backup card kicks in, whereby all the switch’s line cards connect to the backup. The 2:1 multiplexer in the PHY means each line card is interfaced to both switch cards: the primary one and backup.

For line cards that will have 32 or 36 QSFP-DD800 pluggable modules, space is a huge challenge, says Docking: “So having a compact PHY is important.”
“The physical form factor has always been a challenge and then density plays into it and thermal issues,” says Kevin So, associate director, product line management and marketing, communications business unit, at Microchip. “And when you overlay the complexity of the transition from 56 to 112 gigabits, that makes it extremely challenging for board designers.”
Applications
The 1.6-terabit PHY is aimed at switching and routing platforms, compact data centre interconnect systems, optical transport and AI designs.
Which application takes off first depends on several developments. On one side of the PHY chip sits the optics and on the other the ASIC, whether a packet processor, switch chip, processor or DSP. “It’s the timing of those pieces that drive what applications you will see first,” says So.

“Switching and packet processor chips are transitioning to 112-gigabit serdes and you are also starting to see QSFP-DD or OSFP optics with 112-gigabit serdes becoming available,” adds Docking. “So the ecosystem is starting for those types of systems.”
The device is also being aimed at routers for 5G backhaul applications. Here data rates are in the 10- to the 100-gigabit range. “But you are already starting to hear about 400-gigabit rates for some of these access backhaul routers,” says So.
And with 400 Gigabit Ethernet being introduced on access pizza-box routers for 5G this year, in two years, when Microchip’s customers release their hardware, there will likely be denser versions, says So.
“And by then we’ll be talking about a DX3, who knows?” quips So.
Marvell's first Inphi chips following its acquisition

Marvell unveiled three new devices at the recent OFC virtual conference and show.
One chip is its latest coherent digital signal processor (DSP), dubbed Deneb. The other two chips, for use within the data centre, are a PAM-4 (4-level pulse-amplitude modulation) DSP, and a 1.6-terabit Ethernet physical layer device (PHY).
The chips are Marvell’s first announced Inphi products since it acquired the company in April. Inphi’s acquisition adds $0.7 billion to Marvell’s $3 billion annual revenues while the more than 1,000 staff brings the total number of employees to 6,000.
Marvell spends 30 per cent of its revenues on R&D.
Acquisitions
Inphi is the latest of a series of Marvell acquisitions as it focusses on data infrastructure.
Marvell acquired two custom ASIC companies in 2019: Avera Semiconductor, originally the ASIC group of IBM Microelectronics, and Aquantia that has multi-gigabit PHY expertise.
A year earlier Marvell acquired processing and security chip player, Cavium Networks. Cavium had acquired storage specialist, QLogic, in 2017.
These acquisitions have more than doubled Marvell’s staff. Inphi brings electro-optics expertise for the data centre and optical transport and helps Marvell address the cloud and on-premises data centre markets as well as the 5G carrier market.
Marvell is also targeting the enterprise/ campus market and what it highlights as a growth area, automotive. Nigel Alvares, vice president of solutions at Marvell, notes the growing importance of in-vehicle networking, what he calls a ‘data-centre-on-wheels’.
“Inphi’s technology could also help us in automotive as optical technologies are used for self-driving initiatives in future,” says Alvares.
Inphi also brings integration, co-packaging and multi-chip module expertise.

Merchant chip and custom ASIC offerings
Cloud operators and 5G equipment vendors are increasingly developing custom chip designs. Marvell says it is combining its portfolio with their intellectual property (IP) to develop and build custom ICs.
Accordingly, in addition to its merchant chips such as the three OFC-announced devices, Marvell partners with cloud players or 5G vendors, providing them with key IP blocks that are integrated with their custom IP. Marvell can also build the ASICs.
Another chip-design business model Marvell offers is the integration of different hardware in a multi-chip package. The components include a custom ASIC, merchant silicon, high-speed memory and third-party chiplets.
“We co-package and deliver it to a cloud hyperscaler or a 5G technical company,” says Alvares.
Marvell says this chip strategy serves two market sectors: the cloud hyperscalers and the telcos.
Cloud players are developing custom solutions as they become more vertically integrated. They also have deep pockets. But they can’t do everything because they are not chip experts so they partner with companies like Marvell.
“The five to 10 hyperscalers in the world, they are doing so much creative stuff to optimise applications that they need custom silicon,” says Alvares.
The telcos, in contrast, are struggling to grow their revenues and favour merchant ICs, given they no longer have the R&D budgets of the past. It is this split in the marketplace which Marvell is targeting its various chip services.
OFC announcements
At OFC, Marvell announced the Deneb coherent DSP, used for optical transport including the linking of equipment between data centres.
The Deneb DSP is designed with open standards in mind and complements the 400-gigabit CMOS Canopus DSP announced by Inphi in 2019.
Deneb adds the oFEC forward error correction scheme to support open standards such as OpenZR+, 100-gigabit ZR, the 400-gigabit OpenROADM MSA and CableLabs’ 100-gigabit standard.
The 100-gigabit ZR is targeted at the 5G access market and mobile backhaul. Like the OIF 400G ZR, it supports reaches of 80-120km but uses quadrature phase-shift keying (QPSK) modulation.
“Not only do we support 100 gigabit [coherent] but we also have added the full industrial temperature range, from -40oC to 85oC,” says Michael Furlong, associated vice president, product marketing at Marvell.
The Deneb DSP is sampling now. Both the Deneb and Canopus DSPs will have a role in the marketplace, says Furlong.

Atlas PAM-4 DSP and the 1.6-terabit PHY
Marvell also announced at OFC the Atlas PAM-4 DSP and a dual 800-gigabit PHY devices used within the data centre.
Atlas advances Marvell’s existing family of Polaris PAM-4 DSPs in that it integrates physical media devices. “We are integrating [in CMOS] the trans-impedance amplifier (TIA) and laser drivers,” says Alvares.
Using the 200-gigabit Atlas reduces an optical module design from three chips to two; the Atlas comprises a transmit chip and a receiver chip (see diagram below). Using the Atlas chips reduces the module’s bill of materials, while power consumption is reduced by a quarter.

The Atlas chips, now sampling, are not packaged but offered as bare die and will be used for 200-gigabit SR4 and FR4 modules. Meanwhile, Marvell’s 1.6-terabit PHY – the 88X93160, – is a dual 800-gigabit copper DSP that performs retimer and gearbox functions.
“We view this as the key data centre building block for the next decade,” says Alvares. “The world is just starting to design 100-gigabit serial for their infrastructure.”
The device, supporting 16, 100-gigabit lanes, is the industry’s first 100-gigabit serial retimer, says Marvell. The device drives copper cable and backplanes and is being adopted for links between the server and the top-of-tack switch or to connect switches in the data centre.
The device is suitable for next-generation 400-gigabit and 800-gigabit Ethernet links that use 100-gigabit electrical serialisers-deserialisers (serdes).
The 5nm CMOS device supports over a 38dB (decibel) link budget and reduces I/O power by 40 per cent compared to a 50-gigabit Nigel PAM4-based PHY.
The 100-gigabit serdes design will also be used with Marvell’s Prestera switch portfolio.







