The compound complexity of co-packaged optics
Large-scale data centres consume huge amounts of power; one building on a data centre campus can consume 100MW. But there is a limit as to the overall power that can be supplied.

Jeff Hutchins
The challenge facing data centre operators is that networking, used to link the equipment inside the data centre, is consuming more and more of the power.
That means less power remains for the servers; the computing that does the revenue-generating work.
This is forcing a rethink regarding networking and explains the growing interest in co-packaged optics, a technique that effectively adds optical input-output (I/O) to a chip.
Two industry organisations - the OIF and The Consortium for On-Board Optics (COBO) - have each started work to identify the requirements needed for co-packaged optics adoption.
“We are seeing this activity because co-packaged optics is hard and requires prework to figure out how and when it is going to happen, and how the ecosystem changes,” says Nathan Tracy, TE Connectivity and the OIF’s vice president of marketing.
All change
Semiconductors and optics have always been separate domains but with a co-packaged design, silicon is suddenly only a handful of millimetres away from the optics, says Tracy: “It’s a very different environment.”
Hot chips sit next to the optics, so thermal characteristics must be shared and the cooling needs worked out. The electrical interface linking the optics to the chip will need to be optimised while there are new challenges such as how faults are dealt with.
“All these things come together and it changes what is done in the industry,” says Jeff Hutchins, Ranovus and OIF Physical and Link Layer (PLL) Working Group – Co-Packaging Vice-Chair.
“To be fair, there are companies that are not totally on-board with co-packaging,” says Hutchins. “But if you think about what is driving it, as you go to higher and higher electrical rates to connect things, you start to run more power and it is just more difficult to get a signal from Point A to Point B.”
For next-generation designs, companies are also considering ‘fly-over’ cables as well as the intermediate step of on-board optics, moving optics from the front panel onto the line card to be closer to the ASIC.
“But a good part of the industry thinks that, if you look forward, the only way to get there is co-packaging,” says Hutchins.
Using co-packaged optics will also impact the supply chain. The switch and pluggable modules are typically bought separately whereas a co-packaged design integrates the two. “Economically, it changes the way the industry works,” says Hutchins.
OIF and COBO

Nathan Tracy
Hutchins, who is also a board member of COBO, says the co-packaging work of the two organisations will be complementary.
Co-packaged optics resides deep on the line card and fibre must connect the package to the system’s front panel. In turn, an external laser is commonly used as the light source for the optics. Such a laser is linked to the package using fibre.
“What COBO is doing is focussing on the optical connectivity part of this solution; the stuff outside the co-packaged assembly,” says Hutchins. “The OIF is concentrating on what the co-package assembly is, what goes inside, and what agreements can be made for interoperability for the whole assembly.”
The membership of the two organisations also differs: the OIF members include hyperscalers as well as optical and switch companies. “We have a good cross-section of the membership of this ecosystem,” says Tracy. COBO’s membership includes companies with connector and materials expertise.
Framework project
The OIF Framework Project will first study the applications where co-packaged optics will be used, identifying commonalities. It will then address the technology to determine what interoperability agreements are needed.
Applications for co-packaged optics besides Ethernet switches include machine learning and disaggregation. A disaggregated design refers to separating the chips found on a server motherboard - general processors (CPUs), graphics processor units (GPUs) and memory - into separate pools. A workload can then access the pools and configure the hardware elements it needs.
For each application, issues such as density, power, latency, and wavelength-count-per-fibre will be explored. “These must be understood as they differ as you go across the applications,” says Hutchins.
The OIF will identify what interoperability agreements to pursue and what should remain open for now before kicking-off specific Implementation Agreements.
Hutchins stresses that are many aspects that can be standardised such as the mechanical design, environmental issues, power, electrical interfaces and reliability. “That is enough work to keep the whole group busy for quite a while,” he says
As an example, such work could lead to a common socket design that would allow different optical specifications and reliability requirements, says Hutchins.
The OIF expects to complete the first two stages within the coming year.
“People are ready to go but they need to see the whole picture,” says Hutchins.
Roadmap
The OIF expects a gradual introduction of co-packaged Ethernet switches in the data centre with the technology spanning several generations.
Demonstrations could start with 25.6-terabit switches emerging now whereas many think the next-generation 51.2-terabit platforms will be the place to do initial demonstrations and small-scale deployments. After that, 100-terabit switches will likely be the sweet spot for co-packaged optics. And once 200-terabit switches appear, co-packaged optics will be a necessity.
This may be a wide range of entry points, says Hutchins, but technology is being put together in a new way.
“The industry has to learn how to make this cost-effectively and achieve good yields,” says Hutchins. “There has to be a starting point somewhere but where the intercept point is, I don’t know.”
“Pluggables have served the market really well; they are flexible and [optical module] innovation continues,” adds Tracy. “The methodology is working so the question is when does it no longer suit the market.”
Tracy does not rule out pluggables being used for 100-terabit switches but inevitably it will be much harder to satisfy that requirement. “That is when co-packaged optics starts to become compelling,” says Tracy.
Ayar Labs’ TeraPhy chiplet nears volume production
Moving data between processing nodes - whether servers in a data centre or specialised computing nodes used for supercomputing and artificial intelligence (AI) - is becoming a performance bottleneck.
Workloads continue to grow yet networking isn’t keeping pace with processing hardware, resulting in the inefficient use of costly hardware.
Networking also accounts for an increasing proportion of the overall power consumed by such computing systems.
These trends explain the increasing interest in placing optics alongside chips and co-packaging the two to boost input-output (I/O) capacity and reach.
At the ECOC 2020 exhibition and conference held virtually, start-up Ayar Labs showcased its first working TeraPHY, an optical I/O chiplet, manufactured using GlobalFoundries’ 45nm silicon-photonics process.
GlobalFoundries is a strategic investor in Ayar Labs and has been supplying Ayar Labs with TeraPHY chips made using its existing 45nm silicon-on-insulator process for radio frequency (RF) designs.
The foundry’s new 300mm wafer 45nm silicon-photonics process follows joint work with Ayar Labs, including the development of the process design kit (PDK) and standard cells.
“This is a process that mixes optics and electronics,” says Hugo Saleh, vice president of marketing and business development at Ayar Labs (pictured). “We build a monolithic die that has all the logic to control the optics, as well as the optics,” he says.
The latest TeraPHY design is an important milestone for Ayar Labs as it looks to become a volume supplier. “None of the semiconductor manufacturers would consider integrating a solution into their package if it wasn’t produced on a qualified high-volume manufacturing process,” says Saleh.
Applications
The TeraPHY chiplet can be co-packaged with such devices as Ethernet switch chips, general-purpose processors (CPUs), graphics processing units (GPUs), AI processors, and field-programmable gate arrays (FPGAs).
Ayar Labs says it is engaged in several efforts to add optics to Ethernet switch chips, the application most associated with co-packaged optics, but its focus is AI, high-performance computing and aerospace applications.
Last year, Intel and Ayar Labs detailed a Stratix 10 FPGA co-packaged with two TeraPHYs for a phased-array radar design as part of a DARPA PIPES and the Electronics Resurgence Initiative backed by the US government.
Adding optical I/O chiplets to FPGAs suits several aerospace applications including avionics, satellite and electronic warfare.
TeraPHY chiplet
The ECOC-showcased TeraPHY uses eight transmitter-receiver pairs, each pair supporting eight channels operating at either 16, 25 or 32 gigabit-per-second (Gbps), to achieve an optical I/O of up to 2.048 terabits.
The chiplet can use either a serial electrical interface or Intel’s Advanced Interface Bus (AIB), a wide-bus design that uses slower 2Gbps channels. The latest TeraPHY uses a 32Gbps non-return-to-zero (NRZ) serial interface and Saleh says the company is working on a 56Gbps version.
The company has also demonstrated 4-level pulse-amplitude modulation (PAM-4) technology but many applications require the lowest latency links possible.
“PAM-4 gives you a higher data rate but it comes with the tax of forward-error correction,” says Saleh. With PAM-4 and forward-error correction, the latency is hundreds of nanoseconds (ns), whereas the latency is 5ns using a NRZ link.
Ayar Labs’s next parallel I/O AIB-based TeraPHY design will use Intel’s AIB 1.0 specification and will use 16 cells, each having 80, 2Gbps channels, to achieve a 2.5Tbps electrical interface.
In contrast, the TeraPHY used with the Stratix 10 FPGA has 24 AIB cells, each having 20, 2Gbps channels for an overall electrical bandwidth of 960 gigabits, while its optical I/O is 2.56Tbps since 10 transmit-receive pairs are used.
The optical bandwidth is deliberately higher than the electrical bandwidth. First, not all the transmit-receive macros on the die need to be used. Second, the chiplet has a crossbar switch that allows one-to-many connections such that an electrical channel can be sent out on more than one optical interface and vice versa.
Architectures
Saleh points to several recent announcements that highlight the changes taking place in the industry that are driving new architectural developments.
He cites AMD acquiring programmable logic player, Xilinx; how Apple instances are now being hosted in Amazon Web Services’ (AWS) cloud to aid developers and Apple's processors, and how AWS and Microsoft are developing their own processors.
“Processors can now be built by companies using TSMC’s leading process technology using the ARM and RISC-V processor ecosystems,” he says. “AWS and Microsoft can target their codebase to whatever processor they want, including one developed by themselves.”
Saleh notes that Ethernet remains a key networking technology in the data centre and will continue to evolve but certain developments do need something else.
Applications such as AI and high-performance computing would benefit from a disaggregated design whereby CPUs, GPUs, AI devices and memory are separated and pooled. An application can then select the hardware it needs for the relevant pools to create the exact architecture it needs.
“Some of these new applications and processors that are popping up, there is a lot of benefit in a one-to-one and one-to-many connections,” he says. “The Achilles heel has always been how you disaggregate the memory because of latency and power concerns. Co-packaged optics with the host ASIC is the only way to do that.”
It will also be the only way such disaggregated designs will work given that far greater connectivity - estimated to be up to 100x that of existing systems - will be needed.
Expansion
Ayar Labs announced in November that it had raised $35 million in the second round of funding which, it says, was oversubscribed. This adds to its previous funding of $25 million.
The latest round includes four new investors and will help the start-up expand and address new markets.
One investor is a UK firm, Downing, that will connect Ayar Labs to European R&D and product opportunities. Saleh mentions the European Processor Initiative (EPI) that is designing a family of low-power European processors for extreme-scale computing. “Working with Downing, we are getting introduced into some of these initiatives including EPI and having conversations with the principals,” he says.
In turn, SGInnovate, a venture capitalist funded by the Singapore government, will help expand Ayar Labs’ activities in Asia. The two other investors are Castor Ventures and Applied Ventures, the investment arm of Applied Materials, the supplier of chip fabrication plant equipment.
“Applied Materials want to partner with us to develop the methodologies and tools to bring the technology to market,” says Saleh.
Meanwhile, Ayar Labs continues to grow, with a staff count approaching 100.
PCI Express back on track with latest specifications
Richard Solomon and Scott Knowlton are waiting for me in the lobby of a well-known Tel-Aviv hotel overseeing the sunlit Mediterranean Sea.
Richard SolomonSolomon, vice chair of the PCI Special Interest Group (PCI-SIG), and Knowlton, its marketing working group co-chair, are visiting Israel to deliver a training event addressing the PCI Express (PCIe) high-speed serial bus standard.
With over 750 member companies, PCI-SIG conducts several training events around the world each year. The locations are chosen where there is a concentration of companies and engineers undertaking PCIe designs. “These are chip, board and systems architects,” says Solomon.
PCI-SIG has hit its stride after a prolonged quiet period. The group completed the PCIe 4.0 standard in 2017, seven years after it launched PCIe 3.0. The PCIe 4.0 doubles the serial bus speed and with the advent of PCIe 5.0, it will double again.
“We were late with PCIe 4.0,” admits Solomon. But with the introduction of the PCIe 5.0 standard in the first quarter of 2019, the serial bus’ speed progression will be back on track. “PCIe 5.0 is where the industry needs it to be.”
The latest training event is addressing the transition to PCIe 5.0. “User implementation stuff; the PHY, controller and verification IP,” says Knowlton. Verification IP refers to the protocols and interfaces needed to verify a PCIe 5.0-enabled chip design.
Markets
PCIe is used in a range of industries. In the cloud, the serial bus is used for servers and storage.
For servers, PCIe has been adopted by general-purpose microprocessors and more specialist devices such as FPGAs, graphics processing units and AI hardware.
The technology is also being used by enterprises, with PCIe switch silicon adopted in data centres to enable server redundancy and failover.
PCIe 5.0 is where the industry needs it to be
PCIe is also being used for storage and in particular solid-state drives (SSDs). That is because PCIe 4.0 transfers data at 16 gigabit-per-second (Gbps) per lane and can be scaled in parallel, typically in a by-four (x4) or a by-16 (x16) lane configuration.
The proportion of the SSDs that use PCIe is expected to grow from a quarter in 2018 to over three quarters in 2022, according to Forward Insights. Meanwhile, IDC forecasts that the SSD market will grow at a compound annual growth rate of 15 percent from 2016 to 2021.
PCIe is also employed within mobile handsets and for the Internet of Things designs. PCI-SIG attributes its adoption for these applications due to its speed and lane-width flexibility as well as its power efficiency.
Source: PCI-SIG
Bus specifications
The PCIe bus uses point-to-point communications. The standard uses a simple duplex scheme - serial transmissions in both directions that is referred to as a lane. The bus can be bundled in a variety of lane configurations - x1, x2, x4, x8, x12, x16 and x32 - although x2, x12 and x32 are rarely, if ever, used in practice.
Scott KnowltonThe first two iterations of PCIe, versions 1.0 and 2.0, delivered 2.5 and 5 gigatransfers-per-second (GT/s) per lane per direction, respectively.
A transfer refers to an encoded bit. The first two PCIe versions use an 8b/10b encoding scheme such that for every ten-bit payload sent, only 8 bits are data. This is why the data transfer rates per lane per direction are 2Gbps and 4Gbps (250 and 500 gigabytes-per-second), respectively (see table).
With PCIe 3.0, the decision was made to increase the transfer rate to 8GT/s per lane based on the assumption that no equalisation would be needed to counter inter-symbol interference at that speed, says Solomon. However, equalisation was needed in the end but that explains why PCIe 3.0 adopted 8GT/s and not 10GT/s.
Another PCIe 3.0 decision was to move to a 128b/130b scheme to reduce the encoding overhead from 20 percent to just over 1 percent. This is why the transfer rate and bit rate are almost equal from the PCIe 3.0 standard onwards (see table).
The recent PCIe 4.0 specification doubles the transfer rate from 8GT/s to 16GT/s while PCIe 5.0 will achieve 32GT/s per lane per direction.
When more than one lane is used, the encoded data is distributed across the lanes. A PCIe controller is used at each end of a lane to make sense of the bits. Meanwhile, a PCIe switch, a separate chip, can be used when fan out is needed to distribute the point-to-point links.
Compliance testing and design issues
Compliance testing of PCIe 4.0 will only occur in the beginning of 2019 even though it was standardised in 2017. Solomon says that this length of time is actually one of PCI-SIG's shorter periods. It takes time to refine the exact electrical testing to be used, he sys, and there is only so much that can be done until the silicon arrives.
Given that there are now 28Gbps and 56Gbps serialiser-deserialiser (serdes) technologies available, why were the PCIe 4.0 and PCIe 5.0 lane speeds not faster? Solomon says the latest PCIe standards were chosen to be multiples of the PCIe 3.0’s 8GT/s lane speed to ensure backward compatibility.
That said, designing systems using PCIe 4.0 and PCIe 5.0 signalling speeds is a challenge. Printed circuit boards need to be multi-layer and used higher-quality materials while retimer ICs are needed to achieve signal distances of 20 inches.
Solomon stresses that not all systems required such signal reaches; the dense electronics being developed for automotives that use AI techniques to make sense of their environment being one such example.
And with that, Solomon apologises and gets up: “I have a session to present”.
Switch chips not optics set the pace in the data centre
Broadcom is doubling the capacity of its switch silicon every 18-24 months, a considerable achievement given that Moore’s law has slowed down.
Last December, Broadcom announced it was sampling its Tomahawk 3 - the industry’s first 12.8-terabit switch chip - just 14 months after it announced its 6.4-terabit Tomahawk 2.
Rochan SankarSuch product cycle times are proving beyond the optical module makers; if producing next-generation switch silicon is taking up to two years, optics is taking three, says Broadcom.
“Right now, the problem with optics is that they are the laggards,” says Rochan Sankar, senior director of product marketing at switch IC maker, Broadcom. “The switching side is waiting for the optics to be deployable.”
The consequence, says Broadcom, is that in the three years spanning a particular optical module generation, customers have deployed two generations of switches. For example, the 3.2-terabit Tomahawk based switches and the higher-capacity Tomahawk 2 ones both use QSFP28 and SFP28 modules.
In future, a closer alignment in the development cycles of the chip and the optics will be required, argues Broadcom.
Switch chips
Broadcom has three switch chip families, each addressing a particular market. As well as the Tomahawk, Broadcom has the Trident and Jericho families (see table).

All three chips are implemented using a 16nm CMOS process. Source: Broadcom/ Gazettabyte.
“You have enough variance in the requirements such that one architecture spanning them all is non-ideal,” says Sankar.
The Tomahawk is a streamlined architecture for use in large-scale data centres. The device is designed to maximise the switching capacity both in terms of bandwidth-per-dollar and bandwidth-per-Watt.
“The hyperscalers are looking for a minimalist feature set,” says Sankar. They consider the switching network as an underlay, a Layer 3 IP fabric, and they want the functionality required for a highly reliable interconnect for the compute and storage, and nothing more, he says.
Right now, the problem with optics is that they are the laggards
Production of the Tomahawk 3 integrated circuit (IC) is ramping and the device has already been delivered to several webscale players and switch makers, says Broadcom.
The second, Trident family addresses the enterprise and data centres. The chip includes features deliberately stripped from the Tomahawk 3 such as support for Layer 2 tunnelling and advanced policy to enforce enterprise network security. The Trident also has a programmable packet-processing pipeline deemed unnecessary inlarge-scale data centres.
But such features are at the expense of switching capacity. “The Trident tends to be one generation behind the Tomahawk in terms of capacity,” says Sankar. The latest Trident 3 is a 3.2-terabit device.
The third, Jericho family is for the carrier market. The chip includes a packet processor and traffic manager and comes with the accompanying switch fabric IC dubbed Ramon. The two devices can be scaled to create huge capacity IP router systems exceeding 200 terabits of capacity. “The chipset is used in many different parts of the service provider’s backbone and access networks,” says Sankar. The Jericho 2, announced earlier this year, has 10 terabits of capacity.
Trends
Broadcom highlights several trends driving the growing networking needs within the data centre.
One is how microprocessors used within servers continue to incorporate more CPU cores while flash storage is becoming disaggregated. “Now the storage is sitting some distance from the compute resource that needs very low access times,” says Sankar.
The growing popularity of public cloud is also forcing data centre operators to seek greater servers utilisation to ‘pack more tenants per rack’.
There are also applications such as deep learning that use other computing ICs such as graphics processor units (GPUs) and FPGAs. “These push very high bandwidths through the network and the application creates topologies where any element can talk to any element,” says Sankar. This requires a ‘flat’ networking architecture that uses the fewest networking hops to connect the communicating nodes.
Such developments are reflected in the growth in server links to the first level or top-of-rack (TOR) switches, links that have gone from 10 to 25 to 50 and 100 gigabits. “Now you have the first 200-gigabit network interface cards coming out this year,” says Sankar.
Broadcom has been able to deliver 12.8 terabits-per-second in 16nm, whereas some competitors are waiting for 7nm
Broadcom says the TOR switch is not the part of the data centre network experiencing greatest growth. Rather, it is the layers above - the leaf-and-spine switching layers - where bandwidth requirements are accelerating the most. This is because the radix - the switch’s inputs and outputs - is increasing with the use of equal-cost multi-path (ECMP) routing. ECMP is a forwarding technique to distribute the traffic over multiple paths of equal cost to a destination port. “The width of the ECMP can be 4-way, 8-way and 16-way,” says Sankar. “That determines the connectivity to the next layer up.”
It is such multi-layered leaf-spine architectures that the Tomahawk 3 switch silicon addresses.
Tomahawk 3
The Tomahawk 3 is implemented using a 16nm CMOS process and features 256 50-gigabit PAM-4 serialiser-deserialiser (serdes) interfaces to enable the 12.8-terabit throughput.
“Broadcom has been able to deliver 12.8 terabits-per-second in 16nm, whereas some competitors are waiting for 7nm,” says Bob Wheeler, vice president and principal analyst for networking at the Linley Group.
Sankar says Broadcom undertook significant engineering work to move from the 16nm Tomahawk 2’s 25-gigabit non-return-to-zero serdes to a 16nm-based 50G PAM-4 design. The resulting faster serdes design requires only marginally more die area while reducing the gigabit-per-Watt measure by 40 percent.
The Tomahawk 3 also features a streamlined packet-processing pipeline and improved shared buffering. In the past, a switch chip could implement one packet-processing pipeline, says Wheeler. But at 12.8 terabit-per-second (Tbps), the aggregate packet rate exceeds the capacity of a single pipeline. “Broadcom implements multiple ingress and egress pipelines, each connected with multiple port blocks,” says Wheeler. The port blocks include MACs and serdes. “The hard part is connecting the pipelines to a shared buffer, and Broadcom doesn’t disclose details here.”
Source: Broadcom.
The chip also has telemetry support that exposes packet information to allow the data centre operators to see how their networks are performing.
Adopting a new generation of switch silicon also has system benefits.
One is reducing the number of hops between endpoints to achieve a lower latency. Broadcom cites how a 128x100 Gigabit Ethernet (GbE) platform based on a single Tomahawk 3 can replace six 64x100GbE switches in a two-tier arangement. This reduces latency by 60 percent, from 1 microsecond to 400 nanoseconds.
There are also system cost and power consumption benefits. Broadcom uses the example of Facebook’s Backpack modular switch platform. The 8 rack unit (RU) chassis uses two tiers of switches - 12 Tomahawk chips in total. Using the Tomahawk 3, the chassis can be replaced with a 1RU platform, reducing the power consumption by 75 percent and system cost by 85 percent.
Many in the industry have discussed the possibility of using the next 25.6-terabit generation of switch chip in early trials of in-package optics
Aligning timelines
Both the switch-chip vendors and the optical module players are challenged to keep up with the growing networking capacity demands of the data centre. The fact that next-generation optics takes about a year longer than the silicon is not new. It happened with the transition from 40-gigabit QSFP+ to 100-gigabit QSFP28 optical modules and now from the 100-gigabit QSFP28 to 200 gigabit QSFP56 and 400-gigabit QSFP-DD production.
“400-gigabit optical products are currently sampling in the industry in both OSFP and QSFP-DD form factors, but neither has achieved volume production,” says Sankar.
Broadcom is using 400-gigabit modules with its Tomahawk 3 in the lab, and customers are doing the same. However, the hyperscalers are not deploying Tomahawk-3 based data center network designs using 400-gigabit optics. Rather, the switches are using existing QSFP28 interfaces, or in some cases 200-gigabits optics. But 400-gigabit optics will follow.
The consequence of the disparity in the silicon and optics development cycles is that while the data centre players want to exploit the full capacity of the switch once it becomes available, they can’t. This means the data centre upgrades conducted - what Sankar calls ‘mid-life kickers’ - are costlier to implement. In addition, given that most cloud data centres are fibre-constrained, doubling the number of fibres to accommodate the silicon upgrade is physically prohibitive, says Broadcom.
“The operator can't upgrade the network any faster than the optics cadence, leading to a much higher overall total cost of ownership,” says Sankar. They must scale out to compensate for the inability to scale up the optics and the silicon simultaneously.
Optical I/O
Scaling the switch chip - its input-output (I/O) - presents its own system challenges. “The switch-port density is becoming limited by the physical fanout a single chip can support, says Sankar: “You can't keep doubling pins.”
It will be increasingly challenging to increase the input-output (I/O) to 512 or 1024 serdes in future switchchips while satisfying the system link budget, and achieving both in a power-efficient manner. Another reason why aligning the scaling of the optics and the serdes speeds with the switching element is desirable, says Broadcom.
Broadcom says electrical interfaces will certainly scale for its next-generation 25.6-terabit switch chip.
Linley Group’s Wheeler expects the 25.6-terabit switch will be achieved using 256 100-gigabit PAM4 serdes. “That serdes rate will enable 800 Gigabit Ethernet optical modules,” he says. “The OIF is standardising serdes via CEI-112G while the IEEE 802.3 has the 100/200/400G Electrical Interfaces Task Force running in parallel.”
But system designers already acknowledge that new ways to combine the switch silicon and optics are needed.
“One level of optimisation is the serdes interconnect between the switch chip and the optical module itself,” says Sankar, referring to bringing of optics on-board to shorten the electrical paths the serdes must drive. The Consortium of On-Board Optics (COBO) has specified just such an interoperable on-board optics solution.
“The stage after that is to integrate the optics with the IC in a single package,” says Sankar.
Broadcom is not saying which generation of switch chip capacity will require in-package optics. But given the IC roadmap of doubling switch capacity at least every two years, there is an urgency here, says Sankar.
The fact that there are few signs of in-package developments should not be mistaken for inactivity, he says: “People are being very quiet about it.”
Brad Booth, chair of COBO and principal network architect for Microsoft’s Azure Infrastructure, says COBO does not have a view as to when in-package optics will be needed.
Discussions are underway within the IEEE, OIF and COBO on what might be needed for in-package optics and when, says Booth: “One thing that many people do agree upon is that COBO is solving some of the technical problems that will benefit in-package optics such as optical connectivity inside the box.”
The move to in-package optics represents a considerable challenge for the industry.
“The transition and movement to in-package optics will require the industry to answer a lot of new questions that faceplate pluggable just doesn’t handle,” says Booth. “COBO will answer some of these, but in-package optics is not just a technical challenge, it will challenge the business-operating model.”
Booth says demonstrations of in-package optics can already be done with existing technologies. And given the rapid timelines of switch chip development, many in the industry have discussed the possibility of using the next 25.6-terabit generation of switch chip in early trials of in-package optics, he says.
There continues to be strong interest in white-box systems and strong signalling to the market to build white-box platforms
White boxes
While the dominant market for the Tomahawk family is the data centre, a recent development has been the use the 3.2-terabit Tomahawk chip within open-source platforms such as the Telecom Infra Project’s (TIP) Voyager and Cassini packet optical platforms.
Ciena has also announced its own 8180 platform that supports 6.4 terabits of switching capacity, yet Ciena says the 8180 uses a Tomahawk 3, implying the platform will scale to 12.8Tbps.
Niall Robinson,vice president, global business development at ADVA, a member of TIP and the Voyager initiative, makes the point that since the bulk of the traffic remains within the data centre, the packet optical switch capacity and the switch silicon it uses need not be the latest generation IC.
“Eventually, the packet-optical boxes will migrate to these larger switching chips but with some considerable time lag compared to their introduction inside the data centre,” says Robinson.
The advent of 400-gigabit client-port optics will drive the move to higher-capacity platforms such as the Voyager because it is these larger chips that can support 400-gigabit ports. “Perhaps a Jericho 2 at 9.6-terabit is sufficient compared to a Tomahawk 3 at 12.8-terabit,” says Robinson.
Edgecore Networks, the originator of the Cassini platform, says it too is interested in the Tomahawk 3 for its Cassini platform.
“We have a Tomahawk 3 platform that is sampling now,” says Bill Burger, vice president, business development and marketing, North America at Edgecore Networks, referring to a 12.8Tbps open networking switch that supports 32, 400-gigabit QSFP-DD modules that has been contributed to the Open Compute Project (OCP).
Broadcom’s Sankar highlights the work of the OCP and TIP in promoting disaggregated hardware and software. The initiatives have created a forum for open specifications, increased the number of hardware players and therefore competition while reducing platform-development timescales.
“There continues to be strong interest in white-box systems and strong signalling to the market to build white-box platforms,” says Sankar.
The issue, however, is the lack of volume deployments to justify the investment made in disaggregated designs.
“The places in the industry where white boxes have taken off continues to be the hyperscalers, and a handful of hyperscalers at that,” says Sankar. “The industry has yet to take up disaggregated networking hardware at the rate at which it is spreading at least the appearance of demand.”
Sankar is looking for the industry to narrow the choice of white-box solutions available and for the emergence of a consumption model for white boxes beyond just several hyperscalers.
Imec eyes silicon photonics to solve chip I/O bottleneck
In the second and final article, the issue of adding optical input-output (I/O) to ICs is discussed with a focus on the work of the Imec nanoelectronics R&D centre that is using silicon photonics for optical I/O.
Part 2: Optical I/O
Imec has demonstrated a compact low-power silicon-photonics transceiver operating at 40 gigabits per second (Gbps). The silicon photonics transceiver design also uses 14nm FinFET CMOS technology to implement the accompanying driver and receiver electronics.
Joris Van Campenhout“We wanted to develop an optical I/O technology that can interface to advanced CMOS technology,” says Joris Van Campenhout, director of the optical I/O R&D programme at Imec. “We want to directly stick our photonics device to that mainstream CMOS technology being used for advanced computing applications.”
Traditionally, the Belgium nanoelectronics R&D centre has focussed on scaling logic and memory but in 2010 it started an optical I/O research programme. “It was driven by the fact that we saw that electrical I/O doesn’t scale that well,” says Van Campenhout. Electrical interfaces have power, space and reach issues that get worse with each hike in transmission speed.
Imec is working with partner companies to research optical I/O. The players are not named but include semiconductor foundries, tool vendors, fabless chip companies and electronic design automation tools firms. The aim is to increase link capacity, bandwidth density - a measure of the link capacity that can be crammed in a given space - and reach using optical I/O. The research’s target is to achieve between a 10x to 100x in scaling.
The number of silicon photonics optical I/O circuits manufactured each year remains small, says Imec, several thousand to ten thousand semiconductor wafers at most. But Imec expects volumes to grow dramatically over the next five years as optical interconnects are used for ever shorter reaches, a few meters and eventually below one meter.
“That is why we are participating in this research, to put together building blocks to help in the technology pathfinding,” says Van Campenhout.
We wanted to develop an optical I/O technology that can interface to advanced CMOS technology
Silicon photonics transceiver
Imec has demonstrated a 1330nm optical transceiver operating at 40Gbps using non-return-to-zero signalling. The design uses hybrid integration to combine silicon photonics with 14nm FinFET CMOS electronics. The resulting transceiver occupies 0.025 mm2, the area across the combined silicon photonics and CMOS stack for a single transceiver channel. This equates to a bandwidth density of 1.6 terabit-per-second/mm2.
The silicon photonics and FinFET test chips each contain circuitry for eight transmit and eight receive channels. Combined, the transmitter path comprises a silicon photonics ring modulator and a FinFET differential driver while the receiver uses a germanium-based photo-detector and a first-stage FinFET trans-impedance amplifier (TIA).
The transceiver has an on-chip power consumption of 230 femtojoules-per-bit, although Van Campenhout stresses that this is a subset of the functionality needed for the complete link. “This number doesn’t include the off-chip laser power,” he says. “We still need to couple 13dBm - 20mW - of optical power in the silicon photonics chip to close the link budget.” Given the laser has an efficiency of 10 to 20 percent, that means another 100mW to 200mW of power.
That said, an equivalent speed electrical interface has an on-chip power consumption of some 2 picojoules-per-bit so the optical interface still has some margin to better the power efficiency of the equivalent electrical I/O. In turn, the optical I/O’s reach using single-mode fibre is several hundred meters, far greater than any electrical interface.
Imec is confident it can increase the optical interface’s speed to 56Gbps. The layout of the CMOS circuits can be improved to reduce internal parasitic capacitances while Imec has already improved the ring modulator design compared to the one used for the demonstrator.
“We believe that with a few design tweaks we can get to 56Gbps comfortably,” says Van Campenhout. “After that, to go faster will require new technology like PAM-4 rather than non-return-to-zero signalling.”
Imec has also tested four transmit channels using cascaded ring modulators on a common waveguide as part of work to add a wavelength-division multiplexing capability.
Transceiver packaging
The two devices - the silicon photonics die and the associated electronics - are combined using chip-stacking technology.
Both devices use micro-bumps with a 50-micron pitch with the FinFET die flip-chipped onto the silicon photonics die. The combined CMOS and silicon photonics assembly is glued on a test board and wire-bonded, while the v-groove fibre arrays are attached using active alignment. The fibre-to-chip coupling loss, at 4.5dB in the demonstration, remains high but the researchers say this can be reduced, having achieved 2dB coupling losses in separate test chips.
Source: Imec.
Imec is also investigating using through-silicon vias (TSV) technology and a silicon photonics interposer in order to replace the wire-bonding. TSVs deliver better power and ground signals to the two dies and enable high-speed electrical I/O between the transceiver and the ASIC such as a switch chip. The optics and ASIC could be co-packaged or the transceiver used in an on-board optics design next to the chip.
“We have already shown the co-integration of TSVs with our own silicon photonics platform but we are not yet showing the integration with the CMOS die,” says Van Campenhout. “Something we are working on.”
Co-packaging the optics with silicon will come at a premium cost
Applications
The first ICs to adopt optical I/O will be used in the data centre and for high-performance computing. The latest data centre switch ICs, with a capacity of 12.8 terabits, are implemented using 16nm CMOS. Moving to a 7nm CMOS process node will enable capacities of 51.2 terabits. “These are the systems where the bandwidth density challenge is the largest,” says Van Campenhout.
But significant challenges must be overcome before this happens, he says: “I think we all agree that bringing optics deeply integrated into such a product is not a trivial thing.”
Co-packaging the optics with silicon will come at a premium cost. There are also reliability issues to be resolved and greater standardisation across the industry will be needed as to how the packaging should be done.
Van Campenhout expects this will only happen in the next four to five years, once the traffic-handling capacity of switch chips doubles and doubles again.
Imec has seen growing industry interest in optical I/O in the last two years. “We have a lot of active interactions so interest is accelerating now,” says Van Campenhout.
Xilinx delivers 58G serdes and showcases a 112G test chip
In the first of two articles, electrical input-output developments are discussed, focussing on Xilinx’s serialiser-deserialiser (serdes) work for its programmable logic chips. In Part 2, the Imec nanoelectronics R&D centre’s latest silicon photonics work to enable optical I/O for chips is detailed.
Part 1: Electrical I/O
Processor and memory chips continue to scale exponentially. The electrical input-output (I/O) used to move data on and off such chips scales less well. Electrical interfaces are now transitioning from 28 gigabit-per-second (Gbps) to 56Gbps and work is already advanced to double the rate again to 112Gbps. But the question as to when electrical interfaces will reach their practical limit continues to be debated.
Gilles Garcia“Some two years ago, talking to the serdes community, they were seeing 100 gigabits as the first potential wall,” says Gilles Garcia, communications business lead at Xilinx. “In two years, a lot of work has happened and we can now demonstrate 112 gigabits [electrical interfaces].”
The challenge of moving to higher-speed serdes is that the reach shortens with each doubling of speed. The need to move greater amounts of data on- and off-chip also has power-consumption implications, especially with the extra circuitry needed when moving from non-return-to-zero signalling to the more complex 4-level pulse-amplitude modulation (PAM-4) signalling scheme.
PAM-4 is already used for 56-gigabit electrical I/O for such applications as 400 Gigabit Ethernet optical modules and leading edge 12.8-terabit switch chips. Having 112-gigabit serdes at least ensures one further generation of switch chips and optical modules but what comes after that is still to be determined. Even if more can be squeezed out of copper, the trace lengths will shorten and optics will continue to get closer to the chip.
58-gigabit serdes
Xilinx announced in March its first two Virtex Ultrascale+ FPGAs that will feature 58Gbps serdes. The company also demonstrated the technology at the OFC show. “No one else on the show floor had the same [58G serdes] capabilities in terms of bit error rate, noise floor, the demonstration across backplane technology, and transmitting and receiving data simultaneously,” says Garcia.
The two FPGAs are the VU27P that features 32 of the 58Gbps serdes as well as 32, 33Gbps serdes, while the second device, the VU29P, has 48, 58Gbps serdes as well as 32, 33Gbps ones. Both FPGA devices will ship by the year-end, says Xilinx. Moreover, customers have already used Xilinx’s 58Gbps test chip to validate its working over their systems’ backplanes in preparation for the arrival of the FPGAs.
No one else on the show floor had the same [58G serdes] capabilities in terms of bit error rate, noise floor, the demonstration across backplane technology, and transmitting and receiving data simultaneously
The Ultrascale+ FPGAs are constructed using several dice attached to a single silicon interposer to form a 2.5D chip design, what Xilinx calls its stacked silicon interconnect technology. The 58Gbps serdes are integrated into each FPGA slice. “Consider each slice as a monolithic implementation,” says Garcia.
Source: Xilinx.
The two FPGAs with 58Gbps serdes are suited for such telecom applications as next-generation router and packet optical line cards that will use 200-gigabit and 400-gigabit client-side optical modules. The VU29P with its 48, 58Gbps serdes will be able to support line cards with up to six QSFP-DD or OSPF 400 Gigabit Ethernet modules (see the diagram of an example line card).
112-gigabit test chip
Xilinx also showcased its 112Gbps serdes test chip at the OFC show in March. “What we showed was it operating in full duplex mode - transmitting and receiving - running on the same board as the 58-gigabit serdes,” says Garcia. “The point being the 112-gigabit demo worked on a printed circuit board not designed for a 112-gigabit serdes.”
Xilinx stresses that the 112-gigabit serdes will appear on its next generation of FPGA devices implemented using a 7nm CMOS process. “It [the FPGA portfolio] will coincide with when the market needs 112 gigabits,” he says.
One obvious market indicator will be the emergence of optical modules that use electrical lanes operating at 112 gigabits. “The holy grail of optical modules is to use four [electrical] lanes for 400 gigabits,” says Garcia. The IEEE is working on such a specification and the work is expected to be completed at the end of 2019. Optical module vendors will likely have first samples in 2020. Then there is the separate timeline associated with next-generation 25.6-terabit switch chips.
“You need to have the full ecosystem before customers really implement 112Gbps serdes,” says Garcia.
Sckipio’s G.fast silicon to enable gigabit services
Sckipio’s newest G.fast broadband chipset family delivers 1.2 gigabits of aggregate bandwidth over 100m of telephone wire.
The start-up’s SCK-23000 chipset family implements the ITU’s G.fast Amendment 3 212a profile. The profile doubles the spectrum used from G.fast from 106MHz to 212MHz, boosting the broadband rates. In contrast, VDSL2 digital subscriber line technology uses 17MHz of spectrum only.
“What the telcos want is gigabit services,” says Michael Weissman, vice president of marketing at Sckipio. “This second-generation [chipset family] allows that.”
G.fast market
AT&T announced in August that it is rolling out G.fast technology in 22 metro regions in the US. The operator already offers G.fast to multi-dwelling units in eight of these metro regions. The rollout adds to the broadband services AT&T offers in 21 states.
AT&T’s purchase of DirecTV in 2015 has given the operator some 20 million coax lines, says Weissman. AT&T can now deliver broadband services to apartments that have the DirecTV satellite service by bringing a connection to the building’s roof. AT&T will deliver such connections using its own fibre or by partnering with an incumbent operator. Once connected, high-speed internet using G.fast can then be delivered over the coax cable, a superior medium compared to telephony wiring.
Michael Weissman“This is fundamentally going to change the game,” says Weissman. “AT&T can now compete with cable companies and incumbent operators in markets it couldn’t address before.”
Sckipio has secured four out of the top five telcos in the US that have chosen to do G.fast: AT&T, CenturyLink, Windstream and Frontier. “The two largest - AT&T and CenturyLink - are exclusively ours,” says Weissman.
In markets such as China, the focus is on fibre. The three largest Chinese operators had deployed some 260 million fibre-to-the-home (FTTH) lines by the end of July.
Overall, Sckipio is involved in some 100 G.fast pilots worldwide. The start-up is also the sole supplier of G.fast silicon to broadband vendor Calix and one of two suppliers to Adtran.
“Right now there are only two real deployments that are publicly announced - and I mean deployment volumes - AT&T and BT,” says Weissman. “The point is G.fast is real.”
Telcos have several requirements when it comes to G.fast deployment. One is that the technology delivers competitive broadband rates and that means gigabit services. Another is coverage: the ability to serve as high a percentage of customers as possible in a given region.
What the telcos want is gigabit services. This second-generation [chipset family] allows that.
Because G.fast works across the broader spectrum - 212MHz - advanced signal processing techniques are required to make the technology work. Known as vectoring, the signal processing technique rejects crosstalk - leaking signals - between the telephone wires at the distribution point. A further operator need is ‘vectoring density’, the ability to vector as many lines as possible.
It is these and other requirements that Sckipio has set out to address with its SCK-23000 chipset family.
SCK-23000 chipset
The SCK-23000 comprises two chipsets. One is the 8-port DP23000 chipset used at the distribution point unit (DPU) while the second chipset is the CP23000, used for customer premise equipment.
Sckipio is not saying what CMOS process is used to implement the chipsets. Nor will it say how many chips make up each of the chipsets.
As for performance, the chipsets enable an aggregate line-rate performance (downstream and upstream) of 1.7 gigabits-per-second (Gbps) over 50m, to 0.4Gbps over 300m. The DP23000 chipset also supports two bonded telephone lines, effectively doubling the line rate. In markets such as the US and Taiwan, a second wire pair to a home is common.
Vectoring density
Vectoring density dictates how many G.fast ports can be deployed as a distribution point. And the computationally-intensive task is even more demanding with the adoption of the 212a profile. “The larger the vector group, the more each subscriber’s line must know what every other subscriber’s signal is to manage the crosstalk - and you are doing it at twice the bandwidth,” says Weissman.
Sckipio says the SCK-23000 supports up to 96 ports (or 48 bonded ports) at the 212a profile. The design uses distributed parallel processing that spreads the vectoring computation among the DP23000 8-port devices used. “We are not specifying data paths between the chips but you are talking about gigabytes of traffic flowing in all directions, all of the time,” says Weissman.
The computation can not only be spread across the devices in a single distribution point box but across devices in different boxes. Operators can thus use a pay-as-you-grow model, adding a new box as required. “A 96-port design could be two 48-port boxes, or an 8-port box could [be combined to] become a 16- or 24-port design if you have a smaller multi-dwelling unit environment,” says Weissman.
Sckipio’s design also features a reverse power feed: power is fed to the distribution point to avoid having to install a costly power supply. Since the power must come from a subscriber, the box’s power demand must not be excessive. A 16-port box is a good compromise in that it is not too large and as subscriber-count grows, each new 16-port unit added can be powered by another consumer.
“You can only do that if you can do cross distribution-point-unit vectoring,” says Weissman. “It allows the telcos to do a reverse power feed at the densities they require.”
Dynamic bandwidth allocation
The chipsets also support co-ordinated dynamic bandwidth allocation, what Sckipio refers to as co-ordinated dynamic time assignment.
Unlike DSL where the spectrum is split between upstream and downstream traffic, G.fast partitions the two streams in time: the CPE chipset is either uploading or downloading traffic only.
Until now, an operator will preset a fixed upload-download ratio at installation. Now, with the latest silicon, dynamic bandwidth allocation can take place. The system assesses the changing usage of subscribers and adjusts the upload-download ratio accordingly. However, this must be co-ordinated across all users such that they all send and all receive data simultaneously.
“You can’t, under any circumstances, have lines uploading and downloading at the same time,” says Weissman. “All the systems that are vectored must be communicating in the same direction at the same time.” If they are not co-ordinated, crosstalk occurs. This is another crosstalk, in addition to the crosstalk caused by the adjacency of the telephone wires that is compensated for using vectoring.
“If you don’t co-ordinate across all the pairs, you create a different type of crosstalk which you can’t mitigate,” says Weissman. “This will kill the system.”
Sckipio says the SCK-23000 chipsets are already with customers and that the devices are generally available.
MultiPhy unveils 100G single-wavelength PAM-4 chip
A chip to enable 100-gigabit single-wavelength client-side optical modules has been unveiled by MultiPhy. The 100-gigabit 4-level pulse amplitude modulation (PAM-4) circuit will also be a key building block for 400 Gigabit Ethernet interfaces that use four wavelengths.
Source: MultiPhy
Dubbed the MPF3101, the 100-gigabit physical layer (PHY) chip is aimed at such applications as connecting switches within data centres and for 5G cloud radio access network (CRAN).
“The chip has already been sent out to customers and we are heading towards market introductions,” says Avi Shabtai, CEO of MultiPhy.
The MPF3101 will support 100-gigabit over 500m, 2km and 10km.
The IEEE has developed the 100-gigabit 100GBASE-DR standard for 500m while the newly formed 100G Lambda MSA (multi-source agreement) is developing specifications for the 2km 100-gigabit single-channel 100G-FR and the 10km 100G-LR.
MultiPhy says the QSFP28 will be the first pluggable module to implement a 100-gigabit single-wavelength design using its chip. The SFP-DD MSA, currently under development, will be another pluggable form factor for the single-wavelength 100-gigabit designs.
The chip has already been sent out to customers and we are heading towards market introductions
400 Gigabit
The 100-gigabit IP will also be a key building block for a second MultiPhy chip for 400-gigabit optical modules needed for next-generation data centre switches that have 6.4 and 12.8 terabits of capacity. “This is the core engine for all these markets,” says Shabtai.
Companies have differing views as to how best to address the 400-gigabit interconnect market. There is a choice of form factors such as the OSFP, QSFP-DD and embedded optics based on the COBO specification, as well as emerging standards and MSAs.
The dilemma facing companies is what approach will deliver 400-gigabit modules to coincide with the emergence of next-generation data centre switches.
One consideration is the technical risk associated with implementing a particular design. Another is cost, with the assumption that 4-wavelength 400-gigabit designs will be cheaper than 8x50-gigabit based modules but that they may take longer to come to market.
For 400 gigabits, the IEEE 803.3bs 400 Gigabit Ethernet Task Force has specified the 400GBASE-DR4, a 500m-reach four-wavelength specification that uses four parallel single-mode fibres. The 100G Lambda MSA is also working on a 400-gigabit 2km specification based on coarse wavelength-division multiplexing (CWDM), known as 400G-FR4, with work on a 10km reach specification to start in 2018.
We are hearing a lot in the industry about 50-gigabit-per-lambda. For us, this is old news; we are moving to 100-gigabit-per-lambda and we believe the industry will align with us.
And at ECOC 2017 show, held last week in Gothenburg, another initiative - the CWDM8 MSA - was announced. The CWDM8 is an alternative design to the IEEE specifications that sends eight 50-gigabit non-return-to-zero signals rather that PAM-4 over a fibre.
“We are hearing a lot in the industry about 50-gigabit-per-lambda,” says Shabtai. “For us, this is old news; we are moving to 100-gigabit-per-lambda and we believe the industry will align with us.”
Chip architecture
The MPF3101, implemented using a 16nm CMOS process, supports PAM-4 at symbol rates up to 58 gigabaud.
The chip’s electrical input is four 25-gigabit lanes that are multiplexed and encoded into a 50-plus gigabaud PAM-4 signal that is fed to a modulator driver, part of a 100-gigabit single-channel transmitter optical sub-assembly (TOSA). A 100-gigabit receiver optical sub-assembly (ROSA) feeds the received PAM-4 encoded signal to the chip’s DSP before converting the 100-gigabit signal to 4x25 gigabit electrical signals (see diagram).
“If you need now only one laser and one optical path [for 100 gigabits] instead of four [25 gigabits optical paths], that creates a significant cost reduction,” says Shabtai.
The advent of a single-wavelength 100-gigabit module promises several advantages to the industry. One is lower cost. Estimates that MultiPhy is hearing is that a single-wavelength 100-gigabit module will be half the cost of existing 4x25-gigabit optical modules. Such modules will also enable higher-capacity switches as well as 100-gigabit breakout channels when connected to a 400-gigabit four-wavelength module. Lastly, MultiPhy expects the overall power consumption to be less.
Availability
MultiPhy says first 100-gigabit single-wavelength QSFP28s will appear sometime in 2018.
The company is being coy as to when it will have a 400-gigabit PAM-4 chip but it points out that by having working MPF3101 silicon, it is now an integration issue to deliver a 4-channel 400-gigabit design.
As for the overall market, new high-capacity switches using 400-gigabit modules will start to appear next year. The sooner four-channel 400-gigabit PAM-4 silicon and optical modules appear, the less opportunity there will be for eight-wavelength 400-gigabit designs to gain a market foothold.
“That is the race we are in,” says Shabtai.
Inphi unveils a second 400G PAM-4 IC family
Inphi has announced the Vega family of 4-level, pulse-amplitude modulation (PAM-4) chips for 400-gigabit interfaces.
The 16nm CMOS Vega IC family is designed for enterprise line cards and is Inphi’s second family of 400-gigabit chips that support eight lanes of 50-gigabit PAM-4.
Its first 8x50-gigabit family, dubbed Polaris, is used within 400-gigabit optical modules and was announced at the OFC show held in Los Angeles in March.
“Polaris is a stripped-down low-power DSP targeted at optical module applications,” says Siddharth Sheth, senior vice president, networking interconnect at Inphi (pictured). “Vega, also eight by 50-gigabits, is aimed at enterprise OEMs for their line-card retimer and gearbox applications.”
A third Inphi 400-gigabit chip family, supporting four channels of 100-gigabit PAM-4 within optical modules, will be announced later this year or early next year.
400G PAM-4 drivers
Inphi’s PAM-4 chips have been developed in anticipation of the emergence of next-generation 6.4-terabit and 12.8-terabit switch silicon and accompanying 400-gigabit optical modules such as the OSFP and QSFP-DD form factors.
Sheth highlights Broadcom’s Tomahawk-III, start-up Innovium’s Teralynx and Mellanox’s Spectrum-2 switch silicon. All have 50-gigabit PAM-4 interfaces implemented using 25-gigabaud signalling and PAM-4 modulation.
“What is required is that such switch silicon is available and mature in order for us to deploy our PAM-4 products,” says Sheth. “Everything we are seeing suggests that the switch silicon will be available by the end of this year and will probably go into production by the end of next year,” says Sheth.
Several optical module makers are starting to build 8x50-gigabit OSFP and QSFP-DD products
The other key product that needs to be available is the 400-gigabit optical modules. The industry is pursuing two main form factors: the OSFP and the QSFP-DD. Google and switch maker Arista Networks are proponents of the OSFP form factor while the likes of Amazon, Facebook and Cisco back the QSFP-DD. Google has said that it will initially use an 8x50-gigabit module implementation for 400 gigabit. Such a solution uses existing, mature 25-gigabit optics and will be available sooner than the more demanding 4x100-gigabit design that Amazon, Facebook and Cisco are waiting for. The 4x100 gigabit design requires 50Gbaud optics and a 50Gbaud PAM-4 chip.
Inphi says several optical module makers are starting to build 8x50-gigabit OSFP and QSFP-DD products and that its Polaris and Vega family of chips anticipate such deployments.
“We expect 100-gigabit optics to be available sometime around mid-2018 and our next-generation 100-gigabit PAM-4 will be available in the early part of next year,” says Sheth.
Accordingly, the combination of the switch silicon and optics means that the complete ecosystem will already exist next year, he says
Vega
The Polaris chip, used within an optical module, equalises the optical non-linearities of the incoming 50-gigabit PAM-4 signals. The optical signal is created using 25-gigabit lasers that are modulated using a PAM-4 signal that encodes two bits per signal. “When you run PAM-4 over fibre - whether multi-mode or single mode - the signal undergoes a lot of distortion,” says Sheth. “You need the DSP to clean up that distortion.”
The Vega chip, in contrast, sits on enterprise line cards and adds digital functionality that is not supported by the switch silicon. Most enterprise boxes support legacy data rates such as 10 gigabit and 1 gigabit. The Vega chip supports such legacy rates as well as 25, 50, 100, 200 and 400 gigabit, says Sheth.
The Vega chip can add forward-error correction to a data stream and decode it. As well as FEC, the chip also has physical coding sublayer (PCS) functionality. “Every time you need to encode a signal with FEC or decode it, you need to unravel the Ethernet data stream and then reassemble it,” says Sheth.
Also on-chip is a crossbar that can switch any lane to any other lane before feeding the data to the switch silicon.
Sheth stresses that not all switch chip applications need the Vega. For large-scale data centre applications that use stripped-down systems, the optical module would feed the PAM-4 signal directly into the switch silicon, requiring the use of the Polaris chip only.
A second role for Vega is driving PAM-4 signals across a system. “If you want to drive 50-gigabit PAM-4 signals electrically across a system line card and noisy backplane then you need a chip like Vega,” says Sheth.
A further application for the Vega chip is as a ‘gearbox’, converting between 50-gigabit and 25-gigabit line rates. Once high-capacity switch silicon with 50G PAM-4 signals are deployed, the Vega chip will enable the conversion between 50-gigabit PAM-4 and 25-gigabit non-return-to-zero (NRZ) signals.System vendors will then be able to interface 100-gigabit (4x25-gigabit) QSFP28 modules with these new switch chips.
One hundred gigabit modules will be deployed for at least another three to four years while the price of such modules has come down significantly. “For a lot of the cloud players it comes down to cost: are 128-ports at 100-gigabit cheaper that 32, 400-gigabit modules?” says Sheth. The company says it is seeing a lot of interest in this application.
We expect 100-gigabit optics to be available sometime around mid-2018 and our next-generation 100-gigabit PAM-4 will be available in the early part of next year
Availability
Inphi has announced two Vega chips: a 400-gigabit gearbox and a 400-gigabit retimer and gearbox IC. “We are sampling,” says Sheth. “We have got customers running traffic on their line cards.” General availability is expected in the first quarter of 2018.
As for the 4x100-gigabit PAM-4 chips, Sheth expects solutions to appear in the first half of next year: “We have to see how mature the optics are at that point and whether something can go into production in 2018.”
Inphi maintains that the 8x50-gigabit optical module solutions will go to market first and that the 4x100-gigabit variants will appear a year later. “If you look at our schedules, Polaris and the 4x100-gigabit PAM-4 chip are one year apart,” he says.
Cavium broadens its Xpliant switch-chip offerings
- Two families of Xpliant switch chips have been unveiled: the XP60 with sub-terabit switching capacities and the mid-range XP70 devices with 1 to 1.8 terabits of capacity.
- The switch ICs broaden the datacom and telecom markets Cavium can now address.
- Cavium is developing a next-generation high-end switch chip but the company is not saying when it will be announced.
Cavium has broadened its portfolio of switch chips. The two families - the XP60 and the XP70 - have smaller switch capacities than Cavium’s XP80 Xpliant family and feature architectural enhancements.
“The new chips expand Cavium’s addressable markets to include enterprise and carrier-access networks as well as mainstream cloud data centres,” says Bob Wheeler, principal analyst for networking at The Linley Group.
John Harrsen
The switch chips enable Cavium to address 25-gigabit interface switches, power-constrained enclosure designs such as blade servers, and 5G cloud radio access networks (CRAN) and GPON aggregation.
Until now Cavium has offered three XP80 Xpliant switch ICs, the largest being a 3.2-terabit switch. In contrast, the three XP70 devices have switch capacities of 1, 1.4 and 1.8 terabits while the XP60’s three chips have 280, 560 and 720 gigabits of capacity.
“The vast majority of the spend in this market is still the mid-tier; it is not all at the high end,” says John Harrsen, marketing director, switch platform group at Cavium.
Cavium stresses the importance of offering a broad portfolio of switch devices given the high development cost of software for systems vendors. Porting a vendor’s network operating system onto the switch chip is a $5 million to $10 million undertaking, he says: “Customers will not invest in software which is a point solution; it is too damn expensive.”
Programmability enhancements
The Linley Group’s Wheeler points out that traditional Ethernet switch chips are not programmable and that Cavium was the first to production with a programmable switch chip. “Barefoot Networks is the only competitor with a similar level of programmability,” says Wheeler. “So the Xpliant chips are attractive to customers that want to implement custom features or protocols.”
The XP60 and XP70 remain code-compatible with the XP80 devices but the programming model has been enhanced based on three years of experience gained from customers programming the Xpliant architecture.
The new chips expand Cavium’s addressable markets to include enterprise and carrier-access networks as well as mainstream cloud data centres
“You look at how the functionality wanted by a customer gets distributed across the hardware primitives that exist in the switching pipeline,” says Harrsen. “That data and experience are then fed back to the architects that start tinkering with the architecture to make it easier to use and manage.”
Cavium’s switch chips do not use an instruction set because it does not deliver the performance needed by a switch chip, says Harrsen. Instead, a combination of a very long instruction word (VLIW) parallel architecture and look-up tables are used for the programming. “We have primitives dedicated for certain functions that have parameters that can be programmed,” says Harrsen.
One example is parsing packets where the offset into the packet can be programmed. Another is the seed used for a cyclic-redundancy check (CRC) engine used to check packets. Cavium uses a C-like high-level language to program its chips.
The flexibility of a programmable architecture is also reflected in the ability to support extensible protocols. Such protocols feature a type-length-value field that allows changes to be made to a protocol, in effect the protocol header can morph into different things.
One such extensible protocol is segment routing which is gaining in popularity among data centre operators although it has yet to be deployed. “It is an example of a header that we did not anticipate ever supporting but having a programmable architecture, we can,” says Harrsen.
Segment routing enables data centre operators to differentiate between storage and compute traffic flows even before such traffic enters the network. This allows them to better allocate their networking resources to accommodate large (elephant) storage flows compared to shorter compute (mice) flows to avoid overburdening network resources. “This is something our architecture is very good at doing,” says Harrsen.
Being programmable also enables the switch silicon to support evolving network virtualisation protocols. “Customers are altering their virtualisation protocols and this requires a pretty quick switch upgrade cycle,” says Harrsen. “This is only capable of being implemented in a programmable switch; you do not need to spin silicon to upgrade the switch.”
The network virtualisation protocols include Virtual Extensible LAN (VXLAN), Network Vitualisation using Generic Routing Encapsulation (NVGRE), and the more recent Geneve. VXLAN, for example, allows Layer-2 frames to be tunnelled through a Layer-3 IP network as well as extending the number of virtual LANs that can be supported.
The programmable nature of the Xpliant chips also means they can support the P4 programming language. The latest version of the P4 language issued in late 2016 is much more generic than previous generations of the open-source language. The P4 language can be used to program functionality into smart network interface cards - another product line of Cavium after its acquisition of QLogic - as well as switches. Cavium is considering P4 as a viable candidate alongside its own C-like compiler for its chips.
Evolving requirements
The XP60 and XP70 switch chips also include new hardware to address emerging requirements.
Enterprises adopting a hybrid cloud model where part of their data and applications are delivered by a cloud provider require demanding security in the form of policy enforcement. “I now have multiple domains I have to secure against,” says Harrsen. “I can have a combination of security, quality of service and service-level agreement policies I need to enforce in the network.” This translates to more rules that need to be implemented in more places in the network.
Typically, a switch chip uses ternary content-addressable memory (TCAM) to determine how packets should be handled. Cavium has integrated a policy engine into the two new families. The policy engine is partly algorithmic-based and partly TCAM-based, resulting in a 6x-10x scaling advantage compared to the use of TCAM alone. Cavium has developed a set of hardware primitives such that the number of rules can be boosted without the incremental cost of adding more TCAM as the search engine.
Telemetry data has also been enhanced such that a switch chip can document how it is being used and expose data to analytics software that assesses how the network is being run and reallocates network resources as necessary. The chip can report how the packet queuing sub-system is behaving, for example, to identify congestion as well as the characteristics of the traffic the switch chip is encountering. “All this is associated with improving the performance of the data centre,” says Harrsen.
A programmable table controller has also been added to the chips to support denser tables. To understand why this is needed, Harrsen cites the use of containers as an alternative to virtual machines.
Virtual machines allow a server’s processor to be shared across multiple applications, each running their own operating system. A container is another way to virtualise the server’s processor resources but is ‘lighter’ than a virtual machine and does not use its own operating system. Accordingly, the server CPU can support more containers.
To get into a 5G network, you are working on it now, even though it is not going to be deployed until 2019 or 2020. We are doing proof-of-concepts with guys right now.
“There is a need for the switch chip to be able to identify a container which drives a need to have a denser table inside the chips,” says Harrsen. “We address that with the programmable table controller.”
The XP70 family supports 25 gigabit-per-second (Gbps) serialiser-deserialiser (serdes) interfaces while the XP60 supports 10Gbps serdes.
The XP60 family is targeted at enterprises that are upgrading their networks from Gigabit Ethernet (1Gbase-T) to 10 Gigabit Ethernet (10Gbase-T). Enterprises still have a lot of Category 6 cabling deployed that are only now upgrading to 10Gbase-T. Cavium expects this market to grow over the next three years.
The XP70 addresses the build-out of 25Gbps, especially for top-of-rack switches. “The SFP+ and SFP28 [optical modules] are almost at the same price,” says Harrsen. “No one is building an SFP+ switch because they want to support 25-gigabits.” Cavium expects the market for 25-gigabit to grow substantially in the next five years.
Another market is the embedded switch/ enclosure market where the switch is embedded. “They need a lower-power solution than the existing 3.2 terabit chip,” says Harssen, The lower-power XP60 and XP70 devices meet such needs given the more limited airflow compared to a top-of-rack switch environment.
“Ethernet switches are embedded in various chassis-based systems including blade servers,” says Wheeler. “In a blade server, the switch resides on a special blade or module.”
The devices are also being aimed at emerging cloud RAN for 5G and for GPON aggregation. The optical line terminals (OLTs) of passive optical networks also use Ethernet backplanes, says Wheeler.
“To get into a 5G network, you are working on it now, even though it is not going to be deployed until 2019 or 2020,” says Harrsen. “We are doing proof-of-concepts with guys right now.”
Cavium says the XP60 and XP70 devices - implemented in 28nm CMOS, the same as its XP80 family - are now sampling. The devices were taped out in the first quarter of this year and are going into production in the coming weeks, says Harrsen.
The hyper-scale players have to have a long-term strategy to multi-source but this is not their actions right now. They are running so fast and so hard just to keep up with what they have.
High-end switch market
Harrsen describes the high-capacity switch chip market is an arms race, with companies like Broadcom and start-ups Barefoot Networks and Innovium chasing the large-scale data centre players with chips with switch capacities of 6.4 terabits and even 12.8 terabits. But Cavium claims only the hyper-scale data centre players are considering the very highest capacity chips, and they are only likely to be deployed in the next two years.
Cavium also points out that such players' resources developing applications and infrastructure software development are limited. They do not have the scale to multi-source switching sub-systems, says Harrsen. This benefits Broadcom, the incumbent, rather than the start-ups.
“The hyper-scale players have to have a long-term strategy to multi-source but this is not their actions right now,” he says. “They are running so fast and so hard just to keep up with what they have.”
“Targeting hyper-scale operators carries great risk because your whole business hinges on winning one of these big customers,” adds Wheeler. “It’s true that Broadcom remains dominant in these data centres at present.”
Cavium may have launched the XP60 and XP70 to expand its total available market but it says it is working on its next-generation high-end switch to follow its XP80 although it is not saying when it will be available.
“This market is incredibly competitive and there is a lot of jockeying around,” says Harrsen. “We are in development and we think we are going to have a very compelling offering when we do talk about a next-generation product.”


