II-VI’s VCSEL approach for co-packaged optics

Co-packaged optics was a central theme at this year’s OFC show, held in San Diego. But the solutions detailed were primarily using single-mode lasers and fibre.
The firm II-VI is beating a co-packaged optics path using vertical-cavity surface-emitting lasers (VCSELs) and multi-mode fibre while also pursuing single-mode, silicon photonics-based co-packaged optics.
For multi-mode, VCSEL-based co-packaging, II-VI is working with IBM, a collaboration that started as part of a U.S. Advanced Research Projects Agency-Energy (ARPA-E) project to promote energy-saving technologies.
II-VI claims there are significant system benefits using VCSEL-based co-packaged optics. The benefits include lower power, cost and latency when compared with pluggable optics.
The two key design decisions that achieved power savings are the elimination of the retimer chip – also known as a direct-drive or linear interface – and the use of VCSELs.
The approach – what II-VI calls shortwave co-packaged optics – integrates the VCSELs, chip and optics in the same package.
The design is being promoted as first augmenting pluggables and then, as co-packaged optics become established, becoming the predominant solution for system interconnect.
For every 10,000 QSFP-DD pluggable optical modules used by a supercomputer that are replaced with VCSEL-based co-packaged optics, the yearly electricity bill will be reduced by up to half a million dollars, estimate II-VI and IBM.
VCSEL technology
VCSELs are used for active optical cables and short-reach pluggables for up to 70m or 100m links.
VCSEL-based modules consume fewer watts and are cheaper than single-mode pluggables.
Several factors account for the lower cost, says Vipul Bhatt, vice president of marketing, datacom vertical at II-VI.
The VCSEL emits light vertically from its surface, simplifying the laser-fibre alignment, and multi-mode fibre already has a larger-sized core compared to single-mode fibre.
“Having that perpendicular emission from the laser chip makes manufacturing easier,” says Bhatt. “And the device’s small size allows you to get many more per wafer than you can with edge-emitter lasers, benefitting cost.”
The tinier VCSEL also requires a smaller current density to work; the threshold current of a distributed feedback (DFB) laser used with single-mode fibre is 25-30mA, whereas it is 5-6mA for a VCSEL. “That saves power,” says Bhatt.
Fibre plant
Hyperscalers such as Google favour single-mode fibre for their data centres. Single-mode fibre supports longer reach transmissions, while Google sees its use as future-proofing its data centres for higher-speed transmissions.
Chinese firms Alibaba and Tencent use multi-mode fibre but also view single-mode fibre as desirable longer term.
Bhatt says he has been hearing arguments favouring single-mode fibre for years, yet VCSELs continue to advance in speed, from 25 to 50 to 100 gigabits per lane.
“VCSELs continue to lead in cost and power,” says Bhatt. ”And the 100-gigabit-per-lane optical link has a long life ahead of it, not just for networking but machine learning and high-performance computing.“
II-VI says single-mode fibre and silicon photonics modules are suited for the historical IEEE and ITU markets of enterprise and transport where customers have longer-reach applications.
VCSELs are best suited for shorter reaches such as replacing copper interconnects in the data centre.
Copper interconnect reaches are shrinking as interface speeds increase, while a cost-effective optical solution is needed to support short and intermediate spans up to 70 meters.
“As we look to displace copper, we’re looking at 20 meters, 10 meters, or potentially down to three-meter links using active optical cables instead of copper,” says Bhatt. “This is where the power consumption and cost of VCSELs can be an acceptable premium to copper interconnects today, whereas a jump to silicon photonics may be cost-prohibitive.”
Silicon photonics-based optical modules have higher internal optical losses but they deliver reaches of 2km and 10km.
“If all you’re doing is less than 100 meters, think of the incredible efficiency with which these few milliamps of current pumped into a VCSEL and the resulting light launched directly and efficiently into the fibre,” says Bhatt. “That’s an impressive cost and power saving.”
Applications
The bulk of VCSEL sales for the data centre are active optical cables and short-reach optical transceivers.
“Remember, not every data centre is a hyperscale data centre,” says Bhatt. ”So it isn’t true that multi-mode is only for the server to top-of-rack switch links. Hyperscale data centres also have small clusters for artificial intelligence and machine learning.”
The 100m-reach of VCSELs-based optics means it can span all three switching tiers for many data centres.
The currently envisioned 400-gigabit VCSEL modules are 400GBASE-SR8 and the 8-by-50Gbps 400G-SR4.2. Both use 50-gigabit VCSELs: 25 gigabaud devices with 4-level pulse amplitude modulation (PAM-4).
The 400GBASE-SR8 module requires 16 fibres, while the 400G-SR4.2, with its two-wavelength bidirectional design, has eight fibres.
The advent of 100-gigabit VCSELs (50 gigabaud with PAM-4) enables 800G-SR8, 400G-SR4 and 100G-SR1 interfaces. II-VI first demonstrated a 100-gigabit VCSEL at ECOC 2019, while 100-gigabit VCSEL-based modules are becoming commercially available this year.
Terabit VCSEL MSA
The Terabit Bidirectional (BiDi) Multi-Source Agreement (MSA) created earlier this year is tasked with developing optical interfaces using 100-gigabit VCSELs.
The industry consortium will define 800 gigabits interface over parallel multi-mode fibre, the same four pairs of multi-mode fibre that support the 400-gigabit, 400G-BD4.2 interface. It will also define a 1.6 terabit optical interface.
The MSA work will extend the parallel fibre infrastructure from legacy 40 gigabits to 1.6 terabits as data centres embrace 25.6-terabit and soon 51.2-terabit switches.
Founding Terabit BiDi MSA members include II-VI, Alibaba, Arista Networks, Broadcom, Cisco, CommScope, Dell Technologies, HGGenuine, Lumentum, MACOM and Marvell Technology.
200-gigabit lasers and parallelism
The first 200-gigabit electro-absorption modulator lasers (EMLs) were demonstrated at OFC ’22, while the next-generation 200-gigabits directly modulated lasers (DMLs) are still in the lab.
When will 200-gigabit VCSELs arrive?
Bhatt says that while 200-gigabit VCSELs were considered to be research-stage products, recent interest in the industry has spurred the VCSEL makers to accelerate the development timeline.
Bhatt repeats that VCSELs are best suited for optimised short-reach links.
“You have the luxury of making tradeoffs that longer-reach designs don’t have,” he says. “For example, you can go parallel: instead of N-by-200-gig lanes, it may be possible to use twice as many 100-gig lanes.”
VCSEL parallelism for short-reach interconnects is just what II-VI and IBM are doing with shortwave co-packaged optics.
Shortwave co-packaged optics
Computer architectures are undergoing significant change with the emergence of accelerator ICs for CPU offloading.
II-VI cites such developments as Nvidia’s Bluefield data processing units (DPUs) and the OpenCAPI Consortium, which is developing interface technology so that any microprocessor can talk to accelerator and I/O devices.
“We’re looking at how to provide a high-speed, low-latency fabric between compute resources for a cohesive fabric,” says Bhatt. The computational resources include processors and accelerators such as graphic processing units (GPUs) and field-programmable gate arrays (FPGAs).
II-VI claims that by using multi-mode optics, one can produce the lowest power consumption optical link feasible, tailored for very-short electrical link budgets.
The issue with pluggable modules is connecting them to the chip’s high-speed signals across the host printed circuit board (PCB).
“We’re paying a premium to have that electrical signal reach through,” says Bhatt. “And where most of the power consumption and cost are is those expensive chips that compensate these high-speed signals over those trace lengths on the PCB.”
Using shortwave co-packaged optics, the ASIC can be surrounded by VCSEL-based interfaces, reducing the electrical link budget from some 30cm for pluggables to links only 2-3cm long.
“We can eliminate those very expensive 5nm or 7nm ICs, saving money and power,” says Bhatt.
The advantage of shortwave co-packaged optics is better performance (a lower error rate) and lower latency (between 70-100ns) which is significant when connecting to pools of accelerators or memory.
“We can reduce the power from 15W for a QSFP-DD module down to 5W for a link of twice the capacity,” says Bhatt, “We are talking an 80 per cent reduction in power dissipation. Another important point is that when power capacity is finite, every watt saved in interconnects is a watt available to add more servers. And servers bring revenue.”
This is where the 10,000-unit optical interfaces, $0.4-$0.5 million savings in yearly electricity costs comes from.
The power savings arise from the VCSEL’s low drive current, the use of the OIF’s ultra short-reach (USR) electrical interface and the IBM processor driving the VCSEL directly, what is called a linear analogue electrical interface.
In the first co-packaged optics implementation, IBM and II-VI use non-return-to-zero (NRZ) signalling.
The shortwave co-packaged optics has a reach of 20m which enables the potential elimination of top-of-rack switches, further saving costs. (See diagram.)

II-VI sees co-packaged optics as initially augmenting pluggables. With next-generation architectures using 1.6-terabit OSFP-XD pluggables, 20 to 40 per cent of those ports are for sub-20m links.
“We could have 20 to 40 per cent of the switch box populated with shortwave co-packaged optics to provide those links,” says Bhatt.
The remaining ports could be direct-attached copper, longer-reach silicon-photonics modules, or VCSEL modules, providing the flexibility associated with pluggables.
“We think shortwave co-packaged optics augments pluggables by helping to reduce power and cost of next-generation architectures.”
This is the secret sauce of every hyperscaler. They don’t talk about what they’re doing regarding machine learning and their high-performance systems, but that’s where they strive to differentiate their architectures, he says.
Status
Work has now started on a second-generation shortwave design that will use PAM-4 signalling. “That is targeted as a proof-of-concept in the 2024 timeframe,” says Bhatt.
The second generation will enable a direct comparison in terms of power, speed and bandwidth with single-mode co-packaged optics designs.
Meanwhile, II-VI is marketing its first-phase NRZ-based design.
“Since it is an analogue front end, it’s truly rate agnostic,” says Bhatt. “So we’re pitching it as a low-latency, low-power bandwidth density solution for traditional 100-gigabit Ethernet.”
The design also can be used for next-generation PCI Express and CXL disaggregated designs.
II-VI says there is potential to recycle hyperscaler data centre equipment by adding state-of-the-art network fabric to enable pools of legacy processors. “This technology delivers that,” says Bhatt.
But II-VI says the main focus is for accelerator fabrics: proprietary interfaces like NVlink, Fujitsu’s Tofu interconnect or HPE’s Cray’s Slingshot.
“At some point, memory pools or storage pools will also work their way into the hyperscalers’ data centres,” says Bhatt.
Silicon photonics adds off-chip comms to a RISC-V processor
"For the first time a system - a microprocessor - has been able to communicate with the external world using something other than electronics," says Vladimir Stojanovic, associate professor of electrical engineering and computer science at the University of California, Berkeley.
Vladimir Stojanovic
The microprocessor is the result of work that started at MIT nearly a decade ago as part of a project sponsored by the US Defense Advanced Research Projects Agency (DARPA) to investigate the integration of photonics and electronics for off-chip and even intra-chip communications.
The chip features a dual-core 1.65GHz RISC-V open instruction set processor and 1 megabyte of static RAM and integrates 70 million transistors and 850 optical components.
The work is also notable in that the optical components were developed without making any changes to an IBM 45nm CMOS process used to fabricate the processor. The researchers have demonstrated two of the processors communicating optically, with the RISC core on one chip reading and writing to the memory of the second device and executing programs such as image rendering.
This CMOS process approach to silicon photonics, dubbed 'zero-change' by the researchers, differs from that of the optical industry. So far silicon photonics players have customised CMOS processes to improve the optical components' performance. Many companies also develop the silicon photonics separately, using a trailing-edge 130nm or 90nm CMOS process while implementing the driver electronics on a separate chip using more advanced CMOS. That is because photonic devices such as a Mach-Zehnder modulator are relatively large and waste expensive silicon real-estate if implemented using a leading-edge process.
IBM is one player that has developed the electronics and optics on one chip using a 90nm CMOS process. However, the company says that the electronics use feature sizes closer to 65nm to achieve electrical speeds of 25 gigabit-per-second (Gbps), and being a custom process, it will only be possible to implement 50-gigabit rates using 4-level pulse amplitude modulation (PAM-4).
We are now reaping the benefits of this very precise process which others cannot do because they are operating at larger process nodes
"Our approach is that photonics is sort of like a second-class citizen to transistors but it is still good enough," says Stojanovic. This way, photonics can be part of an advanced CMOS process.
Pursuing a zero-change process was first met with skepticism and involved significant work by the researchers to develop. "People thought that making no changes to the process would be super-restrictive and lead to very poor [optical] device performance," says Stojanovic. Indeed, the first designs produced didn't work. "We didn't understand the IBM process and the masks enough, or it [the etching] would strip off certain stuff we'd put on to block certain steps."
But the team slowly mastered the process, making simple optical devices before moving on to more complex designs. Now the team believes its building-block components such as its vertical grating couplers have leading-edge performance while its ring-resonator modulator is close to matching the optical performance of designs using custom CMOS processes.
"We are now reaping the benefits of this very precise process which others cannot do because they are operating at larger process nodes," says Stojanovic.
Silicon photonics design
The researchers use a micro ring-resonator for its modulator design. The ring-resonator is much smaller than a Mach-Zehnder design and is 10 microns in diameter. Stojanovic says the dimensions of its vertical grating couplers are 10 to 20 microns while its silicon waveguides are 0.5 microns.
Photonic components are big relative to transistors, but for the links, it is the transistors that occupy more area than the photonics. "You can pack a lot of utilisation in a very small chip area," he says.
A key challenge with a micro ring-resonator is ensuring its stability. As the name implies, modulation of light occurs when the device is in resonance but this drifts with temperature, greatly impairing its performance.
Stojanovic cites how even the bit sequence can affect the modulator's temperature. "Given the microprocessor data is uncoded, you can have random bursts of zeros," he says. "When it [the modulator] drops the light, it self-heats: if it is modulating a [binary] zero it gets heated more than letting a one go through."
The researchers have had to develop circuitry that senses the bit-sequence pattern and counteracts the ring's self-heating. But the example also illustrates the advantage of combining photonics and electronics. "If you have a lot of transistors next to the modulator, it is much easier to tune it and make it work," says Stojanovic.
A prototype set-up of the chip-to-chip interconnect using silicon photonics. Source: Vladimir Stojanovic
Demonstration
The team used two microprocessors - one CPU talking to the memory of the second chip 4m away. Two chips were used rather than one - going off-chip before returning - to prove that the communication was indeed optical since there is also an internal electrical bus on-chip linking the CPU and memory. "We wanted to demonstrate chip-to-chip because that is where we think the biggest bang for the buck is," says Stojanovic.
In the demonstration, a single laser operating at 1,183nm feeds the two paths linking the memory and processor. Each link is 2.5Gbps for a total bandwidth of 5Gbps. However the microprocessor was clocked at one-eightieth of its 1.65GHz clock speed because only one wavelength was used to carry data. The microprocessor design can support 11 wavelengths for a total bandwidth of 55Gbit/s while the silicon photonics technology itself will support between 16 and 32 wavelengths overall.
The group is already lab-testing a new iteration of the chip that promises to run the processor at full speed. The latest chip also features improved optical functions. "It has better devices all over the place: better modulators, photo-detectors and gratings; it keeps evolving," says Stojanovic.
We can ship that kind of bandwidth [3.2 terabits] from a single chip
Ayar Labs
Ayar Labs is a start-up still in stealth mode that has been established to use the zero-change silicon photonics to make interconnect chips for platforms in the data centre.
Stojanovic says the microprocessor demonstrator is an example of a product that is two generations beyond existing pluggable modules. Ayar Labs will focus on on-board optics, what he describes as the next generation of product. On-board optics sit on a card, close to the chip. Optics integrated within the chip will eventually be needed, he says, but only once applications require greater bandwidth and denser interfaces.
"One of the nice things is that this technology is malleable; it can be put in various form factors to satisfy different connectivity applications," says Stojanovic.
What Ayar Labs aims to do is replace the QSFP pluggable modules on the face plate of a switch with one chip next to the switch silicon that can have a capacity of 3.2 terabits. "We can ship that kind of bandwidth from a single chip," says Stojanovic.
Such a chip promises cost reduction given how a large part of the cost in optical design is in the packaging. Here, packaging 32, 100 Gigabit Ethernet QSFP modules can be replaced with a single optical module using the chip. "That cost reduction is the key to enabling deeper penetration of photonics, and has been a barrier for silicon photonics [volumes] to ramp," says Stojanovic.
There is also the issue of how to couple the laser to the silicon photonics chip. Stojanovic says such high-bandwidth interface ICs require multiple lasers: "You definitely don't want hundreds of lasers flip-chipped on top [of the optical chip], you have to have a different approach".
Ayar Labs has not detailed what it is doing but Stojanovic says that its approach is more radical than simply sharing one laser across a few links, "Think about the laser as the power supply to the box, or maybe a few racks," he says.
The start-up is also exploring using standard polycrystalline silicon rather than the more specialist silicon-on-isolator wafers.
"Poly-silicon is much more lossy, so we have had to do special tricks in that process to make it less so," says Stojanovic. The result is that changes are needed to be made to the process; this will not be a zero-change process. But Stojanovic says the changes are few in number and relatively simple, and that it has already been shown to work.
Having such a process available would allow photonics to be added to transistors made using the most advanced CMOS processes - 16nm and even 7nm. "Then silicon-on-insulator becomes redundant; that is our end goal,” says Stojanovic.
Further information
Single-chip microprocessor that communicates directly using light, Nature, Volume 528, 24-31 December 2015
Silicon photonics: "The excitement has gone"
The opinion of industry analysts regarding silicon photonics is mixed at best. More silicon photonics products are shipping but challenges remain.
Part 1: An analyst perspective
"The excitement has gone,” says Vladimir Kozlov, CEO of LightCounting Market Research. “Now it is the long hard work to deliver products.”
Dale Murray, LightCounting
However, he is less concerned about recent setbacks and slippages for companies such as Intel that are developing silicon photonics products. This is to be expected, he says, as happens with all emerging technologies.
Mark Lutkowitz, principal at consultancy fibeReality, is more circumspect. “As a general rule, the more that reality sets in, the less impressive silicon photonics gets to be,” he says. “The physics is just hard; light is not naturally inclined to work on the silicon the way electronics does.”
LightCounting, which tracks optical component and modules, says silicon photonics product shipments in volume are happening. The market research firm cites Cisco’s CPAK transceivers, and 40 gigabit PSM4 modules shipping in excess of 100,000 units as examples. Six companies now offer 40 gigabit PSM4 products with Luxtera, a silicon photonics player, having a healthy start on the other five.
Indium phosphide and other technologies will not step back and give silicon photonics a free ride
LightCounting also cites Acacia with its silicon photonics-based low-power 100 and 400 gigabit coherent modules. “At OFC, Acacia made a fairly compelling case, but how much of its modules’ optical performance is down to silicon photonics and how much is down to its advanced coherent DSP chip is unclear,” says Dale Murray, principal analyst at LightCounting. Silicon photonics has not shown itself to be the overwhelming solution for metro/ regional and long-haul networks to date but that could change, he says.
Another trend LightCounting notes is how PAM-4 modulation is becoming adopted within standards. PAM-4 modulates two bits of data per symbol and has been adopted for the emerging 400 Gigabit Ethernet standard. Silicon photonics modulators work really well with PAM-4 and getting it into standards benefits the technology, says LightCounting. “All standards were developed around indium phosphide and gallium arsenide technologies until now,” says Kozlov.
You would be hard pressed to find a lot of OEMs or systems integrators that talk about silicon photonics and what impact it is going to have
Silicon photonics has been tainted due to the amount of hype it has received in recent years, says Murray. Especially the claim that optical products made in a CMOS fabrication plant will be significantly cheaper compared to traditional III-V-based optical components.
First, Murray highlights that no CMOS production line can make photonic devices without adaptation. “And how many wafers starts are there for the whole industry? How much does a [CMOS] wafer cost?” he says.
“You would be hard pressed to find a lot of OEMs or systems integrators that talk about silicon photonics and what impact it is going to have,” says Lutkowitz. “To me, that has always said everything.”
Mark Lutkowitz, fibeReality LightCounting highlights heterogeneous integration as one promising avenue for silicon photonics. Heterogeneous integration involves bonding III-V and silicon wafers before processing the two.
This hybrid approach uses the III-V materials for the active components while benefitting from silicon’s larger (300 mm) wafer sizes and advanced manufacturing techniques.
Such an approach avoids the need to attach and align an external discrete laser. “If that can be integrated into a WDM design, then you have got the potential to realise the dream of silicon photonics,” says Murray. “But it’s not quite there yet.”
This poses a real challenge for silicon photonics: it will only achieve low cost if there are sufficient volumes, but without such volumes it will not achieve a cost differential
Murray says over 30 vendors now make modules at 40 gigabit and above: “There are numerous module types and more are being added all the time.” Then there is silicon photonics which has its own product pie split. This poses a real challenge for silicon photonics: it will only achieve low cost if there are sufficient volumes, but without such volumes it will not achieve a cost differential.
“Indium phosphide and other technologies will not step back and give silicon photonics a free ride, and are going to fight it,” says Kozlov. Nor is it just VCSELs that are made in high volumes.
LightCounting expects over 100 million indium phosphide transceivers to ship this year. Many of these transceivers use distributed feedback (DFB) lasers and many are at 10 gigabit and are inexpensive, says Kozlov.
For FTTx and GPON, bi-directional optical subassemblies (BOSAs) now cost $9, he says: “How much lower cost can you get?”
IBM demos a 100 Gigabit silicon photonics transceiver
“It is a demonstration vehicle illustrating the complex design capabilities of the technology and the functionality of the optical and electrical components,” says Will Green, manager of IBM’s silicon integrated nano-photonics group.
Will Green
IBM has been developing silicon photonics technology for over a decade, starting with building-block optical functions based on silicon, to its current monolithic system-on-chip technology that includes design tools, testing and packaging technologies.
Now this technology is nearing commercialisation.
“We do plan to have the technology available for use within IBM’s systems but also within the larger market; large-volume applications such as the data centre and hyper-scale data centres in particular,” says Green.
IBM is already working with companies developing their own optical component designs using its technology and design tools. “These are tools that circuit designers are familiar with, such that they do not need to have an in-depth knowledge of photonics in order to build, for example, an optical transceiver,” says Green.
We do plan to have the technology available for use within IBM’s systems but also within the larger market
100 gig demonstrator
IBM refers to its silicon photonics technology as CMOS-integrated nano-photonics. CMOS-integrated refers to the technology’s monolithic nature that combines CMOS electronics with photonics on one substrate. Nano-photonics highlights the dimensions of the feature sizes used.
IBM is rare among the silicon photonics community in combining electronics and photonics on one chip; other players implement photonics and electronics on separate dies before combining the two. What is not included is the laser which is externally attached using fibre.
The platform supports 25 gigabit speeds as well as wavelength division multiplexing. Originally, IBM started with 90 nm CMOS using bulk silicon before transferring to a silicon-on-insulator (SOI) substrate. An SOI wafer is ideal for creating optical waveguides that confine light using the large refractive index difference between silicon and silicon dioxide. However, to make the electrical devices run at 25 gigabit, the resulting transistor gate length ended up being closer to a 65 nm CMOS process.
Source: IBM Corporation.
IBM's optical waveguides are sub-micron, having dimensions of a few hundred nanometers. This is the middle ground, says Green, trading off the density of smaller-dimensioned waveguides with larger, micron-plus ones that deliver low propagation loss.
Also used are sub-wavelength optical 'metamaterial' structures that transition between the refractive index of the fibre and that of the optical waveguide to achieve a good match between the two. “These very tiny sub-wavelength structures are made using lithography near the limits of what is available,” says Green. “We are engineering the optical properties of the waveguide in order to achieve a low insertion loss when bringing the fibre onto the chip.” The single mode fibre to the chip is attached using passive alignment.
The 100 gigabit transceiver demonstrator uses four 25 gigabit coarse wavelengths around 1310 nm. The technology is suited to implement the CWDM4 MSA.
The whole technology is available to be commercialised by any chip manufacturer
“We are working with four wavelengths today but in the same way as telecom uses many wavelengths, we can follow a similar path,” says Green.
The chip design features transmitter electronics - a series of amplifiers that boost the voltage to drive the Mach-Zehnder Interferometer modulators - and a multiplexer to combine the four wavelengths onto the fibre while the receiver circuitry includes a demultiplexer, four photo-detectors and trans-impedance amplifiers and limiting amplifiers, says Green. What is lacking to make the 100 gigabit transceiver functional is a micro-controller, feedback loops to control the temperature of key circuits, and the circuitry to interface to standard electrical input/ output.
Green highlights how the bill of materials of a chip is only a fraction of the total cost since assembly and testing must also be included.
“We reduce the cost of assembly through automated passive optical alignment and the introduction of custom structures onto the wafer,” he says. “We believe we can make an impact on the cost structure of the optical transceiver and where this technology needs to be to access the data centre.” IBM has also developed a way to test the transceiver chips at the wafer level.
Green admits that its CMOS-integrated nanophotonics process will not scale beyond 25 gigabit as the 90-65 nm CMOS is not able to implement faster serial rates. But IBM has already shown an optical implementation of the PAM-4 modulation scheme that doubles a link's rate to 50 gigabit.
Meanwhile, IBM’s process design kit (PDK) is already with customers. A PDK includes documents and data files that describe the fabrication process and enable a user to complete a design. A PDK includes a fab’s process parameters, mask layout instructions, and the library of silicon photonics components; grating couplers, waveguides, modulators and the like [1].
“They [customers] have used the design kit provided by IBM but have built their own designs,” says Green. “And now they are testing hardware.”
IBM is keen that its silicon photonics technology will be licensed and used by circuit design houses. "Houses that bring their own IP [intellectual property], use the enablement tools and manufacture at a site that is licensing the technology from IBM,” says Green. "The whole technology is available to be commercialised by any chip manufacturer.”
Reference
[1] Silicon Photonics Design: From Devices to Systems, Lukas Chrostowski and Michael Hochberg, Cambridge University Press, 2015. Click here
Boosting high-performance computing with optics
Part 2: High-performance computing
IBM has adopted optical interfaces for its latest POWER7-based high-end computer system. Gazettabyte spoke to IBM Fellow, Ed Seminaro, about high-performance computing and the need for optics to address bandwidth and latency requirements.
“At some point when you go a certain distance you have to go to an optical link”
Ed Seminaro, IBM Fellow
IBM has used parallel optics for its latest POWER7 computing systems, the Power 775. The optical interfaces are used to connect computing node drawers that make up the high-end computer. Each node comprises 32 POWER7 chips, with each chip hosting eight processor cores, each capable of running up to four separate programming tasks or threads.
Using optical engines, each node – a specialised computing card - has a total bandwidth of 224, 120 Gigabit-per-second (12x10Gbps) VCSEL-based transmitters and 224, 120Gbps receivers. The interfaces can interconnect up to 2,048 nodes, over half a million POWER7 cores, with a maximum network diameter of only three link hops.
IBM claims that with the development of the Power 775, it has demonstrated the superiority of optics over copper for high-end computing designs.
High-performance computing
Not so long ago supercomputers were designed using exotic custom technologies. Each company crafted its own RISC microprocessor that required specialised packaging, interconnect and cooling. Nowadays supercomputers are more likely to be made up of aggregated servers – computing nodes - tied using a high-performance switching fabric. Software then ties the nodes to appear to the user as a single computer.
But clever processor design is still required to meet new computing demands and steal a march on the competition, as are ever-faster links – interconnect bandwidth - to connect the nodes and satisfy their growing data transfer requirements.
High-performance computing (HPC) is another term used for state-of-the-art computing systems, and comes in many flavours and deployments, says Ed Seminaro, IBM Fellow, power systems development in the IBM Systems & Technology Group.
“All it means is that you have a compute-intensive workload – or a workload combining compute and I/O [input-output] intensive aspects," says Seminaro. "These occur in the scientific and technical computing world, and are increasingly being seen in business around large-scale analytics and so called ‘big data’ problem sets.”
Within the platform, the computer’s operating system runs on a processor or a group of processors connected using copper wire on a printed circuit board (PCB), typically a few inches apart, says Seminaro
The processor hardware is commonly a two-socket server: two processor modules no more than 10 inches apart. The hardware can run a single copy of the operating system – known as an image - or many images.
Running one copy of the operating system, all the memory and all the processing resource are carefully managed, says Seminaro. Alternatively an image can be broken into hundreds of pieces with a copy of the operating system running on each. “That is what virtualisation means,” says Seminaro. The advent of virtualisation has had a significant impact in the design of data centres and is a key enabler of cloud computing (Add link).
“The biggest you can build one of these [compute nodes] is 32 sockets – 32 processor chips - which may be as much as 256 processor cores - close enough that you can run them as what we call a single piece of hardware,” says Seminaro. But this is the current extreme, he says, the industry standard is two or four-socket servers.
That part is well understood, adds Seminaro, the challenge is connecting many of these hardware pieces into a tightly-coupled integrated system. This is where system performance metrics of latency and bandwidth come to the fore and why optical interfaces have become a key technology for HPC.
Latency and bandwidth
Two data transfer technologies are commonly used for HPC: Ethernet LAN and Infiniband. The two networking technologies are also defined by two important performance parameters: latency and bandwidth.
Using an Ethernet LAN for connectivity, the latency is relatively high when transferring data between two pieces of hardware. Latency is the time it takes before requested data starts to arrive. Normally when a process running on hardware accesses data from its local memory the latency is below 100ns. In contrast, accessing data between nodes can take more than 100x longer or over 10 microseconds.
For Infiniband, the latency between nodes can be under 1 microsecond, still 10x worse than a local transfer but more than 10x better than Ethernet. “Inevitably there is a middle ground somewhere between 1 and 100 microsecond depending on factors such as the [design of the software] IP stack,” says Seminaro.
If the amount of data requested is minor, the transfer itself typically takes nanoseconds. If a large file is requested, then not only is latency important – the time before asked-for data starts arriving – but also the bandwidth dictating overall file transfer times.
To highlight the impact of latency and bandwidth on data transfers, Seminaro cites the example of a node requesting data using a 1 Gigabit Ethernet (GbE) interface, equating to a 100MByte-per-second (MBps) transfer rate. The first bit of data requested by a node arrives after 100ns but a further second is needed before the 100MB file arrives.
A state-of-the-art Ethernet interface is 10GbE, says Seminaro: “A 4x QDR [quad data rate] Infiniband link is four times faster again [4x10Gbps].” The cost of 4x QDR Infiniband interconnect is roughly the same as for 10GbE, so most HPC systems either use 1GbE, for lowest cost networking, or 4x QDR Infiniband, when interconnect performance is a more important consideration. Of the fastest 500 computing systems in the world, over 425 use either 1GbE or Infiniband, only 11 use 10GbE. The remainder use custom or proprietary interconnects, says IBM.
The issue is that going any distance at these speeds using copper interfaces is problematic. “At some point when you go a certain distance you have to go to an optical link,” says Seminaro. “With Gigabit Ethernet there is copper and fibre connectivity; with 10GbE the standard is really fibre connectivity to get any reasonable distance.”
Copper for 10GbE or QDR Infiniband can go 7m, and using active copper cable the reach can be extended to 15m. Beyond that it is optics.
“We have learned that we can do a very large-scale optical configuration cost effectively. We had our doubts about that initially”
Ed Seminaro
The need for optics
Copper’s 7m reach places an upper limit on the number of computing units – each with 32 processor nodes - that can be reached. “To go beyond that, I’m going to have to go optical,” says Seminaro.
But reach is not the sole issue. The I/O bandwidth associated with each node is also a factor. “If you want an enormous amount of bandwidth out of each of these [node units], it starts to get physically difficult to externalise from each that many copper cables,” says Seminaro.
Many data centre managers would be overjoyed to finally get rid of copper, adds Seminaro, but unfortunately optical costs more. This has meant people have pushed to keep copper alive, especially for smaller computing clusters.
People accept how much bandwidth they can get between nodes using technologies such as QDR linking two-socket servers, and then design the software around such performance. “They get the best technology and then go the next level and do the best with that,” says Seminaro. “But people are always looking how they can increase the bandwidth dramatically coming out of the node and also how they can make the node more computationally powerful.” Not only that, if the nodes are more powerful, fewer are needed to do a given job, he says.
What IBM has done
The IBM’s Power 775 computer system is a sixth generation design that started in 2002. The Power 775 is currently being previewed and will be generally available in the second half of 2011, says IBM.
At its core is a POWER7 processor, described by Seminaro as highly flexible. The processor can tackle various problems from commercial applications to high-performance computing and which can scale from one processing node next to the desk to complete supercomputer configurations.
Applications the POWER7 is used for include large scale data analysis, automobile and aircraft design, weather prediction, and oil exploration, as well as multi-purpose computing systems for national research labs.
In the Power 775, as mentioned, each node has 32 chips comprising 256 cores, and each core can process four [programming] threads. “That is 1,024 threads – a lot of compute power,” says Seminaro, who stresses that the number of cores and the computing capability of each thread are important, as is the clock frequency at which they are run. These threads must access memory and are all tightly coupled.
“That is where it all starts: How much compute power can you cram in one of these units of electronics,” says Seminaro. The node design uses copper interconnect on a PCB and in placed into a water-cooled drawer to ensure a relatively low operating temperature, which improves power utilisation and system reliability.
“We have pulled all the stops out with this drawer,” says Seminaro. “It has the highest bandwidth available in a generally commercially available processor – we have several times the bandwidth of a typical computing platform at all levels of the interconnect hierarchy.”
To connect the computing nodes or drawers, IBM uses optical interfaces to achieve a low latency, high bandwidth interconnect design. Each node uses 224 optical transceivers, with each transceiver consisting of an array of 12 send and 12 receive 10Gbps lanes. This equates to a total bandwidth per 2U-high node of 26.88+26.88 Terabit-per-second.
“That is equivalent to 2,688 10Gig Ethernet connections [each way],” says Seminaro. “Because we have so many links coming out of the drawer it allows us to connect a lot of drawers directly to each other.”
In a 128-drawer system, IBM has sufficient number of ports and interconnect bandwidth to link each drawer to every one of the other 127. Using the switching capacity within the drawer, the Power 775 can be further scaled to build systems of up to 2,048 node drawers, with up to 524,288 POWER7 cores.
IBM claims one concern about using optics was cost. However working with Avago Technologies, the supplier of the optical transceivers, it has been able to develop the optical-based systems cost-effectively (see 'Parallel Optics' section within OFC round-up story) . “We have learned that we can do a very large-scale optical configuration cost effectively,” says Seminaro. “We had our doubts about that initially.”
IBM also had concerns about the power consumption of optics. “Copper is high-power but so is optics,” says Seminaro. “Again working with Avago we’ve been able to do this at reasonable power levels.” Even for very short 1m links the power consumption is reasonable, says IBM, and for longer reaches such as connecting widely-separated drawers in a large system, optical interconnect has a huge advantage, since the power required for an 80m link is the same as for a 1m link.
Reliability was also a concern given that optics is viewed as being less reliable than copper. “We have built a large amount of hardware now and we have achieved outstanding reliability,” says Seminaro.
IBM uses 10 out of the 12 lanes - two lanes are spare. If one lane should fail, one of the spare lanes is automatically configured to take its place. Such redundancy improves the failure rate metrics greatly and is needed in systems with a large number of optical interconnects, says Seminaro.
IBM has also done much work to produce an integrated design, placing the optical interfaces close to its hub/switch chip and reducing the discrete components used. And in a future design it will use an optical transceiver that integrates the transmit and receive arrays. IBM also believes it can improve the integration of the VCSEL-drive circuitry and overall packaging.
What next?
For future systems, IBM is investigating increasing the data rate per channel to 20-26Gbps and has already designed the current system to be able to accommodate such rates.
What about bringing optics within the drawer for chip-to-chip and even on-chip communications?
“There is one disadvantage to using optics which is difficult to overcome and that is latency,” says Seminaro. “You will always have higher latency when you go optics and a longer time-of-flight than you have with copper.” That’s because converting from wider, slower electrical buses to narrower optical links at higher bit rate costs a few cycles on each end of the link.
Also an optical signal in a fibre takes slightly longer to propagate, leading to a total increase in propagation delay of 1-5ns. “When you are within that drawer, especially when you are in some section of that drawer say between four chips, the added latency and time-of–flight definitely hurts performance,” says Seminaro.
IBM does not rule out such use of optics in the future. However, in the current Power 775 system, using optical links to interconnect the four-chip processor clusters within a node drawer does not deliver any processing performance advantage, it says.
But as application demands rise, and as IBM’s chip and package technologies improve, the need for higher bandwidth interconnect will steadily increase. Optics within the drawer is only a matter of time.
Further reading
Part 1: Optical Interconnect: Fibre-to-the-FPGA
Get on the Optical Bus, IEEE Spectrum, October 2010.
Framing the information age
When writing features for FibreSystems Europe, I repeatedly asked for high-resolution striking images. The magazine's editors always wanted photos that included people, like Maurice Broomfield's photos. Getting hold of such images did happen but not often.
Inspired by the Financial Times’ interview and Maurice Broomfield's beautiful images, some of the better images sent are presented here.
IBM data centre
I’m on the look-out for more. So if you are the media relations for an operator, equipment maker, optical transceiver or component (optical or IC) vendor, can I please request some inspiring photos - ideally with people - and I'll create a photo gallery of the best.
Network Operations Centre (NOC) Source: AT&T
Source: Cisco Systems
An Intel silicon photonics device
And here is an image of Tokyo's data centre on Flickr

