Ayar Labs and Intel add optical input-output to an FPGA

Start-up Ayar Labs, working with Intel, has interfaced its TeraPHY optical chiplet to the chip giant’s Stratix10 FPGA.
Hugo SalehIntel has teamed with several partners in addition to Ayar Labs for its FPGA-based silicon-in-package design, part of the US Defense Advanced Research Projects Agency’s (DARPA) project.
Ayar Labs used the Hot Chips conference, held in Palo Alto, California in August, to detail its first TeraPHY chiplet product and its interface to the high-end FPGA.
Origins
Ayar Labs was established to commercialise research that originated at MIT. The MIT team worked on integrating both photonics and electronics on a single die without changing the CMOS process.
The start-up has developed such building-block optical components in CMOS as a vertical coupler grating and a micro-ring resonator for modulation, while the electronic circuitry can be used to control and stabilise the ring resonator’s operation.
Ayar Labs has also developed an external laser source that provides an external light source that can power up to 256 optical channels, each operating at either 16, 25 or 32 gigabits-per-second (Gbps).
The company has two strategic investors: Intel Capital, the investment arm of Intel, and semiconductor firm GlobalFoundries.
The start-up received $24 million in funding late last year and has used the funding to open a second office in Santa Clara, California, and double its staff to about 40.
Markets
Ayar Labs has identified four markets for its silicon photonics technology.
The first is the military, aerospace and government market segment. Indeed, the Intel FPGA system-in-package is for a phased-array radar application.
Two further markets are high-performance computing and artificial intelligence, and telecommunications and the cloud.
Computer vision and advanced driver assisted systems is the fourth market segment. Here, the start-up’s expertise in silicon photonics is not for optical I/O but a sensor for LIDAR, says Hugo Saleh, Ayar Labs’ vice president of marketing and business development.
Stratix 10 system-in-package
The Intel phased-array radar system-in-package is designed to takes in huge amounts of RF data that is down-converted and digitised using an RF chiplet. The data is then pre-processed on the FPGA and sent optically using Ayar Labs’ TeraPHY chiplets for further processing in the cloud.

“To digitise all that information you need multiple TeraPHY chiplets per FPGA to pull the information back into the cloud,” says Saleh. A single phased-array radar can use as many as 50,000 FPGAs.
Such a radar design can be applied to civilian and to military applications where it can track 10,000s of objects.
Moreover, it is not just FPGAs that the TeraPHY chiplet can be interfaced to.
Large aerospace companies developing flight control systems also develop their own ASICs. “Almost every single aerospace company we have talked to as part of our collaboration with Intel has said they have custom ASICs,” says Saleh. “They want to know how they can procure, package and test the chiplets and bring them to market.”
It is one thing to integrate a chiplet but photonics is tricky
TeraPHY chiplet
Two Intel-developed technologies are used to interface the TeraPHY chiplet to the Stratix 10 FPGA.
The first is Intel’s Advanced Interface Bus (AIB), a parallel electrical interface technology. The second is the Embedded Multi-die Interconnect Bridge (EMIB) which supports the dense I/O needed to interface the main chip, in this case, the FPGA to a chiplet.
EMIB is a sliver of silicon designed to support I/O. The EMIBs are embedded in an organic substrate on which the dies sit; one is for each chiplet-FPGA interface. The EMIB supports various bump pitches to enable dense I/O connections.
Ayar Labs’ first TeraPHY product uses 24 AIB cells for its electrical interface. Each cell supports 20 channels, each operating at 2Gbps. The result is that each cell supports 40Gbps and the overall electrical bandwidth of the chiplet is 960 gigabits.
The TeraPHY’s optical interface uses 10 transmitter-receiver pairs, each pair supporting 8 optical channels that can operate at 16Gbps, 25Gbps or 32Gbps. The result is that the TeraPHY support a total optical bandwidth ranging from 1.28Tbps to 2.56Tbps.
The optical bandwidth is deliberately higher than the electrical bandwidth, says Saleh: “Just because you have ten [transmit/ receive] macros on the die doesn’t mean you have to use all ten.”
Also, the chiplet supports a crossbar switch that allows one-to-many connections such that an electrical channel can be sent out on more than one optical interface and vice versa.
For the Intel FPGA system-in-package, two TeraPHY chiplets are used, each supporting 16Gbps channels such that the chiplet’s total optical I/O is up to 5.12 terabits.
Ramifications
Saleh stresses the achievement in integrating optics in-package: “It is one thing to integrate a chiplet but photonics is tricky.”
Ayar Labs flip-chips its silicon and etches on the backside. “Besides all the hard work that goes into figuring how to do that, and keeping it hermetically sealed, you still have to escape light,” he says. “Escaping light out of the package that is intended to be high volume requires significant engineering work.” This required working very closely with Intel’s packaging department.
Now the challenge is to take the demonstrator chip to volume manufacturing.
Saleh also points to a more fundamental change that will need to take place with the advent of chip designs using optical I/O.
Over many years compute power in the form of advanced microprocessors that incorporate ever more CPU cores has doubled every two years or so. In contrast, I/O has advanced at a much slower pace – 5 or 10 per cent annually.
This has resulted in application software for high-performance computing being written to take this BW-compute disparity into account, reducing the number of memory accesses and minimising I/O transactions.
“Software now has to be architected to take advantage of all this new performance and all this new bandwidth,” he says. “We are going to see tremendous gains in performance because of it.”
Ayar Labs says it is on schedule to deliver its first TeraPHY chiplet product in volume to lead customers by the second half of 2020.
FPGAs with 56-gigabit transceivers set for 2017
The company demonstrated a 56-gigabit transceiver using 4-level pulse-amplitude modulation (PAM-4) at the recent OFC show. The 56-gigabit transceiver, also referred to as a serialiser-deserialiser (serdes), was shown successfully working over backplane specified for 25-gigabit signalling only.
Gilles GarciaXilinx's 56-gigabit serdes is implemented using a 16nm CMOS process node but the first FPGAs featuring the design will be made using a 7nm process. Gilles Garcia says the choice of 7nm CMOS is solely a business decision and not a technical one.
”Optical module [makers] will take another year to make something decent using PAM-4," says Garcia, Xilinx's director marketing and business development, wired communications. "Our 7nm FPGAs will follow very soon afterwards.”
The company is still to detail its next-generation FPGA family but says that it will include an FPGA capable of supporting 1.6 terabit of Optical Transport Network (OTN) using 56-gigabit serdes only. At first glance that implies at least 28 PAM-4 transceivers on a chip but OTN is a complex design that is logic not I/O limited suggesting that the FPGA will feature more than 28, 56-gigabit serdes.
Applications
Xilinx’s Virtex UltraScale and its latest UltraScale+ FPGA families feature 16-gigabit and 25-gigabit transceivers. Managing power consumption and maximising reach of the high-speed serdes are key challenges for its design engineers. Xilinx says it has 150 engineers for serdes design.
“Power is always a key challenge because as soon as you talk about 400-gigabit to 1-terabit per line card, you need to be cautious about the power your serdes will use,” says Garcia. He says the serdes need to adapt to the quality of the traces for backplane applications. Customers want serdes that will support 25 gigabit on existing 10-gigabit backplane equipment.
Xilinx describes its Virtex UltraScale as a 400-gigabit capable single-chip system supporting up to 104 serdes: 52 at 16 gigabit and 52 at 25 gigabit.
The UltraScale+ is rated as a 500-gigabit to 600-gigabit capable system, depending on the application. For example, the FPGA could support three, 200-gigabit OTN wavelengths, says Garcia.
Xilinx says the UltraScale+ reduces power consumption by 35% to 50% compared to the same designs implemented on the UltrasScale. The Virtex UltraScale+ devices also feature dedicated hardware to implement RS-FEC, freeing up programmable logic for other uses. RS-FEC is used with multi-mode fibre or copper interconnects for error correction, says Xilinx. Six UltraScale+ FPGAs are available and the VU13P, not yet out, will feature up to 128 serdes, each capable of up to 32 gigabit.
We don’t need retimers so customers can connect directly to the backplane at 25 gigabit, thereby saving space, power and cost
The UltraScale and UltraScale+ FPGAs are being used in several telecom and datacom applications.
For telecom, 500-gigabit and 1-terabit OTN designs are an important market for the UltraScale FPGAs. Another use for the FPGA serdes is for backplane applications. “We don’t need retimers so customers can connect directly to the backplane at 25 gigabit, thereby saving space, power and cost,” says Garcia. Such backplane uses include OTN platforms and data centre interconnect systems.
The FPGA family’s 16-gigabit serdes are also being used in 10-gigabit PON and NG-PON2 systems. “When you have an 8-port or 16-port system, you need to have a dense serdes capability to drive the [PON optical line terminal’s] uplink,” says Garcia.
For data centre applications, the FPGAs are being employed in disaggregated storage systems that involved pooled storage devices. The result is many 16-gigabit and 25-gigabit streams accessing the storage while the links to the data centre and its servers are served using 100-gigabit links. The FPGA serdes are used to translate between the two domains (see diagram).
Source: Xilinx
For its next-generation 7nm FPGAs with 56-gigabit transceivers, Xilinx is already seeing demand for several applications.
Data centre uses include server-to-top-of-rack links as the large Internet providers look move from 25 gigabit to 50- and 100-gigabit links. Another application is to connect adjacent buildings that make up a mega data centre which can involve hundreds of 100-gigabit links. A third application is meeting the growing demands of disaggregated storage.
For telecom, the interest is being able to connect directly to new optical modules over 50-gigabit lanes, without the need for gearbox ICs.
Optical FPGAs
Altera, now part of Intel, developed an optical FPGA demonstrator that used co-packaged VCSELs for off-chip optical links. Since then Altera announced its Stratix 10 FPGAs that include connectivity tiles - transceiver logic co-packaged and linked with the FPGA using interposer technology.
Xilinx says it has studied the issue of optical I/O and that there is no technical reason why it can’t be done. But the issue is a business one when integrating optics in an FPGA, he says: “Who is responsible for the yield? For the support?”
Garcia admits Xilinx could develop its own I/O designs using silicon photonics and then it would be responsible for the logic and the optics. “But this is not where we are seeing the business growing,” he says.
Altera’s 30 billion transistor FPGA
- The Stratix 10 features a routing architecture that doubles overall clock speed and core performance
- The programmable family supports the co-packaging of transceiver chips to enable custom FPGAs
- The Stratix 10 family supports up to 5.5 million logic elements
- Enhanced security features stop designs from being copied or tampered with
Altera has detailed its most powerful FPGA family to date. Two variants of the Stratix 10 family have been announced: 10 FPGAs and 10 system-on-chip (SoC) devices that include a quad-core 64-bit architecture Cortex-A53 ARM processor alongside the programmable logic. The ARM processor can be clocked at up to 1.5 GHz.
The Stratix 10 family is implemented using Intel’s 14nm FinFET process and supports up to 5.5 million logic elements. The largest device in Altera’s 20nm Arria family of FPGAs has 1.15 million logic elements, equating to 6.4 billion transistors. “Extrapolating, this gives a figure of some 30 billion transistors for the Stratix 10,” says Craig Davis, senior product marketing manager at Altera.
Altera's HyperFlex routing architecture. Shown (pointed to by the blue arrow) are the HyperFlex registers that sit at the junction of the interconnect traces. Also shown are the adaptive logic module blocks. Source: Altera.
The FPGA family uses a routing fabric, dubbed HyperFlex, to connect the logic blocks. HyperFlex is claimed to double the clock speed compared to designs implemented using Altera’s Stratix V devices, to achieve gigahertz rates. “Having that high level of performance allows us to get to 400 gigabit and one terabit OTN (Optical Transport Network) systems,” says Davies.
The FPGA company detailed the Stratix 10 a week after Intel announced its intention to acquire Altera for US $16.7 billion.
Altera is also introducing with the FPGA family what it refers to as heterogeneous 3D system packaging and integration. The technology enables a designer to customise the FPGA’s transceivers by co-packaging separate transceiver integrated circuits (ICs) alongside the FPGA.
Different line-rate transceivers can be supported to meet a design's requirements: 10, 28 or 56 gigabit-per-second (Gbps), for example. It also allows different protocols such as PCI Express (PCIe), and different modulation formats including optical interfaces. Altera has already demonstrated a prototype FPGA co-packaged with optical interfaces, while Intel is developing silicon photonics technology.
HyperFlex routing
The maximum speed an FPGA design can be clocked is determined by the speed of its logic and the time it takes to move data from one part of the chip to another. Increasingly, it is the routing fabric rather than the logic itself that dictates the total delay, says Davis.
This has led the designers of the Stratix 10 to develop the HyperFlex architecture that adds a register at each junction of the lines interconnecting the logic elements.
Altera first tackled routing delay a decade ago by redesigning the FPGA’s logic building block. Altera went from a 4-input look-up table logic building block to a more powerful 8-input one that includes output registers. Using the more complex logic element - the adaptive logic module (ALM) - simplifies the overall routing. “You are essentially removing one layer of routing from your system,” says Davies.
When an FPGA is programmed, the file is presented that dictates how the wires and hence the device’s logic are connected. The refinement with HyperFlex is that there are now registers at those locations where the switching between the traces occurs. A register can either be bypassed or used.
“It allows us to put the registers anywhere in the design, essentially placing them in an optimum place for a given route across the FPGA,” says Davies. The number of hyper-registers in the device's routing outnumber the standard registers in the ALM blocks by a factor of ten.
Using the registers, designers can introduce data pipelining to reduce overall delay and it is this pipelining, combined with the advanced 14nm CMOS process, that allows a design to run at gigahertz rates.
“We have made the registers small but they add one or two percent to the total die area, but in return it gives us the ability to go to twice the performance,” says Davies. “That is a good trade-off.
The biggest change getting HyperFlex to work has been with the software tools, says Davies. HyperFlex and the associated tools has taken over three years to develop.
“This is a fundamental change,” says Davies. “It [HyperFlex] is relatively simple but it is key; and it is this that allows customers to get to this doubling of core performance.”
The examples cited by Altera certainly suggest significant improvements in speed, density, power dissipation, but I want to see that in real-world designs
Loring Wirbel, The Linley Group
Applications
Altera says that over 100 customer designs have now been processed using the Stratix 10 development tools.
It cites as an example a current 400 gigabit design implemented using a Stratix V FPGA that requires a bus 1024-bits wide, clocked at 390MHz. The wide bus consumes considerable chip area and routing it to avoid congestion is non-trivial.
Porting the design to a Stratix 10 enables the bus to be clocked at 781MHz such that the bus width can be halved to 512 bits. “It reduces congestion, makes it easier to do timing closure and ship the design,” says Davies. “This is why we think Stratix 10 is so important for high-performance applications like OTN and data centres.” Timing closure refers to the tricky part of a design where the engineer may have to iterate to ensure that a design meets all the timing requirements.
For another, data centre design, a Stratix 10 device can replace five Stratix V ICs on one card. The five FPGAs are clocked at 250MHz, run PCIe Gen2 x8 interfaces and DDR3 x72 memory clocked at 800MHz. Overall the power consumed is 120W. Using one Stratix 10 chip clocked at 500MHz, faster PCIe Gen3 x8 can be supported as can a wider DDR3 x144 memory clocked at 1.2GHz, with only 44W consumed.
Loring Wirbel, senior analyst at The Linley Group, says that Altera’s insertion of pipelined registers to cut average trace lengths is unique.
“The more important question is, can the hyper-register topology regularly gain the type of advantages claimed?” says Wirbel. “The examples cited by Altera certainly suggest significant improvements in speed, density, power dissipation, but I want to see that in real-world designs.”
We are also looking at optical transceivers directly connected to the FPGA
Craig Davies, Altera
Connectivity tiles
Altera recognises that future FPGAs will support a variety of transceiver types. Not only are there different line speeds to be supported but also different modulation schemes. “You can’t build one transceiver that fits all of these requirements and even if you could, it would not be an optimised design,” says Davies.
Instead, Altera is exploiting Intel’s embedded multi-die interconnect bridge (EMIB) technology to interface the FPGA and transceivers, dubbed connectivity tiles. The bridge technology is embedded into the chip’s substrate and enables dense interconnect between the core FPGA and the transceiver IC.
Intel claims fewer wafer processing steps are required to make the EMIB compared to other 2.5D interposer processes. An interposer is an electrical design that provides connectivity. “This is a very simple ball-grid sort of interposer, nothing like the Xilinx interposer,” says Wirbel. “But it is lower cost and not intended for the wide range of applications that more advanced interposers use.”
Using this approach, a customer can add to their design the desired interface, including optical interfaces as well as electrical ones. “We are also looking at optical transceivers directly connected to the FPGA,” says Davies.
Wirbel says such links would simplify interfacing to OTN mappers, and data centre designs that use optical links between racks and for the top-of-rack switch.
“Intel wants to see a lot more use of optics directly on the server CPU board, something that the COBO Alliance agrees with in part, and they may steer the on-chip TOSA/ ROSA (transmitter and receiver optical sub-assembly) toward intra-board applications,” he says.
But this is more into the future. “It's fine if Intel wants to pursue those things, but it should not neglect common MSAs for OTN and Ethernet applications of a more traditional sort,” says Wirbel.
The benefit of the system-in-package integration is that different FPGAs can be built without having to create a new expensive mask set each time. “You can build a modular lego-block FPGA and all that it has different is the packaged substrate,” says Davies.
Security and software
Stratix 10 also features security features to protect companies’ intellectual property from being copied or manipulated.
The FPGA features security hardware that protects circuitry from being tampered with; the bitstream that is loaded to configure the FPGA must be decrypted first.
The FPGA is also split into sectors such that parts of the device can have different degrees of security. The sectoring is useful for cloud-computing applications where the FPGA is used as an accelerator to the server host processor. As a result, different customers’ applications can be run in separate sectors of the FPGA to ensure that they are protected from each other.
The security hardware also allows features to be included in a design that the customer can unlock and pay for once needed. For example, a telecom platform could be upgraded to 100 Gigabit while the existing 40 Gig live network traffic runs unaffected in a separate sector.
Altera has upgraded its FPGA software tools in anticipation of the Stratix 10. Features include a hierarchical design flow to simplify the partitioning of a design project across a team of engineers, and the ability to use cloud computing to speed up design compilation time.
What applications will require such advanced FPGAs, and which customers will be willing to pay a premium price for? Wirbel says the top applications will remain communications.
“The emergence of new 400 Gig OTN transport platforms, and the emergence of all kinds of new routers and switches with 400 Gig interfaces, will keep a 40 percent communication base for FPGAs overall solid at Altera,” he says.
Wirbel also expects server accelerator boards where FPGA-based accelerators are used for such applications as financial trading and physics simulation will also be an important market. “But Intel must consider the accelerator board market as an ideal place for Stratix 10 on its own, and not merely as a vehicle for promoting a future Xeon-plus-FPGA hybrid,” he says.
Altera will have engineering samples of the Stratix 10 towards the end of 2015, before being shipped to customers.
