The APC’s blueprint for silicon photonics

Jeffery Maki

The Advanced Photonics Coalition (APC) wants to smooth the path for silicon photonics to become a high-volume manufacturing technology.

The organisation is talking to companies to tackle issues whose solutions will benefit the photonics technology.

The Advanced Photonics Coalition wants to act as an industry catalyst to prove technologies and reduce the risk associated with their development, says Jeffery Maki, Distinguished Engineer at Juniper Networks and a member of the Advanced Photonics Coalition’s board.

Origins

The Advanced Photonics Coalition was unveiled at the Photonic-Enabled Cloud Computing (PECC) Industry Summit jointly held with Optica last October.

The Coalition was formerly known as the Coalition for On-Board Optics (COBO), an industry initiative led by Microsoft.

Microsoft wanted a standard for on-board optics, until then it was a proprietary technology. At the time, on-board optics was seen as an important stepping stone between pluggable optical modules and their ultimate successor, co-packaged optics.

After years of work developing specifications and products, Microsoft chose not to adopt on-board optics in its data centres. Although COBO added other work activities, such as co-packaged optics, the organisation lost momentum and members.

Maki stresses that COBO always intended to tackle other work besides its on-board optics starting point.

Now, this is the Advanced Photonics Coalition’s goal: to have a broad remit to create working groups to address a range of issues.

Tackling technologies

Many standards organisations publish specifications but leave the implementation technologies to their member companies. In contrast, the Advanced Photonics Coalition is taking a technology focus. It wants to remove hurdles associated with silicon photonics to ease its adoption.

“Today, we see the artificial intelligence and machine learning opportunities growing, both in software and hardware,” says Maki. “We see a need in the coming years for more hardware and innovative solutions, especially in power, latency, and interconnects.”

Work Groups

In the past, systems vendors like Cisco or Juniper drove industry initiatives, and other companies fell in line. More recently, it was the hyperscalers that took on the role.

There is less of that now, says Maki: “We have a lot of companies with technologies and good ideas, but there is not a strong leadership.”

The Advanced Photonics Coalition wants to fill that void and address companies’ common concerns in critical areas. “Key customers will then see the value of, and be able to access, that standard or technology that’s then fostered,” says Maki.

The Advanced Photonics Coalition has yet to announce new working groups but it expects to do so in 2024.

One area of interest is silicon photonics foundries and their process design kits (PDKs). Each foundry has a PDK, made up of tools, models, and documentation, to help engineers with the design and manufacture of photonic integrated devices.

“A starting point might be support for more than one foundry in a multi-foundry PDK,” says Maki. “Perhaps a menu item to select the desired foundry where more than one foundry has been verified to support.”

Silicon photonics has long been promoted as a high-volume manufacturing technology for the optical industry. “But it is not if it has been siloed into separate efforts such that there is not that common volume,” says Maki.

Such a PDK effort would identify gaps that each foundry would need to fill. “The point is to provide for more than one foundry to be able to produce the item,” he says.

A company is also talking to the Advanced Photonics Coalition about co-packaged optics. The company has developed an advanced co-packaged optics solution, but it is proprietary.

“Even with a proprietary offering, one can make changes to improve market acceptance,” says Maki. The aim is to identify the areas of greatest contention and remedy them first, for example, the external laser source. “Opening that up to other suppliers through standards adoption, existing or new, is one possibility,” he says.

The Advanced Photonics Coalition is also exploring optical interconnecting definitions with companies. “How we do fibre-attached to silicon photonics, there’s a desire that there is standardisation to open up the market more,” says Maki. “That’s more surgical but still valuable.”

And there are discussions about a working group to address co-packaged optics for the radio access network (RAN). Ericsson is one company interested in co-packaged optics for the RAN. Another working group being discussed could tackle optical backplanes.

Maki says there are opportunities here to benefit the industry.

“Companies should understand that nothing is slowing them down or blocking them from doing something other than their ingenuity or their own time,” he says.

Status

COBO had 50 members earlier in 2023. Now, the membership listed on the website has dropped to 39 and the number could further dip; companies that joined for COBO may still decide to leave.

At the time of writing, an new as yet unannounced member has joined the Advanced Photonics Coalition, taking the membership to 40.

“Some of those companies that left, we think they will return once we get the working groups formed,” says Maki, who remains confident that the organisation will play an important industry role.

“Every time I have a conversation with a company about the status of the market and the needs that they see for the coming years, there’s good alignment amongst multiple companies,” he says.

There is an opportunity for an organisation to focus on the implementation aspects and the various technology platforms and bring more harmony to them, something other standards organisations don’t do, says Maki.


Can a think tank tackle telecoms innovation deficit?

Source: Telecom Ecosystem Group

The Telecom Ecosystem Group (TEG) will publish shortly its final paper that concludes two years of industry discussion on ways to spur innovation in telecommunications.

The paper, entitled Addressing the Telecom Innovation Deficit, says telcos have lost much of their influence in shaping the technologies on which they depend.

“They have become ageing monocultures; disruptive innovators have left the industry and innovation is outsourced,” says the report.

The TEG has held three colloquiums and numerous discussion groups soliciting views from experienced individuals across the industry during the two years.

The latest paper names eight authors but many more contributed to the document and its recommendations.

Network transformation

Don Clarke, formerly of BT and CableLabs, is one of the authors of the latest paper. He also co-authored ETSI’s Network Functions Virtualisation (NFV) paper that kickstarted the telcos’ network transformation strategies of the last decade.

Many of the changes sought in the original NFV paper have come to pass.

Networking functions now run as software and no longer require custom platforms. To do that, the operators have embraced open interfaces that allow disaggregated designs to tackle vendor lock-in. The telcos have also adopted open-source software practices and spurred the development of white boxes to expand equipment choice.

Yet the TEG paper laments the industry’s continued reliance on large vendors while smaller telecom vendors – seen as vital to generate much-needed competition and innovation – struggle to get a look-in.

The telecom ecosystem

The TEG segments the telecommunications ecosystem into three domains (see diagram).

The large-scale data centre players are the digital services providers (top layer). In this domain, innovation and competition are greatest.

The digital network provider domain (middle layer) is served by a variety of players, notably the cloud providers, while it is the telcos that dominate the physical infrastructure provider domain.

At this bottom layer, competition is low and overall investment in infrastructure is inadequate. A third of the world’s population still has no access to the internet, notes the report.

The telcos should also be exploiting the synergies between the domains, says the TEG, yet struggle to do so. But more than that, the telcos can be a barrier.

Clarke cites the emerging metaverse that will support immersive virtual worlds as an example.

Metaverse

The “Metaverse”  is a concept being promoted by the likes of Meta and Microsoft and has been picked up by the telcos, as evident at this week’s MWC Barcelona 22 show.

Meta’s Mark Zuckerberg recently encouraged his staff to focus on long-term thinking as the company transitions to become a metaverse player. “We should take on the challenges that will be the most impactful, even if the full results won’t be seen for years,” he said.

Telcos should be thinking about how to create a network that enables the metaverse, given the data for rendering metaverse environments will come through the telecom network, says Clarke.

Don Clarke

“The real innovation will come when you try and understand the needs of the metaverse in terms of networking, and then you get into the telco game,” he says.

Any concentration of metaverse users will generate a data demand likely to exhaust the network capacity available.

“Telcos will say, ‘We aren’t upgrading capacity because we are not getting a return,’ and then metaverse innovation will be slowed down,” says Clarke.

He says much of the innovation needed for the metaverse will be in the network and telcos need to understand the opportunities for them.  “The key is what role will the telcos have, not in dollars but network capability, then you start to see where the innovation needs to be done.”

The challenge is that the telcos can’t see beyond their immediate operational challenges, says Clarke: “Anything new creates more operational challenges and therefore needs to be rejected because they don’t have the resources to do anything meaningful.”

He stresses he is full of admiration for telcos’ operations staff: “They know their game.” But in an environment where operational challenges are avoided, innovation is less important.

TEG’s action plan

TEG’s report lists direct actions telcos can take regarding innovation. These cover funding, innovation processes, procurement and increasing competition.

Many of the proposals are designed to help smaller vendors overcome the challenges they face in telecoms. TEG views small vendors and start-ups as vital for the industry to increase competition and innovation.

Under the funding category, TEG wants telcos to allocate a least 5 per cent of procurement to start-ups and small vendors. The group also calls for investment funds to be set up that back infrastructure and middleware vendors, not just over-the-top start-ups.

For innovation, it wants greater disaggregation so as to steer away from monolithic solutions. The group also wants commitments to fast lab-to-field trials (a year) and shorter deployment cycles (two years maximum) of new technologies.

Competition will require a rethink regarding small vendors. At present, all the advantages are with the large vendors. It lists six measures how telcos can help small vendors win business, one being to stop forcing them to partner with large vendors. The TEG wants telcos to ensure enough personnel that small vendors get all the “airtime” they need with the telcos.

Lastly, concerning procurement, telcos can do much more.

One suggestion is to stop sending small vendors large, complex request for proposals (RFPs) that they must respond to in short timescales; small vendors can’t compete with the large RFP teams available to the large vendors.

Also, telcos should stop their harsh negotiating terms such as a 30 per cent additional discount. Such demands can hobble a small vendor.

Innovation

“Innovation comes from left field and if you try to direct it with a telco mindset, you miss it,” says Clarke. “Telcos think they know what ‘good’ looks like when it comes to innovation, but they don’t because they come at it from a monoculture mindset.”

He said that in the TEG discussions, the idea of incubators for start-ups was mentioned. “We have all done incubators,” he says. But success has been limited for the reasons cited above.

He also laments the lack of visionaries in the telecom industry.

A monoculture organisation rejects such individuals. “Telcos don’t like visionaries because culturally they are annoying and they make their life harder,” he says. “Disruptors have left the industry.”

Prospects

The authors are realistic.

Even if their report is taken seriously, they note any change will take time. They also do not expect the industry to be able to effect change without help. The TEG wants government and regulator involvement if the long-term prospects of a crucial industry are to be ensured.

The key is to create an environment that nurtures innovation and here telcos could work collectively to make that happen.

“No telco has it all, but individual ones have strengths,” says Clarke. “If you could somehow combine the strengths of the particular telcos and create such an environment, things will emerge.”

The trick is diversity – get people from different domains together to make judgements as to what promising innovation looks like.

“Bring together the best people and marvelous things happen when you give them a few beers and tell them to solve a problem impacting all of them,” says Clarke. “How can we make that happen?”

 


The various paths to co-packaged optics

Brad Booth

Near package optics has emerged as companies have encountered the complexities of co-packaged optics. It should not be viewed as an alternative to co-packaged optics but rather a pragmatic approach for its implementation.

Co-packaged optics will be one of several hot topics at the upcoming OFC show in March.

Placing optics next to silicon is seen as the only way to meet the future input-output (I/O) requirements of ICs such as Ethernet switches and high-end processors.

For now, pluggable optics do the job of routing traffic between Ethernet switch chips in the data centre. The pluggable modules sit on the switch platform’s front panel at the edge of the printed circuit board (PCB) hosting the switch chip.

But with switch silicon capacity doubling every two years, engineers are being challenged to get data into and out of the chip while ensuring power consumption does not rise.

One way to boost I/O and reduce power is to use on-board optics, bringing the optics onto the PCB nearer the switch chip to shorten the electrical traces linking the two.

The Consortium of On-Board Optics (COBO), set up in 2015, has developed specifications to ensure interoperability between on-board optics products from different vendors.

However, the industry has favoured a shorter still link distance, coupling the optics and ASIC in one package. Such co-packaging is tricky which explains why yet another approach has emerged: near package optics.

I/O bottleneck

“Everyone is looking for tighter and tighter integration between a switch ASIC, or ‘XPU’ chip, and the optics,” says Brad Booth, president at COBO and principal engineer, Azure hardware architecture at Microsoft. XPU is the generic term for an IC such as a CPU, a graphics processing unit (GPU) or even a data processing unit (DPU).

What kick-started interest in co-packaged optics was the desire to reduce power consumption and cost, says Booth. These remain important considerations but the biggest concern is getting sufficient bandwidth on and off these chips.

“The volume of high-speed signalling is constrained by the beachfront available to us,” he says.

Booth cites the example of a 16-lane PCI Express bus that requires 64 electrical traces for data alone, not including the power and ground signalling. “I can do that with two fibres,” says Booth.

Nhat Nguyen

Near package optics

With co-packaged optics, the switch chip is typically surrounded by 16 optical modules, all placed on an organic substrate (see diagram below).

“Another name for it is a multi-chip module,” says Nhat Nguyen, senior director, solutions architecture at optical I/O specialist, Ayar Labs.

A 25.6-terabit Ethernet switch chip requires 16, 1.6 terabits-per-second (1.6Tbps) optical modules while upcoming 51.2-terabit switch chips will use 3.2Tbps modules.

“The issue is that the multi-chip module can only be so large,” says Nguyen. “It is challenging with today’s technology to surround the 51.2-terabit ASIC with 16 optical modules.”

A 51.2-terabit Ethernet switch chip surrounded by 16, 3.2Tbps optical modules. Source: OIF.

Near package optics tackles this by using a high-performance PCB substrate – an interposer – that sits on the host board, in contrast to co-packaged optics where the modules surround the chip on a multi-chip module substrate.

The near package optics’ interposer is more spacious, making the signal routing between the chip and optical modules easier while still meeting signal integrity requirements. Using the interposer means the whole PCB doesn’t need upgrading which would be extremely costly.

Some co-packaged optics design will use components from multiple suppliers. One concern is how to service a failed optical engine when testing the design before deployment. “That is one reason why a connector-based solution is being proposed,” says Booth. “And that also impacts the size of the substrate.”

A larger substrate is also needed to support both electrical and optical interfaces from the switch chip.

Platforms will not become all-optical immediately and direct-attached copper cabling will continue to be used in the data centre. However, the issue with electrical signalling, as mentioned, is it needs more space than fibre.

“We are in a transitional phase: we are not 100 per cent optics, we are not 100 per cent electrical anymore,” says Booth. “How do you make that transition and still build these systems?”

Perspectives

Ayar Labs views near package optics as akin to COBO. “It’s an attempt to bring COBO much closer to the ASIC,” says Hugo Saleh, senior vice president of commercial operations and managing director of Ayar Labs U.K.

However, COBO’s president, Booth, stresses that near package optics is different from COBO’s on-board optics work.

“The big difference is that COBO uses a PCB motherboard to do the connection whereas near package optics uses a substrate,” he says. “It is closer than where COBO can go.”

It means that with near package optics, there is no high-speed data bandwidth going through the PCB.

Booth says near package optics came about once it became obvious that the latest 51.2-terabit designs – the silicon, optics and the interfaces between them – cannot fit on even the largest organic substrates.

“It was beyond the current manufacturing capabilities,” says Booth. “That was the feedback that came back to Microsoft and Facebook (Meta) as part of our Joint Development Foundation.”

Near package optics is thus a pragmatic solution to an engineering challenge, says Booth. The larger substrate remains a form of co-packaging but it has been given a distinct name to highlight that it is different to the early-version approach.

Nathan Tracy, TE Connectivity and the OIF’s vice president of marketing, admits he is frustrated that the industry is using two terms since co-packaged optics and near package optics achieve the same thing. “It’s just a slight difference in implementation,” says Tracy.

The OIF is an industry forum studying the applications and technology issues of co-packaging and this month published its framework Implementation Agreement (IA) document.

COBO is another organisation working on specifications for co-packaged optics, focussing on connectivity issues.

The two design approaches: co-packaged optics and near package optics. Source: OIF.

Technical differences

Ayar Labs highlights the power penalty using near package optics due to its use of longer channel lengths.

For near package optics, lengths between the ASIC and optics can be up to 150mm with the channel loss constrained to 13dB. This is why the OIF is developing the XSR+ electrical interface, to expand the XSR’s reach for near package optics.

In contrast, co-packaged optics confines the modules and host ASIC to 50mm of each other. “The channel loss here is limited to 10dB,” says Nguyen. Co-packaged optics has a lower power consumption because of the shorter spans and 3dB saving.

Ayar Labs highlights its optical engine technology, the TeraPHY chiplet that combines silicon photonics and electronics in one die. The optical module surrounding the ASIC in a co-packaged design typically comprises three chips: the DSP, electrical interface and photonics.

“We can place the chiplet very close to the ASIC,” says Nguyen. The distance between the ASIC and the chiplet can be as close as 3-5mm. Whether on the same interposer Ayar Labs refers to such a design using athird term: in-package optics.

Ayar Labs says its chiplet can also be used for optical modules as part of a co-packaged design.

The very short distances using the chiplet result in a power efficiency of 5pJ/bit whereas that of an optical module is 15pJ/bit. Using TeraPHY for an optical module co-packaged design, the power efficiency is some 7.5pJ/bit, half that of a 3-chip module.

A 3-5mm distance also reduces the latency while the bandwidth density of the chiplet, measured in Gigabit/s/mm, is higher than the optical module.

 

Co-existence

Booth refers to near package optics as ‘CPO Gen-1’, the first generation of co-packaged optics.

“In essence, you have got to use technologies you have in hand to be able to build something,” says Booth. “Especially in the timeline that we want to demonstrate the technology.”

Is Microsoft backing near package optics?

Hugo Saleh

“We are definitely saying yes if this is what it takes to get the first level of specifications developed,” says Booth.

But that does not mean the first products will be exclusively near package optics.

“Both will be available and around the same time,” says Booth. “There will be near packaged optics solutions that will be multi-vendor and there will be more vertically-integrated designs; like Broadcom, Intel and others can do.”

From an end-user perspective, a multi-vendor capability is desirable, says Booth.

Ayar Labs’ Saleh sees two developing paths.

The first is optical I/O to connect chips in a mesh or as part of memory semantic designs used for high-performance computing and machine learning. Here, the highest bandwidth and lowest power are key design goals.

Ayar Labs has just announced a strategic partnership with high performance computing leader, HPE, to design future silicon photonics solutions for HPE’s Slingshot interconnect that is used for upcoming Exascale supercomputers and also in the data centre.

The second path concerns Ethernet switch chips and here Saleh expects both solutions to co-exist: near package optics will be an interim solution with co-packaged optics dominating longer term. “This will move more slowly as there needs to be interoperability and a wide set of suppliers,” says Saleh.

Booth expects continual design improvements to co-packaged optics. Further out, 2.5D and 3D chip packaging techniques, where silicon is stacked vertically, to be used as part of co-packaged optics designs, he says.


Compute vendors set to drive optical I/O innovation

Professor Vladimir Stojanovic

Part 2: Data centre and high-performance computing trends

Professor Vladimir Stojanovic has an engaging mix of roles.

When he is not a professor of electrical engineering and computer science at the University of California, Berkeley, he is the chief architect at optical interconnect start-up, Ayar Labs.

Until recently Stojanovic spent four days each week at Ayar Labs. But last year, more of his week was spent at Berkeley.

Stojanovic is a co-author of a 2015 Nature paper that detailed a monolithic electronic-photonics technology. The paper described a technological first: how a RISC-V processor communicated with the outside world using optical rather than electronic interfaces.

It is this technology that led to the founding of Ayar Labs.

Research focus

“We [the paper’s co-authors] always thought we would use this technology in a much broader sense than just optical I/O [input-output],” says Stojanovic.

This is now Stojanovic’s focus as he investigates applications such as sensing and quantum computing. “All sorts of areas where you can use the same technology – the same photonic devices, the same circuits – arranged in different configurations to achieve different goals,” says Stojanovic.

Stojanovic is also looking at longer-term optical interconnect architectures beyond point-to-point links.

Ayar Labs’ chiplet technology provides optical I/O when co-packaged with chips such as an Ethernet switch or an “XPU” – an IC such as a CPU or a GPU (graphics processing unit). The optical I/O can be used to link sockets, each containing an XPU, or even racks of sockets, to form ever-larger compute nodes to achieve “scale-out”.

But Stojanovic is looking beyond that, including optical switching, so that tens of thousands or even hundreds of thousands of nodes can be connected while still maintaining low latency to boost certain computational workloads.

This, he says, will require not just different optical link technologies but also figuring out how applications can use the software protocol stack to manage these connections. “That is also part of my research,” he says.

Optical I/O

Optical I/O has now become a core industry focus given the challenge of meeting the data needs of the latest chip designs. “The more compute you put into silicon, the more data it needs,” says Stojanovic.

Within the packaged chip, there is efficient, dense, high-bandwidth and low-energy connectivity. But outside the package, there is a very sharp drop in performance, and outside the chassis, the performance hit is even greater.

Optical I/O promises a way to exploit that silicon bandwidth to the full, without dropping the data rate anywhere in a system, whether across a shelf or between racks.

This has the potential to build more advanced computing systems whose performance is already needed today.

Just five years go, says Stojanovic, artificial intelligence (AI) and machine learning were still in their infancy and so were the associated massively parallel workloads that required all-to-all communications.

Fast forward to today, such requirements are now pervasive in high-performance computing and cloud-based machine-learning systems. “These are workloads that require this strong scaling past the socket,” says Stojanovic.

He cites natural language processing that within 18 months has grown 1000x in terms of the memory required; from hosting a billion to a trillion parameters.

“AI is going through these phases: computer vision was hot, now it’s recommender models and natural language processing,” says Stojanovic. “Each generation of application is two to three orders of magnitude more complex than the previous one.”

Such computational requirements will only be met using massively parallel systems.

“You can’t develop the capability of a single node fast enough, cramming more transistors and using high-bandwidth memory,“ he says. High-bandwidth memory (HBM) refers to stacked memory die that meet the needs of advanced devices such as GPUs.

Co-packaged optics

Yet, if you look at the headlines over the last year, it appears that it is business as usual.

For example, there have been a Multi Source Agreement (MSA) announcement for new 1.6-terabit pluggable optics. And while co-packaged optics for Ethernet switch chips continues to advance, it remains a challenging technology; Microsoft has said it will only be late 2023 when it starts using co-packaged optics in its data centres.

Stojanovic stresses there is no inconsistency here: it comes down to what kind of bandwidth barrier is being solved and for what kind of application.

In the data centre, it is clear where the memory fabric ends and where the networking – implemented using pluggable optics – starts. That said, this boundary is blurring: there is a need for transactions between many sockets and their shared memory. He cites Nvidia’s NVLink and AMD’s Infinity Fabric links as examples.

“These fabrics have very different bandwidth densities and latency needs than the traditional networks of Infiniband and Ethernet,” says Stojanovic. “That is where you look at what physical link hardware answers the bottleneck for each of these areas.”

Co-packaged optics is focussed on continuing the scaling of Ethernet switch chips. It is a more scalable solution than pluggables and even on-board optics because it eliminates long copper traces that need to be electrically driven. That electrical interface has to escape the switch package, and that gives rise to that package-bottleneck problem, he says.

There will be applications where pluggables and on-board optics will continue to be used. But they will still need power-consuming retimer chips and they won’t enable architectures where a chip can talk to any other chip as if they were sharing the same package.

“You can view this as several different generations, each trying to address something but the ultimate answer is optical I/O,” says Stojanovic.

How optical connectivity is used also depends on the application, and it is this diversity of workloads that is challenging the best of the system architects.

Application diversity

Stojanovic cites one machine learning approach for natural language processing that Google uses that scales across many compute nodes, referred to as the ‘multiplicity of experiments’ (MoE) technique.

Z. Chen, Hot Chips 2020

A processing pipeline is replicated across machines, each performing part of the learning. For the algorithm to work in parallel, each must exchange its data set – its learning – with every other processing pipeline, a stage referred to as all-to-all dispatch and combine.

“As you can imagine, all-to-all communications is very expensive,” says Stojanovic. “There is a lot of data from these complex, very large problems.”

Not surprisingly, as the number of parallel nodes used grows, a greater proportion of the overall time is spent exchanging the data.

Using 1,000 AI processors running 2,000 experiments, a third of the time is required for data exchange. Scaling the hardware to 3,000 to 4,000 AI processors and communications dominate the runtime.

This, says Stojanovic, is a very interesting problem to have: it’s an example where adding more compute simply does not help.

“It is always good to have problems like this,” he says. “You have to look at how you can introduce some new technology that will be able to resolve this to enable further scaling, to 10,000 or 100,000 machines.”

He says such examples highlight how optical engineers must also have an understanding of systems and their workloads and not just focus on ASIC specifications such as bandwidth density, latency and energy.

Because of the diverse workloads, what is needed is a mixture of circuit switching and packet switching interconnect.

Stojanovic says high-radix optical switching can connect up to a thousand nodes and, scaling to two hops, up to a million nodes in sub-microsecond latencies. This suits streamed traffic.

Professor Stojanovic, ECOC 21

But an abundance of I/O bandwidth is also needed to attach to other types of packet switch fabrics. “So that you can also handle cache-line size messages,” says Stojanovic.

These are 64 bytes long and are found with processing tasks such as Graph AI where data searches are required, not just locally but across the whole memory space. Here, transmissions are shorter and involve more random addressing and this is where point-to-point optical I/O plays a role.

“It is an art to architect a machine,” says Stojanovic.

Disaggregation

Another data centre trend is server disaggregation which promises important advantages.

The only memory that meets the GPU requirements is HBM. But it is becoming difficult to realise taller and taller HBM stacks. Stojanovic cites as an example how Nvidia came out with its A100 GPU with 40GB of HBM that was quickly followed a year later, by an 80GB A100 version.

Some customers had to do a complete overall of their systems to upgrade to the newer A100 yet welcomed the doubling of memory because of the exponential growth in AI workloads.

By disaggregating a design – decoupling the compute and memory into separate pools – memory can be upgraded independently of the computing. In turn, pooling memory means multiple devices can share the memory and it avoids ‘stranded memory’ where a particular CPU is not using all its private memory. Having a lot of idle memory in a data centre is costly.

If the I/O to the pooled memory can be made fast enough, it promises to allow GPUs and CPUs to access common DDR memory.

“This pooling, with the appropriate memory controller design, equalises the playing field of GPUs and CPUs being able to access jointly this resource,” says Stojanovic. “That allows you to provide way more capacity – several orders more capacity of memory – to the GPUs but still be within a DRAM read access time.”

Such access time is 50-60ns overall from the DRAM banks and through an optical I/O. The pooling also means that the CPUs no longer have stranded memory.

“Now something that is physically remote can be logically close to the application,” says Stojanovic.

Challenges

For optical I/O to enable such system advances what is needed is an ecosystem of companies. Adding an optical chiplet alongside an ASIC is not the issue; chiplets are aready used by the chip industry. Instead, the ecosystem is needed to address such practical matters as attaching fibres and producing the lasers needed. This requires collaboration among companies across the optical industry.

“That is why the CW-WDM MSA is so important,” says Stojanovic. The MSA defines the wavelength grids for parallel optical channels and is an example of what is needed to launch an ecosystem and enable what system integrators and ultimately the hyperscalers want to do.

Systems and networking

Stojanovic concludes by highlighting an important distinction.

The XPUs have their own design cycles and, with each generation, new features and interfaces are introduced. “These are the hearts of every platform,” says Stojanovic. Optical I/O needs to be aligned with these devices.

The same applies to switch chips that have their own development cycles. “Synchronising these and working across the ecosystem to be able to find these proper insertion points is key,” he says.

But this also implies that the attention given to the interconnects used within a system (or between several systems i.e. to create a larger node) will be different to that given to the data centre network overall.

“The data centre network has its own bandwidth pace and needs, and co-packaged optics is a solution for that,“ says Stojanovic. “But I think a lot more connections get made, and the rules of the game are different, within the node.”

Companies will start building very different machines to differentiate themselves and meet the huge scaling demands of applications.

“There is a lot of motivation from computing companies and accelerator companies to create node platforms, and they are freer to innovate and more quickly adopt new technology than in the broader data centre network environment,” he says

When will this become evident? In the coming two years, says Stojanovic.


Data centre disaggregation with Gen-Z and CXL

Hiren Patel

Part 1: CXL and Gen-Z

  • The Gen-Z and Compute Express Link (CXL) protocols have been shown working in unison to implement a disaggregated processor and memory system at the recent Supercomputing 21 show.
  • The Gen-Z Consortium’s assets are being subsumed within the CXL Consortium. CXL will become the sole industry standard moving forward.
  • Microsoft and Meta are two data centre operators backing CXL.

Pity Hiren Patel, tasked with explaining the Gen-Z and CXL networking demonstration operating across several booths at the Supercomputing 21 (SC21) show held in St. Louis, Missouri in November.

Not only was Patel wearing a sanitary mask while describing the demo but he had to battle to be heard above cooling fans so loud, you could still be at St. Louis Lambert International Airport.

Gen-Z and CXL are key protocols supporting memory and server disaggregation in the data centre.

The SC21 demo showed Gen-Z and CXL linking compute nodes to remote ‘media boxes’ filled with memory in a distributed multi-node network (see diagram, bottom).

CXL was used as the host interface on the various nodes while Gen-Z created and oversaw the mesh network linking equipment up to tens of meters apart.

“What our demo showed is that it is finally coming to fruition, albeit with FPGAs,” says Patel, CEO of IP specialist, IntelliProp, and President of the Gen-Z Consortium.

Interconnects

Gen-Z and CXL are two of a class of interconnect schemes announced in recent years.

The interconnects came about to enable efficient ways to connect CPUs, accelerators and memory. They also address a desire among data centre operators to disaggregate servers so that key components such as memory can be pooled separately from the CPUs.

The idea of disaggregation is not new. The Gen-Z protocol emerged from HPE’s development of The Machine, a novel memory-centric computer architecture. The Gen-Z Consortium was formed in 2016, backed by HPE and Dell, another leading high-performance computing specialist. The CXL consortium was formed in 2019.

Other interconnects of recent years include the Open Coherent Accelerator Processor Interface (Open-CAPI), Intel’s own interconnect scheme Omni-Path which it subsequently sold off, Nvidia’s NVLink, and the Cache Coherent Interconnect for Accelerators (CCIX).

The emergence of the host buses was also a result of industry frustration with the prolonged delay in the release of the then PCI Express (PCIe) 4.0 specification.

All these interconnects are valuable, says Patel, but many are implemented in a proprietary manner whereas CXL and Gen-Z are open standards that have gained industry support.

“There is value moving away from proprietary to an industry standard,” says Patel.

Merits of pooling

Disaggregated designs with pooled memory deliver several advantages: memory can be upgraded at different stages to the CPUs, with extra memory added as required. “Memory growth is outstripping CPU core growth,” says Patel. “Now you need banks of memory outside of the server box.”

A disaggregated memory architecture also supports multiple compute nodes – CPUs and accelerators such as graphics processor units (GPUs) or FPGAs – collaborating on a common data set.

Such resources also become configurable: in artificial intelligence, training workloads require a hardware configuration different to inferencing. With disaggregation, resources can be requested for a workload and then released once a task is completed.

Memory disaggregation also helps data centre operators drive down the cost-per-bit of memory. “What data centres spend just on DRAM is extraordinarily high,” says Erich Hanke, senior principal engineer, storage and memory products, at IntelliProp.

Memory can be used more efficiently and need no longer to be stranded. A server can be designed for average workloads, not worse case ones as is done now. And when worst-case scenarios arise, extra memory can be requested.

Erich Hanke

“This allows the design of efficient data centres that are cost optimised while not losing out on the aggregate performance,” says Hanke.

Hanke also highlights another advantage, minimising data loss during downtimes. Given the huge number of servers in a data centre, reboots and kernel upgrades are a continual occurrence. With disaggregated memory, active memory resources need not be lost.

Gen-Z and CXL

The Gen-Z protocol allows for the allocation and deallocation of resources, whether memory, accelerators or networking. “It can be used to create a temporary or permanent binding of that resource to one or more CPU nodes,” says Hanke.

Gen-Z supports native peer-to-peer requests flowing in any direction through a fabric, says Hanke. This is different to PCIe which supports tree-type topologies.

Gen-Z and CXL are also memory-semantic protocols whereas PCIe is not.

With a memory-semantic protocol, a processor natively issues data loads and stores into fabric-attached components. “No layer of software or a driver is needed to DMA (direct memory access) data out of a storage device if you have a memory-semantic fabric,” says Hanke.

Gen-Z is also hugely scalable. It supports 4,096 nodes per subnet and 64,000 subnets, a total of 256 million nodes per fabric.

The Gen-Z specification is designed modularly, comprising a core specification and other components such as for the physical layer to accommodate changes in serialiser-deserialiser (serdes) speeds.

Disaggregation using Gen-Z and CXL. Source: IntelliProp

For example, the SC21 demo using an FPGA implemented 25 giga-transfers a second (25GT/s) but the standard will support 50 and 112GT/s rates. In effect, the Gen-Z specification is largely done.

What Gen-Z does not support is cache coherency but that is what CXL is designed to do. Version 2.0 of the CXL specification has already been published and version 3.0 is expected in the first half of 2022.

CXL 2.0 supports three protocols: CXL.io which is similar to PCIe – CXL uses the physical layer of the PCIe bus, CXL.memory for host-memory accesses, and CXL.cache for coherent host-cache accesses.

“More and more processors will have CXL as their connect point,” says Patel. “You may not see Open-CAPI as a connect point, you may not see NVLink as a connect point, you won’t see Gen-Z as a connect point but you will see CXL on processors.”

SC21 demo

The demo’s goal was to show how computing nodes – hosts – could be connected to memory modules through a switched Gen-Z fabric.

The equipment included a server hosting the latest Intel Sapphire Rapids processor, a quad-core A53 ARM processor on a Xilinx FPGA implemented with a Bittware 250SoC FPGA card, as well as several media boxes housing memory modules.

The ARM processor was used as the Fabric Manager node which oversees the network to allow access to the storage endpoints. There is also a Fabric Adaptor that connects to the Intel processor’s CXL bus on one side and the other to the memory-semantic fabric.

“CXL is in the hosts and everything outside that is Gen-Z,” says Patel.

The CXL V1.1 interface is used with four hosts (see diagram below). The V1.1 specification is point-to-point and as such can’t be used for any of the fabric implementations, says Patel. The 128Gbps CXL host interfaces were implemented as eight lanes of 16Gbps, using the PCIe 4.0 physical layer.

The Intel Sapphire Rapids processor supports a CXL Gen5x16 bus supporting 512Gbps (PCIe 5.0 x 16 lanes) but that is too fast for IntelliProp’s FPGA implementation. “An ASIC implementation of the IntelliProp CXL host fabric adapter would run at the 512Gpbs full rate,” says Patel. With an ASIC, the Gen-Z port court could be increased from 12 to 48 ports while the latency of each hop would be 35ns only.

The media box is a two-rack-unit (2RU) server without a CPU but with fabric-attached memory modules. Each memory module has a switch that enables multipath accesses. A memory module of 256Gbytes could be partitioned across all four hosts, for example. Equally, memory can be shared among the hosts. In the SC21 demo, memory in a media box was accessed by a server 30m away.

The SC21 demo representation showing the 4 hosts, the Fabric Manager (FM) and the switching that allows multiple paths to the memory end-points (purple nodes). Source: IntelliProp

IntelliProp implemented the Host Fabric Adapter which included integrated switching, a 12-port Gen-Z switch, and the memory modules featuring integrated switching. All of the SC21 demonstration, outside of the Intel host, was done using FPGAs.

For a data centre, the media boxes would connect to a top-of-rack switch and fan out to multiple servers. “The media box could be co-located in a rack with CPU servers, or adjacent racks or a pod,” says Hanke.

The distances of a Gen-Z network in a data centre would typically be a row- or pod-scale, says Hanke. IntelliProp has had enquiries about going greater distances but above 30m fibre length starts to dictate latency. It’s a 10ns round trip for each meter of cable, says IntelliProp.

What the demo also showed was how well the Gen-Z and CXL protocols combine. “Gen-Z converts the host physical address to a fabric address in a very low latency manner; this is how they will eventually blend,” says Hanke.

What next?

The CXL Consortium and The Gen-Z Consortium signed a memorandum of understanding in 2020 and now Gen-Z’s assets are being transferred to the CXL Consortium. Going forward, CXL will become the sole industry standard.

Meanwhile, Microsoft, speaking at SC21, expressed its interest in CXL to support disaggregated memory and to grow memory dynamically in real-time. Meta is also backing the standard. But both cloud companies need the standard to be easily manageable (software) and stress the importance that CXL and its evolutions have minimal impact on overall latency.


Telecoms' innovation problem and its wider cost

Source: Accelerating Innovation in the Telecommunications Arena

Imagine how useful 3D video calls would have been this last year.

The technologies needed – a light field display and digital compression techniques to send the vast data generated across a network – do exist but practical holographic systems for communication remain years off.

But this is just the sort of application that telcos should be pursuing to benefit their businesses.

A call for innovation

“Innovation in our industry has always been problematic,” says Don Clarke, formerly of BT and CableLabs and co-author of a recent position paper outlining why telecoms needs to be more innovative.

Entitled Accelerating Innovation in the Telecommunications Arena, the paper’s co-authors include representatives from communications service providers (CSPs), Telefonica and Deutsche Telekom.

In an era of accelerating and disruptive change, CSPs are proving to be an impediment, argues the paper.

The CSPs’ networking infrastructure has its own inertia; the networks are complex, vast in scale and costly. The operators also require a solid business case before undertaking expensive network upgrades.

Such inertia is costly, not only for the CSPs but for the many industries that depend on connectivity.

But if the telecom operators are to boost innovation, practices must change. This is what the position paper looks to tackle.

NFV White Paper

Clarke was one of the authors of the original Network Functions Virtualisation (NFV) White Paper, published by ETSI in 2012.

The paper set out a blueprint as to how the telecom industry could adopt IT practices and move away from specialist telecom platforms running custom software. Such proprietary platforms made the CSPs beholden to systems vendors when it came to service upgrades.

Don Clarke, formerly of BT and CableLabs and co-author of a recent position paper outlining why telecoms needs to be more innovative.

The NFV paper also highlighted a need to attract new innovative players to telecoms.

“I see that paper as a catalyst,” says Clarke. “The ripple effect it has had has been enormous; everywhere you look, you see its influence.”

Clarke cites how the Linux Foundation has re-engineered its open-source activities around networking while Amazon Web Services now offers a cloud-native 5G core. Certain application programming interfaces (APIs) cited by Amazon as part of its 5G core originated in the NFV paper, says Clarke.

Software-based networking would have happened without the ETSI NFV white paper, stresses Clarke, but its backing by leading CSPs spurred the industry.

However, building a software-based network is hard, as the subsequent experiences of the CSPs have shown.

“You need to be a master of cloud technology, and telcos are not,” says Clarke. “But guess what? Riding to the rescue are the cloud operators; they are going to do what the telcos set out to do.”

For example, as well as hosting a 5G core, AWS is active at the network edge including its Internet of Things (IoT) Greengrass service. Microsoft, having acquired telecom vendors Metaswitch and Affirmed Networks, has launched ‘Azure for Operators’ to offer 5G, cloud and edge services. Meanwhile, Google has signed agreements with several leading CSPs to advance 5G mobile edge computing services.

“They [the hyperscalers] are creating the infrastructure within a cloud environment that will be carrier-grade and cloud-native, and they are competitive,” says Clarke.

The new ecosystem

The position paper describes the telecommunications ecosystem in three layers (see diagram).

The CSPs are examples of the physical infrastructure providers (bottom layer) that have fixed and wireless infrastructure providing connectivity. The physical infrastructure layer is where the telcos have their value – their ‘centre of gravity’ – and this won’t change, says Clarke.

The infrastructure layer also includes the access network which is the CSPs’ crown jewels.

“The telcos will always defend and upgrade that asset,” says Clarke, adding that the CSPs have never cut access R&D budgets. Access is the part of the network that accounts for the bulk of their spending. “Innovation in access is happening all the time but it is never fast enough.”

The middle, digital network layer is where the nodes responsible for switching and routing reside, as do the NFV and software-defined networking (SDN) functions. It is here where innovation is needed most.

Clarke points out that the middle and upper layers are blurring; they are shown separately in the diagram for historical reasons since the CSPs own the big switching centres and the fibre that connect them.

But the hyperscalers – with their data centres, fibre backbones, and NFV and SDN expertise – play in the middle layer too even if they are predominantly known as digital service providers, the uppermost layer.

The position paper’s goal is to address how CSPs can better address the upper two network layers while also attracting smaller players and start-ups to fuel innovation across all three.

Paper proposal

The paper identifies several key issues that curtail innovation in telecoms.

One is the difficulty for start-ups and small companies to play a role in telecoms and build a business.

Just how difficult it can be is highlighted by the closure of SDN-controller specialist, Lumina Networks, which was already engaged with two leading CSPs.

In a Telecom TV panel discussion about innovation in telecoms, that accompanied the paper’s publication, Andrew Coward, the then CEO of Lumina Networks, pointed out how start-ups require not just financial backing but assistance from the CSPs due to their limited resources compared to the established systems vendors.

It is hard for a start-up to respond to an operator’s request-for-proposals that can be thousands of pages long. And when they do, will the CSPs’ procurement departments consider them due to their size?

Coward argues that a portion of the CSP’ capital expenditure should be committed to start-ups. That, in turn, would instill greater venture capital confidence in telecoms.

The CSPs also have ‘organisational inertia’ in contrast to the hyperscalers, says Clarke.

“Big companies tend towards monocultures and that works very well if you are not doing anything from one year to the next,” he says.

The hyperscalers’ edge is their intellectual capital and they work continually to produce new capabilities. “They consume innovative brains far faster and with more reward than telcos do, and have the inverse mindset of the telcos,” says Clarke.

The goals of the innovation initiative are to get CSPs and the hyperscalers – the key digital service providers – to work more closely.

“The digital service providers need to articulate the importance of telecoms to their future business model instead of working around it,” says Clarke.

Clarke hopes the digital service providers will step up and help the telecom industry be more dynamic given the future of their businesses depend on the infrastructure improving.

In turn, the CSPs need to stand up and articulate their value. This will attract investors and encourage start-ups to become engaged. It will also force the telcos to be more innovative and overcome some of the procurement barriers, he says.

Ultimately, new types of collaboration need to emerge that will address the issue of innovation.

Next steps

Work has advanced since the paper was published in June and additional players have joined the initiative, to be detailed soon.

“This is the beginning of what we hope will be a much more interesting dialogue, because of the diversity of players we have in the room,” says Clarke. “It is time to wake up, not only because of the need for innovation in our industry but because we are an innovation retardant everywhere else.”

Further information:

Telecom TV’s panel discussion: Part 2, click here

Tom Nolle’s response to the Accelerating Innovation in the Telecommunications Arena paper, click here


Open Eye gets webscale attention

Microsoft has trialled optical modules that use signalling technology developed by the Open Eye Consortium.

The webscale player says optical modules using the Open Eye’s analogue 4-level pulse-amplitude modulation (PAM-4) technology consume less power than modules with a PAM-4 digital signal processor (DSP).

Brad Booth

Brad Booth

“Open Eye has shown us at least an ability that we can do better on power,” says Brad Booth, director, next cloud system architecture, Azure hardware systems and infrastructure at Microsoft, during an Open Eye webinar.

Optical module power consumption is a key element of the total power budget of data centres that can have as many as 100,000 servers and 50,000 switches.

“You want to avoid running past your limit because then you have to build another data centre,” says Booth.

But challenges remain before Open Eye becomes a mainstream technology, says Dale Murray, principal analyst at market research firm, LightCounting.

Open Eye MSA

When the IEEE standards body developed specifications using 50-gigabit PAM-4 optical signals, the assumption was that a DSP would be needed for signal recovery given the optics’ limited bandwidth.

But as optics improved, companies wondered if analogue circuitry could be used after all.

Such PAM-4 analogue chips would be similar to non-return-to-zero (NRZ) signalling chips used in modules, as would the chip assembly and testing, says Timothy Vang, vice president of marketing and applications, signal integrity products group, Semtech. The analogue chips also promised to be cheaper than DSPs.

This led to the formation of the Open Eye multi-source agreement (MSA) in January 2019. Led by MACOM and Semtech, the MSA now has 37 member companies.

“We felt that if we could enable that capability, you could use the same low-cost optics and, with an Open Eye specification - an eye-mask specification - you get a manufacturable low-cost ecosystem,” says Vang. “That was our goal and we were not alone.”

But a key issue is whether Open Eye solutions will work with existing DSP-based PAM-4 modules that have their own testing procedure.

“Can they eliminate all concerns for interoperability between analogue and DSP based modules without dual testing?” says Murray. “And will end users go with a non-standard solution rather than an IEEE-standard solution?”

“We do face the dilemma LightCounting points out,” says Vang. “It is possible there are poor or older DSP-based modules that wouldn’t pass the Open Eye test, and that could lead data centres to say: ‘Well, that is not good enough’.”

Dale Murray

Dale Murray

“It is a concern,” says Microsoft’s Booth. The first Open Eye samples Microsoft received didn't talk to all the DSP-based modules, he says, but the next revision appeared to address the issue.

“Digital interfaces are certainly easier, but we're burning a lot of power with the DSPs, in the modules and the switch ASIC,” says Booth. “The switch ASIC needs it for direct attach copper (DAC) cables.”

However, the MSA believes that the cost, power and latency advantages of the Open Eye ICs will prove decisive.

Data centre considerations

Microsoft’s Booth outlined the challenges data centre operators face as bandwidth requirements grow exponentially.

The drivers for greater bandwidth include more home-workers using cloud services during the Covid-19 pandemic and the adoption of artificial intelligence and machine learning.

“With machine learning, the more machines you have talking to each other, the more intensive jobs you can handle,” says Booth. “But for distances greater than a few meters you fall into the realm of the 100m range, and that drives you to an optical solution.”

But optics are costly while going from 100-gigabit to 400-gigabit optical modules has not reduced power consumption. Booth says 400-gigabit SR8 modules consume about 10W while the 400-gigabit DR4 and FR4, it is 12W. Yet for 100-gigabit modules the power consumed is a quarter of these figures.

Low latency is another requirement if data centres are to adopt disaggregated servers where memory is pooled and shared between platforms. “Adding latency to these links, which are fairly short, is an impediment to do this disaggregation scenario,” says Booth.

Microsoft trialled an eight-lane on-board optics COBO module using Open-Eye and achieved a 30 per cent power saving compared to QSFP-DD or OSFP DSP-based pluggable modules.

Open Eye technology could also be used for co-packaged optics, promising a further 10 per cent power saving, says Booth.

Given future 51.2-terabit and 102.4-terabit switch silicon, with their significant connectivity, this will help reduce the overall thermal load and hence cooling which is part of a data centre’s overall power consumption.

“Anything that keeps that heat lower as I increase the bandwidth is an advantage,” says Booth.

Cost, power and latency

The Open Eye MSA claims it will cost a company $80 million to develop a next-generation 5nm CMOS PAM-4 DSP. Such a hefty development cost will need to be recouped, adding to a module's price.

Semtech says its Open Eye analogue ICs use a BiCMOS process which is a far cheaper approach.

Timothy Vang

Timothy Vang

The PAM-4 DSPs may consume more power, says Vang, but that will improve with newer CMOS processes. First-generation DSPs were implement using 16nm CMOS while the latest devices are at 7nm CMOS.

So the power advantage of Open Eye devices will shrink, says Vang, although Semtech claims its second-generation Open Eye devices will reduce power by 20 per cent.

Open Eye also has a latency advantage. Citing analysis from Nvidia (Mellanox), a PAM-4 DSP-based optical module adds 100ns of latency per link.

In a multi-hop network linking servers, the optical modules account for 40 per cent of the total latency, the rest being the switch, the network interface card and the optical flight time. Using Open Eye-based modules, the optical module portion shrinks to eight per cent only.

Specification status

The Open Eye MSA has specified 53-gigabit PAM-4 signalling for long-reach and short-reach optical links.

In particular, to its 200-gigabit FR4 specification, the MSA is adding 50-gigabit LR1, while an ER1 lite and 200-gigabit LR4 will be completed in early 2021. Meanwhile, the multi-mode 50-gigabit SR1, 200-gigabit SR4 and 400-gigabit SR8 specifications are done.

The third phase of the Open Eye work, producing a 100-gigabit PAM-4 specification, is starting now. Achieving the specification is important for Open Eye since modules are moving to 100-gigabit PAM-4, says Murray.

 

A 200-gigabit QSFP56-FR4 module block diagram. Source: CIG.

A 200-gigabit QSFP56-FR4 module block diagram. Source: CIG.

 

Products

Semtech is already selling 200-gigabit Open Eye short-reach chips, part of its Tri-Edge family. The two 4x50-gigabit devices are dubbed the GN2558 and GN2559.

The GN2558 is the transmitter chip. It retimes four 50-gigabit signals from the host and feeds them to the integrated VCSEL drivers that generate the optical PAM-4 signals sent over four fibres. The four photo-detector outputs are the receiver are then fed to the GN2559 that includes trans-impedance amplifiers (TIAs) and clock data recovery.

Equalisation is used within both devices. “The eye is opened on the transmitter as well as on the receiver; they equalise the signal in each direction,” says Vang.

The Semtech devices are being used for a 200-gigabit SR4 module and for a 400-gigabit SR8 active optical cable where two pairs of each chip are used.

Semtech will launch Tri-Edge long-reach Open Eye chips. The chips will drive externally-modulated lasers (EMLs), directly- modulated lasers (DMLs) and silicon photonics-based designs for single-mode fibre applications.

“We have early versions of these chips sampled and demonstrated,” says Vang. “In the Open Eye MSA, we have shown the chips interoperating with, for example, MACOM’s chipset.”

Semtech’s Tri-Edge solutions are in designs with over two dozen module customers, says Vang.

Meanwhile, pluggable module maker CIG detailed a 200-gigabit QSFP56-FR4 while Optomind discussed a 400-gigabit QSFP56-DD active optical cable design as part of the Open Eye webinar.


Habana Labs unveils its AI processor plans

Start-up Habana Labs has developed a chip architecture that promises to speed up the execution of machine-learning tasks. 

The Israeli start-up came out of secrecy in September to announce two artificial intelligence (AI) processor chips. One, dubbed Gaudi, is designed to tackle the training of large-scale neural networks. The chip will be available in 2019. 

Eitan MedinaGoya, the start-up’s second device, is an inference processor that implements the optimised, trained neural network.

The Goya chip is already in prospective customers’ labs undergoing evaluation, says Eitan Medina, Habana’s chief business officer.

Habana has just raised $75 million in a second round of funding, led by Intel Capital. Overall, the start-up has raised a total of $120 million in funding. 

 

Deep learning

Deep learning in a key approach used to perform machine learning. To perform deep learning, use is made of an artificial neural network with many hidden layers. A hidden layer is a layer of nodes found between the neural network’s input and output layers. 

To benefit from deep learning, the neural network must first be trained with representative data. This is an iterative and computationally-demanding process. 

 

The computing resources used to train the largest AI jobs has been doubled every 3.5 months since 2012 

 

Once trained, a neural network is ready to analyse data. Common examples where trained neural networks are used include image classification and for autonomous vehicles. 

 

Source: Habana Labs

Two types of silicon are used for deep learning: general-purpose server CPUs such as from Intel and graphics processing units (GPUs) from the likes of Nvidia. 

Most of the growth has been in the training of neural networks and this is where Nvidia has done very well. Nvidia has a run rate close to $3 billion just building chips to do the training of neural networks, says Karl Freund, senior analyst, HPC and deep learning at Moor Insights & Strategy. “They own that market.”

Now custom AI processors are emerging from companies such as Habana that are looking to take business from Nvidia and exploit the emerging market for inference chips. 

“Use of neural networks outside of the Super Seven [hyperscalers] is still a nascent market but it could be potentially a $20 billion market in the next 10 years,” says Freund. “Unlike in training where you have a very strong incumbent, in inference - which could be a potentially larger market - there is no incumbent.”  

This is where many new chip entrants are focussed. After all, it is a lot easier to go after an emerging market than to displace a strong competitor such as Nvidia, says Freund, who adds that Nvidia has its own inference hardware but it is suited to solving really difficult problems such as autonomous vehicles.  

“For any new processor architecture to have any justification, it needs to be significantly better than previous ones,” says Medina. 

Habana cites the ResNet-50 image classification algorithm to highlight its silicon’s merits. ResNet-50 refers to a 50-layer neural network that makes use of a technique called residual learning that improves the efficacy of image classification.    

Habana’s Goya HL-1000 processor can classify 15,000 images-per-second using ResNet-50 while Nvidia’s V100 GPU classifies 2,657and Intel’s dual-socket Platinum 8180 CPU achieves 1225 images-per-second.

“What we have architected is fundamentally better than CPUs and GPUs in terms of processing performance and the processing-power factor,” says Medina.

“Habana appears to be one of the first start-ups to bring an AI accelerator to the market, that is, to actually deliver a product for sale,” says Linley Gwennap, president and principal analyst of The Linley Group. 

Both Habana and start-up Graphcore expect to have final products for sale this year, he says, while Wave Computing, another start-up, expects to enter production early next year. 

“It is also impressive that Habana is reporting 5-6x better performance than Nvidia, whereas Graphcore’s lead is less than 2x,” says Gwennap. “Graphcore focuses on training, however, whereas the Goya chip is for inference.”

 

Habana appears to be one of the first start-ups to bring an AI accelerator to the market


Gaudi training processor

Habana’s Gaudi chip is a neural-network training processor. Once trained, the neural network is optimised and loaded into the inference chip such as Habana’s Goya to implement what has been learnt.

“The process of getting to a trained model involves a very different compute, scale-out and power-envelopment environment to that of inference,” says Medina.  

To put this in perspective, the computing resources used to train the largest AI jobs has been doubled every 3.5 months since 2012. The finding, from AI research company OpenAI, means that the computing power being employed now has grown by over one million times since 2012.  

Habana remains secretive about the details of its chips. It has said that the 16nm CMOS Gaudi chip can scale to thousands of units and that each device will have 2 terabits of input-output (I/O). This contrasts with GPUs used for training that do have scaling issues, it says.

First, GPUs are expensive and power-hungry devices. The data set used for training such as for image classification needs to be split across the GPUs. If the number of images - the batch size - given to each one is too large, the training model may not converge. If the model doesn't converge, the neural network will not learn to do its job. 

In turn, reducing the batch size affects the overall throughput. “GPUs and CPUs want you to feed them with a lot of data to increase throughput,” says Medina.     

Habana says that unlike GPUs, its training processor’s performance will scale with the number of devices used.  

“We will show with the Gaudi that we can scale performance linearly,” says Medina. “Training jobs will finish faster and models could be much deeper and more complex.”

 The Goya IC architecture. Habana says this is a general representation of the chip and what is shown is not the actual number of tensor processor cores (TPCs). Source: Habana Labs

 

Goya inference processor 

The Goya processor comprises multiple tensor processor cores (TPCs), see diagram. Habana is not saying how many but each TPC is capable of processing vectors and matrices efficiently using several data types - eight-, 16- and 32-bit signed and unsigned integers and 32-bit floating point. To achieve this, the architecture used for the TPC is a very-long-instruction-word, (VLIW) single-instruction, multiple-data (SIMD) vector processor.  Each TPC also has its own local memory.  

Other on-chip hardware blocks include a general-purpose engine (GEMM), shared memory, an interface to external DDR4 SDRAM memory and support for PCI Express (PCIe) 4.0.

 

What we have architected is fundamentally better than CPUs and GPUs in terms of processing performance and the processing-power factor

 

Habana claims its inference chip has a key advantage when it comes to latency, the time it takes for the inference chip to deliver its answer. 

Latency too is a function of the batch size - the number of jobs - presented to the device. Being able to pool jobs presented to the chip is a benefit but not if it exceeds the latency required. 

“If you listen to what Google says about real-time applications, to meet the 99th percentile of real-time user interaction, they need the inference to be accelerated to under 7 milliseconds,” says Medina. “Microsoft also says that latency is incredibly important and that is why they can’t use a batch size of 64.”

Habana and other entrants are going after applications where their AI processors are efficient at real-time tasks with a batch size of one. “Everyone is focussing on what Nvidia can’t do well so they are building inference chips that do very well with low batch sizes,” says Freund.    

Having a low-latency device not only will enable all sorts of real-time applications but will also allow a data centre operator to rent out the AI processor to multiple customers, knowing what the latency will be for each job.

“This will generate more revenue and lower the cost of AI,” says Medina.

 

AI PCIe cards

Habana is offering two PCIe 4.0 card versions of its Goya chip: one being one-slot wide and the second being double width. This is to conform to some customers that already use platforms with double-width GPU cards.

Habana’s PCIe 4.0 card includes the Goya chip and external memory and consumes around 100W, the majority of the power consumed by the inference chip.

The card’s PCIe 4.0 interface has 16 lanes (x16) but nearly all the workloads can manage with a single lane.   

“The x16 is in case you go for more complicated topologies where you can split the model between adjacent cards and then we need to pass information between our processors,” says Medina. 

Here, a PCIe switch chip would be put on the motherboard to enable the communications between the Goya processors.

 

Do start-ups have a sustainable architectural roadmap that offers innovation beyond just such single-cycle operations? 

 

Applications

Habana has developed demonstrations of four common applications to run on the Goya cards. These include image classification, machine translation, recommendations, and the classification of text known as sentiment analysis.  

The four were chosen as potential customers want to see these working. “If they are going to buy your hardware for inference, they want to make sure it can deal with any topology they come up with in future,” says Medina.

Habana says it is already engaged with customers other than the largest data centre operators.  And with time, the start-up expects to develop inference chips with tailored I/O to address dedicated applications such as autonomous vehicles.

There are also other markets emerging beside data centres and self-driving cars.

“Mythic, for example, targets security cameras while other start-ups offer IP cores, and some target the Internet of Things and other low-cost applications,” says Gwennap. “Eventually, most processors will have some sort of AI accelerator built-in, so there are many different opportunities for this technology.”

  

Start-up challenge

The challenge facing all the AI processor start-ups, says Freund, is doing more thandeveloping an architecture that can do a multiply-accumulate operation in a single processor clock cycle, and not just with numbers but withn-dimensional matrices.

“That is really hard but eventually - give or take a year - everyone will figure it out,” says Freund. 

The question for the start-ups is: do they have a sustainable architectural roadmap that offers innovation beyond just such single-cycle operations? 

“What architecturally are you able to do beyond that to avoid being crushed by Nvidia, and if not Nvidia then Intel because they haven't finished yet,” says Freund. 

This is what all these start-ups are going to struggle with whereas Nvidia has 10,000 engineers figuring it out, he warns.

 

Article updated on Nov 16 to report the latest Series B funding.  


COBO targets year-end to complete specification

Part 3: 400-gigabit on-board optics

  • COBO will support 400-gigabit and 800-gigabit interfaces 
  • Three classes of module have been defined, the largest supporting at least 17.5W 

The Consortium for On-board Optics (COBO) is scheduled to complete its module specification this year.

A draft specification defining the mechanical aspects of the embedded optics - the dimensions, connector and electrical interface - is already being reviewed by the consortium’s members.

Brad Booth“The draft specification encompasses what we will do inside the data centre and what will work for the coherent market,” says Brad Booth, chair of COBO and principal network architect for Microsoft’s Azure Infrastructure.

COBO was established in 2015 to create an embedded optics multi-source agreement (MSA). On-board optics have long been available but until now these have been proprietary solutions. 

“Our goal [with COBO] was to get past that proprietary aspect,” says Booth. “That is its true value - it can be used for optical backplane or for optical interconnect and now designers will have a standard to build to.” 

 

The draft specification encompasses what we will do inside the data centre and what will work for the coherent market

 

Specification

The COBO modules are designed to be interchangeable. Unlike front-panel optical modules, the COBO modules are not ‘hot-pluggable’ - they cannot be replaced while the card is powered. But the design allows for COBO modules to be interchanged.  

The COBO design supports 400-gigabit multi-mode and single-mode optical interfaces. The electrical interface chosen is the IEEE-defined CDAUI-8, eight lanes each at 50 gigabits implemented using a 25-gigabit symbol rate and 4-level pulse-amplitude modulation (PAM-4). COBO also supports an 800-gigabit interface using two tightly-coupled COBO modules.     

The consortium has defined three module categories that vary in length. The module classes reflect the power envelope requirements; the shortest module supports multi-mode and the lower-power module designs while the longest format supports coherent designs. “The beauty of COBO is that the connectors and the connector spacings are the same no matter what length [of module] you use,” says Booth.

The COBO module is described as table-like, a very small printed circuit board that sits on two connectors. One connector is for the high-speed signals and the other for the power and control signals. “You don't have to have the cage [of a pluggable module] to hold it because of the two-structure support,” says Booth.

To be able to interchange classes of module, a ‘keep-out’ area is used. This area refers to board space that is deliberately left empty to ensure the largest COBO module form factor will fit. A module is inserted onto the board by first pushing it downwards and then sliding it along the board to fit the connection.

Booth points out that module failures are typically due to the optical and electrical connections rather than the optics itself. This is why the repeated accuracy of pick-and-place machines are favoured for the module’s insertion. “The thing you want to avoid is having touch points in the field,” he says.   

 

Coherent

working group was set up after the Consortium first started to investigate using the MSA for coherent interfaces. This work has now been included in the draft specification. “We realised that leaving it [the coherent work] out was going to be a mistake,” says Booth.

The main coherent application envisaged is the 400ZR specification being developed by the Optical Internetworking Forum (OIF)

The OIF 400ZR interface is the result of Microsoft’s own Madison project specification work. Microsoft went to the industry with several module requirements for metro and data centre interconnect applications.

Madison 1.0 was a two-wavelength 100-gigabit module using PAM-4 that resulted in Inphi’s 80km ColorZ module that supports up to 4 terabits over a fibre. Madison 1.5 defines a single-wavelength 100-gigabit module to support 6.4 to 7.2 terabits on a fibre. “Madison 1.5 is probably not going to happen,” says Booth. “We have left it to the industry to see if they want to build it and we have not had anyone come forward yet.”

Madison 2.0 specified a 400-gigabit coherent-based design to support a total capacity of 38.4 terabits - 96 wavelengths of 400 gigabits.

Microsoft initially envisioned a 43 gigabaud 64-QAM module. However, the OIF's 400ZR project has since adopted a 60-gigabaud 16-QAM module which will achieve either 48 wavelengths at 100GHz spacing or 64 wavelengths at 75GHz spacing, capacities of 19.2Tbps and 25.6Tbps, respectively. 

 

In 2017, the number of coherent metro links Microsoft will use will be 10x greater than the number of metro and long-haul coherent links it used in 2016.

 

Once Microsoft starting talking about Madison 2.0, other large internet content providers came forward saying they had similar requirements which led to the initiative being driven into the OIF. The result is the 400ZR MSA that the large-scale data centre players want to be built by as many module companies as possible.

Booth highlights the difference in Microsoft’s coherent interface volume requirements just in the last year. In 2017, the number of coherent metro links Microsoft will use will be 10x greater than the number of metro and long-haul coherent links it used in 2016.

“Because it is an order of magnitude more, we need to have some level of specification, some level of interop because now we're getting to the point where if I have an issue with any single supplier, I do not want my business impeded by it,” he says.     

Regarding the COBO module, Booth stresses that it will be the optical designers that will determine the different coherent specifications possible. Thermal simulation work already shows that the module will support 17.5W and maybe more.

“There is a lot more capability in this module that there is in a standard pluggable only because we don't have the constraint of a cage,” says Booth. “We can always go up in height and we can always add more heat sink.”

Booth says the COBO specification will likely need a couple more members’ reviews before its completion. “Our target is still to have this done by the end of the year,” he says.

 

Amended on Sept 4th, added comment about the 400ZR wavelength plans and capacity options


FPGAs embrace data centre co-processing role

Part 1: Xilinx's SDAccel development tool


The PCIe accelerator card has a power budget of 25W. Hyper data centres can host hundreds of thousands of servers whereas other industries with more specialist computation requirements use far fewers servers. As such, they can afford a higher power budget per card. Source: Xilinx

Xilinx has developed a software-design environment that simplifies the use of an FPGA as a co-processor alongside the server's x86 instruction set microprocessor.

Dubbed SDAccel, the development environment enables a software engineer to write applications using OpenCL, C or the C++ programming language running on servers in the data centre.   

Applications can be developed to run on the server's FPGA-based acceleration card without requiring design input from a hardware designer. Until now, a hardware engineer has been needed to convert the code into the RTL hardware description language that is mapped onto the FPGA's logic gates using synthesis tools.

"[Now with SDAccel] you suffer no degradation in [processing] performance/ Watt compared to hand-crafted RTL on an FPGA," says Giles Peckham, regional americas and EMEA marketing director at Xilinx. "And you move the entire design environment into the software domain; you don't need a hardware designer to create it."   

 

Data centre acceleration

The data centre is the first application targeted for SDAccel along with the accompanying FPGA accelerator cards developed by Xilinx's three hardware partners: Alpha Data, Convey and Pico Computing.

The FPGA cards connect to the server's host processor via the PCI Express (PCIe) interface are not just being aimed at leading internet content providers but also institutions and industries that have custom computational needs. These include oil and gas, financial services, medical and defence companies.  

PCIe cards have a power budget of 25W, says Xilinx. The card's power can be extended by adding power cables but considering that hyper data centres can have hundreds of thousands of servers, every extra Watt consumed comes at a cost.

 

Microsoft has reported that a production pilot it set up that had 1,632 servers using PCIe-based FPGA cards, achieved a doubling of throughput, a 29 percent lower latency, and a 30 percent cost reduction compared to servers without accelerator cards

 

In contrast, institutions and industries use far fewer servers in their data centres. "They can stomach the higher power consumption, from a cost perspective and in terms of dissipating the heat, up to a point," says Peckham. Their accelerator cards may consume up to 100W. "But both have this limitation because of the power ceiling," he says.     

China’s largest search-engine specialist, Baidu, uses neural-network processing to solve problems in speech recognition, image search, and natural language processing, according to The Linley Group senior analyst, Loring Wirbel.

Baidu has developed a 400 Gigaflop software-defined accelerator board that uses a Xilinx Kintex-7 FPGA that plugs into any 1U or 2U high server using PCIe. Baidu says that the FPGA board achieves four times higher performance than graphics processing units (GPUs) and nine times higher performance than CPUs, while consuming between 10-20W.

Microsoft has reported that a production pilot it set up that had 1,632 servers using PCIe-based FPGA cards, achieved a doubling of throughput, a 29 percent lower latency, and a 30 percent cost reduction compared to servers without accelerator cards.

"The FPGA can implement highly parallel applications with the exact hardware required," says Peckham. Since the dynamic power consumed by the FPGA depends on clock frequency and the amount of logic used, the overall power consumption is lower than a CPU or GPU. That is because the FPGA's clock frequency may be 100MHz compared to a CPU's or GPU's 1 GHz, and the FPGA implements algorithms in parallel using hardware tailored to the task.

 

FPGA processing performance/ W for data centre acceleration tasks compared to GPUs and CPUs. Note the FPGA's performance/W advantage increases with the number of software threads. Source: Xilinx

 

SDAccel

To develop a design environment that a software developer alone can use, Xilinx has to make SDAccel aware of the FPGA card's hardware, using what is known as a board support package. "There needs to be an understanding of the memory and communications available to the FPGA processor," says Peckham. "The processor then knows all the hardware around it."

Xilinx claims SDAccel is the industry's first architecturally optimising compiler for FPGAs. "It is as good as hand-coding [RTL]," says Peckham. The tool also delivers a CPU-/ GPU-like design environment. "It is also the first tool that enables designs to have multiple operations at different times on the same FPGA," he says. "You can reconfigure the accelerator card in runtime without powering down the rest of the chip."

SDAccel and the FPGA cards are available, and the tool is with several customers. "We have proven the tool, debugged it, created a GUI as opposed to a command line interface, and have three FPGA boards being sold by our partners," says Peckham. "More partners and more boards will be available in 2015."

Peckham says the simplified design environment appeals to companies not addressing the data centre. "One company in Israel uses a lot of Virtex-6 FPGAs to accelerate functions that start in C code," he says. "They are using FPGAs but the whole design process is drawn-out; they were very happy to learn that [with SDAccel] they don't have to hand-code RTL to program them."    

Xilinx is working to extend OpenCL for computing tasks beyond the data centre. "It is still a CPU-PCIe-to-co-processor architecture but for wider applications," says Peckham.

 

For Part 2, click here

For Part 3, click here


Privacy Preference Center