The computing problem of our time: Moving data

David Lazovsky

  • Celestial AI’s Photonic Fabric technology can deliver up to 700 terabits per second of bidirectional bandwidth per chip package.
  • The start-up has recently raised $100 million in funding.

The size of AI models that implement machine learning continue to grow staggeringly fast.

Such AI models are used for computer vision, large language models such as ChatGPT, and recommendation systems that rank items such as search results and music playlists.

The workhorse silicon used to build such AI models are graphics processing units (GPUs). GPU processing performance and their memory size may be advancing impressively but AI model growth is far outpacing their processing and input-output [I/O] capabilities.

To tackle large AI model workloads, hundreds and even thousands of GPUs are deployed in parallel for boost overall processing performance and high-performance memory storage capacity.

But it is proving hugely challenging to scale such parallel systems and feed sufficient data to the expensive processing nodes so they can do their work.

Or as David Lazovsky, CEO of start-up Celestial AI puts it, data movement has become the computing problem of our time.

Input-output bottleneck

The data movement challenge and scaling hardware for machine learning has caused certain AI start-ups to refocus, looking beyond AI processor development to how silicon photonics can tackle the input-output [I/O] bottleneck.

Lightelligence is one such start-up; Celestial AI is another.

Founded in 2020, Celestial AI has raised $100 million in its latest round of funding, and $165 million overall.

Celestial AI’s products include the Orion AI processor and its Photonic Fabric, an optoelectronic system-in-package comprising a silicon photonics chip and the associated electronics IC.

The Photonic Fabric uses two technological differentiators: a thermally stable optical modulator, and an electrical IC implemented in advanced CMOS.

The Photonic Fabric. Source: Celestial AI

Thermally stable modulation

Many companies use a ring resonator modulator for their co-packaged optics designs, says Lazovsky. Ring resonator modulators are tiny but sensitive to heat, so they must be temperature-controlled to work optimally.

“The challenge of rings is that they are thermally stable to about one degree Celsius,” says Lazovsky.

Celestial AI uses silicon photonics as an interposer such that it sits under the ASIC, a large chip operating at high temperatures.

“Using silicon photonics to deliver optical bandwidth to a GPU that’s running at 500-600 Watts, that’s just not going to work for a ring,” says Lazovsky, adding that even integrating silicon photonics into memory chips that consume 30W will not work.

Celestial AI uses a 60x more thermally stable modulator than a ring modulator.

The start-up uses continuous wave distributed feedback laser (DFB) lasers as the light source, the same lasers used for 400-gigabit DR4 and FR4 pluggable transceivers, and sets their wavelength to the high end of the operating window.

The result is a 60-degree operating window where the silicon photonics circuits can operate. “We can also add closed-loop control if necessary,” says Lazovsky.

Celestial AI is not revealing the details of its technology, but the laser source is believed to be external to the silicon photonics chip.

Thus a key challenge is getting the modulator to work stably so close to the ASIC, and this Celestial AI says it has done.

Advanced CMOS electronics

The start-up says TSMC’s 4nm and 5nm CMOS are the process nodes to be used for the Photonic Fabric’s electronics IC accompanying the optics.

“We are qualifying our technology for both 4nm and 5nm,” says Lazovsky. “Celestial AI’s current products are built using TSMC 5nm, but we have also validated the Photonic Fabric using 4nm for the ASIC in support of our IP licensing business.”

The electronics IC includes the modulator’s drive circuitry and the receiver’s trans-impedance amplifier (TIA).

Celestial AI has deliberately chosen to implement the electronics in a separate chip rather than use a monolithic design as done by other companies. With a monolithic chip, the optics and electronics are implemented using the same 45nm silicon photonics process.

But a 45nm process for the electronics is already an old process, says the start-up.

Using state-of-the-art 4nm or 5nm CMOS cuts down the area and the power requirements of the modulation driver and TIA. The optics and electronics are tightly aligned, less than 150 microns apart.

“We are mirroring the layout of our drivers and TIAs in electronics with the modulator and the photodiode in silicon photonics such that they are directly on top of each other,” says Lazovsky.

The proximity ensures a high signal-to-noise ratio; no advanced forward error correction (FEC) scheme or a digital signal processor (DSP) is needed. The short distances also reduce latency.

This contrasts with co-packaged optics, where chiplets surround the ASIC to provide optical I/O but take up valuable space alongside the ASIC edge, referred to as beachfront.

If the ASIC is a GPU, such chiplets must compete with stacked memory packages – the latest version being High Bandwidth Memory 3 (HBM3) – that also must be placed close to the ASIC.

There is also only so much space for the HBM3’s 1024-bit wide interface to move data, a problem also shared by co-packaged optics, says Lazovsky.

Using the Universal Chiplet Interconnect Express (UCIe) interface, for example, there is a limit to the bandwidth that can be distributed, not just to the chip but across the chip too.

“The beauty of the Photonic Fabric is not just that we have much higher bandwidth density, but that we can deliver that bandwidth anywhere within the system,” says Lazovsky.

The interface comes from below the ASIC and can deliver data to where it is needed: to the ASIC’s compute engines and on-chip Level 2 cache memory.

Bandwidth density

Celestial AI’s first-generation implementation uses four channels of 56 gigabits of non-return-to-zero signalling to deliver up to 700 terabit-per-second (Tbps) total bidirectional bandwidth per package.

How this number is arrived have not been given, but it is based on feeding the I/O via the ASIC’s surface area rather than the chip’s edges.

To put that in perspective, Nvidia’s latest Hopper H100 Tensor Core GPU uses five HBM3 sites. These sites deliver 80 gigabytes of memory and over three terabytes-per-second – 30Tbps – total memory bandwidth.”

The industry trend is to add more HBM memory in-package, but AI models are growing hundreds of times faster. “You need orders of magnitude more memory for a single workload than can fit on a chip,” he says.

Accordingly, vast amounts of efficient I/O are needed to link AI processors to remote pools of high-bandwidth memory by disaggregating memory from compute.

Celestial AI is now working on its second-generation interface that is expected in 18 months. The newer interface quadruples the package bandwidth to >2,000Tbps. The interface uses 4-level pulse amplitude modulation (PAM-4) signaling to deliver 112Gbps per channel and doubles the channel count from four to eight.

“The fight is about bandwidth density, getting large-scale parameters from external memory to the point of computing as efficiently as possible,” says Lazovsky,

By efficiently, Lazovsky means bandwidth, energy, and latency. And low latency for AI applications translates to revenues.

Celestial AI believes its Photonics Fabric technology is game-changing due to the bandwidth density achieved while overcoming the beachfront issue.

Composible memory

Celestial AI changed its priorities to focus on memory disaggregation after working with hyperscalers for the last two years.

The start-up will use its latest funding to expand its commercial activities.

“We’re building optically interconnected, high-capacity and high-bandwidth memory systems to allow our customers to develop composable resources,” says Lazovsky.

Celestial AI is using its Photonic fabric to enable 16 servers (via PCI Express cards) to access a single high-capacity optical-enabled DDR, HBM and hybrid pooled memory.

Another implementation will use its technology in chiplet form via the UCIe interface. Here, the bandwidth is 14.4Tbps, more than twice the speed of the leading co-packaged optics solutions.

Celestial AI also has an optical multi-chip interconnect bridge (OMIB), enabling an ASIC to access pooled high-capacity external memory in a 40ns round trip. OMIB can also be used to link chips optically on a multi-chip module.

Celestial AI stressed that its technology is not limited to memory disaggregation. The Photonic Fabric came out of the company looking to scale multiples of its Orion AI processors.

Celestial AI supports the JEDEC HBM standard and CXL 2.0 and 3.0, as well as other physical interface technologies such as Nvidia’s NVlink and AMD’s Infinity fabric.

“It is not limited to our proprietary protocol,” says Lazovsky.

The start-up is in discussions with ‘multiple’ companies interested in its technology, while Broadcom is a design services partner. Near Margalit, vice president and general manager of Broadcom’s optical systems division, is a technical advisor to the start-up.

Overall, the industry trend is to move from general computing to accelerated computing in data centres. That will drive more AI processors and more memory and compute disaggregation.

“It is optical,” says Lazovsky: “There is no other way to do it.”


Habana Labs unveils its AI processor plans

Start-up Habana Labs has developed a chip architecture that promises to speed up the execution of machine-learning tasks. 

The Israeli start-up came out of secrecy in September to announce two artificial intelligence (AI) processor chips. One, dubbed Gaudi, is designed to tackle the training of large-scale neural networks. The chip will be available in 2019. 

Eitan MedinaGoya, the start-up’s second device, is an inference processor that implements the optimised, trained neural network.

The Goya chip is already in prospective customers’ labs undergoing evaluation, says Eitan Medina, Habana’s chief business officer.

Habana has just raised $75 million in a second round of funding, led by Intel Capital. Overall, the start-up has raised a total of $120 million in funding. 

 

Deep learning

Deep learning in a key approach used to perform machine learning. To perform deep learning, use is made of an artificial neural network with many hidden layers. A hidden layer is a layer of nodes found between the neural network’s input and output layers. 

To benefit from deep learning, the neural network must first be trained with representative data. This is an iterative and computationally-demanding process. 

 

The computing resources used to train the largest AI jobs has been doubled every 3.5 months since 2012 

 

Once trained, a neural network is ready to analyse data. Common examples where trained neural networks are used include image classification and for autonomous vehicles. 

 

Source: Habana Labs

Two types of silicon are used for deep learning: general-purpose server CPUs such as from Intel and graphics processing units (GPUs) from the likes of Nvidia. 

Most of the growth has been in the training of neural networks and this is where Nvidia has done very well. Nvidia has a run rate close to $3 billion just building chips to do the training of neural networks, says Karl Freund, senior analyst, HPC and deep learning at Moor Insights & Strategy. “They own that market.”

Now custom AI processors are emerging from companies such as Habana that are looking to take business from Nvidia and exploit the emerging market for inference chips. 

“Use of neural networks outside of the Super Seven [hyperscalers] is still a nascent market but it could be potentially a $20 billion market in the next 10 years,” says Freund. “Unlike in training where you have a very strong incumbent, in inference - which could be a potentially larger market - there is no incumbent.”  

This is where many new chip entrants are focussed. After all, it is a lot easier to go after an emerging market than to displace a strong competitor such as Nvidia, says Freund, who adds that Nvidia has its own inference hardware but it is suited to solving really difficult problems such as autonomous vehicles.  

“For any new processor architecture to have any justification, it needs to be significantly better than previous ones,” says Medina. 

Habana cites the ResNet-50 image classification algorithm to highlight its silicon’s merits. ResNet-50 refers to a 50-layer neural network that makes use of a technique called residual learning that improves the efficacy of image classification.    

Habana’s Goya HL-1000 processor can classify 15,000 images-per-second using ResNet-50 while Nvidia’s V100 GPU classifies 2,657and Intel’s dual-socket Platinum 8180 CPU achieves 1225 images-per-second.

“What we have architected is fundamentally better than CPUs and GPUs in terms of processing performance and the processing-power factor,” says Medina.

“Habana appears to be one of the first start-ups to bring an AI accelerator to the market, that is, to actually deliver a product for sale,” says Linley Gwennap, president and principal analyst of The Linley Group. 

Both Habana and start-up Graphcore expect to have final products for sale this year, he says, while Wave Computing, another start-up, expects to enter production early next year. 

“It is also impressive that Habana is reporting 5-6x better performance than Nvidia, whereas Graphcore’s lead is less than 2x,” says Gwennap. “Graphcore focuses on training, however, whereas the Goya chip is for inference.”

 

Habana appears to be one of the first start-ups to bring an AI accelerator to the market


Gaudi training processor

Habana’s Gaudi chip is a neural-network training processor. Once trained, the neural network is optimised and loaded into the inference chip such as Habana’s Goya to implement what has been learnt.

“The process of getting to a trained model involves a very different compute, scale-out and power-envelopment environment to that of inference,” says Medina.  

To put this in perspective, the computing resources used to train the largest AI jobs has been doubled every 3.5 months since 2012. The finding, from AI research company OpenAI, means that the computing power being employed now has grown by over one million times since 2012.  

Habana remains secretive about the details of its chips. It has said that the 16nm CMOS Gaudi chip can scale to thousands of units and that each device will have 2 terabits of input-output (I/O). This contrasts with GPUs used for training that do have scaling issues, it says.

First, GPUs are expensive and power-hungry devices. The data set used for training such as for image classification needs to be split across the GPUs. If the number of images - the batch size - given to each one is too large, the training model may not converge. If the model doesn't converge, the neural network will not learn to do its job. 

In turn, reducing the batch size affects the overall throughput. “GPUs and CPUs want you to feed them with a lot of data to increase throughput,” says Medina.     

Habana says that unlike GPUs, its training processor’s performance will scale with the number of devices used.  

“We will show with the Gaudi that we can scale performance linearly,” says Medina. “Training jobs will finish faster and models could be much deeper and more complex.”

 The Goya IC architecture. Habana says this is a general representation of the chip and what is shown is not the actual number of tensor processor cores (TPCs). Source: Habana Labs

 

Goya inference processor 

The Goya processor comprises multiple tensor processor cores (TPCs), see diagram. Habana is not saying how many but each TPC is capable of processing vectors and matrices efficiently using several data types - eight-, 16- and 32-bit signed and unsigned integers and 32-bit floating point. To achieve this, the architecture used for the TPC is a very-long-instruction-word, (VLIW) single-instruction, multiple-data (SIMD) vector processor.  Each TPC also has its own local memory.  

Other on-chip hardware blocks include a general-purpose engine (GEMM), shared memory, an interface to external DDR4 SDRAM memory and support for PCI Express (PCIe) 4.0.

 

What we have architected is fundamentally better than CPUs and GPUs in terms of processing performance and the processing-power factor

 

Habana claims its inference chip has a key advantage when it comes to latency, the time it takes for the inference chip to deliver its answer. 

Latency too is a function of the batch size - the number of jobs - presented to the device. Being able to pool jobs presented to the chip is a benefit but not if it exceeds the latency required. 

“If you listen to what Google says about real-time applications, to meet the 99th percentile of real-time user interaction, they need the inference to be accelerated to under 7 milliseconds,” says Medina. “Microsoft also says that latency is incredibly important and that is why they can’t use a batch size of 64.”

Habana and other entrants are going after applications where their AI processors are efficient at real-time tasks with a batch size of one. “Everyone is focussing on what Nvidia can’t do well so they are building inference chips that do very well with low batch sizes,” says Freund.    

Having a low-latency device not only will enable all sorts of real-time applications but will also allow a data centre operator to rent out the AI processor to multiple customers, knowing what the latency will be for each job.

“This will generate more revenue and lower the cost of AI,” says Medina.

 

AI PCIe cards

Habana is offering two PCIe 4.0 card versions of its Goya chip: one being one-slot wide and the second being double width. This is to conform to some customers that already use platforms with double-width GPU cards.

Habana’s PCIe 4.0 card includes the Goya chip and external memory and consumes around 100W, the majority of the power consumed by the inference chip.

The card’s PCIe 4.0 interface has 16 lanes (x16) but nearly all the workloads can manage with a single lane.   

“The x16 is in case you go for more complicated topologies where you can split the model between adjacent cards and then we need to pass information between our processors,” says Medina. 

Here, a PCIe switch chip would be put on the motherboard to enable the communications between the Goya processors.

 

Do start-ups have a sustainable architectural roadmap that offers innovation beyond just such single-cycle operations? 

 

Applications

Habana has developed demonstrations of four common applications to run on the Goya cards. These include image classification, machine translation, recommendations, and the classification of text known as sentiment analysis.  

The four were chosen as potential customers want to see these working. “If they are going to buy your hardware for inference, they want to make sure it can deal with any topology they come up with in future,” says Medina.

Habana says it is already engaged with customers other than the largest data centre operators.  And with time, the start-up expects to develop inference chips with tailored I/O to address dedicated applications such as autonomous vehicles.

There are also other markets emerging beside data centres and self-driving cars.

“Mythic, for example, targets security cameras while other start-ups offer IP cores, and some target the Internet of Things and other low-cost applications,” says Gwennap. “Eventually, most processors will have some sort of AI accelerator built-in, so there are many different opportunities for this technology.”

  

Start-up challenge

The challenge facing all the AI processor start-ups, says Freund, is doing more thandeveloping an architecture that can do a multiply-accumulate operation in a single processor clock cycle, and not just with numbers but withn-dimensional matrices.

“That is really hard but eventually - give or take a year - everyone will figure it out,” says Freund. 

The question for the start-ups is: do they have a sustainable architectural roadmap that offers innovation beyond just such single-cycle operations? 

“What architecturally are you able to do beyond that to avoid being crushed by Nvidia, and if not Nvidia then Intel because they haven't finished yet,” says Freund. 

This is what all these start-ups are going to struggle with whereas Nvidia has 10,000 engineers figuring it out, he warns.

 

Article updated on Nov 16 to report the latest Series B funding.  


Privacy Preference Center