Habana Labs, a start-up based in Israel with offices in Silicon Valley, emerged from stealth to unveil its first AI processor.
Habana's deep learning inference processor, named Goya, is >2 orders of magnitude better in throughput & power than commonly deployed CPUs, according to the company. The company will offer a PCIe 4.0 card that incorporates a single Goya HL-1000 processor and designed to accelerate various AI inferencing workloads, such as image recognition, neural machine translation, sentiment analysis, recommender systems, etc. A PCIe card based on its Goya HL-1000 processor delivers 15,000 images/second throughput on the ResNet-50 inference benchmark, with 1.3 milliseconds latency, while consuming only 100 watts of power.
Habana is also developing an inference software toolkit to simplify the development and deployment of deep learning models (topologies) for mass-market use. The idea is to provide an inference network model compilation and runtime that eliminates low-level programming of the processor.
I recently sat down with Eitan Medina, Habana Labs' Chief Business Officer, to discuss the development of this new class of AI processors and what it means for the cloud business.
Jim Carroll: Who is Habana Labs and how did you guys get started?
Eitan Medina: Habana was founded in 2016 with the goal of building AI processors for inference and training. Currently, we have about 120 people on board, mostly in R&D and based in Israel. We have a business headquarters here in Silicon Valley. In terms of the background of the management team, most of us have deep expertise in processors, DSPs, and communications semiconductors. I previously was the CTO for Galileo Technology (acquired by Marvell), and now I am on the business side. I would say we have a very strong and multidisciplinary team for machine learning. We certainly have the expertise in the processing, software and networking to architect a complete hardware and software solution for deep learning.
In building this company, we identified the AI space as one that deserves its own class of processors. We believe that the existing CPUs and GPUs are not good enough.
The first wave of these AI processors are coming now or being announced now. Habana decided that unlike other semiconductor companies, we would only emerge from stealth once we have an actual product. We have production samples now and that is why we are officially launching the company.
Jim Carroll: Who are the founders and what motivated them to enter this market segment?
Eitan Medina: The two co-founders are David Dahan (CEO) and Ran Halutz (VP R&D), who worked together at Prime Sense, a company that was acquired by Apple. We also have onboard Shlomo Raikin (CTO), who was the Chief SoC Architect at Mellanox and who has 45 patents. We've also been able to recruit top talent from across the R&D ecosystem in Israel. The lead investors are Avigdor Willenz (Chairman), Bessemer, and WALDEN (Lip-Bu Tan).
Jim Carroll: What does the name "Habana" refer to?
Eitan Medina: In Hebrew, Habana means "understanding" -- a good name for an AI company.
Jim Carroll: The market for AI processors, obviously, is in its infancy. How do you see it developing?
Eitan Medina: Well, some analysts are already projecting a market for new class of chipsets for deep learning. Tractica, for instance, divides the emerging market into CPUs, GPUs, FPGAs, ASICs, SoC accelerators, and other devices. We see the need for a different type of processor because of the huge gap between the computational requirements for AI and the incremental improvements that vendors have delivered over the past few years, which right are just small improvements to CPUs and GPUs. Look at the best-in-class, deep learning models and then calculate how much computing power is needed to train them. Look at how these requirements have grown over the past few years. Trying graphing this progression and you will see a log scale graph with a doubling time of three and a half months. That's 10x every year. Initially, people were running machine learning on CPUs, and then they adopted Nvidia's GPUs. What we see in the market today is that training is dominated by GPUs, while influence is dominated by CPUs.
Jim Carroll: So what is Habana's approach?
Eitan Medina: When we looked at the overall deep learning space, we began with the workflows. It is important to understand that there's a training workflow, and there's an inference workflow. What we are introducing today is our "Goya" inference processor. Our "Gaudi" training processor will be introduced in the second quarter of 2019. It will feature a 2Tbps interface per device and its training performance scales linearly to thousands of processors. We intend to sell line cards equipped with these processors, which you can then plug into your existing servers.
The inference processor offloads this workload completely from the CPU. Therefore, you will not need to replace your existing servers with more advanced CPUs. What can this do for you? This is where our story gets really interesting. We're about more than an order of magnitude improvement.
Look at this graph showing our ResNet-50 inference throughput and latency performance. On the left side is the best performance Intel has shown to date on a dual socket Xeon Platinum. Latency is not reported, which could be a critical issue. In the middle is Nvidia's V100 Tensor GPU, with shows 6ms of latency -- not bad, but we can do better. Our performance, shown on the right, exceeds 15,000 images per second with just 1.3ms of latency. Our card is just 100 watts, whereas we estimate at least 400 watts for the other guys.
Jim Carroll: Where are you getting these gains? Are you processing the images in a different way?
Eitan Medina: Well, I can say that we are not changing the topology. If you are an AI researcher with a ResNet-50 topology, we will take you topology and ingest it to our compiler. We're not forcing you to change anything in your model.
Jim Carroll: So, if we try to understand the magic inside a GPU, Nvidia will talk about their ability to process polygons in parallel with large numbers of cores. Where is the magic for Habana?
Eitan Medina: Yeah, Nvidia will say they are really good at figuring out polygons, and may tell you about the massive memory bandwidth they can provide to the many cores. But, at the end of the day, if you are interested in doing image recognition, you only really care about application performance, not the stories of how wonderful the technology is.
Let's assume for a second that there's a guy with a very inefficient image processing architecture, ok? What would this guy do to give you better performance from generation to generation? He would just pack more of the same stuff each time -- more more memory, more bandwidth, and more power. And then he would tell you to "buy more to save more". Sound familiar? This guy can show you improvements, but if he's carrying that inefficiency throughout the stack, it is just going to be more of the same. If a new guy comes to market, what you want to see is application performance. What's your latency? What's your throughput? What's your accuracy? What's your power? What's your cost? If we can show all of that, then we don't have to have a debate about architecture.
Jim Carroll: So, are you guys using the same "magic" to deliver inference performance?
Eitan Medina: No, but for now, I want to show you what we can do. The lion share of inference processors used by cloud operators today are CPUs -- an estimated 91% of these workloads are running on CPUs. Nvidia so far has not come up with a solution to move this market to GPUs. The market is using their GPUs mainly for training.
Our line card, installed in this server, can ingest and process 15,000 frames per second through the PCI bus. Because our chip is so efficient, we don't need crazy memory technologies or specialized manufacturing techniques. In fact, this chip is built with 16 nanometer technology, which is quite mature and well-understood. As soon as we got the first device back from TSMC, we had ResNet up and running immediately.
In a cloud data center, three of our line cards could deliver the inference processing equivalent of 169 Intel powered servers or eight of Nvidia's latest Tesla V100 GPUs.
Habana Labs is showcasing a Goya inference processor card in a live server, running multiple neural-network topologies, at the AI Hardware Summit on September 18 – 19, 2018, in Mountain View, CA.
Habana's deep learning inference processor, named Goya, is >2 orders of magnitude better in throughput & power than commonly deployed CPUs, according to the company. The company will offer a PCIe 4.0 card that incorporates a single Goya HL-1000 processor and designed to accelerate various AI inferencing workloads, such as image recognition, neural machine translation, sentiment analysis, recommender systems, etc. A PCIe card based on its Goya HL-1000 processor delivers 15,000 images/second throughput on the ResNet-50 inference benchmark, with 1.3 milliseconds latency, while consuming only 100 watts of power.
Habana is also developing an inference software toolkit to simplify the development and deployment of deep learning models (topologies) for mass-market use. The idea is to provide an inference network model compilation and runtime that eliminates low-level programming of the processor.
I recently sat down with Eitan Medina, Habana Labs' Chief Business Officer, to discuss the development of this new class of AI processors and what it means for the cloud business.
Jim Carroll: Who is Habana Labs and how did you guys get started?
Eitan Medina: Habana was founded in 2016 with the goal of building AI processors for inference and training. Currently, we have about 120 people on board, mostly in R&D and based in Israel. We have a business headquarters here in Silicon Valley. In terms of the background of the management team, most of us have deep expertise in processors, DSPs, and communications semiconductors. I previously was the CTO for Galileo Technology (acquired by Marvell), and now I am on the business side. I would say we have a very strong and multidisciplinary team for machine learning. We certainly have the expertise in the processing, software and networking to architect a complete hardware and software solution for deep learning.
In building this company, we identified the AI space as one that deserves its own class of processors. We believe that the existing CPUs and GPUs are not good enough.
The first wave of these AI processors are coming now or being announced now. Habana decided that unlike other semiconductor companies, we would only emerge from stealth once we have an actual product. We have production samples now and that is why we are officially launching the company.
Jim Carroll: Who are the founders and what motivated them to enter this market segment?
Eitan Medina: The two co-founders are David Dahan (CEO) and Ran Halutz (VP R&D), who worked together at Prime Sense, a company that was acquired by Apple. We also have onboard Shlomo Raikin (CTO), who was the Chief SoC Architect at Mellanox and who has 45 patents. We've also been able to recruit top talent from across the R&D ecosystem in Israel. The lead investors are Avigdor Willenz (Chairman), Bessemer, and WALDEN (Lip-Bu Tan).
Jim Carroll: What does the name "Habana" refer to?
Eitan Medina: In Hebrew, Habana means "understanding" -- a good name for an AI company.
Jim Carroll: The market for AI processors, obviously, is in its infancy. How do you see it developing?
Eitan Medina: Well, some analysts are already projecting a market for new class of chipsets for deep learning. Tractica, for instance, divides the emerging market into CPUs, GPUs, FPGAs, ASICs, SoC accelerators, and other devices. We see the need for a different type of processor because of the huge gap between the computational requirements for AI and the incremental improvements that vendors have delivered over the past few years, which right are just small improvements to CPUs and GPUs. Look at the best-in-class, deep learning models and then calculate how much computing power is needed to train them. Look at how these requirements have grown over the past few years. Trying graphing this progression and you will see a log scale graph with a doubling time of three and a half months. That's 10x every year. Initially, people were running machine learning on CPUs, and then they adopted Nvidia's GPUs. What we see in the market today is that training is dominated by GPUs, while influence is dominated by CPUs.
Jim Carroll: So what is Habana's approach?
Eitan Medina: When we looked at the overall deep learning space, we began with the workflows. It is important to understand that there's a training workflow, and there's an inference workflow. What we are introducing today is our "Goya" inference processor. Our "Gaudi" training processor will be introduced in the second quarter of 2019. It will feature a 2Tbps interface per device and its training performance scales linearly to thousands of processors. We intend to sell line cards equipped with these processors, which you can then plug into your existing servers.
The inference processor offloads this workload completely from the CPU. Therefore, you will not need to replace your existing servers with more advanced CPUs. What can this do for you? This is where our story gets really interesting. We're about more than an order of magnitude improvement.
Look at this graph showing our ResNet-50 inference throughput and latency performance. On the left side is the best performance Intel has shown to date on a dual socket Xeon Platinum. Latency is not reported, which could be a critical issue. In the middle is Nvidia's V100 Tensor GPU, with shows 6ms of latency -- not bad, but we can do better. Our performance, shown on the right, exceeds 15,000 images per second with just 1.3ms of latency. Our card is just 100 watts, whereas we estimate at least 400 watts for the other guys.
Jim Carroll: Where are you getting these gains? Are you processing the images in a different way?
Eitan Medina: Well, I can say that we are not changing the topology. If you are an AI researcher with a ResNet-50 topology, we will take you topology and ingest it to our compiler. We're not forcing you to change anything in your model.
Jim Carroll: So, if we try to understand the magic inside a GPU, Nvidia will talk about their ability to process polygons in parallel with large numbers of cores. Where is the magic for Habana?
Eitan Medina: Yeah, Nvidia will say they are really good at figuring out polygons, and may tell you about the massive memory bandwidth they can provide to the many cores. But, at the end of the day, if you are interested in doing image recognition, you only really care about application performance, not the stories of how wonderful the technology is.
Let's assume for a second that there's a guy with a very inefficient image processing architecture, ok? What would this guy do to give you better performance from generation to generation? He would just pack more of the same stuff each time -- more more memory, more bandwidth, and more power. And then he would tell you to "buy more to save more". Sound familiar? This guy can show you improvements, but if he's carrying that inefficiency throughout the stack, it is just going to be more of the same. If a new guy comes to market, what you want to see is application performance. What's your latency? What's your throughput? What's your accuracy? What's your power? What's your cost? If we can show all of that, then we don't have to have a debate about architecture.
Jim Carroll: So, are you guys using the same "magic" to deliver inference performance?
Eitan Medina: No, but for now, I want to show you what we can do. The lion share of inference processors used by cloud operators today are CPUs -- an estimated 91% of these workloads are running on CPUs. Nvidia so far has not come up with a solution to move this market to GPUs. The market is using their GPUs mainly for training.
Our line card, installed in this server, can ingest and process 15,000 frames per second through the PCI bus. Because our chip is so efficient, we don't need crazy memory technologies or specialized manufacturing techniques. In fact, this chip is built with 16 nanometer technology, which is quite mature and well-understood. As soon as we got the first device back from TSMC, we had ResNet up and running immediately.
In a cloud data center, three of our line cards could deliver the inference processing equivalent of 169 Intel powered servers or eight of Nvidia's latest Tesla V100 GPUs.
Habana Labs is showcasing a Goya inference processor card in a live server, running multiple neural-network topologies, at the AI Hardware Summit on September 18 – 19, 2018, in Mountain View, CA.